Software managed TLB

Hyperdrive · **Joined:** Mon Nov 24, 2008 9:13 am **Posts:** 93

Hi all,

today I was involved in some discussion. Until now I have no real opinion about that and I'm not so sure about the quality of the arguments that I heard during the discussion. So, I'd like to hear your thoughts/opinions as experts (in the general field of OS development). For now, I hold my personal thoughts back to not distort the picture your thoughts/opinions will paint.

So...

The x86 has the nice paging mechanism and the MMU does all the translation and handling for you. TLBs are more or less transparent and managed automatically (except when you change the paging structures - invalidation, TLB shootdown...). In some RISC designs (e.g. Alpha) you can have an exception in case of a TLB miss and you can insert a translation by hand (both not available in x86). That leads to a software managed TLB.

First question: What do you think - is such a software managed TLB better or worse than the x86-like TLB? Why?

Let us suppose that we want such a software managed TLB but we are using x86(-64). We can try to simulate it:

Initially the TLB is flushed and all entries in PML4 (i.e. highest paging structure level) are marked "not present".
For a memory access we will get a Page Fault. We then have the faulting address and the type of access that was made (read or write).
We then build some page table structures that will map the virtual address to the appropriate physical address. Make the relevant path for the transition marked "present".
We do a dummy read/write to the faulted address. So the address translation is loaded into the TLB.
Make the PML4 entires marked as "not present".
End of Page Fault handler.
The address translation for the memory access that caused the Page Fault can now be satisfied from the TLB.

Pros:

You don't need big paging structures for every process - you can use only very small dummy structures and one big data structure internal to the OS (e.g. some sort of inverted page tables?).
For the internal address translation structure you can use a format that suits your need best (you don't have to use the Intel's predefined ones that may or may not fit your needs).
It's more flexible. You can give the address translation that you want at any time, because you see every memory access.

Cons:

You have to emulate some of the book keeping the MMU normally does for you for free (e.g. access/dirty bits).
Performance will suck. (Really?)

Second question: Is this implementation possible?

To refine that question ... The answer is "No" if there's no TLB. And the Application Note of Intel says you can't rely on that. The most processors have one. So, suppose the processor has one. Is it feasible the way I described it? Then again, the answer may be "No", because no one says the TLB must cache a translation, it just may. (Or am I wrong?) Does somebody know how current implementations work - do they always cache the latest translation or just now and then?

Third question: How big would the impact of performance be? Some estimates?

Fourth question: Any pros/cons that I didn't mention? (I guess there are at least some more cons...)

Fifth question: If you think it would be pretty cool: Why the hell is noone doing it?

(That's my favourite. I'll answer just that for now... I personally think, that this is not the first time that idea came up. But I'm not aware of any system that works like that. So maybe the disadvantages are too big?!)

Your thoughts are very welcome. Please, don't think "Uh, that guy again with that weird ideas." As I said, I'm not conviced about that stuff, but I think it's discussable.

It's your turn... (please)

--TS

Combuster · **Posted:** Tue Apr 14, 2009 1:53 pm

Software TLBs are interesting, but they also pose an additional challenge (one I'm willing to try one day

). I haven't looked up the details of Alpha's software TLB, but things can get especially interesting when they do not have fixed sizes (And I don't know for either Alpha nor SuperH if they are fixed or variable)...

Also, there is no proper way to manually manage a TLB in software, since it may evict entries at random. There is also the problem of cache colouring when the TLB is not fully associative. Also, a cleared A or D bit will cause a write to the page table when the entry is used (for writing in case of the D bit). If you can guarantee that writebacks and TLB invalidations do not occur randomly, then yes this can work. But then again, I don't have all the TLB-related details, and there's always SMM that has potential of trashing your precious TLB.

Turning a hardware TLB into a software TLB is like wasting computing resources. There's only very little chance that custom TLB management is faster. Every miss will generate a fault which takes an order of magnitude more cycles than the hardware approach. Then again, the difference will only really matter if whatever application does not fit within the size of the TLB, since only a fixed amount of pagefaults are needed after a task switch (and you could possibly do that greedily for more effect)

This is also the reason why nobody is doing it - the x86 has a hardware TLB and nobody in his right mind is going to break performance by overriding that. You'd only do that when the platform requires you to.

That said, if I were to design a CPU, it would come with SW TLBs (of variable size =P~

)

Colonel Kernel · **Posted:** Tue Apr 14, 2009 11:48 pm

Hyperdrive wrote:

We then build some page table structures that will map the virtual address to the appropriate physical address. Make the relevant path for the transition marked "present".

That's why performance would probably be poor. TLB misses are way, way more frequent than page faults. You want TLB miss handlers to run really, really fast.

IMO the main reasons to go with software-managed TLBs are hardware-related, not software-related. The logic required to implement the x86 page table support and co-ordinate it across multiple processors is non-trivial, and has some bugs (http://download.intel.com/design/mobile ... 922214.pdf). Those transistors would be better spent on bigger TLBs. Also, if there are bugs in the TLB handlers, you can actually fix them. :wink:

Hyperdrive · **Joined:** Mon Nov 24, 2008 9:13 am **Posts:** 93

Combuster wrote:

Software TLBs are interesting, but they also pose an additional challenge (one I'm willing to try one day

).

I'm willing to try the challenge, too. But for now I'll stick to x86. And I don't think we're going to see software TLBs there in the (near) future.

Combuster wrote:

Also, there is no proper way to manually manage a TLB in software, since it may evict entries at random. There is also the problem of cache colouring when the TLB is not fully associative. Also, a cleared A or D bit will cause a write to the page table when the entry is used (for writing in case of the D bit). If you can guarantee that writebacks and TLB invalidations do not occur randomly, then yes this can work. But then again, I don't have all the TLB-related details, and there's always SMM that has potential of trashing your precious TLB.

I fully agree. The random eviction of entries wouldn't be a (big) problem in our case. And the A and D bits could be set, so no problem here too. So for our situation we're fine. But you're right that this wouldn't be the case in typical systems. Good point.

Combuster wrote:

Turning a hardware TLB into a software TLB is like wasting computing resources. There's only very little chance that custom TLB management is faster.

That's mainly the thing that I'm concerned about.

Combuster wrote:

Every miss will generate a fault which takes an order of magnitude more cycles than the hardware approach.

Exactly this was my point in our discussion. Do you - or someone else here - have a estimation about what it costs (time) to call a Page Fault handler and returning from it (the handling itself could be tweaked, but not the hardware side).

Combuster wrote:

Then again, the difference will only really matter if whatever application does not fit within the size of the TLB, since only a fixed amount of pagefaults are needed after a task switch (and you could possibly do that greedily for more effect)

This is also the reason why nobody is doing it - the x86 has a hardware TLB and nobody in his right mind is going to break performance by overriding that. You'd only do that when the platform requires you to.

The counter argument against me was that it would significantly reduce the size of paging structures. Imagine a 64 bit world where each process has its own page tables (okay, that's common) and every process uses a very big amount of the address space. Then you'll have very very much RAM "wasted" for just the paging structures. The space needed grows with the number of processes and the amount of address space used. Using the approach I outlined in my original post that space needed only grows with the number of processes (and the space needed per process is very very small) or is even fixed (and extremely small).

What do you - and the others - think? Is that a valid argument? Or is it negligible/less severe?

Combuster wrote:

That said, if I were to design a CPU, it would come with SW TLBs (of variable size =P~

)

Full agreement. That was more or less what I said in the discussion: "I don't think turning HW TLBs into SW TLBs is the right way (because of performance impact). But generally I would favour SW TLBs over HW TLBs."

Colonel Kernel wrote:

Hyperdrive wrote:

We then build some page table structures that will map the virtual address to the appropriate physical address. Make the relevant path for the transition marked "present".

That's why performance would probably be poor. TLB misses are way, way more frequent than page faults. You want TLB miss handlers to run really, really fast.

Yes and no. I agree with you with the facts (TLB misses more frequent, TLB miss handlers have to be really fast), but I think that the TLB miss handler outlined by me can be made really fast. If it would be fast enough I don't know. I have a gut feeling, that this wouldn't be the case; and I'm afraid that only a complete implementation can prove or disprove the estimates.

Colonel Kernel wrote:

IMO the main reasons to go with software-managed TLBs are hardware-related, not software-related. The logic required to implement the x86 page table support and co-ordinate it across multiple processors is non-trivial, and has some bugs (http://download.intel.com/design/mobile ... 922214.pdf). Those transistors would be better spent on bigger TLBs. Also, if there are bugs in the TLB handlers, you can actually fix them. :wink:

Right. That's my opinion, too. You can change software more easily (recompile and It's done). So we all agree to this point: Big software TLBs would be nice.

From your two opinions (thank you for that!) I'm affirmed in my position. Well, we are researchers and so the emulated software TLB will most likely be realized - just to see how good or bad it is (I'll bet on "bad" and from our discussion here I think you'd too, but we'll see). If you'd like I'll keep you informed.

One more quick question: Do the emulators emulate TLBs?

Bochs
Qemu
VirtualBox
VirtualPC
VMWare
...

I didn't find really reliable information about that

--TS

Combuster · **Posted:** Sat Apr 18, 2009 12:30 pm

I know bochs does (with the right configure options)...

Kevin · **Posted:** Sat Apr 18, 2009 12:53 pm

So does qemu.

mystran · **Joined:** Thu Mar 08, 2007 11:08 am **Posts:** 670

Hyperdrive wrote:

The counter argument against me was that it would significantly reduce the size of paging structures. Imagine a 64 bit world where each process has its own page tables (okay, that's common) and every process uses a very big amount of the address space. Then you'll have very very much RAM "wasted" for just the paging structures. The space needed grows with the number of processes and the amount of address space used. Using the approach I outlined in my original post that space needed only grows with the number of processes (and the space needed per process is very very small) or is even fixed (and extremely small).

It's not like you actually have to keep the paging structures there where pages aren't mapped.. it's perfectly acceptable to hold the information in a more compact format (say list of memory mapped regions) and then construct the page tables on the fly.

An example design could be:

- every region is memory mapped, either file or anonymous space
- there is a block cache, which can fetch blocks from filesystem/swap manager
- the block cache can also be told to write-back dirty blocks
- page faults are satisfied by asking the cache for a particular block
- clean pages are kept in cache unless the memory is needed for something else

That'd be a pretty typical VMM design..

Now suppose our kernel decides that there's too much memory being held by the page tables. Since it already has the region list, it obviously doesn't need the page tables, so it can just unmap everything (at least in theory) and discard the page tables. This will obviously make all the processes start page-faulting a lot (duh) but since all the blocks are supposedly still in the block cache, we can satisfy those just by asking the cache, and reconstruct the page tables for those portions (only) of memory that are currently being requested by the processes.

Ok, so what good was that? Well, now you've got page-tables simply as a part of the application working set. Since you can have stuff in memory (in block cache) without having it mapped in processes, you could theoretically even allocate the pages for the page tables from a finite pool, and hence put an upper bound to the amount of memory wasted (trading that memory for some more page faults to handle). Even without a specific quotas, the pages taken by the page tables are now trivially "swappable" just like everything else..... so it's no longer a problem to map a few terabytes of disk-space to every process: just those page tables that the process needs to access it's current working set need actually be in the memory.

OSDev.org

Software managed TLB

Who is online