Benk wrote:See other thread , nothing wrong with MMU and paging which are good and give your copy on write its the TLB for VM to actual address lookup that causes most of the pain.
Turn off MMU get +5-15% performance gain ..pm me for sources
MMU takes 25% die space on an Arm chip , that could give you more cores is it worth it. Eg on a dual CPU Arm board with 6 cores each you would get 3 more cores instead of the MMU.
The TLB lookup is a power hungry parrallel search ( on all entries in the list since a linear search or hash would be too slow) for each lookup.
Nothing that cant be done in software more efficiently.
TLB contamination in IPC between cores
TLB contamination and weakening for Virtualization which is increasing the TLB count ( and hence power and die size)
TLB flushes on context switch
TLB miss on page table lookup.
TLB miss requires expenisve page table walk (which can also miss) and contaminates the cache ,
Anyway this is RESEARCH we are talking about which aims to investigate it.
Sorry, but I found some of these to be...less than solid, if you excuse me.
"TLB flushes on context switch" : That depends on how the CPU implements paging, and how much it tries to manage the TLB itself. On an arch without hardware assisted TLB loading, the kernel can, especially for a special purpose kernel, which has say, only one task, easily partition the TLB and manage entries on it own. So it splits up the TLB half and half and doesn't flush TLB entries on context switch. Simply because x86 walks the page tables automatically and therefore has to flush on each context switch doesn't mean that 'flushing on context switch' is an attribute of CPUs with TLBs.
"TLB miss on Page Table Lookup" : Even in software, a cache would be of a fixed size, otherwise it would be useless. If the cache is allowed to grow indefinitely then lookup time increases. So it would be expected that a TLB could only have N entries. Again, simply because x86 walks the tables on its own, and decides to leave the programmer out of the whole deal, doesn't mean the the idea of a PMMU is bad: on an arch which does not have hardware assisted TLB loading, one can easily, in one's TLB management code, when 'flushing' an entry from a TLB, store it in a fast cache of flushed entries in RAM. The idea would be that "These entries were at one time in the TLB, and will likely be accessed again. So storing in them in some fast data structure in software would speed up TLB loads in software."
"TLB miss requires expensive page table walk ... proliferate cache cooling" : True. But when software is written with caches in mind, the programmer tends to have the process image as condensed as is possible. Also, the need for a TLB lookup is only really existent when a new page is accessed. That means that a 4K boundary of instructions or data fetches must be crossed before a new TLB load may need to be initiated. 4K is a lot of bytes of instructions, and a modest bit of data, too. You may say: "Jumps will easily invalidate that argument", but I would say: "A data/instruction cache line is usually of a pretty small size, so really, the number of potential cache misses is high whether or not you have to walk page tables."
"The TLB lookup is a power hungry parrallel search ( on all entries in the list since a linear search or hash would be too slow) for each lookup.
Nothing that cant be done in software more efficiently." : I can't speak with much authority on this one, but generally a hash works well for any kind of cache that has a tuple field which can be sensibly split on some modulo value. Virtual addresses in a TLB can be hashed, and associativity can be used to, again, speed up the lookups.
I'm not sure where your research paper came from, but maybe those people need to redo their research and come up with some better reasons for not using PMMUs. Also, they haven't considered the great alleviation that is the removal of the problem of address space fragmentation. Without paging, you have one fixed address space within which to load all processes. When a process is killed, it leaves a hole in this fixed linear address space. As you create and kill more processes, you'll end up having processes scattered everywhere, and eventually, due to placement problems, you'll need to move one or more processes around to find room for other new ones (but how do you do that quickly?), or else just tell the user: "Close one of more programs, and we'll play lotto to see whether that will free enough contiguous space to load a new process."
--Nice read altogether,
gravaera