Hot OS topic

wererabit · Post by **wererabit** » Sun Apr 11, 2010 4:11 pm

hi guys,

I am not sure if this the right place to ask, so sorry if it isnt. I love OS development (and maybe a bit security) and want to find a research topic that can work on for my thesis. I have been searching for a while but apparently, there is not many research topic in OS.

So just wonder if any of you guys have any suggestions?

Thanks

NickJohnson · Post by **NickJohnson** » Sun Apr 11, 2010 5:26 pm

Concurrency is probably a good general area for research, due to the rapid changes that widespread multithreading and multiple cores are causing. I don't know exactly what you could research though...

Edit:
Maybe what you should be researching is a way of formally describing an operating system, just like you can formally define a programming language. One of the reasons it's such a slow field (as far as I can tell) is that OS implementation is relatively arcane and ad-hoc. I wouldn't be too surprised if there's a reasonable "algebra" of continuation operations that could be used to produce a theoretically complete threading system, for example.

Benk · Post by **Benk** » Sat May 08, 2010 11:39 pm

Actually research is pretty interesting though most focus around concurrency.
Very few async systems have been built and those mostly fall back on sync after 1 message is queued.

- Asynch IPC ( for the concurrency)
- App run time for Asynch IPC , esp with no wait handles and user threads.
- no MMU , Software based MM.
- Language based/OO OS ( eg Memory safe and Type Safe)
- Capability security
- Kernel Verification
- 3 rd gen micro kernels with direct user to user IPC
- How to replace shared memory for IPC.

Ben

SOOOS operating system.

Neolander · Post by **Neolander** » Sun May 09, 2010 1:45 am

Benk wrote:- no MMU , Software based MM.

I don't understand why so much people hate MMUs. It's fast. It's powerful. It's useful. It's fairly inexpensive. So why getting rid of it ?

Benk · Post by **Benk** » Sun May 09, 2010 8:12 am

See other thread , nothing wrong with MMU and paging which are good and give your copy on write its the TLB for VM to actual address lookup that causes most of the pain.

Turn off MMU get +5-15% performance gain ..pm me for sources
MMU takes 25% die space on an Arm chip , that could give you more cores is it worth it. Eg on a dual CPU Arm board with 6 cores each you would get 3 more cores instead of the MMU.
The TLB lookup is a power hungry parrallel search ( on all entries in the list since a linear search or hash would be too slow) for each lookup.
Nothing that cant be done in software more efficiently.
TLB contamination in IPC between cores
TLB contamination and weakening for Virtualization which is increasing the TLB count ( and hence power and die size)
TLB flushes on context switch
TLB miss on page table lookup.
TLB miss requires expenisve page table walk (which can also miss) and contaminates the cache ,

Anyway this is RESEARCH we are talking about which aims to investigate it.

gerryg400 · Post by **gerryg400** » Sun May 09, 2010 8:46 am

Benk,

Have you seen that Nehalem addresses some of your concerns? In particular the new architecture (perhaps not fully implemented yet) has a context id with each tlb entry so the entire tlb is not flushed on a context switch.

In fact if there are many cores, (let's say as many cores as there are running processes) and each core has it's own tlb and if processes have tight core affinity, there will be few tlb flushes.

On a quad-core Nehalem, the mmu takes about 10%, just a bit less than one core.

- gerryg400

gravaera · Post by **gravaera** » Sun May 09, 2010 8:48 am

Benk wrote:See other thread , nothing wrong with MMU and paging which are good and give your copy on write its the TLB for VM to actual address lookup that causes most of the pain.

Turn off MMU get +5-15% performance gain ..pm me for sources
MMU takes 25% die space on an Arm chip , that could give you more cores is it worth it. Eg on a dual CPU Arm board with 6 cores each you would get 3 more cores instead of the MMU.
The TLB lookup is a power hungry parrallel search ( on all entries in the list since a linear search or hash would be too slow) for each lookup.
Nothing that cant be done in software more efficiently.
TLB contamination in IPC between cores
TLB contamination and weakening for Virtualization which is increasing the TLB count ( and hence power and die size)
TLB flushes on context switch
TLB miss on page table lookup.
TLB miss requires expenisve page table walk (which can also miss) and contaminates the cache ,

Anyway this is RESEARCH we are talking about which aims to investigate it.

Sorry, but I found some of these to be...less than solid, if you excuse me.

"TLB flushes on context switch" : That depends on how the CPU implements paging, and how much it tries to manage the TLB itself. On an arch without hardware assisted TLB loading, the kernel can, especially for a special purpose kernel, which has say, only one task, easily partition the TLB and manage entries on it own. So it splits up the TLB half and half and doesn't flush TLB entries on context switch. Simply because x86 walks the page tables automatically and therefore has to flush on each context switch doesn't mean that 'flushing on context switch' is an attribute of CPUs with TLBs.

"TLB miss on Page Table Lookup" : Even in software, a cache would be of a fixed size, otherwise it would be useless. If the cache is allowed to grow indefinitely then lookup time increases. So it would be expected that a TLB could only have N entries. Again, simply because x86 walks the tables on its own, and decides to leave the programmer out of the whole deal, doesn't mean the the idea of a PMMU is bad: on an arch which does not have hardware assisted TLB loading, one can easily, in one's TLB management code, when 'flushing' an entry from a TLB, store it in a fast cache of flushed entries in RAM. The idea would be that "These entries were at one time in the TLB, and will likely be accessed again. So storing in them in some fast data structure in software would speed up TLB loads in software."

"TLB miss requires expensive page table walk ... proliferate cache cooling" : True. But when software is written with caches in mind, the programmer tends to have the process image as condensed as is possible. Also, the need for a TLB lookup is only really existent when a new page is accessed. That means that a 4K boundary of instructions or data fetches must be crossed before a new TLB load may need to be initiated. 4K is a lot of bytes of instructions, and a modest bit of data, too. You may say: "Jumps will easily invalidate that argument", but I would say: "A data/instruction cache line is usually of a pretty small size, so really, the number of potential cache misses is high whether or not you have to walk page tables."

"The TLB lookup is a power hungry parrallel search ( on all entries in the list since a linear search or hash would be too slow) for each lookup.
Nothing that cant be done in software more efficiently." : I can't speak with much authority on this one, but generally a hash works well for any kind of cache that has a tuple field which can be sensibly split on some modulo value. Virtual addresses in a TLB can be hashed, and associativity can be used to, again, speed up the lookups.

I'm not sure where your research paper came from, but maybe those people need to redo their research and come up with some better reasons for not using PMMUs. Also, they haven't considered the great alleviation that is the removal of the problem of address space fragmentation. Without paging, you have one fixed address space within which to load all processes. When a process is killed, it leaves a hole in this fixed linear address space. As you create and kill more processes, you'll end up having processes scattered everywhere, and eventually, due to placement problems, you'll need to move one or more processes around to find room for other new ones (but how do you do that quickly?), or else just tell the user: "Close one of more programs, and we'll play lotto to see whether that will free enough contiguous space to load a new process."

--Nice read altogether,
gravaera

Owen · Post by **Owen** » Sun May 09, 2010 10:15 am

Benk wrote:The TLB lookup is a power hungry parrallel search ( on all entries in the list since a linear search or hash would be too slow) for each lookup.
Nothing that cant be done in software more efficiently.

Since when? Most x86 TLBs are 4 way associative, so addresses have to be compared (Plus perhaps one for 4MB pages, an architectural complication, not one required by paging). Many other architectures do things differently. Regardless, please tell me how this is any different from cache?

And the TLB area is mostly an issue of architectures which do the page table walks in hardware. PowerPC's hash table walks are much smaller, and the SPARC and MIPS software managed TLBs are absolutely tiny.

That x86 TLB is absolutely insane and sometimes has to do the following lookup chain does not mean that other architectures share that property

Guest PML4E -> Host PML4E -> Host PDPTE -> Host PDE -> Host PTE
Guest PDPTE -> Host PML4E -> Host PDPTE -> Host PDE -> Host PTE
Guest PDE -> Host PML4E -> Host PDPTE -> Host PDE -> Host PTE
Guest PTE -> Host PML4E -> Host PDPTE -> Host PDE -> Host PTE

Selenic · Post by **Selenic** » Sun May 09, 2010 12:35 pm

Owen wrote:Regardless, please tell me how this is any different from cache?

Elaborating on this, the Wikipedia page on caching mentions that some processors use virtual addresses in their L1 caches, so the TLB doesn't even (necessarily) slow it down.

Also, notice this. It demonstrates a lot about x86 vs. ARM and not a lot about MMUs:

Owen wrote:That x86 TLB is absolutely insane ...

gerryg400 wrote:On a quad-core Nehalem, the mmu takes about 10%, just a bit less than one core.

Benk wrote:MMU takes 25% die space on an Arm chip , that could give you more cores is it worth it. Eg on a dual CPU Arm board with 6 cores each you would get 3 more cores instead of the MMU.

Finally, about that last quote, 12 * 1.25 = 15 (your figure) < 12 / 0.75 = 16 (what you would actually get).
In addition, with ARMs (at least, when you're using them for a specialized embedded system), there are three memory-management options to pick from anyway (IIRC): no MMU, protection only and full paging+TLB.

gerryg400 · Post by **gerryg400** » Sun May 09, 2010 5:33 pm

I suspect, thought I can't find it actually written anywhere that the Nehalem chips use virtual addresses in both the L1 and L2 caches. So TLB is only probably only searched when going to the L3 cache, which is common to all cores on the die. Furthermore the L3 cache lookup and the TLB access will be partly done in parallel since even with paging enabled 12 address lines don't change.

- gerryg400

Benk · Post by **Benk** » Sun May 09, 2010 7:21 pm

gerryg400 wrote:Benk,

Have you seen that Nehalem addresses some of your concerns? In particular the new architecture (perhaps not fully implemented yet) has a context id with each tlb entry so the entire tlb is not flushed on a context switch.

In fact if there are many cores, (let's say as many cores as there are running processes) and each core has it's own tlb and if processes have tight core affinity, there will be few tlb flushes.

On a quad-core Nehalem, the mmu takes about 10%, just a bit less than one core.

- gerryg400

The Nehalem , uses these because it is needed most of the perf loss from Virtualization is due to the higher TLB miss rate. IMHO i makes the problem worse as the TLB becomes bigger ( most of the MMU is TLB)
.
On the Nehalem its 10% ( less on previous) but thats because x86 cores are huge and wastefull , on arm and similar its 25%. Now consider 10 years from now when we are looking at 200-1000 core thats 20 -250 extra cores . It will make for major changes. I suspect that IF OS can overcome the concurrency issues ( which requires changing the way we write apps - eg fully non waiting asynch) than large amounts of simpler cores will become more dominant ( Sun believes this also and has put a lot of investment over the years in such cores) .

Anyway I may be wrong but i think so and its an interesting topic to debate , its not good to always rely on preconceived ideas

Benk · Post by **Benk** » Sun May 09, 2010 7:32 pm

Selenic wrote:
Owen wrote:Regardless, please tell me how this is any different from cache?
Elaborating on this, the Wikipedia page on caching mentions that some processors use virtual addresses in their L1 caches, so the TLB doesn't even (necessarily) slow it down.

Also, notice this. It demonstrates a lot about x86 vs. ARM and not a lot about MMUs:
Yes i think Arm will be very strong in the future and in a highly parralel world may challenge intel on the server and the fact you can use their chips with various MMU states is great.

Finally, about that last quote, 12 * 1.25 = 15 (your figure) < 12 / 0.75 = 16 (what you would actually get).
In addition, with ARMs (at least, when you're using them for a specialized embedded system), there are three memory-management options to pick from anyway (IIRC): no MMU, protection only and full paging+TLB.

You are correct thats 4 more cores if the TLB takes 25% , 12-> 16

Benk · Post by **Benk** » Sun May 09, 2010 8:06 pm

Owen wrote:
Benk wrote:The TLB lookup is a power hungry parrallel search ( on all entries in the list since a linear search or hash would be too slow) for each lookup.
Nothing that cant be done in software more efficiently.
Since when? Most x86 TLBs are 4 way associative, so addresses have to be compared (Plus perhaps one for 4MB pages, an architectural complication, not one required by paging). Many other architectures do things differently. Regardless, please tell me how this is any different from cache?

And the TLB area is mostly an issue of architectures which do the page table walks in hardware. PowerPC's hash table walks are much smaller, and the SPARC and MIPS software managed TLBs are absolutely tiny.

That x86 TLB is absolutely insane and sometimes has to do the following lookup chain does not mean that other architectures share that property

Guest PML4E -> Host PML4E -> Host PDPTE -> Host PDE -> Host PTE

Guest PDPTE -> Host PML4E -> Host PDPTE -> Host PDE -> Host PTE

Guest PDE -> Host PML4E -> Host PDPTE -> Host PDE -> Host PTE

Guest PTE -> Host PML4E -> Host PDPTE -> Host PDE -> Host PTE

Agree the software managed TLBs are much better , but the question is why do it in HW ? When we run 1000 core in 10-20 years time , you can dedicate 50 cores to it and you would still be ahead. Also it greatly increases OS portability.

TLB counts have gone up old sparc chips were small but Virtualization and the desire to use 4k pages ( for copy and write , shared ipc etc) means more and more entries .

Look at sparc 64 ( compared to the old one)

CPU die measuring 202 mm2,
MMU die 103 mm2, ( though it has some cache)
CACHE die 84 mm2

On intel the MMU is small because of their bloated cored but we dont notice becuase they have the best manufacturing by far 10* die size and much smaller. With the same Transister count you could fit a LOT of no mmu arm cores
eg a 6 core i7 has 1,170,000,000 transistors , Armv6 has 35,000 if you scaled it down to a similar process as x86 it would run at 2Ghs but have 33000 cores. Having 33,000 MMUs makes no sense or even 10+ .

Why dont we use this because OS dont allow apps to solve the concurrency problem though with Asyncronous non blocking and use of much more immutable data it should be possible

the "associative lookup" is a parralel search in HW on all the TLB entries. Yes a similar mechanism is used on cache but it is doubled up.

Anyway this is predicated on OS helping apps solve the concurrency issues which i think will happen.

gerryg400 · Post by **gerryg400** » Sun May 09, 2010 8:59 pm

I suspect that IF OS can overcome the concurrency issues ( which requires changing the way we write apps - eg fully non waiting asynch) than large amounts of simpler cores will become more dominant ( Sun believes this also and has put a lot of investment over the years in such cores) .

Isn't this a bit of a chicken and egg thing ? Who's going to write an OS and new apps for hardware that doesn't yet exist ? And who's going to build hardware for an OS and apps that don't exist. It's been tried - even Intel has been burned that way. A profitable migration path forward would be needed. But is it necessary to change our architecture at all ? I'm entirely happy with the performance of my PC.

BTW, what if the tlb for each core was big enough to hold all the pdirs, pgtbls for every process, so that there is never a tlb miss ? Wouldn't that be cool ?

- gerryg400

gerryg400 · Post by **gerryg400** » Sun May 09, 2010 9:03 pm

Agree the software managed TLBs are much better , but the question is why do it in HW ? When we run 1000 core in 10-20 years time , you can dedicate 50 cores to it and you would still be ahead. Also it greatly increases OS portability.

So you're suggesting devoting 5% of the die to software TLB ? Current Nehalem uses much less than that for a hardware TLB.

- gerryg400

OSDev.org

Hot OS topic

Hot OS topic

Re: Hot OS topic

Re: Hot OS topic

Re: Hot OS topic

Re: Hot OS topic

Re: Hot OS topic

Re: Hot OS topic

Re: Hot OS topic

Re: Hot OS topic

Re: Hot OS topic

Re: Hot OS topic

Re: Hot OS topic

Re: Hot OS topic

Re: Hot OS topic

Re: Hot OS topic