Hot OS topic

Discussions on more advanced topics such as monolithic vs micro-kernels, transactional memory models, and paging vs segmentation should go here. Use this forum to expand and improve the wiki!
Benk
Member
Member
Posts: 62
Joined: Thu Apr 30, 2009 6:08 am

Re: Hot OS topic

Post by Benk »

Sorry, but I found some of these to be...less than solid, if you excuse me.
Some were more solid than others.. The biggest one IMHO is when we go to 100 core, 100MMUs dont make sese.
"TLB flushes on context switch" : That depends on how the CPU implements paging, and how much it tries to manage the TLB itself. On an arch without hardware assisted TLB loading, the kernel can, especially for a special purpose kernel, which has say, only one task, easily partition the TLB and manage entries on it own. So it splits up the TLB half and half and doesn't flush TLB entries on context switch. Simply because x86 walks the page tables automatically and therefore has to flush on each context switch doesn't mean that 'flushing on context switch' is an attribute of CPUs with TLBs.
Yes some archs do it better the above is a software driven HW TLB.
It also depends on the OS , some OS avoid changing user address spaces just for this reason.
"TLB miss on Page Table Lookup" : Even in software, a cache would be of a fixed size, otherwise it would be useless. If the cache is allowed to grow indefinitely then lookup time increases. So it would be expected that a TLB could only have N entries. Again, simply because x86 walks the tables on its own, and decides to leave the programmer out of the whole deal, doesn't mean the the idea of a PMMU is bad: on an arch which does not have hardware assisted TLB loading, one can easily, in one's TLB management code, when 'flushing' an entry from a TLB, store it in a fast cache of flushed entries in RAM. The idea would be that "These entries were at one time in the TLB, and will likely be accessed again. So storing in them in some fast data structure in software would speed up TLB loads in software."
Yes this isnt an either or , but IMHO more if nto all should be done in software.
You can have Sparc like software managed TLB
An OS patching addresses in software ,
Position independent code
or even the software emulating a Virtual address system ( using pathching and relocation to achieve the goal) .

While x86s are the worst due to their huge core size the have one of the smallest MMUs
"TLB miss requires expensive page table walk ... proliferate cache cooling" : True. But when software is written with caches in mind, the programmer tends to have the process image as condensed as is possible. Also, the need for a TLB lookup is only really existent when a new page is accessed. That means that a 4K boundary of instructions or data fetches must be crossed before a new TLB load may need to be initiated. 4K is a lot of bytes of instructions, and a modest bit of data, too. You may say: "Jumps will easily invalidate that argument", but I would say: "A data/instruction cache line is usually of a pretty small size, so really, the number of potential cache misses is high whether or not you have to walk page tables."
Only the most highly tuned code can be tuned for caches and its time consuming and expensive besides the old -Os. ( having just tuned a CAS producer /Consumer queue) most code will never use this. The industry is moving towards higher level languages witness python , PHP , Ruby and Java script these also need to run fast and will cross page boundaries frequently ( and with little consideration) .
"The TLB lookup is a power hungry parrallel search ( on all entries in the list since a linear search or hash would be too slow) for each lookup.
Nothing that cant be done in software more efficiently." : I can't speak with much authority on this one, but generally a hash works well for any kind of cache that has a tuple field which can be sensibly split on some modulo value. Virtual addresses in a TLB can be hashed, and associativity can be used to, again, speed up the lookups.
The hash is too slow ( remember like cache you really want this to complete in < a few cycles) so it uses a mechanism like cache ( but faster ) and like cache it makes the chips more powerhungry. Again ( i know im repeating) 100 MMU like 100 Caches dont make sense . We may have a L1 small TLB and L2 slower and shared TLB in such a scenario meaning page misses will be even more expensive.

I'm not sure where your research paper came from, but maybe those people need to redo their research and come up with some better reasons for not using PMMUs. Also, they haven't considered the great alleviation that is the removal of the problem of address space fragmentation. Without paging, you have one fixed address space within which to load all processes. When a process is killed, it leaves a hole in this fixed linear address space. As you create and kill more processes, you'll end up having processes scattered everywhere, and eventually, due to placement problems, you'll need to move one or more processes around to find room for other new ones (but how do you do that quickly?), or else just tell the user: "Close one of more programs, and we'll play lotto to see whether that will free enough contiguous space to load a new process."
All these things are taken care off and it works , GCs makes it easy ( and compact) , stopping the threads and moving a app to remove fragmentation is quicker than a disk write. An OS can also detect an idle process and put it to sleep and page out the whole or part of a process - witness how fast an entire machine can hibernate compared to how it just dies when you have 2 * the memory load and swapping. A big sequentail write and read being much better than a large amount of Random access pages.

In terms of running out of memory the other option we suffer from today is death by swapping from a bug in an allocation routine or over comitment , my machine has suffered from this a few times .. anyway swapping is a different discussion.


--Nice read altogether,
And all of the above is predicated on solving the concurrency problem and high core counts which requires some changes to the way we write apps.

Personally i think there will be a big break with the past in 5-10 years . And the following all require changing the way we write apps

- Capability security ( no ambient authority and little global data)
- Asyncronous runtime ( to improve concurrency & work better in a network connectected world)
- Immutable data eg Ocml ( to improve concurrency )
- Software memory protection ( no untrusted code , though compiled type safe and memory safe code is fine) .

So im bundling it in one , but note im cheating i still allow a few C user apps / libs to ease the migration process. The above will create an OS that is far faster on multicore as well as being far more reliable ( including self healing and sub application failure allowing the app to heal and continue) and secure though it will be "strange".
gerryg400
Member
Member
Posts: 1801
Joined: Thu Mar 25, 2010 11:26 pm
Location: Melbourne, Australia

Re: Hot OS topic

Post by gerryg400 »

Personally i think there will be a big break with the past in 5-10 years . And the following all require changing the way we write apps

- Capability security ( no ambient authority and little global data)
- Asyncronous runtime ( to improve concurrency & work better in a network connectected world)
- Immutable data eg Ocml ( to improve concurrency )
- Software memory protection ( no untrusted code , though compiled type safe and memory safe code is fine) .

So im bundling it in one , but note im cheating i still allow a few C user apps / libs to ease the migration process. The above will create an OS that is far faster on multicore as well as being far more reliable ( including self healing and sub application failure allowing the app to heal and continue) and secure though it will be "strange".
I wonder if Linus will accept your changes, perhaps in patch-2.6.329-no_tlb from the -benk tree ? :)

- gerryg400
If a trainstation is where trains stop, what is a workstation ?
User avatar
Owen
Member
Member
Posts: 1700
Joined: Fri Jun 13, 2008 3:21 pm
Location: Cambridge, United Kingdom
Contact:

Re: Hot OS topic

Post by Owen »

Benk wrote:Agree the software managed TLBs are much better , but the question is why do it in HW ? When we run 1000 core in 10-20 years time , you can dedicate 50 cores to it and you would still be ahead. Also it greatly increases OS portability.

TLB counts have gone up old sparc chips were small but Virtualization and the desire to use 4k pages ( for copy and write , shared ipc etc) means more and more entries .

Look at sparc 64 ( compared to the old one)

CPU die measuring 202 mm2,
MMU die 103 mm2, ( though it has some cache)
CACHE die 84 mm2

On intel the MMU is small because of their bloated cored but we dont notice becuase they have the best manufacturing by far 10* die size and much smaller. With the same Transister count you could fit a LOT of no mmu arm cores
eg a 6 core i7 has 1,170,000,000 transistors , Armv6 has 35,000 if you scaled it down to a similar process as x86 it would run at 2Ghs but have 33000 cores. Having 33,000 MMUs makes no sense or even 10+ .

Why dont we use this because OS dont allow apps to solve the concurrency problem though with Asyncronous non blocking and use of much more immutable data it should be possible

the "associative lookup" is a parralel search in HW on all the TLB entries. Yes a similar mechanism is used on cache but it is doubled up.

Anyway this is predicated on OS helping apps solve the concurrency issues which i think will happen.
You are seriously pulling things out of your arse. For a start:
  • Neither ARMv6 or SPARC64 are cores. They are architecture revisions.
  • Anyone who knows the floor plan of a modern processor will know just how big the caches have gotten - Just a couple of generations back, look at the floor plan of the Pentium Ms with 2MB of L2 cache, in which it takes up 60% of the die area.
And remember that caches have to baloon in cache and area (or sharply decrease in performance) as more cores start hitting them.

If you want me to believe your points, please cite sources.
Benk
Member
Member
Posts: 62
Joined: Thu Apr 30, 2009 6:08 am

Re: Hot OS topic

Post by Benk »

gerryg400 wrote:
Personally i think there will be a big break with the past in 5-10 years . And the following all require changing the way we write apps

- Capability security ( no ambient authority and little global data)
- Asyncronous runtime ( to improve concurrency & work better in a network connectected world)
- Immutable data eg Ocml ( to improve concurrency )
- Software memory protection ( no untrusted code , though compiled type safe and memory safe code is fine) .

So im bundling it in one , but note im cheating i still allow a few C user apps / libs to ease the migration process. The above will create an OS that is far faster on multicore as well as being far more reliable ( including self healing and sub application failure allowing the app to heal and continue) and secure though it will be "strange".
I wonder if Linus will accept your changes, perhaps in patch-2.6.329-no_tlb from the -benk tree ? :)

- gerryg400
uClinux already does this and it was accepted into the core .. so yes :-)

just comment out

Config_Arch_Use_mmu.
User avatar
Owen
Member
Member
Posts: 1700
Joined: Fri Jun 13, 2008 3:21 pm
Location: Cambridge, United Kingdom
Contact:

Re: Hot OS topic

Post by Owen »

Benk wrote:
gerryg400 wrote:
Personally i think there will be a big break with the past in 5-10 years . And the following all require changing the way we write apps

- Capability security ( no ambient authority and little global data)
- Asyncronous runtime ( to improve concurrency & work better in a network connectected world)
- Immutable data eg Ocml ( to improve concurrency )
- Software memory protection ( no untrusted code , though compiled type safe and memory safe code is fine) .

So im bundling it in one , but note im cheating i still allow a few C user apps / libs to ease the migration process. The above will create an OS that is far faster on multicore as well as being far more reliable ( including self healing and sub application failure allowing the app to heal and continue) and secure though it will be "strange".
I wonder if Linus will accept your changes, perhaps in patch-2.6.329-no_tlb from the -benk tree ? :)

- gerryg400
uClinux already does this and it was accepted into the core .. so yes :-)

just comment out

Config_Arch_Use_mmu.
No-mmu provides no isolation, not what you are discussing...
Benk
Member
Member
Posts: 62
Joined: Thu Apr 30, 2009 6:08 am

Re: Hot OS topic

Post by Benk »

Owen wrote:
Benk wrote:Agree the software managed TLBs are much better , but the question is why do it in HW ? When we run 1000 core in 10-20 years time , you can dedicate 50 cores to it and you would still be ahead. Also it greatly increases OS portability.

TLB counts have gone up old sparc chips were small but Virtualization and the desire to use 4k pages ( for copy and write , shared ipc etc) means more and more entries .

Look at sparc 64 ( compared to the old one)

CPU die measuring 202 mm2,
MMU die 103 mm2, ( though it has some cache)
CACHE die 84 mm2

On intel the MMU is small because of their bloated cored but we dont notice becuase they have the best manufacturing by far 10* die size and much smaller. With the same Transister count you could fit a LOT of no mmu arm cores
eg a 6 core i7 has 1,170,000,000 transistors , Armv6 has 35,000 if you scaled it down to a similar process as x86 it would run at 2Ghs but have 33000 cores. Having 33,000 MMUs makes no sense or even 10+ .

Why dont we use this because OS dont allow apps to solve the concurrency problem though with Asyncronous non blocking and use of much more immutable data it should be possible

the "associative lookup" is a parralel search in HW on all the TLB entries. Yes a similar mechanism is used on cache but it is doubled up.

Anyway this is predicated on OS helping apps solve the concurrency issues which i think will happen.
You are seriously pulling things out of your arse. For a start:
  • Neither ARMv6 or SPARC64 are cores. They are architecture revisions.
  • Anyone who knows the floor plan of a modern processor will know just how big the caches have gotten - Just a couple of generations back, look at the floor plan of the Pentium Ms with 2MB of L2 cache, in which it takes up 60% of the die area.
And remember that caches have to baloon in cache and area (or sharply decrease in performance) as more cores start hitting them.

If you want me to believe your points, please cite sources.
Which numbers dont you believe...Sparc 64 is straight from Wikipedia ( and yes there are slightly different dies mased on the arxchitecture but it doesnt invalidate the numbers the manufacturers put out as indicative for that series) , Arm is from the arm web site . If you are going to question someone you better do some basic research ( wikepedia is not always right but its not hard to find stuff )

Yes for a while most of the die was L2 Cache , but L2 is shared . I admit your not going to get 33 K Arm cores on an intel chip with die but you will get Many Thousands of 2 Gig Cores ( and to use them yes they have to use little cahce so that puts monolithic sync OS out of the picture) .

Pentium M was probably the peak for Level 2 cache ( and had no lvl 3 that was some time ago) .. It is MUCH smaller now even on the 4 cores its about 25% . L2 cache is now about 256K to 512K and now they employ shared level 3 ( about 2 Meg ) between pairs of cores ( which is a pain for OS designers)

look at this

http://www.3dnow.net/phpBB2/viewtopic.p ... w=previous

"
We see right away that Nehalem devotes 85% more die area to core logic than to cache whereas Shanghai devotes about the same die area to core logic and cache"

And another poster not me posted this CPU had 10% of the die MMU (could be core) , so thats 75% Cores and other , 10% mmu , 15% lvl 2 cache.. ( i have only commented on teh MMU size on the arm which was on their web site on choosing between mmu and non mmu versions ) Things have changed.... Pentium M was a long time ago.

Still not convinced here is a 6 core
http://www.legitreviews.com/images/revi ... wn_die.jpg

Servers Xeons have larger caches

Anyway i dont think any of this is disputable ( ok i exagurate the 32 K cores by not including support etc but you will get a lot ) but fundamentally its sound and it proves my point that the future is hundreds or thousands of cores, the question is can we create an OS and run time that helps devs remove the concurrency issues , speaking of which such a runtime would use far less l2 cache per core and achieve higher cache hits ( think a message processing process( thread) on minix it just runs over the same code data). If we cant build such an OS or runtime then we are practically limited to 4-8 cores.
Benk
Member
Member
Posts: 62
Joined: Thu Apr 30, 2009 6:08 am

Re: Hot OS topic

Post by Benk »


No-mmu provides no isolation, not what you are discussing...
Its exactly what im talking about , you turn of the MMU and you have no isolation AND flat memory ( ie no TLB ) see the mmu forum thread for some linux benchmark showing a srop in IPC times from 100 us to 20 ms with ucLinux.
User avatar
Owen
Member
Member
Posts: 1700
Joined: Fri Jun 13, 2008 3:21 pm
Location: Cambridge, United Kingdom
Contact:

Re: Hot OS topic

Post by Owen »

Benk wrote:
Owen wrote:
Benk wrote:Agree the software managed TLBs are much better , but the question is why do it in HW ? When we run 1000 core in 10-20 years time , you can dedicate 50 cores to it and you would still be ahead. Also it greatly increases OS portability.

TLB counts have gone up old sparc chips were small but Virtualization and the desire to use 4k pages ( for copy and write , shared ipc etc) means more and more entries .

Look at sparc 64 ( compared to the old one)

CPU die measuring 202 mm2,
MMU die 103 mm2, ( though it has some cache)
CACHE die 84 mm2

On intel the MMU is small because of their bloated cored but we dont notice becuase they have the best manufacturing by far 10* die size and much smaller. With the same Transister count you could fit a LOT of no mmu arm cores
eg a 6 core i7 has 1,170,000,000 transistors , Armv6 has 35,000 if you scaled it down to a similar process as x86 it would run at 2Ghs but have 33000 cores. Having 33,000 MMUs makes no sense or even 10+ .

Why dont we use this because OS dont allow apps to solve the concurrency problem though with Asyncronous non blocking and use of much more immutable data it should be possible

the "associative lookup" is a parralel search in HW on all the TLB entries. Yes a similar mechanism is used on cache but it is doubled up.

Anyway this is predicated on OS helping apps solve the concurrency issues which i think will happen.
You are seriously pulling things out of your arse. For a start:
  • Neither ARMv6 or SPARC64 are cores. They are architecture revisions.
  • Anyone who knows the floor plan of a modern processor will know just how big the caches have gotten - Just a couple of generations back, look at the floor plan of the Pentium Ms with 2MB of L2 cache, in which it takes up 60% of the die area.
And remember that caches have to baloon in cache and area (or sharply decrease in performance) as more cores start hitting them.

If you want me to believe your points, please cite sources.
Which numbers dont you believe...Sparc 64 is straight from Wikipedia ( and yes there are slightly different dies mased on the arxchitecture but it doesnt invalidate the numbers the manufacturers put out as indicative for that series) , Arm is from the arm web site . If you are going to question someone you better do some basic research ( wikepedia is not always right but its not hard to find stuff )
OK, according to the Wikipedia you quoted out of context:
"The MMU die contains the memory management unit, cache controller and the external interfaces. The SPARC64 has separate interfaces for memory and input/output (I/O). The bus used to access the memory is 128 bits wide. The system interface is the HAL I/O (HIO) bus, a 64-bit asynchronous bus. The MMU has a die area of 163 mm2."

So, in other words, that die has
  • The virtual memory translation (The actual MMU) logic
  • The IO/memory redirection logic
  • The cache control logic (Big and complex!)
  • IO bus drivers (Again big)
  • SDRAM controller (Complex, pipelined, big)
  • Lots of busses to connect to the other dies
  • Interlock logic
Oh, and SPARC64 is quite old. As other parts have grown, the MMU has comparatively shrunk.

Plus, you failed to mention that the Hitachi SPARC 64s had four L1 cache dies. And you failed to account for that multi-die designs are always bigger and more bloated.
Yes for a while most of the die was L2 Cache , but L2 is shared . I admit your not going to get 33 K Arm cores on an intel chip with die but you will get Many Thousands of 2 Gig Cores ( and to use them yes they have to use little cahce so that puts monolithic sync OS out of the picture) .
So, lets say you have 2000 cores on there. You probably have a 3.5GHz memory bus, double pumped, so 7GT/s.

In other words, each core has a 3.5MT/s memory bus. Abysmal. Useless, even assuming the best improvements to cache locality. Which are never going to happen.
Pentium M was probably the peak for Level 2 cache ( and had no lvl 3 that was some time ago) .. It is MUCH smaller now even on the 4 cores its about 25% . L2 cache is now about 256K to 512K and now they employ shared level 3 ( about 2 Meg ) between pairs of cores ( which is a pain for OS designers)

look at this

http://www.3dnow.net/phpBB2/viewtopic.p ... w=previous

"
We see right away that Nehalem devotes 85% more die area to core logic than to cache whereas Shanghai devotes about the same die area to core logic and cache"

And another poster not me posted this CPU had 10% of the die MMU (could be core) , so thats 75% Cores and other , 10% mmu , 15% lvl 2 cache.. ( i have only commented on teh MMU size on the arm which was on their web site on choosing between mmu and non mmu versions ) Things have changed.... Pentium M was a long time ago.

Still not convinced here is a 6 core
http://www.legitreviews.com/images/revi ... wn_die.jpg

Servers Xeons have larger caches

Anyway i dont think any of this is disputable ( ok i exagurate the 32 K cores by not including support etc but you will get a lot ) but fundamentally its sound and it proves my point that the future is hundreds or thousands of cores, the question is can we create an OS and run time that helps devs remove the concurrency issues , speaking of which such a runtime would use far less l2 cache per core and achieve higher cache hits ( think a message processing process( thread) on minix it just runs over the same code data). If we cant build such an OS or runtime then we are practically limited to 4-8 cores.
And how does that image tell you how much die is devoted to cache? L3, sure, what about L1 and L2? Thats just lumped under "Core". It also says nothing about how much is taken by the TLB.

I can definitely spot a lot of blocks in those cores which look like SRAM, however.
Benk
Member
Member
Posts: 62
Joined: Thu Apr 30, 2009 6:08 am

Re: Hot OS topic

Post by Benk »

Owen you really do have a bee in your bonet , this thread is not about who is right or wrong , its about what MAY happen and what are good areas of research you wont know these things until you build them if you want to coment be constructive and write mature posts.

-There is no need to quote huge slabs of wikipedia because anyone can look it up
- To avoid my comments being misconstrued for the MMU i specifically added the comment that the MMU unit contained cache. ( which could be high)
- I did not say on a Sparc64 that almost 30% was MMU , but i do hold that it is not an insignificant amount
Even if the MMU is 5-10 % on 100-1000 core chips of the future that is still a massive waste , though Intel has the most bloated cores and on their new CPUs its "only"10% ( not my figure) ,Arm state clearly its 25% which makes sense since their cores are tiny .

Im not trying to convince anyone , because the preconditions first has to happen , im just suggesting a path i think that has a high probability of happening but first you have to think about huge number of core based async systems ( which the best researches still struggle with for GP systems) and how the run time will work ( eg they need little cache and suffer badly from TLB polution in IPC)

And please research your answers and spend a bit more time before you comment, L3 and L2 are CLEARLY marked ( Level 2 is the small boxes that say 256kB L2) and your Pentium M example is about 10 years out of date and completely wrong for modern CPUs . Core 2 had a 6 Meg Shared cache for the cores but it doesnt have L3. L1 is insignificant and not shown eg L1 is less than 1% of the size of L3.


I will requote , this is not my oppinion ( though its obvious)
We see right away that Nehalem devotes 85% more die area to core logic than to cache whereas Shanghai devotes about the same die area to core logic and cache"
It is interesting to note that what i was saying earlier about L2 TLBs happening its already happening. eg Nehalem has 64 lvl 1 TLB entires and 512 L2 entries per core.

Anyway you are arguing the wrong point .. the real question is can we go high multi core counts for a GP system - and if we do IMHO there will be big changes. Obviously an Async system allows threads to run over a much smaller domain and hence smalelr caches are needed and hence core counts can rampup further.

It is very interesting that Intel and AMD CPU are tuned to Windows and to a lesser extent Linux monlithic systems which require large amounts of L2 cache. A full Async system would work better with say double the L1 but it could use a fraction of the L2 and L3.
User avatar
Neolander
Member
Member
Posts: 228
Joined: Tue Mar 23, 2010 3:01 pm
Location: Uppsala, Sweden
Contact:

Re: Hot OS topic

Post by Neolander »

I agree with Owen about the 100-1000 cores chips. It's not going to happen anytime soon, since with 8-cores chips bus-related issues already start to show up.

Adding up that much cores would require a full revamp of the bus and the general CPU architecture to work. It probably involves getting definitively rid of the unified memory model and having several "virtual computers" in your CPU control logic. Why not, after all distributed operating systems are a major trend nowadays, but adding up multicomputers-related issue in order to improve computing power of computers which spend most of their time waiting for I/O sounds strange. On servers running highly scalable code, I see the logic in there (getting compact multicomputers), but globally your vision of all OSs becoming async makes for sense, in my opinion.
Benk
Member
Member
Posts: 62
Joined: Thu Apr 30, 2009 6:08 am

Re: Hot OS topic

Post by Benk »

I agree you wont solve these in traditional OS you need to rethink the way we program and/or the way we write OS and IMHO that is fully asyncronous IPC.

There are a dozen high core count chips , bus and cache issues are an issue because of the way we write apps on UNIX and Windows. On something like Minix it would be MUCH less of an issue ( thoguh amusingly it doesnt support multi core) .

Example Intel Terraflop chip 80 CPU , 3G,5G and 7G . 100M transistors ( you could put 10- 16 of these chips in a modern Core 2) . Power usage is also a fraction of current CPUs.
You can see Intel and MS also see this, having donated 20M each to build a new parralel lib for these chips. Though they think it can be solved by teaching developers multi threading with better libs.

I think over the next 10 years CPUS will hit the limit on Ghz and parralelism will be the ONLY way to go . An os that can do that better will be dominant in the 10-20 year time frame and i would add an app that runs at 10% on a 100 core , using 20% with a better OS will have a significant advantage (100%).
Selenic
Member
Member
Posts: 123
Joined: Sat Jan 23, 2010 2:56 pm

Re: Hot OS topic

Post by Selenic »

Benk wrote:I think over the next 10 years CPUS will hit the limit on Ghz and parralelism will be the ONLY way to go
Isn't this already pretty much true?

Either way, as mentioned before, more cores sharing the same memory bandwidth is problematic. Hence why many-processor servers tend to be NUMA, with one block of memory coupled to each processor (especially the AMD ones, but also some Intel ones, IIRC)

One semi-crazy idea I see potentially working for many-core processors sharing one memory bus would be to have a reconfigurable per-core cache (as a size example, I think AMD Phenom IIs (their newest processors, I think) have 64K+64K L1 (separate instruction/data) and 512K L2 per core, with 6M shared L3) where some portion can be used as memory by the OS. You could then use this for small, short-lived, thread-local objects, with main memory being used for larger storage and things which will be shared a lot (to avoid stealing bandwidth from the other core's "fast" memory). On the other hand, programming for this model sounds like it would be a bit annoying (particularly from an OS perspective, but also from a userland perspective)
Benk
Member
Member
Posts: 62
Joined: Thu Apr 30, 2009 6:08 am

Re: Hot OS topic

Post by Benk »

Selenic wrote:
Benk wrote:I think over the next 10 years CPUS will hit the limit on Ghz and parralelism will be the ONLY way to go
Isn't this already pretty much true?

Either way, as mentioned before, more cores sharing the same memory bandwidth is problematic. Hence why many-processor servers tend to be NUMA, with one block of memory coupled to each processor (especially the AMD ones, but also some Intel ones, IIRC)

One semi-crazy idea I see potentially working for many-core processors sharing one memory bus would be to have a reconfigurable per-core cache (as a size example, I think AMD Phenom IIs (their newest processors, I think) have 64K+64K L1 (separate instruction/data) and 512K L2 per core, with 6M shared L3) where some portion can be used as memory by the OS. You could then use this for small, short-lived, thread-local objects, with main memory being used for larger storage and things which will be shared a lot (to avoid stealing bandwidth from the other core's "fast" memory). On the other hand, programming for this model sounds like it would be a bit annoying (particularly from an OS perspective, but also from a userland perspective)
Yes it is almost true but Intell will get a few more 25% jumps .

re Numa , if we cant avoid the bus by higher L1 and L2 cache then you are correct it will be forced and we will have NUMA on chip .

Those things will help but the real answer is to not have threads running all over the code base an Asynch message based IPC system would use a fraction of the cache since it only runs over the same code eg a mm service would just iterate over the mm data and code.
User avatar
Owen
Member
Member
Posts: 1700
Joined: Fri Jun 13, 2008 3:21 pm
Location: Cambridge, United Kingdom
Contact:

Re: Hot OS topic

Post by Owen »

Even at a 1khz tick rate context switches are only the cause of a rather small proportion of cache misses.

Additionally, the majority of such misses are for the data cache: While intensive apps tend to iterate over the same small portion of code repetitively (For me its about ~48kb), they tend to hit lots of data (For me I easily have a ~4MB working set which is iterated over ~60 times per second).

And, unfortunately, a lot of said intensive apps are games or other 3D based apps, and there is no multithreadied 3D library yet.
Benk
Member
Member
Posts: 62
Joined: Thu Apr 30, 2009 6:08 am

Re: Hot OS topic

Post by Benk »

Owen wrote:Even at a 1khz tick rate context switches are only the cause of a rather small proportion of cache misses.

Additionally, the majority of such misses are for the data cache: While intensive apps tend to iterate over the same small portion of code repetitively (For me its about ~48kb), they tend to hit lots of data (For me I easily have a ~4MB working set which is iterated over ~60 times per second).

And, unfortunately, a lot of said intensive apps are games or other 3D based apps, and there is no multithreadied 3D library yet.
Yes that is correct , note however that Async systems have 10-100* more context switches.

What causes the cache is just the thread wondering everywhere it runs through the kernel , through a dozen libs , the mm and thats before it even gets to user code. This is what Asynch systems do better as each processes services a small working set. ( ie your repetative code for 48K becomes more common)

And yes 3D is difficult because its a large object domain however it can still be broken down ( just like on the vid card) with the vertex bufferes & shaders as a seperate layer ( pretty much how Direct X does it) to the objects and the physics . I believe DirectX now has it ( no idea how well it works) . http://www.rorydriscoll.com/2009/04/21/ ... ithreading.

I dont belive 3D rendering will work much better in high core cost systems but here you are limited to the vid & Video bus anyway. All the other parts of a 3D app however can be improved.
Post Reply