Code: Select all
[cpu0] ---> [PRIVATE-RAM0]
|------------+
(slow) |
| |
[cpu1] ---> [PRIVATE-RAM1] (slower)--------+
| |
[cpu2] ---> [PRIVATE-RAM2] | |
|------------+ |
(slow) |
| (slowest) [possible high latency]
[cpu3] ---> [PRIVATE-RAM3] |
|
|
|
|
[cpu4] ---> [PRIVATE-RAM4] |
|------------+ |
(slow) | |
| | |
[cpu5] ---> [PRIVATE-RAM5] (slower)--------+
|
[cpu6] ---> [PRIVATE-RAM6] |
|------------+
(slow)
|
[cpu7] ---> [PRIVATE-RAM7]
Code: Select all
kernel0 <cpu0, cpu1>
kernel1 <cpu2, cpu3>
kernel2 <cpu4, cpu5>
kernel3 <cpu6, cpu7>
You could also do (may not be beneficial but possble):
Code: Select all
kernel0 <cpu6, cpu7, cpu4>
kernel1 <cpu5, cpu3>
kernel2 <cpu2>
kernel3 <cpu1>
Thus this should give us an A-SMP type system. Each kernel works in SMP mode, but the whole system could be AMP or SMP (if you only had one kernel).
Now, the kernels could share threads of the same process because all physical memory is directly addressable by instructions on the memory bus/interface. Of course process could be locked to a specific kernel and specific CPU or locked to a specific group of kernels.
Memory Management (NUMA Memory Domans)
There will exist multiple physical page heaps and logical page heaps. A logical page heap is just a collection of references to physical page heaps. So therefore your physical pages heap would be:
Code: Select all
phypageheap<0>: PRIVATE-RAM0
phypageheap<1>: PRIVATE-RAM1
phypageheap<2>: PRIVATE-RAM2
phypageheap<3>: PRIVATE-RAM3
phypageheap<4>: PRIVATE-RAM4
phypageheap<5>: PRIVATE-RAM5
phypageheap<6>: PRIVATE-RAM6
phypageheap<7>: PRIVATE-RAM7
The physical page heaps just define the different regions because of different access latency and throughput, or just to separate the memory of two different kernel (not sure if this would desirable but it is a side effect).
Logical Heaps
Then kernel0 might have the following logical physical page heaps:
logpageheap<0>: phypageheap<0> (fast)
logpageheap<1>: phypageheap<1> (slow)
logpageheap<2>: phypageheap<2>, phypageheap<3> (slower)
logpageheap<3>: phypageheap<4>, phypageheap<5>, phypageheap<6>, phypageheap<7> (slowest)
Each kernel would have it's own set of logical heaps which reference the actual heap. And, some heaps could be deemed non-sharable so that kernel0 might create a phypageheap with some of it's private RAM that is NOT used (shared) by any other kernel and then the remaining it might put into a phypageheap that IS used (shared) by any other kernel or only some kernels.
Basically, my point is the possible configurations are many, and therefore should support any type of system configuration. For example one day we might see one primary CPU and two smaller lower power CPUs (which are tightly coupled by cache). So you might see better performance by running the two lower power CPUs in SMP mode (kernel1), and the larger CPU in SMP mode (kernel0), and by chance a faster more local area of memory for the two smaller CPUs for some strange reason.
This type stuff would be decided during boot-time by creating the phypageheaps and logpageheaps ahead of time. It could also be changed during run-time but not sure if there is a good reason to do that.
NUMA Aware Applications/Processes and Non-Aware
Now, an depending on if the application was designed NUMA aware (takes advantage of kernel's advance memory allocation features) or if the application was NUMA unaware would determine which logical heap memory would be allocated from. A NUMA aware application might create threads and assign them to a specific kernel or specific kernel and CPU and only allocate memory from specific logical heaps to
maximize performance. Where as a non-NUMA aware application might just grab memory from the fastest first and end up with memory all over the place.. but of course there might be an algorithm that could help with this such as automatically determining weather to restrict the process threads to a specific kernel CPU, specific kernel, group of kernels, or all kernels.
Why
You may be wondering why such a complex hierarchy of memory allocation is needed. It is my feeling that possibly we may see this happen as the number of cores increase in order to keep the performance benefits eventually you are going to see a less flat memory system like my diagram above. You might not see anything with more layers than above, but I have a feeling we might at least see one more layer added. When I say layer I do not mean the memory is actually layered but rather the access latencies are layered and for two reasons being distance (because of EM limitations) and contention. As you add more CPUs they naturally get further and the memory banks naturally get further apart so a solution is to drop down and make a few CPUs have a more direct connection to their closer companions, but still retain a separate memory bus for accessing the memory in other groups. This is similar to an IP network of course without the dynamic routing -- which of course kinda peaks my curiosity of the benefits of something like that between memory or hot pluggable CPUs and memory banks.. but that is not relevant. Or, maybe more similar to an ethernet with switches. Any way my point is I do not care exactly how it is implemented only that memory is addressable directly and that there exists different latencies between regions of that addressable memory from different CPUs.
Conclusion
I have a few doubts myself. First being if we will ever see a non-flat NUMA system and if any actual benefits could be reaped. Another fuzzy area is the handling of non-NUMA aware applications/code/programs/processes. There has to be some algorithm (even if simple) to help these programs make the most out of the system. It could be a simple just grab the fastest memory from whatever CPU the thread is running on to something more complex such as automatically determining to lock threads to specific CPUs, kernels, or groups of kernels. Also is it possible for an algorithm to move memory around as in transfer it from another memory region with lower throughput to a higher throughput as the process/thread is running. How these algorithms will work is beyond my understanding at the moment and if they are even possible.