Page 1 of 1

A-SMP Architecture With Non-Flat NUMA

Posted: Fri Apr 18, 2014 6:12 am
by Pancakes
This is just an illustration, by no means is the system limited to two cores per private ram. There could be any number of CPUs connected in any manner even an uneven number. I just made it balanced because it is easier to look at I feel like.

Code: Select all

    
[cpu0] ---> [PRIVATE-RAM0]
                 |------------+
               (slow)         |
                 |            |
[cpu1] ---> [PRIVATE-RAM1] (slower)--------+
                              |            |
[cpu2] ---> [PRIVATE-RAM2]    |            |
                 |------------+            |
               (slow)                      |
                 |                       (slowest) [possible high latency]
[cpu3] ---> [PRIVATE-RAM3]                 |
                                           |
                                           |
                                           |
                                           |
[cpu4] ---> [PRIVATE-RAM4]                 |
                 |------------+            |
               (slow)         |            |            
                 |            |            |
[cpu5] ---> [PRIVATE-RAM5] (slower)--------+
                              |
[cpu6] ---> [PRIVATE-RAM6]    |
                 |------------+
               (slow)
                 |
[cpu7] ---> [PRIVATE-RAM7]
So.. you could create kernels like so:

Code: Select all

        kernel0 <cpu0, cpu1>
        kernel1 <cpu2, cpu3>
        kernel2 <cpu4, cpu5>
        kernel3 <cpu6, cpu7>

You could also do (may not be beneficial but possble):

Code: Select all

        kernel0 <cpu6, cpu7, cpu4>
        kernel1 <cpu5, cpu3>
        kernel2 <cpu2>
        kernel3 <cpu1>

Thus this should give us an A-SMP type system. Each kernel works in SMP mode, but the whole system could be AMP or SMP (if you only had one kernel).

Now, the kernels could share threads of the same process because all physical memory is directly addressable by instructions on the memory bus/interface. Of course process could be locked to a specific kernel and specific CPU or locked to a specific group of kernels.

Memory Management (NUMA Memory Domans)

There will exist multiple physical page heaps and logical page heaps. A logical page heap is just a collection of references to physical page heaps. So therefore your physical pages heap would be:

Code: Select all

        phypageheap<0>: PRIVATE-RAM0
        phypageheap<1>: PRIVATE-RAM1
        phypageheap<2>: PRIVATE-RAM2
        phypageheap<3>: PRIVATE-RAM3
        phypageheap<4>: PRIVATE-RAM4
        phypageheap<5>: PRIVATE-RAM5
        phypageheap<6>: PRIVATE-RAM6
        phypageheap<7>: PRIVATE-RAM7

The physical page heaps just define the different regions because of different access latency and throughput, or just to separate the memory of two different kernel (not sure if this would desirable but it is a side effect).

Logical Heaps

Then kernel0 might have the following logical physical page heaps:
logpageheap<0>: phypageheap<0> (fast)
logpageheap<1>: phypageheap<1> (slow)
logpageheap<2>: phypageheap<2>, phypageheap<3> (slower)
logpageheap<3>: phypageheap<4>, phypageheap<5>, phypageheap<6>, phypageheap<7> (slowest)

Each kernel would have it's own set of logical heaps which reference the actual heap. And, some heaps could be deemed non-sharable so that kernel0 might create a phypageheap with some of it's private RAM that is NOT used (shared) by any other kernel and then the remaining it might put into a phypageheap that IS used (shared) by any other kernel or only some kernels.

Basically, my point is the possible configurations are many, and therefore should support any type of system configuration. For example one day we might see one primary CPU and two smaller lower power CPUs (which are tightly coupled by cache). So you might see better performance by running the two lower power CPUs in SMP mode (kernel1), and the larger CPU in SMP mode (kernel0), and by chance a faster more local area of memory for the two smaller CPUs for some strange reason.

This type stuff would be decided during boot-time by creating the phypageheaps and logpageheaps ahead of time. It could also be changed during run-time but not sure if there is a good reason to do that.

NUMA Aware Applications/Processes and Non-Aware

Now, an depending on if the application was designed NUMA aware (takes advantage of kernel's advance memory allocation features) or if the application was NUMA unaware would determine which logical heap memory would be allocated from. A NUMA aware application might create threads and assign them to a specific kernel or specific kernel and CPU and only allocate memory from specific logical heaps to
maximize performance. Where as a non-NUMA aware application might just grab memory from the fastest first and end up with memory all over the place.. but of course there might be an algorithm that could help with this such as automatically determining weather to restrict the process threads to a specific kernel CPU, specific kernel, group of kernels, or all kernels.

Why

You may be wondering why such a complex hierarchy of memory allocation is needed. It is my feeling that possibly we may see this happen as the number of cores increase in order to keep the performance benefits eventually you are going to see a less flat memory system like my diagram above. You might not see anything with more layers than above, but I have a feeling we might at least see one more layer added. When I say layer I do not mean the memory is actually layered but rather the access latencies are layered and for two reasons being distance (because of EM limitations) and contention. As you add more CPUs they naturally get further and the memory banks naturally get further apart so a solution is to drop down and make a few CPUs have a more direct connection to their closer companions, but still retain a separate memory bus for accessing the memory in other groups. This is similar to an IP network of course without the dynamic routing -- which of course kinda peaks my curiosity of the benefits of something like that between memory or hot pluggable CPUs and memory banks.. but that is not relevant. Or, maybe more similar to an ethernet with switches. Any way my point is I do not care exactly how it is implemented only that memory is addressable directly and that there exists different latencies between regions of that addressable memory from different CPUs.

Conclusion

I have a few doubts myself. First being if we will ever see a non-flat NUMA system and if any actual benefits could be reaped. Another fuzzy area is the handling of non-NUMA aware applications/code/programs/processes. There has to be some algorithm (even if simple) to help these programs make the most out of the system. It could be a simple just grab the fastest memory from whatever CPU the thread is running on to something more complex such as automatically determining to lock threads to specific CPUs, kernels, or groups of kernels. Also is it possible for an algorithm to move memory around as in transfer it from another memory region with lower throughput to a higher throughput as the process/thread is running. How these algorithms will work is beyond my understanding at the moment and if they are even possible.

Re: A-SMP Architecture With Non-Flat NUMA

Posted: Fri Apr 18, 2014 11:07 pm
by Brendan
Hi,
Pancakes wrote:I have a few doubts myself. First being if we will ever see a non-flat NUMA system and if any actual benefits could be reaped.
These already exist. For modern (larger) systems using Quick-path or Hyper-transport links, typically there's 3 links per physical CPU. Because of this (and the need to use at least one link on one CPU for IO) for a 4-socket motherboard you often end up with something like this:

Code: Select all

        MEM0    MEM1
         |       |
  IO1---CPU0---CPU1
         |     / |
         |    /  |
         |   /   |
         |  /    |
         | /     |
        CPU2---CPU3---IO2
         |       |
        MEM2    MEM3
Of course the diagram above shows physical CPUs. Each of those are likely to contain multiple cores. For example, CPU0.0 and CPU0.1 might share fast access to MEM0, have slower access to MEM1 and MEM2, and have even slower access to MEM3.
Pancakes wrote:Another fuzzy area is the handling of non-NUMA aware applications/code/programs/processes. There has to be some algorithm (even if simple) to help these programs make the most out of the system. It could be a simple just grab the fastest memory from whatever CPU the thread is running on to something more complex such as automatically determining to lock threads to specific CPUs, kernels, or groups of kernels. Also is it possible for an algorithm to move memory around as in transfer it from another memory region with lower throughput to a higher throughput as the process/thread is running. How these algorithms will work is beyond my understanding at the moment and if they are even possible.
The traditional approach for dealing with ccNUMA (regardless of how strange the layout is) is to figure out what is in each NUMA domain, and then have a grid showing the cost for accessing something in any NUMA domain from any NUMA domain. For example (for the previous example):

Code: Select all

     0 1 2 3

  0  0 1 1 2
  1  1 0 1 1
  2  1 1 0 1
  3  2 1 1 0
In this case, you can quickly see that that when something in NUMA domain 2 is accessing something in NUMA domain 3 the cost will be 1. For 80x86 (and I assume other platforms that use ACPI); this grid is given to you by ACPI's SLIT (System Locality Information Table).

From this grid, it's trivial to determine a "memory search order" for each NUMA domain. For example, expand the table like this:

Code: Select all

  0  0,MEM0  1,MEM1  1,MEM2  2,MEM3
  1  1,MEM0  0,MEM1  1,MEM2  1,MEM3
  2  1,MEM0  1,MEM1  0,MEM2  1,MEM3
  3  2,MEM0  1,MEM1  1,MEM2  0,MEM3
Then sort each row (in order of cost) to get this:

Code: Select all

  0  0,MEM0  1,MEM1  1,MEM2  2,MEM3
  1  0,MEM1  1,MEM0  1,MEM2  1,MEM3
  2  0,MEM2  1,MEM0  1,MEM1  1,MEM3
  3  0,MEM3  1,MEM1  1,MEM2  2,MEM0
In the same way, you'd determine a "CPU search order" for each NUMA domain. For example, assuming dual-core physical CPUs (where "CPU3.0" is the core 0 in physical chip 3):

Code: Select all

  0  0,CPU0.0  0,CPU0.1  1,CPU1.0  1,CPU1.1  1,CPU2.0  1,CPU2.1  2,CPU3.0  2,CPU3.1
  1  0,CPU1.0  0,CPU1.1  1,CPU0.0  1,CPU0.1  1,CPU2.0  1,CPU2.1  1,CPU3.0  1,CPU3.1
  2  0,CPU2.0  0,CPU2.1  1,CPU0.0  1,CPU0.1  1,CPU1.0  1,CPU1.1  1,CPU3.0  1,CPU3.1
  3  0,CPU3.0  0,CPU3.1  1,CPU1.0  1,CPU1.1  1,CPU2.0  1,CPU2.1  2,CPU0.0  2,CPU0.1
Typically each process has a "home NUMA domain" and this is used to determine which "memory search list" to use when allocating memory, and which "CPU search list" to use when scheduling. For example, if a process' home domain is 3 then you'd allocate memory from MEM3 if possible (and then try MEM1, MEM2, MEM0), and use logical CPUs in the specified order (CPU3.0, then CPU3.1, CPU1.0, CPU1.1, etc).

This means that each process is using memory and CPUs that are "fast" whenever possible (with a "least worst" fall back to slower options when there's no choice). You'll find that (for ccNUMA) this method always works regardless of how messed up things are (regardless of how many NUMA domains, the distances between them, etc); and will handle NUMA domains that have memory and no CPUs (or CPUs and no memory), and extremely rare situations where the cost is different in different directions (e.g. things in NUMA domain X can access things in NUMA domain Y quickly, but things in NUMA domain Y access things in NUMA domain X slowly).


Cheers,

Brendan