Implementing NUMA

JohnnyTheDon · Post by **JohnnyTheDon** » Fri Feb 20, 2009 10:10 pm

I've recently starting looking at adding NUMA support to my kernel. I've managed to simulate a NUMA system by passing fudged SRAT and SLIT tables to my kernel through my bootloader (The system is acutally a normal UMA SMP system). Now I'm trying to figure out how to implement NUMA in a performance enhancing way.

My first idea is to keep everything the way it is, except change my memory manager and scheduler so they keep most of a program's memory in the same domain and schedule that program on the CPUs in that domain. AFAIK that is how Linux does it.

My other idea has to do with the design of my operating system. It uses software protection (like Singularity), JIT compilation and is designed to run on distributed systems. Each program consists of modules. From the programmer's perspective these modules are on the same system. Communication is done through normal function calls. However, since its a distributed system, there are times when the modules might be running on other computers. In this case the calls between the modules are hooked by the kernel, which then serializes the arguments and sends a network message.

I was thinking about using this existing message passing based framework on NUMA domains. Modules would make their normal IPC calls, but instead of a network message, the message would be sent using shared memory and IPIs between the domains. There would be seperate copies of the kernel running on each domain with entriely seperate data. Communication would only occur when necessary, reducing bus traffic to the bare minimum.

Which option is better?

Brendan · Post by **Brendan** » Sat Feb 21, 2009 1:36 am

Hi,

JohnnyTheDon wrote:I've recently starting looking at adding NUMA support to my kernel. I've managed to simulate a NUMA system by passing fudged SRAT and SLIT tables to my kernel through my bootloader (The system is acutally a normal UMA SMP system). Now I'm trying to figure out how to implement NUMA in a performance enhancing way.

Some ideas from my OS design:

Processes are assigned a default NUMA domain, and initially they're restricted (via. CPU affinity) to their default NUMA domain. The default domain for a process is either auto-selected (the least loaded NUMA domain) or explicitly requested by whoever is starting the process.
Processes can ask to increase the number of domains they're using, and the kernel will find suitable domains for the process to expand into (based on current load, and distance from the default domain using information from the SLIT). In this case the CPU affinity for a process is increased. The CPU affinity for a process can not be decreased, and a process can't ask to decrease the number of domains it's using.
When a new thread is started, the thread's CPU affinity will be the same as the current CPU affinity for the process. Threads can change their CPU affinity (increase or decrease the number of CPUs they're restricted to) but this is always limited by the current CPU affinity for the process.
When a thread becomes "ready to run" the scheduler chooses a CPU for it to be run on using current CPU load, which CPU the thread used last time and the thread's CPU affinity; without caring about NUMA.
When memory is allocated by a thread, the memory manager finds pages that are "closest" to the NUMA domain that the thread happens to be running in. For example, a thread running in domain #3 would be given memory from domain #3, but if there's no free memory left in domain #3 it'd be given memory from the next closest domain (using information from the SLIT). This means that threads may end up allocating memory from any domain.
There's a "memory checker" that optimizes page usage (e.g. replaces "distant" pages with "closer" pages if possible) and monitors page usage (keeping track of which pages are used most often, to determine which pages to send to swap space if necessary).
Device drivers are (mostly) normal processes. When they're started their default NUMA domain is the closest NUMA domain to the device. For example, the OS tries to avoid having a device driver running in domain #3 when the device is connected to an I/O hub in domain #0.
The OS tries to make sure that IRQs are sent to CPUs that belong to the closest NUMA domain to the device (which is hopefully where the device driver is running too). The idea here (for both IRQ routing and device driver default domains) is to minimize the amount of unnecessary bus traffic involved with handling devices.
For kernel data that is normally only read and not modified (including kernel code), a different copy of the data is used in each domain (different physical pages mapped at the same virtual addresses). For example, if there's 4 NUMA domains then there's 4 copies of the kernel's code and read-only data. This means that the kernel's code and read-only data can always be accessed from the copy in "close" memory. Note: for an OS like Linux, there's one copy of the kernel where most of the kernel's code and data ends up in the same NUMA domain (the "golden domain"). This means that CPUs in one NUMA domain perform better than CPUs in all other NUMA domains, because CPUs in other NUMA domains always end up with high latency access to "distant" kernel code and data.
For dynamically allocated pages used for kernel data, the memory manager finds pages that are "closest" to the NUMA domain that the thread happens to be running in. This means that if data is used from all NUMA domains you end up using pages from any NUMA domain/s (it's much more random), which helps to avoid the "golden domain" problem while also making sure that kernel data that's only used from a few domains (e.g. domains that contain I/O hubs) only use memory that's close to those domains.

JohnnyTheDon wrote:My other idea has to do with the design of my operating system. It uses software protection (like Singularity), JIT compilation and is designed to run on distributed systems. Each program consists of modules. From the programmer's perspective these modules are on the same system. Communication is done through normal function calls. However, since its a distributed system, there are times when the modules might be running on other computers. In this case the calls between the modules are hooked by the kernel, which then serializes the arguments and sends a network message.

I was thinking about using this existing message passing based framework on NUMA domains. Modules would make their normal IPC calls, but instead of a network message, the message would be sent using shared memory and IPIs between the domains. There would be seperate copies of the kernel running on each domain with entriely seperate data. Communication would only occur when necessary, reducing bus traffic to the bare minimum.

That should work fine.

I'll refrain from providing my general opinion of synchronous IPC though, as it'd be off-topic...

Cheers,

Brendan

thooot · Post by **thooot** » Sat Feb 21, 2009 2:19 pm

Brendan wrote: [*]For kernel data that is normally only read and not modified (including kernel code), a different copy of the data is used in each domain (different physical pages mapped at the same virtual addresses). For example, if there's 4 NUMA domains then there's 4 copies of the kernel's code and read-only data. This means that the kernel's code and read-only data can always be accessed from the copy in "close" memory. Note: for an OS like Linux, there's one copy of the kernel where most of the kernel's code and data ends up in the same NUMA domain (the "golden domain"). This means that CPUs in one NUMA domain perform better than CPUs in all other NUMA domains, because CPUs in other NUMA domains always end up with high latency access to "distant" kernel code and data.

Do you have any data (or research papers) on how having copies of read-only data in multiple domains affects performance? I would guess that this would vary a lot depending on the workload. If your workload doesn't access too much memory then the hot sections of the kernel will probably be in the L3 cache and you won't have much penalty regardless of the domain the memory is from. On the other hand if its memory intensive then when you enter the kernel you'll probably have to fetch the data from external domains.

Brendan · Post by **Brendan** » Sun Feb 22, 2009 5:46 am

Hi,

thooot wrote:
Brendan wrote: [*]For kernel data that is normally only read and not modified (including kernel code), a different copy of the data is used in each domain (different physical pages mapped at the same virtual addresses). For example, if there's 4 NUMA domains then there's 4 copies of the kernel's code and read-only data. This means that the kernel's code and read-only data can always be accessed from the copy in "close" memory. Note: for an OS like Linux, there's one copy of the kernel where most of the kernel's code and data ends up in the same NUMA domain (the "golden domain"). This means that CPUs in one NUMA domain perform better than CPUs in all other NUMA domains, because CPUs in other NUMA domains always end up with high latency access to "distant" kernel code and data.
Do you have any data (or research papers) on how having copies of read-only data in multiple domains affects performance? I would guess that this would vary a lot depending on the workload. If your workload doesn't access too much memory then the hot sections of the kernel will probably be in the L3 cache and you won't have much penalty regardless of the domain the memory is from. On the other hand if its memory intensive then when you enter the kernel you'll probably have to fetch the data from external domains.

I don't have any data, and I've never seen any research papers or any other OS that implements it (but this doesn't necessarily mean that nobody else has implemented it or that research papers don't exist).

However, for worst case (everything remains in cache anyway) the only disadvantage is that it wastes some RAM. My OS is designed as a micro-kernel, and the amount of RAM used would be roughly 128 KiB per NUMA domain (except for the first 128 KiB - the data has to be somewhere even if there's no copies). IMHO this is negligible, considering that NUMA systems tend to have at least 1 GiB of RAM per NUMA domain.

Best case would be an OS running processes that constantly touch several MiB of RAM (and continually push least recently used data out of the caches). In this case, in a large NUMA system (e.g. an 8-way system where some CPUs are "2 hops" away from others) fetching kernel code and data from RAM could be several times faster, and it might improve overall performance by 50%.

In practice, the performance would be somewhere between these extremes (and depend on a large number of things). As a very rough estimate I'd expect that for typical/average usage there'd be a small/modest improvement (e.g. maybe between 0.1% to 3% faster overall performance), but in specific cases (e.g. a computer being used as part of a render farm) I'd also expect much larger improvements.

Also note that modern CPUs have large caches but these caches are shared by multiple cores and/or multiple logical CPUs. For example, current Core i7 CPUs have a huge 16 MiB L3 cache (largest I could find) but that huge L3 cache is shared by 8 logical CPUs, so it's closer to 2 MiB per CPU. The chance of a process touching 2 MiB of data or more (and pushing kernel code and data out of the cache) isn't too unlikely, and there's some things my kernel does (e.g. "page cleaning") where one part of the kernel can push the rest of the kernel's code and data out of the caches.

Cheers,

Brendan

MooCow · Post by **MooCow** » Wed Feb 25, 2009 10:11 pm

Brendan wrote:
thooot wrote:
Brendan wrote: [*]For kernel data that is normally only read and not modified (including kernel code), a different copy of the data is used in each domain (different physical pages mapped at the same virtual addresses). For example, if there's 4 NUMA domains then there's 4 copies of the kernel's code and read-only data. This means that the kernel's code and read-only data can always be accessed from the copy in "close" memory. Note: for an OS like Linux, there's one copy of the kernel where most of the kernel's code and data ends up in the same NUMA domain (the "golden domain"). This means that CPUs in one NUMA domain perform better than CPUs in all other NUMA domains, because CPUs in other NUMA domains always end up with high latency access to "distant" kernel code and data.
Do you have any data (or research papers) on how having copies of read-only data in multiple domains affects performance? I would guess that this would vary a lot depending on the workload. If your workload doesn't access too much memory then the hot sections of the kernel will probably be in the L3 cache and you won't have much penalty regardless of the domain the memory is from. On the other hand if its memory intensive then when you enter the kernel you'll probably have to fetch the data from external domains.
I don't have any data, and I've never seen any research papers or any other OS that implements it (but this doesn't necessarily mean that nobody else has implemented it or that research papers don't exist).

I believe that Linux does this, and, IIRC, has since 2.4 days. There is no way that it would be able to scale up to 1k+ nodes if it didn't. Is there a particular reason that you are thinking that it doesn't?

Brendan · Post by **Brendan** » Thu Feb 26, 2009 1:08 am

Hi,

MooCow wrote:
Brendan wrote:
thooot wrote:Do you have any data (or research papers) on how having copies of read-only data in multiple domains affects performance?
I don't have any data, and I've never seen any research papers or any other OS that implements it (but this doesn't necessarily mean that nobody else has implemented it or that research papers don't exist).
I believe that Linux does this, and, IIRC, has since 2.4 days. There is no way that it would be able to scale up to 1k+ nodes if it didn't. Is there a particular reason that you are thinking that it doesn't?

I couldn't find anything to suggest that Linux does this (e.g. multiple copies of the kernel's code). Is there a particular reason that you are thinking Linux does?

Note: Linux doesn't scale up to 1k+ NUMA nodes in the same computer. You're probably getting confused with 1k+ separate computers in a distributed cluster.

Cheers,

Brendan

MooCow · Post by **MooCow** » Thu Feb 26, 2009 11:07 pm

Brendan wrote: I couldn't find anything to suggest that Linux does this (e.g. multiple copies of the kernel's code). Is there a particular reason that you are thinking Linux does?

Well the first thing that made me think that it did was the kernel documentation dating back to 1999 suggesting that is is a good idea. But after your reply I started looking into it more and couldn't find code that would prove it. So I turned to google and the first 5 hits for "linux kernel text replicate" enlightened me a bit. Seems like it is still not mainstream for x86/x64 (that I can see) even though there was a patch for it in 2003. How sad. Or maybe I'm not looking hard enough...

Brendan wrote: Note: Linux doesn't scale up to 1k+ NUMA nodes in the same computer. You're probably getting confused with 1k+ separate computers in a distributed cluster.

Yes, and no. I'm also thinking of SGI's Altix which can scale up to 512 sockets under one address space, then can distribute to up to 10k.

Regards

OSDev.org

Implementing NUMA

Implementing NUMA

Re: Implementing NUMA

Re: Implementing NUMA

Re: Implementing NUMA

Re: Implementing NUMA

Re: Implementing NUMA

Re: Implementing NUMA