SMP vs NUMA

beyondsociety · Post by **beyondsociety** » Fri Aug 04, 2006 9:58 pm

I am redesigning my operating system and was researching multiprocessing architectures and had a few questions.

I know alot of people are supporting SMP in their os's but what about NUMA? At the time I dont really see a need for supporting that many cpus. Is there any real reason why I would want to support NUMA now or even in the future? What about support for hyper-threading?

The main benefit of NUMA is, as mentioned above, scalability. It is extremely difficult to scale SMP past 8-12 CPUs. At that number of CPUs, the memory bus is under heavy contention. NUMA is one way of reducing the number of CPUs competing for access to a shared memory bus. This is accomplished by having several memory busses and only having a small number of CPUs on each of those busses. There are other ways of building massively multiprocessor machines

paulbarker · Post by **paulbarker** » Sat Aug 05, 2006 5:35 am

I think Opteron systems are NUMA. Infact, I'm pretty sure they are since each processor has it's own memory controller...

Am I right here? If so thats a pretty cheap & common (compared to huge NUMA server setups) system which uses NUMA.

FlashBurn · Post by **FlashBurn** » Sat Aug 05, 2006 11:59 am

Yeah, you are right!

@beyondsociety

If you like to support hyperthreading you also want to support dual core and for that you need smp code (you need it for both). So if you are designing your os smp aware then you also can design it so that it will work with at least 8 cpus/cores/threads(hyperthreading).

Another point is, why supporting hyperhtreading? I think there are more dual core systems than hyperhtreading systems!?

gaf · Post by **gaf** » Sat Aug 05, 2006 12:57 pm

Hello,
in my opinion the quotation you posted is somewhat misleading as it creates the impression that SMP and NUMA are two competing concepts that are in a way exclusive ("SMP for up to 12 CPUs, NUMA if you want more"). In fact they however both work on a totally different level and can be combined perfectly well.

Non Uniform Memory Access (as opposed to UMA) describes the connection between processors and memory on a harware-level. In a traditional UMA system all memory access is done using the systembus, which might lead to bottle-necks if the number of CPUs grows too large. NUMA systems thus assign each processor a dedicated area of memory, that it can access directly, without having to use the systembus. It's much faster to read/write to this local memory, but that doesn't mean that the remaining RAM couldn't get accessed at all. In order to increase system performance, an operating system should however enforce a memory allocation policy that always tries to use memory local to the current CPU first.

Symetric Multi Processing on the other hand describes how the operating system uses the system. On a SMP each CPU runs a copy of the operating systems, so that systemcalls/scheduling/etc can be done on a local level. The alternative would be a master/slave approach, where one CPU is chosen as the operating system server, while all the other processors are used as slaves that run user-level applications.

SMP and NUMA can thus be combined perfectly well, as they both enforce a local policy..

Is there any real reason why I would want to support NUMA now or even in the future?

As paulbarker already said, the Opteron seems to be a NUMA system. Whether a hobby operating system should ecplicitly support such NUMA processors is a different question. After all a traditional OS (that is not NUMA aware) can also run on a Opteron, although it might be a bit slower. On the other hand it's probably wiser to include NUMA support right away, rather than having to add it later..

Btw: Does anybody know how to detect NUMA systems ? Do the ACPI tables actually suport it ?

What about support for hyper-threading?

Adding support for hyper-threading is really simple once your system runs regular SMP. The logical processors of a hyperthreading CPU can get booted just the same way as the application processors of a traditional multiprocessor system. All you really have to add is the HT detection code (cpuid + rdmsr) aswell as some HT optimizations for your waiting-loop.

Another point is, why supporting hyperhtreading? I think there are more dual core systems than hyperhtreading systems!?

Because it can be done in less than 50 lines of code

regards,
gaf

Brendan · Post by **Brendan** » Sun Aug 06, 2006 3:45 am

Hi,

gaf wrote:in my opinion the quotation you posted is somewhat misleading as it creates the impression that SMP and NUMA are two competing concepts that are in a way exclusive ("SMP for up to 12 CPUs, NUMA if you want more"). In fact they however both work on a totally different level and can be combined perfectly well.

SMP can be thought of as a subset of NUMA, in that any OS that does supports NUMA also supports SMP. In general there's a measurement called the "NUMA ratio", which is the difference in speed between the closes memory and the furtherest memory, and an SMP machine can be described as a NUMA machine where the NUMA ratio is 1.

The reverse is also true for 80x86 - any OS that supports SMP also supports NUMA. However there's a difference between supporting NUMA and being designed for NUMA, which relates mostly to performance.

For example, imagine you're allocating memory for a process. An OS designed for NUMA would try to allocate memory that is "close" to the CPU that the process is using, while an OS designed for SMP would just allocate any memory that is free.

In general, to design an OS for NUMA you need to detect the relationships between CPUs, memory ranges and I/O controllers, then manage memory, CPU time and access to I/O devices accordingly.

The scheduler can be one of the largest differences - determining which CPU/s a process should run on. For example, if each NUMA node has a pair of dual core CPUs that support hyper-threading, then there's several "penalties" to consider. Shifting a process from one logical CPU to another logical CPU within the same CPU core has no penalties. Shifting to a different core in the same NUMA domain involves minor penalties due to the CPUs caches (as the processes code & data will be in the wrong core's caches). Shifting to a different NUMA domain is the worst, as all previously allocated physical memory will no longer be "close".

There's also schemes to help with IPC (including shared memory). The general idea is to keep processes that frequently communicate on the same NUMA domain to minimise memory access penalties. For example, if one process stores data in Page A for a second process to read, then at least one of the processes will have access penalties (unless they are both in the same NUMA domain).

I guess I should also point out that while Opteron is NUMA, the NUMA ratio is close to 1 (or the penalty for accessing memory that is not "close" isn't very high). I'd guess that for Opteron, an OS designed for NUMA would run 10% faster but you'd lose half of that due to management overhead, resulting in only 5% extra performance.

gaf wrote:Btw: Does anybody know how to detect NUMA systems ? Do the ACPI tables actually suport it ?

There's 2 tables - the SRAT (System Resource Affinity Table) and the SLIT (System Locality Index Table IIRC). The first one describes which resources (CPUs and memory ranges) belong to which NUMA domain, and the second one describes the "relative distance" between NUMA domains. The SRAT is older than the SLIT, so you might find an SRAT without a SLIT (but shouldn't find a SLIT without an SRAT). See the latest ACPI specifications for details (they're not in earlier versions of the standard). I'm not sure how you're meant to figure out which NUMA domain I/O controllers are in, but I have a feeling there may be more NUMA related stuff in the interpretted firmware/AML code.

Cheers,

Brendan

DruG5t0r3 · Post by **DruG5t0r3** » Tue Aug 08, 2006 11:11 am

huhh...I haven't posted in here in like a year...oh well hello there

I am about to buy a AMD x86-64 dualcore processor and was reading this post...

Here i'll just talk about the _possiblity_ of doing it. Let's say I have a kernel on which I assign as much data/threads structure to each cpu that each threads has affinity to...basically bringing down the number of times I have to spin_lock the cpus to a bare minimum...wouldn't that speed up things quite a lot? Or even worse implement something like NUMA in this post and have each CPU its on set of memory (ok I know its crazy) but is this doable?

Candy · Post by **Candy** » Tue Aug 08, 2006 1:02 pm

well, of course it's doable. It's also doable to make each thread hold on to one cpu only, or to make them never swap, or to make them run on every 12587'th cycle only. The less logical it'll be the slower it'll be.

Just keep track of corner cases. CPU affinity is good, except when all threads hold on to an overloaded cpu while 5 are idling. No CPU affinity solves this, but causes problems with 1 task on 2+ cpu's.

OSDev.org

SMP vs NUMA

SMP vs NUMA

Re:SMP vs NUMA

Re:SMP vs NUMA

Re:SMP vs NUMA

Re:SMP vs NUMA

Re:SMP vs NUMA

Re:SMP vs NUMA