Hi,
gaf wrote:in my opinion the quotation you posted is somewhat misleading as it creates the impression that SMP and NUMA are two competing concepts that are in a way exclusive ("SMP for up to 12 CPUs, NUMA if you want more"). In fact they however both work on a totally different level and can be combined perfectly well.
SMP can be thought of as a subset of NUMA, in that any OS that does supports NUMA also supports SMP. In general there's a measurement called the "NUMA ratio", which is the difference in speed between the closes memory and the furtherest memory, and an SMP machine can be described as a NUMA machine where the NUMA ratio is 1.
The reverse is also true for 80x86 - any OS that supports SMP also supports NUMA. However there's a difference between supporting NUMA and being designed for NUMA, which relates mostly to performance.
For example, imagine you're allocating memory for a process. An OS designed for NUMA would try to allocate memory that is "close" to the CPU that the process is using, while an OS designed for SMP would just allocate any memory that is free.
In general, to design an OS for NUMA you need to detect the relationships between CPUs, memory ranges and I/O controllers, then manage memory, CPU time and access to I/O devices accordingly.
The scheduler can be one of the largest differences - determining which CPU/s a process should run on. For example, if each NUMA node has a pair of dual core CPUs that support hyper-threading, then there's several "penalties" to consider. Shifting a process from one logical CPU to another logical CPU within the same CPU core has no penalties. Shifting to a different core in the same NUMA domain involves minor penalties due to the CPUs caches (as the processes code & data will be in the wrong core's caches). Shifting to a different NUMA domain is the worst, as all previously allocated physical memory will no longer be "close".
There's also schemes to help with IPC (including shared memory). The general idea is to keep processes that frequently communicate on the same NUMA domain to minimise memory access penalties. For example, if one process stores data in Page A for a second process to read, then at least one of the processes will have access penalties (unless they are both in the same NUMA domain).
I guess I should also point out that while Opteron is NUMA, the NUMA ratio is close to 1 (or the penalty for accessing memory that is not "close" isn't very high). I'd guess that for Opteron, an OS designed for NUMA would run 10% faster but you'd lose half of that due to management overhead, resulting in only 5% extra performance.
gaf wrote:Btw: Does anybody know how to detect NUMA systems ? Do the ACPI tables actually suport it ?
There's 2 tables - the SRAT (System Resource Affinity Table) and the SLIT (System Locality Index Table IIRC). The first one describes which resources (CPUs and memory ranges) belong to which NUMA domain, and the second one describes the "relative distance" between NUMA domains. The SRAT is older than the SLIT, so you might find an SRAT without a SLIT (but shouldn't find a SLIT without an SRAT). See the latest ACPI specifications for details (they're not in earlier versions of the standard). I'm not sure how you're meant to figure out which NUMA domain I/O controllers are in, but I have a feeling there may be more NUMA related stuff in the interpretted firmware/AML code.
Cheers,
Brendan