OSDev.org

Posted: **Thu Mar 19, 2009 3:06 pm**

Hi all,

I was wondering if anyone had a good starting point or reference for implementing NUMA on x86 h/w?
I've done a lot of reading up on the subject of SMP vs NUMA (I personally believe that SMP and the way that multicore has been implemented in PCs sofar has been a complete failure).
My reasoning revolves around the fact that 9 out of 10 algorithms are going to be constrained by memory access and bus long before even a single core is maxed out. All algorithms or code require data in some
form or another to operate on, especially in cases where you truly want to divide and conquer with multiple cores. These cases would usually be operating on large to massive sets of data. I've tested my theory out many times using multi-threaded code with core affinity and at best I've seen 20% increase from adding a second thread.. from there on it decreases even more significantly.
In any event my understanding is that to implement NUMA would require seperate memory regions assigned to each core and some sort of mapping and interconnect between cores to access memory. Is this something that is present in ALL PCs now (ala core i7) or would NUMA only be possible using a custom machine architecture built around an x86 chip? From what I've found sofar I would presume the later.

If it is possible to implement a NUMA model for any/all new multi-core x86 chips / pcs.. where would one start (IE: getting the memory ranges for each core, how memory is allocated to cores, distances etc).

Thanks!
John

Posted: **Thu Mar 19, 2009 5:12 pm**

Core i7 is both NUMA and pretty much the first available for desktop use (and with that, is intel's first to provide the capability).
Multisocket Opteron machines are NUMA too and have been around for much longer (but you don't regularly come across one). The idea is that AMD came with HyperTransport waaay earlier than Intel did with with QuickPath, and with that very method, provided the necessary support for non-uniform memory models.

In other words:

Wikipedia wrote:Current ccNUMA systems are multiprocessor systems based on the AMD Opteron, which can be implemented without external logic, and Intel Itanium, which requires the chipset to support NUMA. Examples of ccNUMA enabled chipsets are the SGI Shub (Super hub), the Intel E8870, the HP sx2000 (used in the Integrity and Superdome servers), and those found in recent NEC Itanium-based systems. Earlier ccNUMA systems such as those from Silicon Graphics were based on MIPS processors and the DEC Alpha 21364 (EV7) processor.

Intel announced NUMA introduction to its x86 and Itanium servers in late 2007 with Nehalem and Tukwila CPUs. Both CPU families will share a common socket; the interconnection is called Intel Quick Path Interconnect (QPI).

As for the actual information, there's ACPI (yuck) which can tell you about it, or you could try and grab some MSRs which tels you a bit (only what the processor knows)

Posted: **Thu Mar 19, 2009 5:23 pm**

First, there are some calculations that are "embarrasingly parallel" that really can exploit most of the power of an SMP architecture. They are often mathematical or physical calculations, and "algorithms" (especially typical computer system algorithms) usually do not fall into this category, as you say. On the other hand, it is very efficient to run multiple, completely unrelated processes on separate CPUs of an SMP machine.

Second, the fault really isn't in the SMP architecture, I'd say. It's in the entire concept of cache, and MESI protocol for cache control. MESI needs a 5th state to actually be useful and efficient. If the memory of a system consisted entirely of superfast SRAM (with no DRAM at all) you wouldn't need caches, and SMP would work great.

johnsa wrote: would NUMA only be possible using a custom machine architecture ... ?

Yes. NUMA requires very specific, non-pc-standard hardware. It is also theoretically not chip-specific. It is possible to create a 32-core NUMA ARM machine using external logic, but as Combuster quoted -- there are several current lines of chips with extra stuff in them that make building NUMA motherboards much simpler.

where would one start (IE: getting the memory ranges for each core, how memory is allocated to cores, distances etc).

You start with the ACPI 3.0 tables that the BIOS loads into memory during boot. They contain all that info.

Posted: **Fri Mar 20, 2009 4:07 am**

Combuster wrote:ACPI (yuck)

Since ACPI is about the only modern, standard way of getting information on the PC, you should not let people (esp. newbies) think ACPI is somehow not ok, imho.

JAL

Posted: **Sat Mar 21, 2009 6:52 pm**

ACPI is somehow not ok

Something with 30% and broken support springs to mind... Like last time I put a soundcard into a dualsocket server box. consequently windows complained about ACPI and refused to start. Writing good ACPI support = having no life.

OSDev.org

NUMA on x86

NUMA on x86

Re: NUMA on x86

Re: NUMA on x86

Re: NUMA on x86

Re: NUMA on x86