NUMA on x86
Posted: Thu Mar 19, 2009 3:06 pm
Hi all,
I was wondering if anyone had a good starting point or reference for implementing NUMA on x86 h/w?
I've done a lot of reading up on the subject of SMP vs NUMA (I personally believe that SMP and the way that multicore has been implemented in PCs sofar has been a complete failure).
My reasoning revolves around the fact that 9 out of 10 algorithms are going to be constrained by memory access and bus long before even a single core is maxed out. All algorithms or code require data in some
form or another to operate on, especially in cases where you truly want to divide and conquer with multiple cores. These cases would usually be operating on large to massive sets of data. I've tested my theory out many times using multi-threaded code with core affinity and at best I've seen 20% increase from adding a second thread.. from there on it decreases even more significantly.
In any event my understanding is that to implement NUMA would require seperate memory regions assigned to each core and some sort of mapping and interconnect between cores to access memory. Is this something that is present in ALL PCs now (ala core i7) or would NUMA only be possible using a custom machine architecture built around an x86 chip? From what I've found sofar I would presume the later.
If it is possible to implement a NUMA model for any/all new multi-core x86 chips / pcs.. where would one start (IE: getting the memory ranges for each core, how memory is allocated to cores, distances etc).
Thanks!
John
I was wondering if anyone had a good starting point or reference for implementing NUMA on x86 h/w?
I've done a lot of reading up on the subject of SMP vs NUMA (I personally believe that SMP and the way that multicore has been implemented in PCs sofar has been a complete failure).
My reasoning revolves around the fact that 9 out of 10 algorithms are going to be constrained by memory access and bus long before even a single core is maxed out. All algorithms or code require data in some
form or another to operate on, especially in cases where you truly want to divide and conquer with multiple cores. These cases would usually be operating on large to massive sets of data. I've tested my theory out many times using multi-threaded code with core affinity and at best I've seen 20% increase from adding a second thread.. from there on it decreases even more significantly.
In any event my understanding is that to implement NUMA would require seperate memory regions assigned to each core and some sort of mapping and interconnect between cores to access memory. Is this something that is present in ALL PCs now (ala core i7) or would NUMA only be possible using a custom machine architecture built around an x86 chip? From what I've found sofar I would presume the later.
If it is possible to implement a NUMA model for any/all new multi-core x86 chips / pcs.. where would one start (IE: getting the memory ranges for each core, how memory is allocated to cores, distances etc).
Thanks!
John