Hi, I have Four 1.90 GHz Twelve-Core processors. 48 Cores @ 1.90 GHz
It would appear that the CPUs are in groups of 8 and when SYSCALL is called at the same time within the same group the processor shares the load.
I am in 64 bit long mode.
I am testing SYSCALL and SYSRET performance.
I am calling SYSCALL from user code in a loop, the SYSCALL Kernel code increases a counter for the CPU and returns.
If I just have Core #0 running it processes 10.6 million per second.
If I have Core #0 and Core #8 running, each process 10.6 million per second.
The same goes for Core #0, Core #8, Core #16, Core #24, Core #32 and Core #48, all 10.6 million per second.
Giving me around 60 million per second.
But if I ask for Core #0 and Core #1 both run at 5.1 million per second.
It appears if the CPUs are grouped, 0..7, 8..15, 16..23, 24..31, 32..39, 40..47
The performance slows the more cores are added. if Core #0 through to Core #7 are all running, then each has 700,000 per second.
This gives me for the 0-7 group around 10.2 million, very close to the single Core at 10.6 million.
If Cores #0 through to Core#7 are running as above at 700K/sec, and just Core #8 in the Core #8 through to Core #15 is just running it runs at 10.6 Million per second.
So it appears that the Core are grouped and that one group does not affect the others.
Any ideas on what the system is doing?
Many thanks. Alistair
SYSCALL Performance multicores
Re: SYSCALL Performance multicores
Hi,
For four 12-core CPUs you could actually get up to 4 or more different speeds for different pairs, corresponding to:
Cheers,
Brendan
tsdnz wrote:the SYSCALL Kernel code increases a counter for the CPU and returns.
Yes; the cache line that the counter is in is probably bouncing between cores. When a pair of logical CPUs share the same L1 cache (e.g. hyper-threading) the cache line doesn't need to move at all and it's fast. When a pair of logical CPUs aren't sharing caches the cache line has to be transferred and it's "less fast".tsdnz wrote:Any ideas on what the system is doing?
For four 12-core CPUs you could actually get up to 4 or more different speeds for different pairs, corresponding to:
- All caches shared between both CPUs (hyper-threading)
- L2 shared between CPUs (different cores in same module)
- L3 shared between CPUs (different cores in different modules of same chip)
- No caches shared between CPUs (different chips, and possibly different penalties depending on how many "hops" between NUMA domains)
Cheers,
Brendan
For all things; perfection is, and will always remain, impossible to achieve in practice. However; by striving for perfection we create things that are as perfect as practically possible. Let the pursuit of perfection be our guide.
Re: SYSCALL Performance multicores
Back from picking my daughter up from swimming.
I have a different Counter for each CPU.
Thinking at swimming.... 8 x sizeof(QWORD) = cacheline size.
I have assembly code that is called as the SYSCALL entry point that then passes code to the correct routine.
This code uses (QWORD*) + CPU->Index as an exception handler just for SYSCALL.
I have setup the structure wrong, just as you have outlined.
I will try this tonight and post the results.
Many thanks for your time Brendan
Code: Select all
Simple solution (if you want to avoid this) is to have a different counter for each CPU. For example, you could use "inc ebx" and then afterwards do "lock add [counter],ebx" on each CPU (instead of doing "lock inc [counter]").
Thinking at swimming.... 8 x sizeof(QWORD) = cacheline size.
I have assembly code that is called as the SYSCALL entry point that then passes code to the correct routine.
This code uses (QWORD*) + CPU->Index as an exception handler just for SYSCALL.
I have setup the structure wrong, just as you have outlined.
I will try this tonight and post the results.
Many thanks for your time Brendan
Re: SYSCALL Performance multicores
Could not wait until tonight, server on!!!!
That was it!!
Now to change the code, and align the data correctly.
Brendan, many thanks. I will allowed for 128 cache size.
To all who have helped, and keep this site and amazing place to visit.
Thank you all.
Alistair
That was it!!
Now to change the code, and align the data correctly.
Brendan, many thanks. I will allowed for 128 cache size.
To all who have helped, and keep this site and amazing place to visit.
Thank you all.
Alistair