SYSCALL Performance multicores

tsdnz · Post by **tsdnz** » Wed Nov 18, 2015 6:45 pm

Hi, I have Four 1.90 GHz Twelve-Core processors. 48 Cores @ 1.90 GHz

It would appear that the CPUs are in groups of 8 and when SYSCALL is called at the same time within the same group the processor shares the load.

I am in 64 bit long mode.
I am testing SYSCALL and SYSRET performance.

I am calling SYSCALL from user code in a loop, the SYSCALL Kernel code increases a counter for the CPU and returns.

If I just have Core #0 running it processes 10.6 million per second.

If I have Core #0 and Core #8 running, each process 10.6 million per second.

The same goes for Core #0, Core #8, Core #16, Core #24, Core #32 and Core #48, all 10.6 million per second.
Giving me around 60 million per second.

But if I ask for Core #0 and Core #1 both run at 5.1 million per second.

It appears if the CPUs are grouped, 0..7, 8..15, 16..23, 24..31, 32..39, 40..47

The performance slows the more cores are added. if Core #0 through to Core #7 are all running, then each has 700,000 per second.
This gives me for the 0-7 group around 10.2 million, very close to the single Core at 10.6 million.

If Cores #0 through to Core#7 are running as above at 700K/sec, and just Core #8 in the Core #8 through to Core #15 is just running it runs at 10.6 Million per second.

So it appears that the Core are grouped and that one group does not affect the others.

Any ideas on what the system is doing?

Many thanks. Alistair

Brendan · Post by **Brendan** » Wed Nov 18, 2015 8:17 pm

Hi,

tsdnz wrote:the SYSCALL Kernel code increases a counter for the CPU and returns.

tsdnz wrote:Any ideas on what the system is doing?

Yes; the cache line that the counter is in is probably bouncing between cores. When a pair of logical CPUs share the same L1 cache (e.g. hyper-threading) the cache line doesn't need to move at all and it's fast. When a pair of logical CPUs aren't sharing caches the cache line has to be transferred and it's "less fast".

For four 12-core CPUs you could actually get up to 4 or more different speeds for different pairs, corresponding to:

All caches shared between both CPUs (hyper-threading)
L2 shared between CPUs (different cores in same module)
L3 shared between CPUs (different cores in different modules of same chip)
No caches shared between CPUs (different chips, and possibly different penalties depending on how many "hops" between NUMA domains)

Simple solution (if you want to avoid this) is to have a different counter for each CPU. For example, you could use "inc ebx" and then afterwards do "lock add [counter],ebx" on each CPU (instead of doing "lock inc [counter]").

Cheers,

Brendan

tsdnz · Post by **tsdnz** » Wed Nov 18, 2015 9:31 pm

Back from picking my daughter up from swimming.

Code: Select all

Simple solution (if you want to avoid this) is to have a different counter for each CPU. For example, you could use "inc ebx" and then afterwards do "lock add [counter],ebx" on each CPU (instead of doing "lock inc [counter]").

I have a different Counter for each CPU.

Thinking at swimming.... 8 x sizeof(QWORD) = cacheline size.

I have assembly code that is called as the SYSCALL entry point that then passes code to the correct routine.
This code uses (QWORD*) + CPU->Index as an exception handler just for SYSCALL.

I have setup the structure wrong, just as you have outlined.

I will try this tonight and post the results.

Many thanks for your time Brendan

tsdnz · Post by **tsdnz** » Wed Nov 18, 2015 9:57 pm

Could not wait until tonight, server on!!!!

That was it!!

Now to change the code, and align the data correctly.

Brendan, many thanks. I will allowed for 128 cache size.

To all who have helped, and keep this site and amazing place to visit.
Thank you all.

Alistair

OSDev.org

SYSCALL Performance multicores

SYSCALL Performance multicores

Re: SYSCALL Performance multicores

Re: SYSCALL Performance multicores

Re: SYSCALL Performance multicores