SYSCALL Performance multicores

Question about which tools to use, bugs, the best way to implement a function, etc should go here. Don't forget to see if your question is answered in the wiki first! When in doubt post here.
Post Reply
tsdnz
Member
Member
Posts: 333
Joined: Sun Jun 16, 2013 4:09 am

SYSCALL Performance multicores

Post by tsdnz »

Hi, I have Four 1.90 GHz Twelve-Core processors. 48 Cores @ 1.90 GHz

It would appear that the CPUs are in groups of 8 and when SYSCALL is called at the same time within the same group the processor shares the load.

I am in 64 bit long mode.
I am testing SYSCALL and SYSRET performance.

I am calling SYSCALL from user code in a loop, the SYSCALL Kernel code increases a counter for the CPU and returns.

If I just have Core #0 running it processes 10.6 million per second.

If I have Core #0 and Core #8 running, each process 10.6 million per second.

The same goes for Core #0, Core #8, Core #16, Core #24, Core #32 and Core #48, all 10.6 million per second.
Giving me around 60 million per second.

But if I ask for Core #0 and Core #1 both run at 5.1 million per second.

It appears if the CPUs are grouped, 0..7, 8..15, 16..23, 24..31, 32..39, 40..47

The performance slows the more cores are added. if Core #0 through to Core #7 are all running, then each has 700,000 per second.
This gives me for the 0-7 group around 10.2 million, very close to the single Core at 10.6 million.

If Cores #0 through to Core#7 are running as above at 700K/sec, and just Core #8 in the Core #8 through to Core #15 is just running it runs at 10.6 Million per second.

So it appears that the Core are grouped and that one group does not affect the others.

Any ideas on what the system is doing?

Many thanks. Alistair
User avatar
Brendan
Member
Member
Posts: 8561
Joined: Sat Jan 15, 2005 12:00 am
Location: At his keyboard!
Contact:

Re: SYSCALL Performance multicores

Post by Brendan »

Hi,
tsdnz wrote:the SYSCALL Kernel code increases a counter for the CPU and returns.
tsdnz wrote:Any ideas on what the system is doing?
Yes; the cache line that the counter is in is probably bouncing between cores. When a pair of logical CPUs share the same L1 cache (e.g. hyper-threading) the cache line doesn't need to move at all and it's fast. When a pair of logical CPUs aren't sharing caches the cache line has to be transferred and it's "less fast".

For four 12-core CPUs you could actually get up to 4 or more different speeds for different pairs, corresponding to:
  • All caches shared between both CPUs (hyper-threading)
  • L2 shared between CPUs (different cores in same module)
  • L3 shared between CPUs (different cores in different modules of same chip)
  • No caches shared between CPUs (different chips, and possibly different penalties depending on how many "hops" between NUMA domains)
Simple solution (if you want to avoid this) is to have a different counter for each CPU. For example, you could use "inc ebx" and then afterwards do "lock add [counter],ebx" on each CPU (instead of doing "lock inc [counter]").


Cheers,

Brendan
For all things; perfection is, and will always remain, impossible to achieve in practice. However; by striving for perfection we create things that are as perfect as practically possible. Let the pursuit of perfection be our guide.
tsdnz
Member
Member
Posts: 333
Joined: Sun Jun 16, 2013 4:09 am

Re: SYSCALL Performance multicores

Post by tsdnz »

Back from picking my daughter up from swimming.

Code: Select all

Simple solution (if you want to avoid this) is to have a different counter for each CPU. For example, you could use "inc ebx" and then afterwards do "lock add [counter],ebx" on each CPU (instead of doing "lock inc [counter]").
I have a different Counter for each CPU.

Thinking at swimming.... 8 x sizeof(QWORD) = cacheline size.

I have assembly code that is called as the SYSCALL entry point that then passes code to the correct routine.
This code uses (QWORD*) + CPU->Index as an exception handler just for SYSCALL.

I have setup the structure wrong, just as you have outlined.

I will try this tonight and post the results.

Many thanks for your time Brendan
tsdnz
Member
Member
Posts: 333
Joined: Sun Jun 16, 2013 4:09 am

Re: SYSCALL Performance multicores

Post by tsdnz »

Could not wait until tonight, server on!!!!

That was it!!

Now to change the code, and align the data correctly.

Brendan, many thanks. I will allowed for 128 cache size.

To all who have helped, and keep this site and amazing place to visit.
Thank you all.

Alistair
Post Reply