Multi-CPU & 32/64 bits

Question about which tools to use, bugs, the best way to implement a function, etc should go here. Don't forget to see if your question is answered in the wiki first! When in doubt post here.
User avatar
Brendan
Member
Member
Posts: 8561
Joined: Sat Jan 15, 2005 12:00 am
Location: At his keyboard!
Contact:

Re:Multi-CPU & 32/64 bits

Post by Brendan »

Hi,

Just thought I'd mention that because of this bus locking, using the LOCK prefix can effect the performance of other CPUs, even if those other CPUs aren't doing anything related to the lock. For e.g. if it takes a CPU 60 nS to read, 100 nS to do the operation and then another 60 nS to write, other CPUs won't be able to access the bus for any reason for 220 nS. Consider a simple spinlock:

Code: Select all

get_lock:
   lock bts [the_lock],1
   jc get_lock
In this case, while the lock can't be acquired the conditional jump ("jc get_lock") would be in the CPU's L1 instruction cache and would execute quickly, resulting in the bus being locked more than 50% of the time.

For this reason it's recommended to use "test, test & modify" locks. The idea is to test the lock first without locking the bus using an instruction that doesn't modify so that you don't need to worry about it being atomic. For example:

Code: Select all

get_lock:
   test [the_lock],1
   jne get_lock

   lock bts [the_lock],1
   jc get_lock
In this case, the bus is only locked when you know that the lock can be acquired. This reduces the effect of the spinlock on other CPUs who are trying to access to the bus for other reasons.

Then you've got the PAUSE instruction, which benefits hyper-threading CPUs. The lock above would keep one logical CPU busy which effects the speed of the other logical CPU. To improve this, the PAUSE instruction reduces the amount of CPU resources used for the spinlock, which increases the amount of CPU resources used by the other logical CPU (ie. the waiting CPU waits slower while the working CPU works faster). This improves performance for hyper-threading:

Code: Select all

get_lock:
   pause
   test [the_lock],1
   jne get_lock

   lock bts [the_lock],1
   jc get_lock
This can be improved. If the lock is free anyway (which is hopefully normally the case) then you can avoid the pause and pre-test on the first attempt:

Code: Select all

get_lock:
   lock bts [the_lock],1
   jnc .acquired

.retry:
   pause
   test [the_lock],1
   jne .retry

   lock bts [the_lock],1
   jc .retry

.acquired:
That's about as good as the spinlock code can get (without using the new MONITOR and MWAIT instructions, which are probably overkill anyway).

[continued in next post]
For all things; perfection is, and will always remain, impossible to achieve in practice. However; by striving for perfection we create things that are as perfect as practically possible. Let the pursuit of perfection be our guide.
User avatar
Brendan
Member
Member
Posts: 8561
Joined: Sat Jan 15, 2005 12:00 am
Location: At his keyboard!
Contact:

Re:Multi-CPU & 32/64 bits

Post by Brendan »

[continued from previous post]

The final step is to choose the address of the lock. With modern CPUs everything works on "cache lines", which (depending on the CPU) can currently be as large as 128 bytes. If you try to read a single byte and there's a cache miss, a full cache line will be loaded into the CPU. Consider the following:

Code: Select all

   section .data
   align 64
lockA:   dd 0
lockB:   dd 0
lockC:   dd 0
lockD:   dd 0
lockE:   dd 0
lockF:   dd 0
   section .text
Any code that uses any lock is going to want the entire cache line. If 2 CPUs are trying to use any of these locks they're both going to want the same cache line. If one lock is contended (several CPUs trying to acquire it) then the cache line is going to be continually read into each CPU, and any other CPU trying to access any other lock will also need to fight for the cache line - even if no-one else wants the other lock. This is "cache thrashing".

In fact it's the same for data:

Code: Select all

   section .data
   align 64
lockA:   dd 0
variableA:   dd 0
variableB:   dd 0
variableC:   dd 0
   section .text

Here, cache thrashing caused by the lock will reduce the performance of other CPUs trying to access variables.

The solution is to make sure each lock is on a cache line by itself, so that any cache thrashing is isolated to the lock only, and anything else isn't effected:

Code: Select all

   section .data
   align 128
lockA:   dd 0
   align 128
lockB:   dd 0
   align 128
lockC:   dd 0
   align 128
variableA:   dd 0
variableB:   dd 0
variableC:   dd 0
   section .text
After that, the only problem is lock contention (several CPUs trying to acquire the same lock at the same time). The only way to minimize lock contention is to use a lot of locks, and find algorithms that don't need any locks. The more CPUs the computer has the worse lock contention will be (a single lock that locks everything in the kernel is an extremely bad idea).

The reverse can also become true - each lock costs a little overhead even when there's no lock contention. If you need to lock 20 locks to do something then the overhead can be more important. This creates a compromise - with only 2 CPUs lock contention can be less important than lock overhead, but for the same code with 128 CPUs lock contention can be more important than lock overhead.

With all of this in mind, the best thing you can have is a lockless algorithm. Developing lockless algorithms is unfortunately very difficult - it's one of the rare areas in programing where specialists get paid good money to develop an algorithm and nothing else. In general the largest benefits are where one or more locks are under heavy contention (where the locks can be the largest performance bottleneck).


Cheers,

Brendan
For all things; perfection is, and will always remain, impossible to achieve in practice. However; by striving for perfection we create things that are as perfect as practically possible. Let the pursuit of perfection be our guide.
Post Reply