Re:Multi-CPU & 32/64 bits
Posted: Tue Jul 19, 2005 11:42 pm
Hi,
Just thought I'd mention that because of this bus locking, using the LOCK prefix can effect the performance of other CPUs, even if those other CPUs aren't doing anything related to the lock. For e.g. if it takes a CPU 60 nS to read, 100 nS to do the operation and then another 60 nS to write, other CPUs won't be able to access the bus for any reason for 220 nS. Consider a simple spinlock:
In this case, while the lock can't be acquired the conditional jump ("jc get_lock") would be in the CPU's L1 instruction cache and would execute quickly, resulting in the bus being locked more than 50% of the time.
For this reason it's recommended to use "test, test & modify" locks. The idea is to test the lock first without locking the bus using an instruction that doesn't modify so that you don't need to worry about it being atomic. For example:
In this case, the bus is only locked when you know that the lock can be acquired. This reduces the effect of the spinlock on other CPUs who are trying to access to the bus for other reasons.
Then you've got the PAUSE instruction, which benefits hyper-threading CPUs. The lock above would keep one logical CPU busy which effects the speed of the other logical CPU. To improve this, the PAUSE instruction reduces the amount of CPU resources used for the spinlock, which increases the amount of CPU resources used by the other logical CPU (ie. the waiting CPU waits slower while the working CPU works faster). This improves performance for hyper-threading:
This can be improved. If the lock is free anyway (which is hopefully normally the case) then you can avoid the pause and pre-test on the first attempt:
That's about as good as the spinlock code can get (without using the new MONITOR and MWAIT instructions, which are probably overkill anyway).
[continued in next post]
Just thought I'd mention that because of this bus locking, using the LOCK prefix can effect the performance of other CPUs, even if those other CPUs aren't doing anything related to the lock. For e.g. if it takes a CPU 60 nS to read, 100 nS to do the operation and then another 60 nS to write, other CPUs won't be able to access the bus for any reason for 220 nS. Consider a simple spinlock:
Code: Select all
get_lock:
lock bts [the_lock],1
jc get_lock
For this reason it's recommended to use "test, test & modify" locks. The idea is to test the lock first without locking the bus using an instruction that doesn't modify so that you don't need to worry about it being atomic. For example:
Code: Select all
get_lock:
test [the_lock],1
jne get_lock
lock bts [the_lock],1
jc get_lock
Then you've got the PAUSE instruction, which benefits hyper-threading CPUs. The lock above would keep one logical CPU busy which effects the speed of the other logical CPU. To improve this, the PAUSE instruction reduces the amount of CPU resources used for the spinlock, which increases the amount of CPU resources used by the other logical CPU (ie. the waiting CPU waits slower while the working CPU works faster). This improves performance for hyper-threading:
Code: Select all
get_lock:
pause
test [the_lock],1
jne get_lock
lock bts [the_lock],1
jc get_lock
Code: Select all
get_lock:
lock bts [the_lock],1
jnc .acquired
.retry:
pause
test [the_lock],1
jne .retry
lock bts [the_lock],1
jc .retry
.acquired:
[continued in next post]