Hi,
mutex wrote:The theory on this is pretty clear in my mind, but i have problems with the "physical" since the behaviour if atleast VirtualBox seems very strange on this. Bochs & Core i7 2600 works without any issues at all.
Some emulators aren't as realistic as others. For example, last time I looked at it (when emulating multiple CPUs) Qemu would emulate one CPU for 100 ms, then emulate the next CPU for 100 ms, etc. The end result (for spinlocks) is that one CPU might repeatedly acquire and release the spinlock for 100 ms, then the remaining CPUs might all find it locked and fail to acquire it (and spin for 100 ms each!), and eventually the first CPU gets another 100 ms and repeatedly acquires and releases the spinlock for another 100 ms.
I wouldn't be surprised if VirtualBox has similar problems, especially if you're not using hardware virtualisation or the number of emulated CPUs is greater than the number of actual/host CPUs.
Bochs tends to do "several" (it's configurable) instructions on one CPU, then "several" on the next CPU, etc. If I remember right the default value is 5 instructions; which is much more realistic (compared to Qemu - 100 ms works out to about millions of instructions). Of course for real CPUs they all execute instructions at the same time and don't take turns, which is even more realistic..
Also note that Intel recommend doing a "test (without LOCK prefix)" followed by a "test and set (with LOCK prefix)" to minimise bus locking. This is slightly pessimistic for a spinlock where lock contention should be very rare, so I prefer to try an initial "test and set" first. For example:
Code: Select all
acquireLock:
mov eax,1
;Initial "test and set"
xchg [lock],eax
test eax,eax
je .acquired
;Main "test" loop
.tryLock:
pause
cmp dword [lock],0 ;No LOCK here (which is a good thing)!
jne .tryLock
;Main "test and set"
xchg [lock],eax
test eax,eax
jne .tryLock
.acquired:
Note 1: for the XCHG instruction, the LOCK is implied. An explicit lock prefix does nothing (and only waste space).
Note 2: this only uses 1 bit, so the remaining 31 bits are wasted. It's possible to use these remaining bits to detect bugs, like the same CPU trying to acquire the same lock multiple times (e.g. store a "CPU number" in those 31 bits) and attempts to free a lock that wasn't acquired. In the same way you can monitor lock contention (e.g. increment the value on those 31 bits or in a dword at a different address if the first attempt to acquire the lock fails). This sort of thing can be combined with "#ifdef DEBUGGING", and can make it a lot easier to detect bugs (including excessive lock contention).
I'd also recommend re-reading gerryg400's post. He's perfectly correct in that if a simple spinlock (like yours) is under high contention then you've got a design flaw and/or should be using a type of lock that is "fair" (e.g. where CPUs acquire the lock in the order they arrive, and not in a potentially "unfair" or unpredictable order). The ticket lock gerryg400 mentioned is one example of a "fair" lock.
To understand how a "ticket lock" would work, start by reading
Wikipedia's page on Lamport's Bakery Algorithm. If you think about it enough, you'll see that it can be implemented with 2 variables (one for the "next ticket value" and one for the "ticket value currently being served"); and with a "lock xadd" (to get the ticket) followed by a "while(current_value != ticket)" to spin, then an "inc" or "mov" (without a LOCK prefix) to release the lock. For example:
Code: Select all
acquireLock:
;Get ticket
mov eax,1
lock xadd [next_ticket],eax
;Check if no waiting is needed
cmp [current_ticket],eax
je .acquired
%ifdef DEBUGGING
lock inc dword [contention_counter]
%endif
;Wait for my turn
.retry:
pause
cmp [current_ticket],eax
jne .retry
.acquired:
Note 3: Notice that this doesn't do the "test, then test and set" thing that Intel recommends. This is because the "wait for my turn" part doesn't need a LOCK prefix to begin with, which means it's unnecessary. I put an initial test in anyway (so that when there's no contention the PAUSE instruction is never executed, and to make it easy to add the contention counter debugging code to the example).
Of course for an emulator like Qemu (where CPUs are emulated one at a time and execute a huge number of instructions), a "fair" lock like this can still cause some serious performance problems when there's lots of lock contention - e.g. every 100ms the next CPU might get the lock once (because it's that CPU's turn) and then spin for 99.9 ms trying to get it a second time (and failing because it's not that CPUs turn anymore). In general, this is the emulators fault and not your code's problem (and there's nothing you can do to fix the problem, other than avoiding lock contention).
Cheers,
Brendan