Hi,
Wow, OK...
My formula for "estimated cost of a spinlock" is exactly how Colonel Kernel explained:
T = Pa * Ta + Pb * Tb
Where:
- T = estimated average cost of the spinlock
Pa = probability that the lock can be acquired without spinning
Pb = probability that the lock can't be acquired without spinning
Ta = average cost of acquiring the lock if no spinning is needed
Tb = average cost of acquiring the lock if spinning is needed
It wasn't meant to be 100% accurate (it's just an estimation about the relative cost), just like my other statement about the estimated cost of a thread switch ("in the order of 5 to 500 times more CPU time") isn't 100% accurate either.
Still, take a look at some basic spinlock code (typically implemented as a macro):
Code: Select all
lock bts [myLock], 0
jnc .acquired
.spin:
pause
test [myLock], 1
jne .spin
lock bts [myLock], 0
jc .spin
.acquired:
Then there's the code to free the lock:
In this case, average cost of acquiring the lock if no spinning is needed (Tb) comes from the first 2 instructions and nothing else. On a modern CPU, if "[myLock]" is in the cache, the first instruction ("lock bts [myLock], 0") takes roughly 3 cycles and doesn't involve locking the bus. The second instruction (if it doesn't cause a branch mis-prediction) doesn't cost anything. If "[myLock]" is not in the cache, then you've got a "read for ownership" and bus-locking, and if the branch is mis-predicted there's also a branch mis-prediction penalty (roughly 25 cycles for Pentium 4/Netburst - less for CPUs with shorter pipelines).
For the average cost of acquiring the lock if spinning is needed (Tb), the cost of all the instructions that execute before the lock becomes free can be measured as the number of cycles until some other CPU frees the lock. When the lock does become free it'd take about 30 cycles (including a likely branch mis-prediction) plus the cost of a cache miss to exit the loop.
The cost of freeing the lock should also be included - a possible cache line miss (if someone else tried to acquire the lock while we were in out critical section) plus about 2 cycles.
This gives a more accurate estimate of:
T = Pa * (Ta + Pb * C + Pc * M) + Pd * (L + X + C + Pe * M) + Pf * C + F
Where:
- T = estimated average cost of the spinlock
Pa = probability that the lock can be acquired without spinning
Pb = probability that the cache line needs to be fetched when no spinning is needed
Pc = probability that the branch is mis-predicted when no spinning is needed
Pd = probability that the lock can't be acquired without spinning
Ta = minimum cost of acquiring the lock if no spinning is needed
C = cache line fetch cost
M = branch mis-prediction penalty
L = average number of cycles before the lock becomes free
X = minimum cost of exit overhead (without cache line miss or branch mis-predict)
Pe = probability of a branch mis-predict when exiting the spinloop
Pf = probablility that someone tries to acquire the lock before it's freed
F = minimum cost of freeing the lock
Simplified, with guesses based on expected times for Pentium 4, an assumed 100% chance of a branch mis-predict when exiting the spinloop, an assumed 10% chance of branch-mispredict when no spinning is needed, an assumed 25 cycle branch mis-prediction penalty, an assumed 100 cycles cache miss penalty, the realisation that Pf = Pd, and the assumed minimum cost of freeing the lock is 2 cycles it becomes:
T = Pa * (Pb * 100 + 5.5) + (1 - Pa) * (L + 137 + C)
Where:
- T = estimated average cost of the spinlock
Pa = probability that the lock can be acquired without spinning
Pb = probability that the cache line needs to be fetched when no spinning is needed
L = average number of cycles before the lock becomes free
Now, if (for a specific lock with specific expected usage) the probability that the lock can be acquired without spinning is 90%, the probability that the cache line needs to be fetched when no spinning is needed is 50%, and the average time before a lock becomes free is 250 cycles (i.e. the average critical section protected by the lock is about 500 cycles), you get:
T = 0.9 * (0.5 * 100 + 5.5) + (1 - 0.9) * (250 + 137)
T = 0.9 * 55.5 + 0.1 * 388
T = 88.75 cycles
In contrast, something like an "FXLOAD/FXSAVE" pair (without any of the other costs of thread switches) is going to cost more than double this in terms of cycles used.
For my OS, thread switch costs include determining which thread to run next, adjusting the amount of time used by the thread, switching register state, and (in all cases) the cost of switching address spaces (TLB flush, etc). For me, the cost of a thread switch is around 2500 cycles. In this case a spinlock is going to be cheaper than a pair of thread switches unless "T" (from the formulas above) is > 5000 cycles (assuming that the thread isn't about to be pre-empted, which is something you can't predict when writing your kernel), which corresponds to insane amounts of lock contention and/or insanely long critical sections (and brings me back to the "3 rules of thumb for micro-kernel locking" I posted earlier, which can be summerized as "use spinlocks unless your code needs to be redesigned").
Of course locks in user-space and device drivers are still a completely different matter...
bewing wrote:In these calculations, I think you are all making a huge mistake by ignoring some hardware details -- that are in a sense hidden from you, yes -- but still exist.
Posit: you have a machine with 8 CPUs. The whole point of a spinlock is that there is a specific memory location that is updated/checked by each CPU in their own spinlock code, repeatedly. Either you work really hard to put this memory location into uncacheable memory -- or it will certainly fall into L1 and L2 cache on every CPU, from being accessed so often.
Under no circumstances should any lock be in uncacheable memory - access to RAM is slow, and the relative difference between CPU speed and RAM speed can be expected to continue growing.
bewing wrote:If the memory address is cacheable, on the other hand, then you just invalidated significant portions of the L1 and L2 caches of all 7 of the other CPUs on the machine. On a page that they are guaranteed to need again very very soon -- because they will be checking this spinlock themselves. Which means that all 7 of those CPUs will need to reload exactly the same page out of ram -- guaranteeing massive contention. Not to mention the tremendous load you place on the cache coherency mechanisms.
First, "page" in your statement above is misleading, as "page" typically means a unit of the virtual memory system (e.g. for 80x86 it's 4096 bytes, or possibly 2 MiB or 4 MiB) while for modern CPUs the cache coherency mechanism typically uses much smaller areas (e.g. 64-byte cache line sizes on modern 80x86).
Secondly, the cache coherency mechanism normally uses MESI (modified, exclusive, shared, invalid) states, with cache snooping. This means that if one CPU has modified a cache line and other CPU/s want that cache line, then the first CPU will send the modified cache line on the bus to memory and other CPU/s can snoop that transaction directly from the bus. I.e. the modified cache line goes from CPU to RAM and other CPU/s, not from CPU to RAM then from RAM to other CPU/s.
For a lock under extremely heavy contention, the same cache line would be in all CPUs caches. When the current lock owner tries to free the lock it'd tell other CPUs to set the cache line to "invalid", then modify the cache line in it's own cache. Soon after the other spinning CPUs will request to read the cache line again, and the CPU that just freed the lock will send the modified cache line across the bus to the other CPUs (and RAM). Now the cache line is "shared" by all CPUs. The other CPUs will realise that the lock is now free and attempt to lock the bus and put the cache line in the "exclusive" state. The first CPU to lock the bus succeeds, all other CPUs change the cache line state to "invalid" and the winner sets the cache line to "exclusive" and does it's "read/modify/write" atomic instruction. Soon after the other spinning CPUs will request to read the cache line again, and the CPU that just acquired the lock will send the modified cache line across the bus to the other CPUs (and RAM). Now the cache line is "shared" by all CPUs again, and we're back where we started.
Basically, under extremely heavy contention, there's no bus traffic at all unless a CPU acquires the lock or frees the lock. When a CPU does acquire or free the lock it causes a request for exclusive access, one or more requests for cache line read, followed by a cache line write.
For a badly designed spinlock, it is much worse. For example, compare the spinlock code near the top of this post to something like this:
Code: Select all
.spin:
lock bts [myLock], 0
jc .spin
In this case you would get a huge amount of bus traffic (but that's probably what you'd deserve for using such a badly designed spinlock to begin with).
Cheers,
Brendan