Why is there cache invalidation?
Imagine a simple two CPU system, each with its own cache. CPU1 - Cache1, CPU2 - Cache2. CPU1 runs code that accesses/modifies the value of the spinlock counter. The hardware caches the memory address containing the spinlock counter into Cache1, modifies the value, and marks that part of the cache "dirty". Perhaps CPU1 now releases the lock. The spinlock counter in Cache1 gets modified again, and is still dirty. Now CPU2 wants to access/modify the spinlock. When it tries, Cache1 announces that it owns a dirty copy, and needs to write it back out to memory. It does so, clears its copy, and only then Cache2 is allowed to read that memory, modify the value, and marks its copy as now being dirty. Please notice that flushing Cache1, and loading Cache2 took a tremendous amount of time.
Now, imagine the case where they try to access the spinlock simultaneously. CPU2 has to spin, constantly reading the value of the spinlock until CPU1 releases it. But every loop on CPU2 modifies the value. So Cache1 initially held a dirty copy. Then it was written out, Cache2 read it in, dirtied it, spins on the cached value until CPU1 wants to release the lock, then Cache2 has to write out the memory line, Cache1 has to read it back in, modify it, now Cache2 wants it back to see what the new value is, Cache1 writes it back out to RAM, Cache2 reads it back in. Each of those points where a cache line has to be passed from one cache to another is called an RFO.
Now, that was only a 2 CPU case. So imagine 3 CPUs. 2 of them spinning on the value of the lock. In the previous case, Cache2 owned the memory the entire time that it was spinning on the lock. But now, CPU2 and CPU3 are both constantly modifying the value of the lock. So on every single spin loop, there is an RFO, transferring a dirtied memory address from Cache2 to Cache3 and back again.
Now imagine an N CPU case. CPU1 owns the lock, but the hardware doesn't know that. You have N-1 CPUs spinning on the lock, and CPU1 wants to release it. Therefore you have N RFO requests, and (on average) half of them will get processed before CPU1 finally gets its RFO request handled, gets control of that memory address, and can finally release the lock. On real hardware, every RFO takes hundreds of CPU cycles to flush cache, and reload the new cache -- during which time at least one, and usually both CPUs are idle. If you are using hyperthreading, then you can recover some of these CPU cycles.
The best case happens when the lock is used so rarely that the cache for one CPU gets flushed before the next CPU ever gets around to asking for that same spinlock address. Then all you get is just an access to uncached memory (which still takes several hundred CPU cycles), instead of an RFO (which takes even more).
See this thread for a detailed explanation of memory functions, caching, and cache coherency (make sure you have several hours available before reading it).
http://www.osdev.org/phpBB2/viewtopic.php?t=16012
This thread has links to a spinlock tutorial (see osdev64's post):
http://www.osdev.org/phpBB2/viewtopic.php?t=14261
Brendan has some spinlock code here:
http://www.osdev.org/phpBB2/viewtopic.php?t=9669
And the last half of this topic was a discussion on spinlocks, the time they take, and some more spinlock code from Brendan:
http://www.osdev.org/phpBB2/viewtopic.p ... 4&start=45