Spinlocks page wishlist

Teodor Väänänen · Post by **Teodor Väänänen** » Thu Jan 22, 2009 3:40 am

Noted the creation of the page on spinlocks. I was waiting for it to appear.

A couple of things strike me as something that could be elaborated on a little:

When I've seen spinlock code in the past, I've also seen the use of a special instruction to avoid penalties with the processor pipeline or cache, IIRC the PAUSE instruction. Maybe a separate page, or a subsection on why to use it?
I've seen other forms of spinlocks than the ones using BTx, IIRC Linux uses XCHG. A comparison on methods maybe?
I seem to recall that the preferred way is to use a plain BT to see if the approriate bit is set/clear, then use
a LOCK BTx to actually try to get the lock. Any thoughts on this?
Then one thing which might be a little off-topic, which is read/write spinlocks. I have in my possesion both a photo copy and a manually HTMLized copy of 'Concurrent Control with "Readers" and"Writers"' from Communications of the ACM, Vol #14, No 10, pp 668-669, which explores the more advanced flavors of locks and mutexes. Any interest of adding it to the Spinlocks page or into some page in the IPC category? The pages copyright notice clearly says "General permission to republish, but not for profit, all or part of this material is granted, provided that reference is made to this publication, to its date of issue, and to the fact that reprinting privileges were granted by permission of the Association for Computing Machinery".
Another potential off-topic issue (don't be afraid to direct me to the appropriate forums) is the issue of a simple timeout mechanism. Is a simple counting mechanism (either LOOP eCX or something else) which puts a maximum limit to the number of tries before giving up something which could be useful in kernel spinlock code?

Any thoughts? Other people with opinions on the wiki page or on what I said?

Love4Boobies · Post by **Love4Boobies** » Thu Jan 22, 2009 6:29 am

I'll try adding a wiki article about it. No one actually checks to see requests... sorry...

quok · Post by **quok** » Thu Jan 22, 2009 9:39 am

I was thinking perhaps the spinlocks page should give an overview of what they are, how they work, and a really general (non-optimized) easy to use sample implementation. After that, explain the problems with the above implementation, and include stubs (perhaps even eventually linking to full articles) on better solutions using things like exponential backoff, queued spinlocks, and lockless algorithms. Each has advantages and disadvantages of course, and those should be covered. Include links to forum topics, whitepapers, perhaps even open source implementations.

I don't have time to work on this now (getting ready for a well deserved 2 week vacation -- leaving tomorrow night), but I'll definitely include what I know.

Brendan · Post by **Brendan** » Fri Jan 23, 2009 3:39 am

Hi,

Some notes for more discussion, etc.

Teodor Väänänen wrote:When I've seen spinlock code in the past, I've also seen the use of a special instruction to avoid penalties with the processor pipeline or cache, IIRC the PAUSE instruction. Maybe a separate page, or a subsection on why to use it?

For a tight loop the CPU can pipeline many different iterations of the loop. For a spinloop this is bad because you can end up with the same instruction repeated in the CPUs pipeline several times, which is pointless and increases the time taken to exit the loop. It's even worse for hyper-threading (where the other logical CPU could be using the pipeline instead). The PAUSE instruction prevents this problem (AFAIK by waiting for the previous instruction to complete before shoving more instructions onto the pipeline) and prevents the loop exit penalty.

Teodor Väänänen wrote:I've seen other forms of spinlocks than the ones using BTx, IIRC Linux uses XCHG. A comparison on methods maybe?

The XCHG instruction is a little faster and probably consumes less bytes (no LOCK prefix is needed because it's implied). The BTS method only relies one one bit, which means that the remainder of the dword can be used for extra information, like recording which task acquired the lock (to detect deadlocks and/or when someone tries to free the lock when someone else acquired it), or having a "this lock was under contention" flag that's polled occasionally to see which locks are under contention and which locks aren't.

Note: There's also other things you can do to detect programming errors, like checking that the lock was acquired when you attempt to free it. It's always better to have a little extra code to find problems like this than to have random unexpected crashes that you can never find (especially for alpha/beta code).

Teodor Väänänen wrote:[*]I seem to recall that the preferred way is to use a plain BT to see if the approriate bit is set/clear, then use
a LOCK BTx to actually try to get the lock. Any thoughts on this?

The "test then test-and-set" method is Intel's recommended method. The idea is to avoid continually LOCKing when there's lock contention; which can be a serious problem if 2 or more CPUs are trying to acquire an already locked lock, because you end up with a cache line constantly bouncing from CPU to CPU (while the bus is constantly locked) rather than having the cache line in the shared state in all CPUs (and a lot less LOCKed bus traffic).

I'd also add something about the bakery algorithm. It has similar properties as a spinlock but also makes sure that tasks get the lock in the order they arrive (which prevents starvation). The disadvantage is that it costs twice as much data (e.g. either 2 cache lines instead of one, or 2 dwords instead of one).

The code is simple enough:

Code: Select all

    mov ebx,1
    lock xadd [ticket],ebx   ;ebx = my ticket number

    cmp [counter],ebx        ;Is it my turn?
    je .acquired             ; yes

.retry:
    pause
    cmp [counter],ebx        ;Is it my turn?
    je .retry                ; no, keep waiting
.acquired:

    < Critical Section Here >

    lock inc [counter]

I'd guess spinlocks are better if lock contention is unlikely, while a "bakery" is better if lock contention is expected. For a small amount of lock contention I'd use a bakery where the "ticket" and "counter" are on the same cache line (to reduce cache line fetches) while for moderate lock contention I'd put "ticket" and "counter" on different cache lines. For even heavier lock contention I'd find a way to reduce lock contention...

Also note that re-entrancy locks protect data, and the critical section/s are pieces of code that work on that data. Some people assume that re-entrancy locks protect critical sections, but that's just wrong.

It's probably a good idea to have something about avoiding deadlocks too - e.g. always acquire locks in the same order, and always free locks in the reverse order.

Doh - there's a mistake on the Wiki page too - the BTC instruction is "Bit Test and Compliment". You'd want BTR (Bit Test and Reset). This might have been my fault (I have a habit of thinking BTC is "Bit Test and Clear" when it's not

).

Cheers,

Brendan

Love4Boobies · Post by **Love4Boobies** » Fri Jan 23, 2009 6:49 am

Brendan wrote:For a tight loop the CPU can pipeline many different iterations of the loop. For a spinloop this is bad because you can end up with the same instruction repeated in the CPUs pipeline several times, which is pointless and increases the time taken to exit the loop. It's even worse for hyper-threading (where the other logical CPU could be using the pipeline instead). The PAUSE instruction prevents this problem (AFAIK by waiting for the previous instruction to complete before shoving more instructions onto the pipeline) and prevents the loop exit penalty.

Note that this is new only for newer Intel CPUs (don't know about AMD). I'm too lazy to check which but a CPU test could be done to ensure that the optimum spinlock implementation is used. I will try find to write the article today. To make my job easier, and for the sake of completeness, I'll copy & paste some of the things you said, if you don't mind, Brendan.

JohnnyTheDon · Post by **JohnnyTheDon** » Fri Jan 23, 2009 9:25 am

Love4Boobies wrote:
Brendan wrote:For a tight loop the CPU can pipeline many different iterations of the loop. For a spinloop this is bad because you can end up with the same instruction repeated in the CPUs pipeline several times, which is pointless and increases the time taken to exit the loop. It's even worse for hyper-threading (where the other logical CPU could be using the pipeline instead). The PAUSE instruction prevents this problem (AFAIK by waiting for the previous instruction to complete before shoving more instructions onto the pipeline) and prevents the loop exit penalty.
Note that this is new only for newer Intel CPUs (don't know about AMD). I'm too lazy to check which but a CPU test could be done to ensure that the optimum spinlock implementation is used. I will try find to write the article today. To make my job easier, and for the sake of completeness, I'll copy & paste some of the things you said, if you don't mind, Brendan.

On a computer that doesn't have the PAUSE instruction, it translates to rep nop which does nothing. At the worst, it causes an extra instruction decode before the lock is tested again.

OSDev.org

Spinlocks page wishlist

Spinlocks page wishlist

Re: Spinlocks page wishlist

Re: Spinlocks page wishlist

Re: Spinlocks page wishlist

Re: Spinlocks page wishlist

Re: Spinlocks page wishlist