Mutex, Spinlocks and Cache

mutex · Post by **mutex** » Mon Sep 26, 2011 3:11 pm

Given the following code / scenario. Is it possible that the one CPU that gets the mutex always keeps it due to that the other competing (spinning) cores will have a copy of the value in its cache and therefore when the owner releases the mutex it will never get noticed by the spinning cores?

I have been thinking about introducing a cache invalidate for the mutex, but im not sure if its needed..

Code: Select all


typedef struct mutex
{
  int lock;
} mutex;

int test_and_set(int *pointer, int value)
{
#ifdef __GNUC__
  __asm__ __volatile__("xchgl %%eax,(%%edx) ": "=a"(value) : "a"(value), "d"(pointer));
  return value;
#endif

#ifdef _MSC_VER
  __asm
  {
    mov edx, dword ptr [pointer]
    mov eax, value
    lock xchg eax, dword ptr [edx]
  }
  // returning EAX or VALUE
#endif
}

void mutexSpinLock(mutex *p)
{
  while(test_and_set(&(p->lock), 1))
    {
      __asm { rep nop }		// No pause? rep nop
    }
}

void mutexUnlock(mutex *p)
{
  p->lock = 0;                            // Unlocks the mutex
}

Lets say i have 8 cores trying to execute on the same very tight loop;

Code: Select all

mutex testMutex;
void test()
{
  while(1)
  {
    mutexSpinLock(&textMutex);

    for(int i=0; i < rand(256); i++)
    {
    }

    mutexUnlock(&testMutex);
  }
}

The reason im asking is that i get very strange results on Virtualbox and it looks like these mutexes are never released and some/all of the other cores just spins forever in mutexSpinLock().

On bochs and on real hardware i have not seen the issue..

regards
Thomas

gerryg400 · Post by **gerryg400** » Mon Sep 26, 2011 3:52 pm

Given the following code / scenario. Is it possible that the one CPU that gets the mutex always keeps it due to that the other competing (spinning) cores will have a copy of the value in its cache and therefore when the owner releases the mutex it will never get noticed by the spinning cores?

Yes. Heavily contended spinlocks are quite unfair on multicore x86. I guess it's because of the cache. It's quite possible to get a situation where a single core completely hogs the PC. When I was hammering my interrupt subsystem during testing it was plainly obvious. I reduced the APIC timer interrupt interval down to almost zero to try to find and get rid of lock contention. In the kernel I found that ticket locks were the answer.

I don't think manually messing with the cache is a good solution. In my opinion there is a bug in your test() function. Well it's a system bug actually. A system that contains a thread that releases a heavily contended spinning mutex and immediately tries to re-acquire it is mis-behaving. It should be redesigned so that the thing that the mutex is protecting is not so heavily contended. test() is good for testing correctness but actually code like this should not exist in a real world situation.

Also you seem to be spinning with interrupts enabled. You need to be careful that the thread is not pre-empted whilst spinning or your system may appear to lock up. Most operating systems that provide spinlocks to userspace disable interrupts on the spinning core OR the spinlock is actually a more fair (perhaps queueing) mutex.

[Edit] I said that you don't want to be pre-empted while spinning on a spinlock. Actually that's okay. I should have said you don't want to be pre-empted while holding a spinlock.

pcmattman · Post by **pcmattman** » Mon Sep 26, 2011 6:08 pm

Also, I would recommend you use GCC's atomic builtins where possible. They're more portable than inline assembly.

I would also recommend yielding the CPU at some point in your tests, so the scheduler can let something else run. Even with pre-emptive scheduling, there is nothing wrong with allowing threads to give up their timeslice early, and as you've found out - sometimes it is necessary.

Brendan · Post by **Brendan** » Mon Sep 26, 2011 9:16 pm

Hi,

Just a quick note...

A spinlock is something that will "busy wait" until a lock becomes free. A mutex is something that may cause a task switch, so that other tasks can use that CPU until the lock becomes free. A "mutex spinlock" would be something that busy-waits and allows another tasks to use the CPU at the same time, and therefore a "mutex spinlock" can't make sense...

Cheers,

Brendan

mutex · Post by **mutex** » Tue Sep 27, 2011 9:23 am

Thanks guys,

I already have mutexTryLock(mutex *) and mutexWaitLock(mutex *). Later will actually if locked will perform a re-schedule. Priority inheritence (if applicable) and then put process in blocked/waiting state.

The theory on this is pretty clear in my mind, but i have problems with the "physical" since the behaviour if atleast VirtualBox seems very strange on this. Bochs & Core i7 2600 works without any issues at all.

I have been reading Vol3 of the Intel docs and the LOCK prefix i use on test_and_set should actually take care of either cache or memory locking if other cores/processors on the system have the mutex in its cache. So the way i understand it LOCK will actually make the data be coherent.. But my experiment shows different..

It might be as indicated above that one cpu is actually starving the others since the core in mention actually unlocks and reaquires the lock before and spinlocking thread actually may enter.. Will try to introduce some random waiting after releasing to see if it helps.

@Brendan: spinLock (in my terms) is a thing that will not cause context switch but spin until aquired/locked. A waitLock is something that will perform a context switch and set thread to blocked and only reschedule for running when the locked mutex is released... I mainly use spinlocks inside kernel ISR's so protect other execution cores to access something at same time. Wait lock here would/could in some cases provide very undesirable results or even deadlock.. I use spinlocks when the estimated waiting time is very low and lower than the benefit of a task switch.

For all userspace stuff a waitLock would be the most used..

After some thinking... Can it be that the releasing (setting the value to 0) it the thing that is not propagated well enough / cached?? Since the lock on test and set should propagate / make it coherent..???

regards
Thomas

Brendan · Post by **Brendan** » Tue Sep 27, 2011 12:27 pm

Hi,

mutex wrote:The theory on this is pretty clear in my mind, but i have problems with the "physical" since the behaviour if atleast VirtualBox seems very strange on this. Bochs & Core i7 2600 works without any issues at all.

Some emulators aren't as realistic as others. For example, last time I looked at it (when emulating multiple CPUs) Qemu would emulate one CPU for 100 ms, then emulate the next CPU for 100 ms, etc. The end result (for spinlocks) is that one CPU might repeatedly acquire and release the spinlock for 100 ms, then the remaining CPUs might all find it locked and fail to acquire it (and spin for 100 ms each!), and eventually the first CPU gets another 100 ms and repeatedly acquires and releases the spinlock for another 100 ms.

I wouldn't be surprised if VirtualBox has similar problems, especially if you're not using hardware virtualisation or the number of emulated CPUs is greater than the number of actual/host CPUs.

Bochs tends to do "several" (it's configurable) instructions on one CPU, then "several" on the next CPU, etc. If I remember right the default value is 5 instructions; which is much more realistic (compared to Qemu - 100 ms works out to about millions of instructions). Of course for real CPUs they all execute instructions at the same time and don't take turns, which is even more realistic..

Also note that Intel recommend doing a "test (without LOCK prefix)" followed by a "test and set (with LOCK prefix)" to minimise bus locking. This is slightly pessimistic for a spinlock where lock contention should be very rare, so I prefer to try an initial "test and set" first. For example:

Code: Select all

acquireLock:
    mov eax,1

    ;Initial "test and set"

    xchg [lock],eax
    test eax,eax
    je .acquired

    ;Main "test" loop

.tryLock:
    pause
    cmp dword [lock],0         ;No LOCK here (which is a good thing)!
    jne .tryLock

    ;Main "test and set"

    xchg [lock],eax
    test eax,eax
    jne .tryLock

.acquired:

Note 1: for the XCHG instruction, the LOCK is implied. An explicit lock prefix does nothing (and only waste space).

Note 2: this only uses 1 bit, so the remaining 31 bits are wasted. It's possible to use these remaining bits to detect bugs, like the same CPU trying to acquire the same lock multiple times (e.g. store a "CPU number" in those 31 bits) and attempts to free a lock that wasn't acquired. In the same way you can monitor lock contention (e.g. increment the value on those 31 bits or in a dword at a different address if the first attempt to acquire the lock fails). This sort of thing can be combined with "#ifdef DEBUGGING", and can make it a lot easier to detect bugs (including excessive lock contention).

I'd also recommend re-reading gerryg400's post. He's perfectly correct in that if a simple spinlock (like yours) is under high contention then you've got a design flaw and/or should be using a type of lock that is "fair" (e.g. where CPUs acquire the lock in the order they arrive, and not in a potentially "unfair" or unpredictable order). The ticket lock gerryg400 mentioned is one example of a "fair" lock.

To understand how a "ticket lock" would work, start by reading Wikipedia's page on Lamport's Bakery Algorithm. If you think about it enough, you'll see that it can be implemented with 2 variables (one for the "next ticket value" and one for the "ticket value currently being served"); and with a "lock xadd" (to get the ticket) followed by a "while(current_value != ticket)" to spin, then an "inc" or "mov" (without a LOCK prefix) to release the lock. For example:

Code: Select all

acquireLock:

    ;Get ticket

    mov eax,1
    lock xadd [next_ticket],eax

    ;Check if no waiting is needed

    cmp [current_ticket],eax
    je .acquired

%ifdef DEBUGGING
    lock inc dword [contention_counter]
%endif

    ;Wait for my turn

.retry:
    pause
    cmp [current_ticket],eax
    jne .retry

.acquired:

Note 3: Notice that this doesn't do the "test, then test and set" thing that Intel recommends. This is because the "wait for my turn" part doesn't need a LOCK prefix to begin with, which means it's unnecessary. I put an initial test in anyway (so that when there's no contention the PAUSE instruction is never executed, and to make it easy to add the contention counter debugging code to the example).

Of course for an emulator like Qemu (where CPUs are emulated one at a time and execute a huge number of instructions), a "fair" lock like this can still cause some serious performance problems when there's lots of lock contention - e.g. every 100ms the next CPU might get the lock once (because it's that CPU's turn) and then spin for 99.9 ms trying to get it a second time (and failing because it's not that CPUs turn anymore). In general, this is the emulators fault and not your code's problem (and there's nothing you can do to fix the problem, other than avoiding lock contention).

Cheers,

Brendan

mutex · Post by **mutex** » Tue Sep 27, 2011 12:54 pm

Wow,

Thanks for a detailed contribution.

First of all let me just tell that i understand that bochs might not be the best "proof of the pudding", but im using Virtualbox with hardware virtualization so there is no/minimal emulation done. I use 8 cores which is what i have aswell. I have tried with other core configurations but all seem to randomly get the "bug".

I agree that the test might not be very good since its releasing and re-aquire the lock at a very high rate, and in very near memory (probably in the icache of L1 or in L2).

Anyway I think that even if its heavily and extremly unfair, there must be one of the spinLocks() that aquire the lock when the core having the lock is releasing it and now in my new code does some time outside before trying to lock it again. See bellow.

The problem is that i can see that the core(s) waiting in mutexSpinLock() actually never gets in (after the bug happens).. Of course before this everything is going smoothly. So there is definetly something wrong. The test_and_set method is only a few ops and they should take no more than a few of the for loop cycles at maximum. This would actually in my opinion force it to let someone enter since the for loop after release of the mutex will give time for other cores to enter mutexSpinLock.

Code: Select all

while (1)
	{  	
		mutexSpinLock(&shared_lock);
		for(r=0; r < rand(0,99999); r++);
		mutexUnlock(&shared_lock);
		for(r=0; r < rand(10000,9999999); r++);
	}

I understand the theory of the mutex, semaphore and other synchronization and inter process synchronization and communication methods. The ticketing algo is also something im familiar with.

I have various other implementations in my kernel, both shm (shared memory with atomic access), semaphores, barriers (cpu) based on semaphores and some other stuff. Everything works pretty well except this fu*ker on Virtualbox.

Im very aware of the possible unfair and starvation issues that might arise. But i have them covered (i think), except for in this test.. which is just what it is.. a test to see if the code works.. As with all things. The use of these mechanisms should be used / handled with care. Specially in kernel and ISR's where they have a potential to do alot of damage..

Looks like im going to have a long debug session in Virtualbox and try to see what is really happening. If possible..

FlashBurn · Post by **FlashBurn** » Tue Sep 27, 2011 2:17 pm

Maybe someone can help me, where in a kernel would I like to use a ticket lock?

Edit::

It would also only make sense if I would busy waiting with ints off.

mutex · Post by **mutex** » Tue Sep 27, 2011 3:03 pm

Maybe someone can help me, where in a kernel would I like to use a ticket lock?

Edit::

It would also only make sense if I would busy waiting with ints off.

Im not sure if i follow you. I said above that i would only use spinlocks in kernel. Atleast for synchronous handlers...

gerryg400 · Post by **gerryg400** » Tue Sep 27, 2011 3:11 pm

FlashBurn wrote:Maybe someone can help me, where in a kernel would I like to use a ticket lock?

Edit::

It would also only make sense if I would busy waiting with ints off.

Whether a lock is a ticket lock or not is an implementation detail. You can have ticketed spinlocks or (normal) spinlocks. It's just one way of making a lock fair. In my kernel all locks are ticketed spinlocks. So, where would I like to use them ? Everywhere.

FlashBurn · Post by **FlashBurn** » Wed Sep 28, 2011 2:12 am

I wanted to know if there is a thing where I need to use such ticket lock.

I also have one "spinlock", where you can´t use a ticket lock, because the "lock" is there that I can be sure that all cpu will run the following code at the same time (I know that it is not exactly the same time, but almost).

gerryg400 · Post by **gerryg400** » Wed Sep 28, 2011 4:53 am

FlashBurn wrote:I wanted to know if there is a thing where I need to use such ticket lock.

I also have one "spinlock", where you can´t use a ticket lock, because the "lock" is there that I can be sure that all cpu will run the following code at the same time (I know that it is not exactly the same time, but almost).

A ticket lock is simply a type of fair spinlock. I don't understand that there is a situation where you use a spinlock but a ticket lock won't work.

FlashBurn · Post by **FlashBurn** » Wed Sep 28, 2011 5:00 am

gerryg400 wrote: I don't understand that there is a situation where you use a spinlock but a ticket lock won't work.

That´s why I´ve written I use a "spinlock", because the CPUs don´t want to get the lock, but they wait that the lock gets free and for this you can´t use a ticket locket.

I´m doing this for rendevouz code, like synchronizing the TSCs. The first CPU will acquire the lock and send a msg to all other CPUs. The other CPUs then wait for the lock to become free and the start CPU waits that all CPUs arrived (by looking at an counter of CPUs that arrived) and if all arrived it releases the lock.

gerryg400 wrote: A ticket lock is simply a type of fair spinlock.

Ok, I thought that there maybe is an algorithm which needs a ticket lock and wont work with a spinlock.

gerryg400 · Post by **gerryg400** » Wed Sep 28, 2011 3:53 pm

and the start CPU waits that all CPUs arrived (by looking at an counter of CPUs that arrived) and if all arrived it releases the lock.

Okay, I think that's called a barrier. Not quite the same as a spinlock, so you're correct, a ticketlock won't work by itself.

mutex · Post by **mutex** » Wed Sep 28, 2011 4:23 pm

Yes thats true. Barrier (cpu). I use it when spinning up AP's and configuring them and leaving on hold before they get switched into action

OSDev.org

Mutex, Spinlocks and Cache

Mutex, Spinlocks and Cache

Re: Mutex, Spinlocks and Cache

Re: Mutex, Spinlocks and Cache

Re: Mutex, Spinlocks and Cache

Re: Mutex, Spinlocks and Cache

Re: Mutex, Spinlocks and Cache

Re: Mutex, Spinlocks and Cache

Re: Mutex, Spinlocks and Cache

Re: Mutex, Spinlocks and Cache

Re: Mutex, Spinlocks and Cache

Re: Mutex, Spinlocks and Cache

Re: Mutex, Spinlocks and Cache

Re: Mutex, Spinlocks and Cache

Re: Mutex, Spinlocks and Cache

Re: Mutex, Spinlocks and Cache