Letting the kernel handle mutexes / semaphores with IPIs

Tobba · Post by **Tobba** » Sat Jun 04, 2011 5:59 pm

I was toying with the idea of letting the OS fully handle mutexes and semaphores, with a few syscalls

10 syscalls would be provided (5 for semaphores, 5 for mutexes)

One would just set up the values needed in memory, this would be provided for both semaphores and mutexes

For mutexes, there would also be:
Wait and lock
Lock
Is locked
Unlock

For semaphores:
Increase
Wait and decrease
Decrease
Value

With the kernel handeling this, you could make use of IPIs to send an interupt to every processor when; for example, a mutex is locked
Each IPI would put every core in a spinlock, effecivly stopping race conditions
All cores would resume normally after every core is finished processing the interrupt, which would take about 100 cycles (hopefully)
And any threads that was waiting and is now resumed will be queued for execution the next context switch

This came into my head after reading a portion of the APIC section of the Intel manuals, although, I was pretty tired while reading it
Any thoughs?

bluemoon · Post by **bluemoon** » Sat Jun 04, 2011 6:20 pm

Wait and lock - does it mean ordinary lock as per pthread?
Lock - is it trylock ?

If not, I will miss trylock in your environment.
and there is less point to provide an "Is locked" query, since the result will expire as soon as it return.

And I would not waste 100 cycle on every core just to provide quicker respond on mutex.
If anyone want quick respond, he use spinlock.

About the race condition, if 2 or more thread/core tries to lock the same mutex,
you may use a waiting queue to fairly distribute the lock chance,
or to some extend, there is actually no need to be fair;
it's the application programmer's responsibility to use mutex correctly - eg use with condition variable, etc

Tobba · Post by **Tobba** » Sat Jun 04, 2011 6:34 pm

bluemoon wrote:Wait and lock - does it mean ordinary lock as per pthread?
Lock - is it trylock ?

Pretty much

And, after thinking about it, I agree with you, using IPIs for this is not worth it

I'm still thinking about letting the OS handle it though, having a simple "Waiting" flag on the thread would be far more effective than switching to the thread just to have to switch to something else a few wasted cycles later

bluemoon · Post by **bluemoon** » Sat Jun 04, 2011 7:49 pm

Tobba wrote:having a simple "Waiting" flag on the thread

Sounds good.
A thread call to Lock and not gained instantly, a "Waiting flag" will be set and exclude from scheduling.
A call to Unlock will pass the lock to one of the thread and clear its wait state
(you may also apply round robin to unset each thread and produce a fair distribution).

Owen · Post by **Owen** » Sun Jun 05, 2011 4:44 am

The Linux guys invented Futexes. Futexes are awesome. Anyone designing a new OS today, really should consider using them

(And I realise it's been an age since I last updated my blog. oops.)

Tobba · Post by **Tobba** » Sun Jun 05, 2011 8:22 am

Owen wrote:The Linux guys invented Futexes. Futexes are awesome. Anyone designing a new OS today, really should consider using them

(And I realise it's been an age since I last updated my blog. oops.)

Futexes seems awsome, I'll look into it when i come to implement this (Still lots left to do on the kernel before getting into this)

rdos · Post by **rdos** » Sun Jun 05, 2011 10:10 am

Owen wrote:The Linux guys invented Futexes. Futexes are awesome. Anyone designing a new OS today, really should consider using them

They are incorrect. This has been used as the primary synchronization mechanism in RDOS for many years. I call them "critical sections", and I got this idea from some other OS that had a synchronization mechanism that didn't require kernel involvement as long as no blocking is needed. IOW, they are wrong when they claim Linux invented this. It clearly existed before Linux.

OTOH, I moved all the code for user critical sections to kernel some time ago, mostly because I didn't want the application to have access to the data-type, and potentially messing it up. It was safer to use a handle for critical sections that had the data in kernel-space so the user process couldn't mess it up. And since switches to kernel in RDOS goes directly through a callgate (and uses register context only), there isn't much overhead involved.

Owen · Post by **Owen** » Sun Jun 05, 2011 12:04 pm

Perhaps I shouldn't have used the word invented; certainly they don't claim that, and their introductory paper cites a 1995 IEEE paper on the subject; additionally, Futexes do a lot more than critical sections do.

Regardless, their paper demonstrates major performance improvements in real-world workloads from making the fast-path entirely occur in userspace - this on a kernel which takes advantage of SYSENTER/SYSCALL, and therefore has much lower system call overhead than any call gate based implementation can possibly have.

And if a user tramples the user-space part of their futex? Thats their own problem. They're probably trashing their own data also.

rdos · Post by **rdos** » Sun Jun 05, 2011 1:55 pm

I'll have to be skeptical to how their blocking algorithm, based on physical address, works and performs. There is a risk of reuse of physical addresses which could connect an unrelated futex to some old version that used the same physical address. Additionally, the process to connect a physical address to a blocking-list seems non-trivial, and has to be based on some kind of search algoritm. This means blocking scales poorly. I bet their futexes perform extremely bad when they block.

That is the primary problem. Either you have the block-list with the futex (and expose kernel-only data to user), or you have it separately and need to search for the block-list in kernel. A better compromise is problably to have a counter + a handle to the block-list. A handle contains no direct kernel data, and can be dereferenced without searching. Still, I don't think I trust user-space to do the correct locked modifications on the futex. Providing it as part of libc or something might be acceptable, but not to let the application do it for itself.

Owen · Post by **Owen** » Sun Jun 05, 2011 2:06 pm

rdos wrote:I'll have to be skeptical to how their blocking algorithm, based on physical address, works and performs. There is a risk of reuse of physical addresses which could connect an unrelated futex to some old version that used the same physical address. Additionally, the process to connect a physical address to a blocking-list seems non-trivial, and has to be based on some kind of search algoritm. This means blocking scales poorly. I bet their futexes perform extremely bad when they block.

That is the primary problem. Either you have the block-list with the futex (and expose kernel-only data to user), or you have it separately and need to search for the block-list in kernel. A better compromise is problably to have a counter + a handle to the block-list. A handle contains no direct kernel data, and can be dereferenced without searching. Still, I don't think I trust user-space to do the correct locked modifications on the futex. Providing it as part of libc or something might be acceptable, but not to let the application do it for itself.

The case to optimize for is the uncontested case. If you're blocking, you've already failed at scaling. Besides, if you can lock and unlock the futex in 1/5th of the time (And this is realistic with privilege level switch times on x86) then you significantly reduce the time in which there is an opportunity for blocking to occur (Assuming the application holds the lock for the shortest time possible; a hopefully reasonable assumption)

There are two reasons why there is no user space handle to the block list:

This would either require the handle to be global and therefore insecure, or local and therefore not work in shared memory
Holding a block list for a lock which is uncontested is a giant waste of memory - locked memory.

Of course most apps don't touch the futexes themselves; that would be silly

gerryg400 · Post by **gerryg400** » Sun Jun 05, 2011 5:13 pm

rdos wrote:There is a risk of reuse of physical addresses which could connect an unrelated futex to some old version that used the same physical address.

This shouldn't be a problem. Futex_create(or futex_init) needs to register the futex with the kernel which creates an associated kernel object. During futex_lock the process accesses its own virtual memory and if the kernel is entered it does a virt_to_phys using the current page tables and finds the associated kernel futex object. 'Old' physical memory and old objects will never be referenced.

rdos wrote:Additionally, the process to connect a physical address to a blocking-list seems non-trivial, and has to be based on some kind of search algoritm. This means blocking scales poorly. I bet their futexes perform extremely bad when they block.

This is a problem. Converting a physical address to a kernel futex object probably cannot be done at O(1). Caching recent lookups, hash-tables and balanced trees might help with scaling.

I've been thinking about storing a handle to the kernel object in the user-space futex structure. As long a the user programme was well-behaved lookup would be fast. If the user programme trashed the handle the penalty would be a new lookup. On the kernel side there would need to be a way to verify that the handle was correct.

OSwhatever · Post by **OSwhatever** » Sun Jun 05, 2011 6:10 pm

I see futexes as an implementation of user mode scheduling. Many kernels offer user space suspend and resume calls for threads and these can be used for implementing semaphores for example. User mode semaphores will of course only work for resources that are specific for the process in question.

Owen · Post by **Owen** » Sun Jun 05, 2011 6:26 pm

gerryg400 wrote:
rdos wrote:There is a risk of reuse of physical addresses which could connect an unrelated futex to some old version that used the same physical address.
This shouldn't be a problem. Futex_create(or futex_init) needs to register the futex with the kernel which creates an associated kernel object. During futex_lock the process accesses its own virtual memory and if the kernel is entered it does a virt_to_phys using the current page tables and finds the associated kernel futex object. 'Old' physical memory and old objects will never be referenced.

There is no futex(FT_CREATE, ...) syscall. Futexes have no in kernel state. this is important; it means that a Futex is only a few bytes of user-space memory, and in particular, it requires no bytes of locked memory.

The kernel just needs to lock the memory containing any futex that is in use. This is no real issue; it means you can consume 4kB of locked memory per thread. The implicit overhead of a thread is probably more.

rdos wrote:Additionally, the process to connect a physical address to a blocking-list seems non-trivial, and has to be based on some kind of search algoritm. This means blocking scales poorly. I bet their futexes perform extremely bad when they block.
This is a problem. Converting a physical address to a kernel futex object probably cannot be done at O(1). Caching recent lookups, hash-tables and balanced trees might help with scaling.

I've been thinking about storing a handle to the kernel object in the user-space futex structure. As long a the user programme was well-behaved lookup would be fast. If the user programme trashed the handle the penalty would be a new lookup. On the kernel side there would need to be a way to verify that the handle was correct.

Userspace object struct Futex { uint32_t m_futex, uint32_t m_osHandle; };

The kernel can then check that the handle is correct by doing a translation of the Futex' virtual address to physical address and comparing with the value stored in the kernel obejct. These handles would be system global (so that futexes can go in shared memory). The OS sets the handle whenever a wait queue is created, and unsets it when one expires. The whole structure can be kept consistent by use of double-wide compare and swap instructions.

Regardless, this is all unnecessary in practice. For all of Linux's problems, general system performance is not one of them. Remember: If you're taking a mutex lock, you've already lost

gerryg400 · Post by **gerryg400** » Sun Jun 05, 2011 8:24 pm

Owen wrote:There is no futex(FT_CREATE, ...) syscall. Futexes have no in kernel state. this is important; it means that a Futex is only a few bytes of user-space memory, and in particular, it requires no bytes of locked memory.

Ooops, I was assuming something about the Linux implementation without actually looking at it. I did understand that the futex lock itself is in user space but doesn't the kernel have a queue associated with each futex ? I guess that queue could be created when/if contention causes a kernel entry. In fact, since I am trying to implement pthread_mutexes this way I'll have to do that because pthread_mutexes can be initialised statically and locked without pthread_mutex_init ever being called.

Owen wrote:Userspace object struct Futex { uint32_t m_futex, uint32_t m_osHandle; };

The kernel can then check that the handle is correct by doing a translation of the Futex' virtual address to physical address and comparing with the value stored in the kernel obejct. These handles would be system global (so that futexes can go in shared memory). The OS sets the handle whenever a wait queue is created, and unsets it when one expires. The whole structure can be kept consistent by use of double-wide compare and swap instructions.

Does Linux do this ? Perhaps I should have a look at the Linux implementation. So far I've been resisting that.

rdos · Post by **rdos** » Mon Jun 06, 2011 12:31 am

Owen wrote: There is no futex(FT_CREATE, ...) syscall. Futexes have no in kernel state. this is important; it means that a Futex is only a few bytes of user-space memory, and in particular, it requires no bytes of locked memory.

Is there no deletion either? Or only a delete, which in that case seems like bad practise since object deletion should always be matched with object creation.

Owen wrote: Userspace object struct Futex { uint32_t m_futex, uint32_t m_osHandle; };

OK, so there is an handle as well.

OSDev.org

Letting the kernel handle mutexes / semaphores with IPIs

Letting the kernel handle mutexes / semaphores with IPIs

Re: Letting the kernel handle mutexes / semaphores with IPIs

Re: Letting the kernel handle mutexes / semaphores with IPIs

Re: Letting the kernel handle mutexes / semaphores with IPIs

Re: Letting the kernel handle mutexes / semaphores with IPIs

Re: Letting the kernel handle mutexes / semaphores with IPIs

Re: Letting the kernel handle mutexes / semaphores with IPIs

Re: Letting the kernel handle mutexes / semaphores with IPIs

Re: Letting the kernel handle mutexes / semaphores with IPIs

Re: Letting the kernel handle mutexes / semaphores with IPIs

Re: Letting the kernel handle mutexes / semaphores with IPIs

Re: Letting the kernel handle mutexes / semaphores with IPIs

Re: Letting the kernel handle mutexes / semaphores with IPIs

Re: Letting the kernel handle mutexes / semaphores with IPIs

Re: Letting the kernel handle mutexes / semaphores with IPIs