Userland only semaphores
Userland only semaphores
Hi !
I am trying to improve my semaphore to avoid switch between user/kernel code.
In fact for the x86 cpu I use the "xchg" instruction which provide locking, and a spinning...
Have you ever implemented such semaphores ?
In fact most semaphore code I've read where locked using the "lock" prefix and disabling interruptions. I understand how it works, but I don't understand why interruptions are disabled... is it always required (even using "xchg") ? In that case my userland locking may be totally wrong.
Do you differentiate kernel & user semaphore to improve performance ?
Thank you for your help !
INeo
I am trying to improve my semaphore to avoid switch between user/kernel code.
In fact for the x86 cpu I use the "xchg" instruction which provide locking, and a spinning...
Have you ever implemented such semaphores ?
In fact most semaphore code I've read where locked using the "lock" prefix and disabling interruptions. I understand how it works, but I don't understand why interruptions are disabled... is it always required (even using "xchg") ? In that case my userland locking may be totally wrong.
Do you differentiate kernel & user semaphore to improve performance ?
Thank you for your help !
INeo
Re:Userland only semaphores
For a mutex xchg is ok, but for a semaphore it's bad. Use inc or dec (or add/sub).ineo wrote: In fact for the x86 cpu I use the "xchg" instruction which provide locking, and a spinning...
Only in kernel mode, and not yet stresstestedHave you ever implemented such semaphores ?
note that you can change byte to anything you like, it functions in the lower half of the semaphore (up to 0x7F for byte).In fact most semaphore code I've read where locked using the "lock" prefix and disabling interruptions. I understand how it works, but I don't understand why interruptions are disabled... is it always required (even using "xchg") ? In that case my userland locking may be totally wrong.
Code: Select all
acquire_semaphore:
mov eax, [esp+4]
lock dec byte [eax]
js semaphore_too_low
ret
.semaphore_too_low
lock inc byte [eax]
call wait_some_time
jmp acquire_semaphore
Code: Select all
restore_semaphore:
mov eax, [esp+4]
lock inc byte [eax]
Yes, no need to use the kernel mode semaphores, and for userland it might be interesting to not suspend on SMP machines if a thread holding the semaphore is running. Also, you might want to (ab)use some method of telling the OS interrupting the task is a bad idea (I'm intending to use cli/sti in the normal fashion, which generates a GPF which I capture. If it's cli the code won't be interrupted the next clock cycle interrupt, period. Good thing for mutex-holders .Do you differentiate kernel & user semaphore to improve performance ?
Kernel semaphores do their own cli/sti, and they never will be interrupted while holding one so they don't have to worry about other threads. And that speeds up your kernel code again
- Pype.Clicker
- Member
- Posts: 5964
- Joined: Wed Oct 18, 2006 2:31 am
- Location: In a galaxy, far, far away
- Contact:
Re:Userland only semaphores
the XCHG operation do not need a 'LOCK', by design (iirc. this is in the Intel Manual, book2)
keep in mind that a 'spinlock' will lead to ultra-poor performance when the lock is taken by another thread/process, especially if lock-holding thread are allowed to get suspended etc. (i.e. you'll be wasting CPU cycles by just checking a variable that *cannot* change as it's only expected to be modified by a currently sleeping/waiting thread !)
keep in mind that a 'spinlock' will lead to ultra-poor performance when the lock is taken by another thread/process, especially if lock-holding thread are allowed to get suspended etc. (i.e. you'll be wasting CPU cycles by just checking a variable that *cannot* change as it's only expected to be modified by a currently sleeping/waiting thread !)
Re:Userland only semaphores
@Candy: Thank you for your detailled answer. In fact I was speaking from mutex when talking about "xchg", and I use the lock prefix for other operations (add, sub...).
@Pype: Yes, I know for xchg. In fact my message is a bit confusing because I am dealing with semaphore & mutexes at the same time, and the problems encounteres are quite the same. Concerning the spinning, in fact it's a "clever" one suspending the process/thread until the lock is available... So there is no performance problem, but that's a good point.
So I think I'll keep some kind of userland locking primitives that do not require RPL0. I was just thinking of its design to improve the performance.
I plan to support SMP, do you have pointers to locking primitives code that work properly in SMP ? Mine should work, however I as I am not an SMP expert, I may miss some thing, and testing is quite difficult.
Thanks
INeo
@Pype: Yes, I know for xchg. In fact my message is a bit confusing because I am dealing with semaphore & mutexes at the same time, and the problems encounteres are quite the same. Concerning the spinning, in fact it's a "clever" one suspending the process/thread until the lock is available... So there is no performance problem, but that's a good point.
So I think I'll keep some kind of userland locking primitives that do not require RPL0. I was just thinking of its design to improve the performance.
I plan to support SMP, do you have pointers to locking primitives code that work properly in SMP ? Mine should work, however I as I am not an SMP expert, I may miss some thing, and testing is quite difficult.
Thanks
INeo
Re:Userland only semaphores
Aah... that makes sense . For a mutex, xchg the mutex with a zero, and see if you got a one. That's the basic algo .ineo wrote: @Candy: Thank you for your detailled answer. In fact I was speaking from mutex when talking about "xchg", and I use the lock prefix for other operations (add, sub...).
The entire point of lock is that other devices (including other processors) cannot write to the memory at the same time as your processor. So, if your code works decently with other busmasters, it works decently in an SMP environment.I plan to support SMP, do you have pointers to locking primitives code that work properly in SMP ? Mine should work, however I as I am not an SMP expert, I may miss some thing, and testing is quite difficult.
BTW, you can only test against race-conditions in your mind. Consider what /COULD/ happen, no matter how unprobable.
I used js in the above example, because if you'd use jc (which works in most cases) can fail, if two processors try to take the semaphore at exactly the same time (between the dec and the inc). In that case the second one will not see the semaphore as a negative value but as available, so it'll consider it taken. The other will then see the highest value (or negative -1) and take another, considering that it did in fact succeed too. Which they both didn't, and that's why js. JS works for at least 127 processors.
- Pype.Clicker
- Member
- Posts: 5964
- Joined: Wed Oct 18, 2006 2:31 am
- Location: In a galaxy, far, far away
- Contact:
Re:Userland only semaphores
i'm feel uncomfortable with candy's code (decrementing and re-incrementing afterwards). i would rather have used
When some process is manipulating the semaphore, it gets a 'magic' value which tells the other they should wait for the updated value to come. Since the process cannot be interrupted (and yes, we can support it in user mode through Pending Virtual Interrupts on P2+), threads on other CPUs that are in the busy-loop wouldn't have to wait for too long ...
Code: Select all
get_semaphore:
mov eax,MAGIC_VALUE
.loop:
cli ; we don't want to keep the MAGIC_VALUE for too long
xchg [semaphore],eax
cmp eax,MAGIC_VALUE
je .loop
cmp eax,0
je .fall_asleep
dec eax
xchg [semaphore],eax ; we own the semaphore
sti
ret
.fall_asleep:
xchg [semaphore],eax
int SLEEP_SYSCALL
sti
Re:Userland only semaphores
The idea behind the increment/decrement logic was that it is not possible for any cpu to ever get a too high value, or to get access when it shouldn't. By using js instead of jc, you don't rely on the last one being taken, but on any being there. So pype, the code does work, does work reliably, and does not contain a race. The most important point is, it doesn't contain a spinlock, which combined with caching results in disastrous speed.Pype.Clicker wrote: i'm feel uncomfortable with candy's code (decrementing and re-incrementing afterwards). i would rather have used
...
When some process is manipulating the semaphore, it gets a 'magic' value which tells the other they should wait for the updated value to come. Since the process cannot be interrupted (and yes, we can support it in user mode through Pending Virtual Interrupts on P2+), threads on other CPUs that are in the busy-loop wouldn't have to wait for too long ...
If you'd like a magic value for the semaphore, implement it as a counter that's mutexed. That way the stability of the semaphore depends on the stability of the mutex.
As for PVI, I don't see how you could regain control after the process has disabled virtual interrupts. How do you?
- Pype.Clicker
- Member
- Posts: 5964
- Joined: Wed Oct 18, 2006 2:31 am
- Location: In a galaxy, far, far away
- Contact:
Re:Userland only semaphores
you mean a user process evil1ously 'cli', knowing that the system will not attempt to switch it away and thus prevent the preemption system to work at all ?
Simple: if IRQ0 occurs with VIF=0 and PVI=0, i set PVI
if IRQ0 occurs again with VIF=0 and PVI=1, then the process is overusing the 'cli/sti' feature and it get killed (or at least tagged so that the system doesn't listen at its VIF requests anymore ...)
iirc, the only thing that is automated with the Pending Virtual Interrupt is that some exception (GPF?) is raised when the user program STI while PVI=1 (which means the system has deferred something and the user program now allows it to do what it kindly requested to be delayed).
Hope i'm not wrong on that point since that's something i studied a long time ago
Simple: if IRQ0 occurs with VIF=0 and PVI=0, i set PVI
if IRQ0 occurs again with VIF=0 and PVI=1, then the process is overusing the 'cli/sti' feature and it get killed (or at least tagged so that the system doesn't listen at its VIF requests anymore ...)
iirc, the only thing that is automated with the Pending Virtual Interrupt is that some exception (GPF?) is raised when the user program STI while PVI=1 (which means the system has deferred something and the user program now allows it to do what it kindly requested to be delayed).
Hope i'm not wrong on that point since that's something i studied a long time ago
Re:Userland only semaphores
I wanted to post to this thread this morning but I've had to wait until I got home from work.
Implementing user mode-only semaphores, aka critical sections or monitors, is non-trivial to do properly. I had a hard time getting a decent algorithm to reproduce Win32's CRITICAL_SECTION behaviour.
My goal here was to come up with a user-mode mutex-like object which would be almost zero cost when there is no contention for the lock, and the same cost as the equivalent kernel-mode object when there is contention. Obviously I could have copied my kernel spinlock implementation in user mode, but that wouldn't have been acceptable: you can't disable interrupts in user mode (at least not under Mobius), and spinning on the lock would be inefficient.
You will need:
-- 1 working kernel-mode semaphore implementation
-- 1 user-mode atomic IncrementAndReturnOldValue function
Here's my acquire algorithm:
And release:
Importantly, to initialise:
See the attached file for the real code;
Implementing user mode-only semaphores, aka critical sections or monitors, is non-trivial to do properly. I had a hard time getting a decent algorithm to reproduce Win32's CRITICAL_SECTION behaviour.
My goal here was to come up with a user-mode mutex-like object which would be almost zero cost when there is no contention for the lock, and the same cost as the equivalent kernel-mode object when there is contention. Obviously I could have copied my kernel spinlock implementation in user mode, but that wouldn't have been acceptable: you can't disable interrupts in user mode (at least not under Mobius), and spinning on the lock would be inefficient.
You will need:
-- 1 working kernel-mode semaphore implementation
-- 1 user-mode atomic IncrementAndReturnOldValue function
Here's my acquire algorithm:
Code: Select all
if (IncrementAndReturnOldValue(&locks) > 0)
{
// some other thread is in the lock
Down(semaphore);
// now we own the lock
locks--;
}
else
{
// there is no contention
// locks is already incremented
}
Code: Select all
if (locks > 1)
{
// there is another thread blocked on the semaphore
Up(semaphore);
}
else
locks = 0;
Code: Select all
semaphore = CreateSemaphoreWithInitialCount(0);
locks = 0;
Re:Userland only semaphores
semaphores in Userland only ... I can't imagine how one would acquire the necessary atomic operation when task switches are occuring all the time.
So, in my opinion, it is best to have the semaphore mechanism work in kernel land, with interrupts disabled to avoid disturbing occurences.
In BlueIllusion, Semaphores exist only in Kernel Land and there are three functions available, which do the semaphore stuff: semget,semctl,semop. I'm working on an implementation of semget, which permits the creation *and* initialization of a set of semaphores in one atomic operation.
As for permitting user processes to issue cli/sti: Nay, won't do that.
Stay Safe.
So, in my opinion, it is best to have the semaphore mechanism work in kernel land, with interrupts disabled to avoid disturbing occurences.
In BlueIllusion, Semaphores exist only in Kernel Land and there are three functions available, which do the semaphore stuff: semget,semctl,semop. I'm working on an implementation of semget, which permits the creation *and* initialization of a set of semaphores in one atomic operation.
As for permitting user processes to issue cli/sti: Nay, won't do that.
Stay Safe.
Re:Userland only semaphores
No need to use cli/sti for userland semaphore/mutexes... that's the point. That's why it is possible to do it without switching context. I'll use that for synchronizing my threads.As for permitting user processes to issue cli/sti: Nay, won't do that.
For the system semaphore/mutexes, I plan to develop some kind of filesystem to reference them (for those shared among several processes).
INeo