Userland only semaphores

ineo · Post by **ineo** » Fri Apr 23, 2004 3:29 am

Hi !
I am trying to improve my semaphore to avoid switch between user/kernel code.
In fact for the x86 cpu I use the "xchg" instruction which provide locking, and a spinning...

Have you ever implemented such semaphores ?

In fact most semaphore code I've read where locked using the "lock" prefix and disabling interruptions. I understand how it works, but I don't understand why interruptions are disabled... is it always required (even using "xchg") ? In that case my userland locking may be totally wrong.

Do you differentiate kernel & user semaphore to improve performance ?

Thank you for your help !
INeo

Candy · Post by **Candy** » Fri Apr 23, 2004 3:52 am

ineo wrote: In fact for the x86 cpu I use the "xchg" instruction which provide locking, and a spinning...

For a mutex xchg is ok, but for a semaphore it's bad. Use inc or dec (or add/sub).

Have you ever implemented such semaphores ?

Only in kernel mode, and not yet stresstested

In fact most semaphore code I've read where locked using the "lock" prefix and disabling interruptions. I understand how it works, but I don't understand why interruptions are disabled... is it always required (even using "xchg") ? In that case my userland locking may be totally wrong.

note that you can change byte to anything you like, it functions in the lower half of the semaphore (up to 0x7F for byte).

Code: Select all

acquire_semaphore:
  mov eax, [esp+4]
  lock dec byte [eax]
  js semaphore_too_low
  ret

.semaphore_too_low
  lock inc byte [eax]
  call wait_some_time
  jmp acquire_semaphore

Code: Select all

restore_semaphore:
  mov eax, [esp+4]
  lock inc byte [eax]

Do you differentiate kernel & user semaphore to improve performance ?

Yes, no need to use the kernel mode semaphores, and for userland it might be interesting to not suspend on SMP machines if a thread holding the semaphore is running. Also, you might want to (ab)use some method of telling the OS interrupting the task is a bad idea (I'm intending to use cli/sti in the normal fashion, which generates a GPF which I capture. If it's cli the code won't be interrupted the next clock cycle interrupt, period. Good thing for mutex-holders

.

Kernel semaphores do their own cli/sti, and they never will be interrupted while holding one so they don't have to worry about other threads. And that speeds up your kernel code again

Pype.Clicker · Post by **Pype.Clicker** » Fri Apr 23, 2004 3:52 am

the XCHG operation do not need a 'LOCK', by design (iirc. this is in the Intel Manual, book2)

keep in mind that a 'spinlock' will lead to ultra-poor performance when the lock is taken by another thread/process, especially if lock-holding thread are allowed to get suspended etc. (i.e. you'll be wasting CPU cycles by just checking a variable that *cannot* change as it's only expected to be modified by a currently sleeping/waiting thread !)

ineo · Post by **ineo** » Fri Apr 23, 2004 6:03 am

@Candy: Thank you for your detailled answer. In fact I was speaking from mutex when talking about "xchg", and I use the lock prefix for other operations (add, sub...).

@Pype: Yes, I know for xchg. In fact my message is a bit confusing because I am dealing with semaphore & mutexes at the same time, and the problems encounteres are quite the same. Concerning the spinning, in fact it's a "clever" one suspending the process/thread until the lock is available... So there is no performance problem, but that's a good point.

So I think I'll keep some kind of userland locking primitives that do not require RPL0. I was just thinking of its design to improve the performance.

I plan to support SMP, do you have pointers to locking primitives code that work properly in SMP ? Mine should work, however I as I am not an SMP expert, I may miss some thing, and testing is quite difficult.

Thanks
INeo

Candy · Post by **Candy** » Fri Apr 23, 2004 6:22 am

ineo wrote: @Candy: Thank you for your detailled answer. In fact I was speaking from mutex when talking about "xchg", and I use the lock prefix for other operations (add, sub...).

Aah... that makes sense

. For a mutex, xchg the mutex with a zero, and see if you got a one. That's the basic algo

.

I plan to support SMP, do you have pointers to locking primitives code that work properly in SMP ? Mine should work, however I as I am not an SMP expert, I may miss some thing, and testing is quite difficult.

The entire point of lock is that other devices (including other processors) cannot write to the memory at the same time as your processor. So, if your code works decently with other busmasters, it works decently in an SMP environment.

BTW, you can only test against race-conditions in your mind. Consider what /COULD/ happen, no matter how unprobable.

I used js in the above example, because if you'd use jc (which works in most cases) can fail, if two processors try to take the semaphore at exactly the same time (between the dec and the inc). In that case the second one will not see the semaphore as a negative value but as available, so it'll consider it taken. The other will then see the highest value (or negative -1) and take another, considering that it did in fact succeed too. Which they both didn't, and that's why js. JS works for at least 127 processors.

Pype.Clicker · Post by **Pype.Clicker** » Fri Apr 23, 2004 7:18 am

i'm feel uncomfortable with candy's code (decrementing and re-incrementing afterwards). i would rather have used

Code: Select all

get_semaphore:
     mov eax,MAGIC_VALUE
.loop:
     cli    ; we don't want to keep the MAGIC_VALUE for too long
     xchg [semaphore],eax
     cmp eax,MAGIC_VALUE
     je .loop

     cmp eax,0
     je .fall_asleep
     dec eax
     xchg [semaphore],eax ; we own the semaphore
     sti
     ret

.fall_asleep:
     xchg [semaphore],eax
     int SLEEP_SYSCALL
     sti

When some process is manipulating the semaphore, it gets a 'magic' value which tells the other they should wait for the updated value to come. Since the process cannot be interrupted (and yes, we can support it in user mode through Pending Virtual Interrupts on P2+), threads on other CPUs that are in the busy-loop wouldn't have to wait for too long ...

Candy · Post by **Candy** » Fri Apr 23, 2004 8:00 am

Pype.Clicker wrote: i'm feel uncomfortable with candy's code (decrementing and re-incrementing afterwards). i would rather have used
...
When some process is manipulating the semaphore, it gets a 'magic' value which tells the other they should wait for the updated value to come. Since the process cannot be interrupted (and yes, we can support it in user mode through Pending Virtual Interrupts on P2+), threads on other CPUs that are in the busy-loop wouldn't have to wait for too long ...

The idea behind the increment/decrement logic was that it is not possible for any cpu to ever get a too high value, or to get access when it shouldn't. By using js instead of jc, you don't rely on the last one being taken, but on any being there. So pype, the code does work, does work reliably, and does not contain a race. The most important point is, it doesn't contain a spinlock, which combined with caching results in disastrous speed.

If you'd like a magic value for the semaphore, implement it as a counter that's mutexed. That way the stability of the semaphore depends on the stability of the mutex.

As for PVI, I don't see how you could regain control after the process has disabled virtual interrupts. How do you?

Pype.Clicker · Post by **Pype.Clicker** » Fri Apr 23, 2004 8:07 am

you mean a user process evil1ously 'cli', knowing that the system will not attempt to switch it away and thus prevent the preemption system to work at all ?

Simple: if IRQ0 occurs with VIF=0 and PVI=0, i set PVI
if IRQ0 occurs again with VIF=0 and PVI=1, then the process is overusing the 'cli/sti' feature and it get killed (or at least tagged so that the system doesn't listen at its VIF requests anymore ...)

iirc, the only thing that is automated with the Pending Virtual Interrupt is that some exception (GPF?) is raised when the user program STI while PVI=1 (which means the system has deferred something and the user program now allows it to do what it kindly requested to be delayed).

Hope i'm not wrong on that point since that's something i studied a long time ago

Tim · Post by **Tim** » Fri Apr 23, 2004 12:08 pm

I wanted to post to this thread this morning but I've had to wait until I got home from work.

Implementing user mode-only semaphores, aka critical sections or monitors, is non-trivial to do properly. I had a hard time getting a decent algorithm to reproduce Win32's CRITICAL_SECTION behaviour.

My goal here was to come up with a user-mode mutex-like object which would be almost zero cost when there is no contention for the lock, and the same cost as the equivalent kernel-mode object when there is contention. Obviously I could have copied my kernel spinlock implementation in user mode, but that wouldn't have been acceptable: you can't disable interrupts in user mode (at least not under Mobius), and spinning on the lock would be inefficient.

You will need:
-- 1 working kernel-mode semaphore implementation
-- 1 user-mode atomic IncrementAndReturnOldValue function

Here's my acquire algorithm:

Code: Select all

if (IncrementAndReturnOldValue(&locks) > 0)
{
    // some other thread is in the lock

    Down(semaphore);

    // now we own the lock

    locks--;
}
else
{
    // there is no contention
    // locks is already incremented
}

And release:

Code: Select all

if (locks > 1)
{
    // there is another thread blocked on the semaphore

    Up(semaphore);
}
else
    locks = 0;

Importantly, to initialise:

Code: Select all

semaphore = CreateSemaphoreWithInitialCount(0);
locks = 0;

See the attached file for the real code;

BI lazy · Post by **BI lazy** » Sat Apr 24, 2004 2:33 pm

semaphores in Userland only ... I can't imagine how one would acquire the necessary atomic operation when task switches are occuring all the time.

So, in my opinion, it is best to have the semaphore mechanism work in kernel land, with interrupts disabled to avoid disturbing occurences.

In BlueIllusion, Semaphores exist only in Kernel Land and there are three functions available, which do the semaphore stuff: semget,semctl,semop. I'm working on an implementation of semget, which permits the creation *and* initialization of a set of semaphores in one atomic operation.

As for permitting user processes to issue cli/sti: Nay, won't do that.

Stay Safe.

ineo · Post by **ineo** » Mon Apr 26, 2004 3:54 am

As for permitting user processes to issue cli/sti: Nay, won't do that.

No need to use cli/sti for userland semaphore/mutexes... that's the point. That's why it is possible to do it without switching context. I'll use that for synchronizing my threads.

For the system semaphore/mutexes, I plan to develop some kind of filesystem to reference them (for those shared among several processes).

INeo

OSDev.org

Userland only semaphores

Userland only semaphores

Re:Userland only semaphores

Re:Userland only semaphores

Re:Userland only semaphores

Re:Userland only semaphores

Re:Userland only semaphores

Re:Userland only semaphores

Re:Userland only semaphores

Re:Userland only semaphores

Re:Userland only semaphores

Re:Userland only semaphores