Critical sections in userspace without syscalls?

Nable · Post by **Nable** » Mon Jul 23, 2012 6:21 am

> 386, 486
> SMP
You've touched my heart, I just want to see such system.

gerryg400 · Post by **gerryg400** » Mon Jul 23, 2012 6:26 am

kernel_lock_futex:
    LockKernel    ; take an exclusive lock
    mov al,val
    cmp al,3
    jnz unlock_scheduler

    Block
    jmp done

unlock_scheduler:
   UnlockKernel

done:
    ret

rdos, my first implementation was also like this but it turned out that under reasonably heavy load, that threads would get stuck on the queue while 2 threads would lock/unlock the mutex in turn. That's the reason that, if the variable has been changed and a thread needs to return and retry, it blocks and lets the next thread from the queue do the retry. It made a huge difference to the fairness in my limited testing so far.

rdos · Post by **rdos** » Mon Jul 23, 2012 6:27 am

gerryg400 wrote:I don't bother to pass a value to the kernel either. I just check the highest bit. Actually my user space futex variable is different from the one in the article as well. It contains the following fields
Code: Select all
bit31: waiter bit
bit30: spare
bits29 to bit0: thread id
I have the tid in the lock to allow recursive locks and/or to check for deadlock. It also means that the kernel can check the owner for when I get around to implementing priority inheritance. I haven't put a whole lot of thought into that yet though.

I have one additional objection to the definition of the futex. Kernel is passed an address, which proposedly needs to be translated into a physical address, and then the kernel queue must be searched for by looking up the physical address. This really sucks, especially if there are many futexes in the system, which there will be in mine.

I would probably define it like this in userspace:

Code: Select all


class TSection
{
public:
    TSection()
    {
        handle = RdosCreateSection(&val);
        val = 0;
    }

    ...

private:
    int handle;
    char val;
};

The "futex" lock would be defined like:

void RdosEnterSection(int handle);

and the unlock:

void RdosLeaveSection(int handle);

In kernel, the usual dereference of the handle will get the address of the variable and the associated queue.

The structure associated with the handle will look like this:

Code: Select all


struct section_handle
{
    handle_base base;
    char *user_val_ptr;
    int *queue_head;
};

gerryg400 · Post by **gerryg400** » Mon Jul 23, 2012 6:32 am

I've thought about using a handle too, cached in the userspace structure but haven't bothered yet.

Honestly, by the time you enter the kernel, the lin2phys() and a hash table lookup aren't too expensive. Remember that this code only executes if the lock is missed. And that hardly ever happens, right ?

Once I've finished testing for correctness in a few weeks, I'll try some benchmarks for speed.

rdos · Post by **rdos** » Mon Jul 23, 2012 6:35 am

Nable wrote:> 386, 486
> SMP
You've touched my heart, I just want to see such system.

That's not the point. The point is that userlevel code written for my OS should run both on a 386 and AMD Phenom / Intel Core Duo, and every other available x86 CPU starting with 386. If I code 486+ only constructs into applications, this no longer works.

Besides, there are no drawbacks that I can see with replacing cmpxchg with bts and sub with rcr. The code should run with similar speed. The only possible drawback is that it cannot easily be coded in portable C.

rdos · Post by **rdos** » Mon Jul 23, 2012 7:25 am

gerryg400 wrote:I have the tid in the lock to allow recursive locks and/or to check for deadlock. It also means that the kernel can check the owner for when I get around to implementing priority inheritance. I haven't put a whole lot of thought into that yet though.

I forgot about that aspect. It is preferable if the same thread can lock the same mutex multiple times (in my case it is an requirement), without syscalls if possible.

Maybe the mutex could be defined like this instead:

Code: Select all


class mutex
    mutex()
    {
        handle = RdosCreateSection(&val);
        owner = 0;
        counter = 0;
    }

    lock()
    {
        if (owner == str())
            counter++;
        else
        {
              acquire (as before)

               owner = str();
               counter = 1;
         }
     }

     unlock()
     {
         counter--;

         if (counter == 0)
         {
            owner = 0;
            release (as before)
          }
     }

private:
     int handle;
     char val;
     int counter;
     short int owner;
}

str() above would be the x86 str instruction which can be executed at user-level, and at least in my OS, is usable as an unique thread-id. Another possibility is fs selector if PE executables are used. Other alternatives would require syscalls.

rdos · Post by **rdos** » Mon Jul 23, 2012 7:58 am

The final (userlevel) solution would thus look like this in pseudo-x86 assembly:

Code: Select all


struct mutex
{
    int handle;
    int val;
    int owner;
    int counter;
};

; es:edi contains mutex

init:
    push edi
    lea edi,[edi].val
    ebx = RdosCreateSection(es:edi)
    pop edi
    mov [edi].handle,ebx
    mov [edi].val,0
    mov [edi].counter,0
    mov [edi].owner,0
    ret

lock:
    str eax
    cmp eax,[edi].owner
    jne lock_do

    inc [edi].counter
    ret

lock_do:
    lock bts [edi].val,0
    jnc lock_take

lock_retry:
    mov eax,3
    xchg eax,[edi].val
    or eax,eax
    jz lock_take

    mov ebx,[edi].handle
    RdosEnterSection(ebx)

    jmp lock_retry

lock_take:
    str eax
    mov [edi].owner,eax
    mov [edi].counter,1
    ret

unlock:
    str eax
    cmp eax,[edi].owner    ; check owner for safety
    jne unlock_done

    sub [edi].counter,1
    jnz unlock_done

    xor eax,eax           
    mov [edi].owner,eax
    lock shr [edi].val,1         ; changed to shr as rcr doesn't affect ZF
    jz unlock_done              ; fixed this as it is when val becomes zero we can skip the syscall

    mov [edi].val,eax
    mov ebx,[edi].handle
    RdosLeaveSection(ebx)

unlock_done:
    ret

Edit: fixed some bugs.

rdos · Post by **rdos** » Mon Jul 23, 2012 8:25 am

gerryg400 wrote:rdos, my first implementation was also like this but it turned out that under reasonably heavy load, that threads would get stuck on the queue while 2 threads would lock/unlock the mutex in turn. That's the reason that, if the variable has been changed and a thread needs to return and retry, it blocks and lets the next thread from the queue do the retry. It made a huge difference to the fairness in my limited testing so far.

OK, that seems to be a reasonable fix.

rdos · Post by **rdos** » Mon Jul 23, 2012 9:17 am

Turns out that SHR cannot be prefixed with LOCK according to Intel, so I need some other solution if the code should execute on 386. The only instructions that can be prefixed with lock are ADD, ADC, AND, BTC, BTR, BTS, CMPXCHG,
DEC, INC, NEG, NOT, OR, SBB, SUB, XOR, XADD, and XCHG.

Owen · Post by **Owen** » Mon Jul 23, 2012 9:47 am

rdos wrote:
gerryg400 wrote:I don't bother to pass a value to the kernel either. I just check the highest bit. Actually my user space futex variable is different from the one in the article as well. It contains the following fields
Code: Select all
bit31: waiter bit
bit30: spare
bits29 to bit0: thread id
I have the tid in the lock to allow recursive locks and/or to check for deadlock. It also means that the kernel can check the owner for when I get around to implementing priority inheritance. I haven't put a whole lot of thought into that yet though.
I have one additional objection to the definition of the futex. Kernel is passed an address, which proposedly needs to be translated into a physical address, and then the kernel queue must be searched for by looking up the physical address. This really sucks, especially if there are many futexes in the system, which there will be in mine.

The rationale for a physical address is twofold:

It allows the kernel to recycle the wait queues for any futex with no waiters
It allows the use of futexes in shared memory regions

A hash table isn't that expensive - especially when you consider that its about to be followed by the thread sleeping and a context switch. That said, you could easily have both forms, if you somehow keep the allocated handles as invalid futex addresses - a simple way would be to set the lowest bit of the handle to 1, for example.

rdos · Post by **rdos** » Mon Jul 23, 2012 1:35 pm

Owen wrote:The rationale for a physical address is twofold:

It allows the kernel to recycle the wait queues for any futex with no waiters

It allows the use of futexes in shared memory regions
A hash table isn't that expensive - especially when you consider that its about to be followed by the thread sleeping and a context switch. That said, you could easily have both forms, if you somehow keep the allocated handles as invalid futex addresses - a simple way would be to set the lowest bit of the handle to 1, for example.

I think I understand their reasons for this, and it is attractive to be able to use it in shared memory. However, when the idea is extended to nested accesses by the same thread, there is a minimum of 3 variables required, and then the simple int variable design doesn't hold. There is a need for a more complex constructor, and when that is needed anyway it is just as easy to also require a destructor, and allocate / free a handle as well so no searches are necesary.

What I find attractive is avoiding syscalls when they are not necesary, not a synchronization primitive with a simple clear constructor, and no destructor. I prefer to use my own IPC mechanism between processes / computers rather than shared memory.

rdos · Post by **rdos** » Mon Jul 23, 2012 1:46 pm

It doesn't seem like there is an SMP-safe solution that only uses 386 instructions, or at least not a simple one. Therefore, I think that the previous solution, without locks, would be used for single core machines, while the original version that uses cmpxchg and locks would be used on multicore machines. Since no viable 386 machine uses multicore, the usage of cmpxchg for multicore is no problem. That should be an optimal solution. The call to lock and unlock thus would need to call a procedure that would be constructed at runtime based on configuration. The initialization code, and data structure would be the same. The only difference is that the SMP version would use 0, 1 and 2 for states, while the 386 version would use 0, 1 and 3.

Owen · Post by **Owen** » Mon Jul 23, 2012 2:45 pm

Nested, in-thread accesses only require two variables:

Code: Select all

void lock(Mutex* mutex) 
{
    prev = compareAndSwap(&mutex->futex, 0, myThreadId);
    if(prev &~ FUTEX_WAIT_BIT == myThreadId) {
        // Recursively taking the lock
        mutex->lockCount++;
    } else if(prev == 0) {
        // We took the lock
    } else {
        // Somebody else has the lock
        futexWaitLock(&mutex->futex);
    }
}

void unlock(Mutex* mutex)
{
    assert(mutex->futex &~ FUTEX_WAIT_BIT == myThreadId);
    if(--mutex->lockCount == 0) {
        prev = compareAndSwap(&mutex->futex, myThreadId, 0);
        if(prev != myThreadId) {
            // Somebody else is waiting on the lock
            futexWaitUnlock(&mutex->futex);
        }
    }
}

Fundamentally, you're always going to wrap the Futex in some manner; it's nontrivial to use correctly. That said, I like the fact that a Futex can be statically initialized in the traditional C manner (i.e. to zero)

gerryg400 · Post by **gerryg400** » Mon Jul 23, 2012 3:45 pm

Owen wrote:Nested, in-thread accesses only require two variables:
Code: Select all
void lock(Mutex* mutex) 
{
    prev = compareAndSwap(&mutex->futex, 0, myThreadId);
    if(prev &~ FUTEX_WAIT_BIT == myThreadId) {
        // Recursively taking the lock
        mutex->lockCount++;
    } else if(prev == 0) {
        // We took the lock
    } else {
        // Somebody else has the lock
        futexWaitLock(&mutex->futex);
    }
}

void unlock(Mutex* mutex)
{
    assert(mutex->futex &~ FUTEX_WAIT_BIT == myThreadId);
    if(--mutex->lockCount == 0) {
        prev = compareAndSwap(&mutex->futex, myThreadId, 0);
        if(prev != myThreadId) {
            // Somebody else is waiting on the lock
            futexWaitUnlock(&mutex->futex);
        }
    }
}
Fundamentally, you're always going to wrap the Futex in some manner; it's nontrivial to use correctly. That said, I like the fact that a Futex can be statically initialized in the traditional C manner (i.e. to zero)

Owen, it looks like your futexWaitLock function much be capable of atomically taking the lock. Is that function a system call or is the lock retried in userspace before/after the actual system call is made ? I'm curious about whether it would be possible to take the physical lock safely from kernel space.

Owen · Post by **Owen** » Mon Jul 23, 2012 9:08 pm

gerryg400 wrote:Owen, it looks like your futexWaitLock function much be capable of atomically taking the lock. Is that function a system call or is the lock retried in userspace before/after the actual system call is made ? I'm curious about whether it would be possible to take the physical lock safely from kernel space.

futexWaitLock is the system call. The kernel essentially does this pseudo code:

Code: Select all

shared_ptr<FutexWaitList> waitList(getWaitListFor(futex))
with(lock(waitList)) {
    futexValue = __atomic_load(futex, __ATOMIC_RELAXED);
    for(;;) {
        if(futexValue & FUTEX_WAIT_BIT) {
            // somebody else is already waiting. Join the list.
            waitList.waitOn(); // waitOn adds us to the wait queue and then unlocks the spinlock
            return;
        } else if(futexValue == 0) {
            bool didSet = __atomic_compare_and_swap(futex, &futexValue, myThreadId, __ATOMIC_ACQUIRE, __ATOMIC_RELAXED);
            if(didSet) return; // successfully locked; no need to wait
        } else {
            bool didSet = __atomic_compare_and_swap(futex, &futexValue, futexValue | FUTEX_WAIT_BIT, __ATOMIC_ACQUIRE, __ATOMIC_RELAXED);
            if(didSet) {
                waitList.waitOn();
                return;
            }
        }
    }
}

The __atomic operations here are the GCC C11/C++11 support builtins

OSDev.org

Critical sections in userspace without syscalls?

Re: Critical sections in userspace without syscalls?

Re: Critical sections in userspace without syscalls?

Re: Critical sections in userspace without syscalls?

Re: Critical sections in userspace without syscalls?

Re: Critical sections in userspace without syscalls?

Re: Critical sections in userspace without syscalls?

Re: Critical sections in userspace without syscalls?

Re: Critical sections in userspace without syscalls?

Re: Critical sections in userspace without syscalls?

Re: Critical sections in userspace without syscalls?

Re: Critical sections in userspace without syscalls?

Re: Critical sections in userspace without syscalls?

Re: Critical sections in userspace without syscalls?

Re: Critical sections in userspace without syscalls?

Re: Critical sections in userspace without syscalls?