Page 3 of 4

Re: Critical sections in userspace without syscalls?

Posted: Tue Jul 24, 2012 12:40 am
by rdos
There actually is a 386 solution. It is possible to change cmpxchg to add. The results for cmpxchg is that 0->1, 1->1 and 2->2. However, the line after a failed cmpxchg sets 1->2, so in essence the function is 0->1, 1->2 and 2->2. An add would be very similar. The only difference is that 2->3, and that if many threads execute the initial add only, the value could go up considerably. However, if a 16 bit value is used up to 65535 threads could try to access the mutex at the same time without the counter rolling over. For practical purposes, this will be fine. Owen's optimization to overlay the thread-id with the value will not work, but that is of minor importance.

In order to use add/sub, the value range must also change. The initial vaule should be -1, and then CF will be checked in order to detect the fast paths.

At the kernel side, there will be a check if value is -1, 0 or something else. Values -1 and 0 will exit the acquire syscall.

Code: Select all


struct mutex
{
    int handle;
    int counter;
    short int val;
    short int owner;
};

; es:edi contains mutex

init:
    push edi
    lea edi,[edi].val
    ebx = RdosCreateFutex(es:edi)
    pop edi
    mov [edi].handle,ebx
    mov [edi].val,-1
    mov [edi].counter,0
    mov [edi].owner,0
    ret

lock:
    str eax
    cmp ax,[edi].owner
    jne lock_do

    inc [edi].counter
    ret

lock_do:
    lock add [edi].val,1
    jc lock_take

lock_retry:
    mov ax,1
    xchg ax,[edi].val
    cmp ax,-1
    je lock_take

    mov ebx,[edi].handle
    RdosAcquireFutex(ebx)

    jmp lock_retry

lock_take:
    str eax
    mov [edi].owner,ax
    mov [edi].counter,1
    ret

unlock:
    str eax
    cmp ax,[edi].owner    ; check owner for safety
    jne unlock_done

    sub [edi].counter,1
    jnz unlock_done

    mov [edi].owner,0
    lock sub [edi].val,1       
    jc unlock_done             

    mov [edi].val,-1
    mov ebx,[edi].handle
    RdosReleaseFutex(ebx)

unlock_done:
    ret


Re: Critical sections in userspace without syscalls?

Posted: Wed Jul 25, 2012 3:26 pm
by rdos
I plan to provide the more efficient userspace mutexes (which I call critical sections), transparently. Today, the code patcher will patch a syscall when it encounters these instructions. However, I've provided new hooks so the loader can provide a different patch that is the near call to the fast mutexes that are based on futexes. This part of the patching procedure now works for flat applications using PE-format. Now I only need to implement the kernel part (the futexes) before I can make some serious tests.

Re: Critical sections in userspace without syscalls?

Posted: Thu Jul 26, 2012 3:42 pm
by rdos
Some more redesigns.

In order to provide a "safe" (and compatible) interface I changed the idea about returning an address to the mutex data structure. Instead I predefine a maximum number of mutexes that a process can create (currently 1024), and allocate linear space for an index array + the actual mutex data (which then is 16 * 1024 bytes). I then initialize the index array with zeros. The interface will work with indexes into the array (1..1024), which is safer.

Then I used the original futex-idea so now I don't allocate the handle at initialization time, rather only when a thread is about to block. Since I don't want to expose kernel data (and especially not scheduler lists) to applications, I still use the handle concept. When the first thread is about to block, it acquires a kernel lock, allocates the handle (and thus the block-list) and fills in the handle in the userlevel mutex data structure. The next time something is blocked, the stored handle will be dereferenced, which is much faster than looking up physical addresses. The application cannot forge the handle to manipulate scheduler lists. When a mutex no longer is used, the application should delete it, and then also possibly free the handle if it is allocated.

The nice thing is that the whole sequence of create, acquire, release and delete a mutex can be done without syscalls if there is no contention.

Re: Critical sections in userspace without syscalls?

Posted: Tue Jul 31, 2012 1:49 pm
by rdos
OK, so now I've implemented the kernel-side also, and made some tests.

The test code looks like this

Code: Select all


void test_thread(char mych)
{
    for (;;)
    {
        Section.Enter();
        WriteChar(mych);
        Section.Leave();
    }
}

It runs fine with 2 threads on a dual-code PC, as well as with 3 threads on a dual-core and 6-core PC.

However, there is a problem with fairness. Since the code takes the mutex almost immediately after releasing it, most of the time the woken-up threads will have no time to activate and compete for the mutex, so frequently the same thread will run multiple rounds before it accidently loose the race. It is worse on the 6-core AMD Athlon, probably because it is much faster.

A possible solution is to force a reschedule every time the futex_release syscall is called, but that would also slow down the code when it is less extreme than in this case.

Edit:
A more elaborate test:

Code: Select all


char someval;

void test_thread(char mych)
{
    for (;;)
    {
        Section.Enter();
        someval = mych;
        WriteChar(mych);
        if (someval != mych)
            for (;;)
                 ;
        Section.Leave();
    }
}

The code never gets to the infinite loop, which should be an indication that no more than one thread could be in the protected section.

Re: Critical sections in userspace without syscalls?

Posted: Tue Jul 31, 2012 2:28 pm
by Nable
What about calling (with some condition checking) smth like sched_yield from the application if you want fairness? Such mixture of preemption and cooperative multitasking.
Else it's a scheduler task to provide fairness by keeping some trace counters for threads and preempting those who have the highest values of, for example, section_acquire counter.

Re: Critical sections in userspace without syscalls?

Posted: Tue Jul 31, 2012 2:38 pm
by rdos
Nable wrote:What about calling (with some condition checking) smth like sched_yield from the application if you want fairness? Such mixture of preemption and cooperative multitasking.
Else it's a scheduler task to provide fairness by keeping some trace counters for threads and preempting those who have the highest values of, for example, section_acquire counter.
The example is quite extreme because it takes perhaps 100 times as long to execute the protected code compared to retake the mutex. That will hardly ever happen in real situations.

I tried to do a yield in the release_futex call, but it doesn't do much good. About the only effect is that it consumes a lot of cycles which in itself increases chances for another core / thread to win the race for the mutex.

I think I just let it be. No sense in optimizing for unlikely cases.

Re: Critical sections in userspace without syscalls?

Posted: Thu Aug 02, 2012 7:18 pm
by Owen
If the futex was contented, then the unlocker knows this and has to call into the kernel. If you're already in the kernel, surely it can't be that expensive to
  • Pick the first task off the waiting list
  • Rewrite the futex (which is still in the locked state) to be locked by this task
  • Unblock the task
No need to (necessarily) immediately go for a full schedule - but in doing this you've (inexpensively, one hopes) guaranteed fairness

Re: Critical sections in userspace without syscalls?

Posted: Fri Aug 03, 2012 1:12 am
by rdos
Owen wrote:If the futex was contented, then the unlocker knows this and has to call into the kernel. If you're already in the kernel, surely it can't be that expensive to
  • Pick the first task off the waiting list
  • Rewrite the futex (which is still in the locked state) to be locked by this task
  • Unblock the task
No need to (necessarily) immediately go for a full schedule - but in doing this you've (inexpensively, one hopes) guaranteed fairness
What happens is this:
  • The owner releases the futex (setting it to "free"), and then unblocks one thread
  • The owner then exists the release code, and directly enters the lock code again
  • If the woken-up thread hasn't been activated yet, the previous owner will take the futex again (it's free)
  • The woken-up thread is activated, and tries to take the futex, but it can't because it is already taken, and thus will be blocked again
What is needed to avoid this scenario is that the thread that unblocks one thread cannot be allowed to continue until the unblocked thread is running. This means it must be preempted. It is a race-condition.

I might look into Owen's solution above where the thread that releases the futex will set the unblocked thread to the owner of the futex. That could work, especially since owner is the task selector, which is easily available when unblocking. The retry code also must be changed so a retrying thread first checks if it is the owner, and if it is, just exists the acquire procedure. Additionally, the thread that unblocks one thread probably shouldn't set the mutex to "free" before it does the unblock syscall. To ensure fairness, it should set it to "contended".

Re: Critical sections in userspace without syscalls?

Posted: Fri Aug 03, 2012 2:13 am
by rdos
OK, updated code according to Owen's idea (and with the user-handle concept):

Code: Select all


struct mutex
{
    int handle;
    int counter;
    short int val;
    short int owner;
};

; create new mutex
; returns handle in ebx

create:
    push eax
    push ecx
    push edx
    push edi

    mov edx,OFFSET handle_tab
    mov edi,edx
    mov ecx,MAX_SECTIONS
    repnz scasd                                     ; search for free handle

    stc
    jnz create_done                               ; no free handles

    mov ebx,MAX_SECTIONS
    sub ebx,ecx
    push ebx                                         ; ebx now is the handle (1-based index)

    dec ebx
    sub edi,4                                        ; edi is the handle table index
    mov eax,12
    mul ebx
    add eax,4 * MAX_SECTIONS           
    mov edx,eax
    add edx,OFFSET handle_tab              ; edx is the mutex entry itself
    mov [edi],edx                                 ; save address to mutex in index array

    mov [edx].handle,0                          ; initialize mutex
    mov [edx].val,-1
    mov [edx].counter,0
    mov [edx].owner,0

    pop ebx
    clc

create_done:
    pop edi
    pop edx
    pop ecx
    pop eax
    ret

; free mutex
; ebx is handle

free:
    push edx

    sub ebx,1
    jc free_done

    cmp ebx,MAX_SECTIONS            ; check that hande is in range
    jae free_done

    shl ebx,2
    add ebx,OFFSET handle_tab
    xor eax,eax
    xchg eax,[ebx]                         ; mark  handle as free
    or eax,eax
    jz free_done

    mov ebx,eax
    mov eax,[ebx].handle                ; check if a kernel contention handle has been allocated
    or eax,eax
    jz free_done

    FreeFutex(ebx)                        ; free kernel side handle                      

free_done:
    xor ebx,ebx
    pop edx
    ret

; lock mutex
; ebx is handle

lock:
    push eax
    push edx

    sub ebx,1
    jc lock_done                             ; invalid zero handle

    cmp ebx,MAX_SECTIONS
    jae lock_done                          ; index out of range

    shl ebx,2
    add ebx,OFFSET handle_tab
    mov ebx,[ebx]
    or ebx,ebx
    jz lock_done                            ; not allocated

    str ax
    cmp ax,[ebx].owner
    jne lock_do                            ; already owned?

    inc [ebx].counter
    jmp lock_done

lock_do:
    lock add [ebx].val,1
    jc lock_take

    mov eax,1
    xchg ax,[ebx].val                   ; double check so it isn't free, and avoid the "roll-over" problem
    cmp ax,-1
    jne lock_block

lock_take:
    str ax                                 ; set as owner
    mov [ebx].owner,ax
    mov [ebx].counter,1
    jmp lock_done

lock_block:
    RdosAcquireFutex(ebx)          ; block

lock_done:
    pop ebx
    pop eax
    ret

; unlock mutex
; ebx is handle

unlock:
    push eax
    push ebx

    sub ebx,1
    jc unlock_done              ; invalid zero handle

    cmp ebx,MAX_SECTIONS
    jae unlock_done               ; index out of range

    shr ebx,2
    add ebx,OFFSET handle_tab
    mov ebx,[ebx]
    or ebx,ebx
    jz unlock_done                ; not allocated

    str ax
    cmp ax,[ebx].owner 
    jne unlock_done              ; check that owner is correct

    sub [ebx].counter,1
    jnz unlock_done             

    mov [ebx].owner,0
    lock sub [ebx].val,1       
    jc unlock_done                  ; go if not contended

    mov [ebx].val,-1                 ; temporarily set to free to handle race condition with wakeup
    RdosReleaseFutex(ebx)

unlock_done:
    pop ebx
    pop eax
    ret

Edit: When the acquire call returns from kernel, the kernel can ensure that the mutex is always owned, and thus there is no need to check after the blocking call returns.

Edit2: In order to handle race conditions it is necesary to set the mutex to free before doing the release futex syscall.

Re: Critical sections in userspace without syscalls?

Posted: Fri Aug 03, 2012 3:06 am
by rdos
Here is the kernel-side, with some pseudo-code:

Code: Select all


; acquire futex
; es:ebx is address of mutex data

acquire_futex:
    push ds
    push fs
    push eax
    push ebx
    push cx
    push esi
;    
    mov esi,ebx          
;
    mov ebx,es:[esi].fs_handle
    mov ax,FUTEX_HANDLE
    DerefHandle                                ; try to deref kernel handle
    jnc acquire_no_sect                
;    
    mov ax,task_sel
    mov ds,ax
    EnterSection ds:futex_section       ; take a kernel mutex to ensure only one kernel handle is allocated
;
    mov ebx,es:[esi].fs_handle
    mov ax,FUTEX_HANDLE
    DerefHandle                                ; retry with lock
    jnc acquire_handle_ok
;
    mov cx,SIZE futex_handle_seg
    AllocateHandle                             ; allocate kernel handle
    mov ds:[ebx].fh_list,0                   ; clear block-list
    mov ds:[ebx].fh_lock,0                  ; initialize spinlock
    mov [ebx].hh_sign,FUTEX_HANDLE   ; set to this handle-type
;
    movzx eax,[ebx].hh_handle
    mov es:[esi].fs_handle,eax             ; save handle in user-level mutex

acquire_handle_ok:
    push ds
    mov ax,task_sel
    mov ds,ax
    LeaveSection ds:futex_section        ; release handle mutex
    pop ds

acquire_no_sect:
    call LockCore                               ; lock core so no scheduling can happen
    call cs:lock_futex_proc                  ; take mutex spinlock
    mov ax,1
    xchg ax,es:[esi].val                  ; make sure nobody just left the section
    cmp ax,-1
    je acquire_take
;
    mov ax,ds
    push OFFSET acquire_done            ; save resume-point on stack
    call SaveLockedThread                  ; save thread state
;
    mov ds,ax
    lea edi,[ebx].fh_list     
    mov es,fs:ps_curr_thread
    mov fs:ps_curr_thread,0
;
    call InsertBlock32                        ; block into block list
    call cs:unlock_futex_proc             ; release spinlock
;
    mov es:p_sleep_sel,ds
    mov es:p_sleep_offset,edi    
    jmp LoadThread                         ; schedule

acquire_take:
    str ax
    mov es:[esi].owner,ax                ; set self to owner
    mov es:[esi].counter,1                 
    call cs:unlock_futex_proc            ; free mutex spinlock
    call UnlockCore                          ; unlock scheduler
    
acquire_done:
    pop esi
    pop cx
    pop ebx
    pop eax
    pop fs
    pop ds    
    ret

; release futex
; es:ebx address of user-level mutex data

release_futex:
    push ds
    push fs
    push eax
    push ebx
    push cx
    push esi
;
    mov esi,ebx                            ; save user-level base in esi
    mov ebx,es:[esi].handle
    mov ax,FUTEX_HANDLE
    DerefHandle                           ; dereference handle
    jc release_done
;
    call LockCore                         ; lock scheduler
    call cs:lock_futex_proc            ; take futex spinlock
;
    mov ax,ds:[ebx].fh_list
    or ax,ax
    jz release_unlock                    ; check if somebody is blocked, go if not
;    
    mov ax,1
    xchg ax,es:[esi].val                ; try to acquire the mutex
    cmp ax,-1
    jne release_unlock                  ; if somebody already has it, go
;
    push di    
;
    push es
    push esi    
    lea esi,ds:[ebx].fh_list
    call RemoveBlock32                 ; remove a waiting thread
    mov es:p_data,0
    mov cx,es                             ; save thread block
    mov di,es:p_tss_sel               ; get task selector of waiting thread
    pop esi
    pop es
;
    mov es:[esi].owner,di             ; set owner
    mov es:[esi].counter,1            ; set count
;
    call cs:unlock_futex_proc         ; release mutex spinlock
;
    push es
    mov es,cx                              ; restore thread block
    cli
    mov di,OFFSET ps_wakeup_list
    call InsertCoreBlock                 ; insert unblocked thread into wakeup list
    sti
    pop es
;    
    pop di
    call UnlockCore                       ; unlock scheduler
    jmp release_done

release_unlock:
    call cs:unlock_futex_proc        ; release mutex spinlock
    call UnlockCore                      ; unlock scheduler
    
release_done:
    pop esi
    pop cx
    pop ebx
    pop eax
    pop fs
    pop ds    
    ret

Edit: Made some more modifications. In essence, what the release_futex syscall does is that it tries to acquire the futex in kernel if somebody is blocked, and then hands it over to the blocked thread. That means the functionality is very similar. It is necesary to set the futex to free before starting to do this because otherwise a concurrent acquire can be on it's way on another core, but still not having lead to a blocked thread.

Edit2: Changed registers as the address of the kernel handle data was destroyed.

Re: Critical sections in userspace without syscalls?

Posted: Sun Jan 06, 2013 1:13 pm
by rdos
In order to implement this design in long mode, there is a need for a slight modification. The owner no longer can be determined by using str (task register) since TR is loaded with a per-core value. There is a need for a different thread indicator in long mode. I think a good one is bits 30-39 of application RSP. This is because thread stacks are allocated with 1G alignment from a specific address-range. The identification can be provided as a new parameter to the acquire and release calls, or be determined through the task structure. The latter probably is better since this can be used transparently. When the syscalls determine the thread is running in long mode, it would use the identification in the task structure instead of TR as owner.

Re: Critical sections in userspace without syscalls?

Posted: Sun Jan 06, 2013 3:19 pm
by Combuster
rdos wrote:I think a good one is bits 30-39 of application RSP.
I see an exploit coming soon.

Re: Critical sections in userspace without syscalls?

Posted: Mon Jan 07, 2013 12:35 am
by rdos
Combuster wrote:
rdos wrote:I think a good one is bits 30-39 of application RSP.
I see an exploit coming soon.
Anything that is in the user-part of structure can be abused by user-mode, which includes the owner field. Thus, you don't need an "exploit", you can already do it in 32-bit by setting owner to anything you like. The owner field is not used for anything in kernel, other than as an indicator of who user-mode set to owner. If all the exploit is for is to mark the section as occupied, it is easier to do with an acquire call :mrgreen:

Re: Critical sections in userspace without syscalls?

Posted: Mon Jan 07, 2013 1:56 am
by rdos
Update:

There is a new field in the thread structure called p_futex_id, which kernel side uses to match owner. In 32-bit mode, this field is set to TR on thread creation. In 64-bit mode, the long mode monitor will use bits 30-46 of the newly created thread's RSP as the id, and call a register function in the scheduler to set p_futex_id to this value. At the user-mode side, 32-bit will use str for the owner field, while 64-bit mode will use current RSP bits 30-46.

Re: Critical sections in userspace without syscalls?

Posted: Mon Jan 07, 2013 5:08 am
by FallenAvatar
If it wasn't an issue, why did you suddenly update us with how you "fixed" it?

- Monk