Priority Inversions... The curse of multitasking.

Discussions on more advanced topics such as monolithic vs micro-kernels, transactional memory models, and paging vs segmentation should go here. Use this forum to expand and improve the wiki!
User avatar
bloodline
Member
Member
Posts: 264
Joined: Tue Sep 15, 2020 8:07 am
Location: London, UK

Priority Inversions... The curse of multitasking.

Post by bloodline »

I have just completed major rewrite/clean up of my OS code, to allow me to implement proper message passing.

My tasks are scheduled to run preemptively in a simple Round Robin depending upon a priority sorted list, higher priority tasks will always be scheduled before lower one, any task can request to be removed from the ready list and moved to the waiting list where they will remain until the waiting tasks receive a "signal" to run. Each task has 64 signal bits (dynamically assigned), and this signalling system has been sufficient for early tests, but now I have added a higher level message passing system build on the signal system so task can communicate with each other. This was needed as the input task (which monitors the keyboard and mouse) must send more complex events to tasks now, rather than just flag that "something has changed" (which is easy with signals).

A consequence of this was some graphical user interface house keeping code which was temporarily running in a timed interrupt, has now been moved to its proper place in the input task.

This is when I stumbled into a problem I've been expecting for a while, the code running in the interrupt was using data structures protected (to some degree) from concurrent access by other tasks due to the fact the interrupt would never be interrupted by a task. But now I needed to protect these data structures with a lock. MY first attempt was to have simple spin lock as such:

Code: Select all


bool lock;

void Lock(bool* lock){
    while(*lock){
        Reschedule();    // This causes the task to give up its time slice, and wait to be scheduled back in during the round robin.  
    }
}

This is were I hit the priority inversion. The input task runs at a much high priority than the normal tasks on the system... So when the input task encountered a lock which was set by lower priority task, it would just sit and never obtain the lock, since the lower priority task would never be scheduled in to complete it's work and release the lock.

Here is the solution I have implemented, it might be be useful to others facing similar issues... And more experience members here might be able to critique/spot issues.

Code: Select all


typedef struct{
    bool isLocked;
    int lockingPriority;
}lock_t;


void Lock(lock_t* lock){
    if(lock->isLocked){

        int taskPri = thisTask->priority;                  //Temporarily save the original task priority for restoration when the lock is released

        if(lock->lockingPriority < taskPri){              //Only change the priority if the locking task is lower priority
            thisTask->priority = lock->lockingPriority;
        }

        while(lock->isLocked){
            Reschedule();
        }
       
        thisTask->priority = taskPri;
    }

    lock->isLocked = true;
    lock->lockingPriority = thisTask->priority;
}


Now any task waiting for a lock to be freed will wait at the same priority as the locking task, allowing all tasks to get some CPU time. This works for now, and I can't foresee any issues... yet :)
CuriOS: A single address space GUI based operating system built upon a fairly pure Microkernel/Nanokernel. Download latest bootable x86 Disk Image: https://github.com/h5n1xp/CuriOS/blob/main/disk.img.zip
Discord:https://discord.gg/zn2vV2Su
rdos
Member
Member
Posts: 3276
Joined: Wed Oct 01, 2008 1:55 pm

Re: Priority Inversions... The curse of multitasking.

Post by rdos »

I think another solution is to not allow tasks at different priority levels to compete for the same locks.
User avatar
bloodline
Member
Member
Posts: 264
Joined: Tue Sep 15, 2020 8:07 am
Location: London, UK

Re: Priority Inversions... The curse of multitasking.

Post by bloodline »

rdos wrote:I think another solution is to not allow tasks at different priority levels to compete for the same locks.
That’s a vaild point, but my design is to be as fast as possible, so using a third party to arbitrate shared resources would go against the design goal.
CuriOS: A single address space GUI based operating system built upon a fairly pure Microkernel/Nanokernel. Download latest bootable x86 Disk Image: https://github.com/h5n1xp/CuriOS/blob/main/disk.img.zip
Discord:https://discord.gg/zn2vV2Su
nullplan
Member
Member
Posts: 1767
Joined: Wed Aug 30, 2017 8:24 am

Re: Priority Inversions... The curse of multitasking.

Post by nullplan »

I do hope that was just pseudo-code, and that the actual code contains the necessary assembly to make these transactions atomic. Otherwise there are some race conditions here. And data races, too.

You have stumbled upon quite a few problems in your system here. The first is your scheduler: It is possible for processes to starve. That is basically the one thing a scheduler must guarantee that it never happens. In this case a low priority process is starved when a high priority process is continuously runnable. It doesn't take a lot of imagination to see why that is a problem. imagine someone running an infinite loop at high priority. If all lower-priority processes starve, there is no way to end this process.

One of the simplest algorithms for a prioritized round-robin scheduler is to put all runnable tasks into a priority queue, yes, but to tell between static and dynamic priority. Static priority is the value you configure outside of the scheduler, and dynamic priority starts at the same value when a process comes into the queue (or is scheduled to run), but whenever a process is runnable but not picked to be run, its dynamic priority is increased. And you always schedule the task with the highest dynamic priority. That way, starvation is impossible as long as static priority is limited way below the limits of dynamic priority. Sooner or later, any process will come to the top and be scheduled. It will just take longer for a low-priority process.

There is a lot more you can do with scheduling, like giving a bonus to interactive processes and a malus to batch-processes (chances are, batch-processes don't have to be all that responsive, anyway), but that is for later.

Next problem: That lock algorithm. The problem here is that a task waiting on a contended lock is still considered runnable by the scheduler. Therefore the scheduler is tricked into scheduling the task that will not make forward progress. It would be better to mark a task as sleeping during that time and to wake it up once the lock becomes available. The system I would suggest for that is called a futex, and is a Linux idea I really, really like. Basically, a futex is a pointer to an int, and there are two principal operations you can perform with it: You can wait on it, which makes the kernel suspend the calling thread if the int has the value given in the syscall, and you can wake it, which makes the kernel wake up previously suspended processes on that futex.

This mechanism is so versatile, you can implement any synchronization method with it that comes to mind. But for your case, a simple non-recursive lock could be as simple as:

Code: Select all

struct lock {
   volatile int val; /* values: 0 - lock free, >0 - lock is free but has waiting tasks, INT_MIN + x - lock is locked with contention count x. */
};

void lock(struct lock *lck)
{
  while (__sync_lock_test_and_set(&lck->val, 1))
    futex_wait(&lck->val, 1);
}

void unlock(struct lock *lck)
{
  lck->val = 0;
  futex_wake(&lck->val, 1);
}
Now your processes are no longer runnable while waiting on a lock. Yes, this design can be improved, it is merely illustrative.
Carpe diem!
User avatar
bloodline
Member
Member
Posts: 264
Joined: Tue Sep 15, 2020 8:07 am
Location: London, UK

Re: Priority Inversions... The curse of multitasking.

Post by bloodline »

nullplan wrote:I do hope that was just pseudo-code, and that the actual code contains the necessary assembly to make these transactions atomic. Otherwise there are some race conditions here. And data races, too.
I haven’t read about the x86’s atomic instructions yet (I only hit the problem and worked out a solution this morning before work), on the 68k I would use a TAS instruction, I’m sure the X86 works similarly. I’ll definitely need to go atomic if I ever get as far as SMP.

You have stumbled upon quite a few problems in your system here. The first is your scheduler: It is possible for processes to starve.
The ability for a high priority task to inhibit the execution of all lower priority tasks is a known aspect of the design (I’ve used this scheduler in embedded systems), and I have designed the system with this in mind. User tasks are not allowed a priority above 0, (they are welcome to use lower priorities), prioriy levels above 0 are reserved for operating system tasks, which have a very strict hierarchy of execution.

My system is not Unix. All software on this system is trusted, tasks are expected not to hog CPU and memory is unprotected, tasks are expected not to trample over other tasks’ memory.

That is basically the one thing a scheduler must guarantee that it never happens. In this case a low priority process is starved when a high priority process is continuously runnable. It doesn't take a lot of imagination to see why that is a problem. imagine someone running an infinite loop at high priority. If all lower-priority processes starve, there is no way to end this process.

One of the simplest algorithms for a prioritized round-robin scheduler is to put all runnable tasks into a priority queue, yes, but to tell between static and dynamic priority. Static priority is the value you configure outside of the scheduler, and dynamic priority starts at the same value when a process comes into the queue (or is scheduled to run), but whenever a process is runnable but not picked to be run, its dynamic priority is increased. And you always schedule the task with the highest dynamic priority. That way, starvation is impossible as long as static priority is limited way below the limits of dynamic priority. Sooner or later, any process will come to the top and be scheduled. It will just take longer for a low-priority process.

There is a lot more you can do with scheduling, like giving a bonus to interactive processes and a malus to batch-processes (chances are, batch-processes don't have to be all that responsive, anyway), but that is for later.
Some really nice design ideas here, I do want to pay with scheduling algorithms, and have made it relatively easy for me to have the scheduling algorithm altered at run time!
Next problem: That lock algorithm. The problem here is that a task waiting on a contended lock is still considered runnable by the scheduler. Therefore the scheduler is tricked into scheduling the task that will not make forward progress. It would be better to mark a task as sleeping during that time and to wake it up once the lock becomes available. The system I would suggest for that is called a futex, and is a Linux idea I really, really like. Basically, a futex is a pointer to an int, and there are two principal operations you can perform with it: You can wait on it, which makes the kernel suspend the calling thread if the int has the value given in the syscall, and you can wake it, which makes the kernel wake up previously suspended processes on that futex.

This mechanism is so versatile, you can implement any synchronization method with it that comes to mind. But for your case, a simple non-recursive lock could be as simple as:

Code: Select all

struct lock {
   volatile int val; /* values: 0 - lock free, >0 - lock is free but has waiting tasks, INT_MIN + x - lock is locked with contention count x. */
};

void lock(struct lock *lck)
{
  while (__sync_lock_test_and_set(&lck->val, 1))
    futex_wait(&lck->val, 1);
}

void unlock(struct lock *lck)
{
  lck->val = 0;
  futex_wake(&lck->val, 1);
}
Now your processes are no longer runnable while waiting on a lock. Yes, this design can be improved, it is merely illustrative.
So I did ponder a semaphore system, which would put the task back into the waiting list, only to be scheduled back into when it received the signal that the lock was free. But since a task switch in my system is just a simple stack swap, keeping the task in the ready list has less overhead than the bookkeeping needed to keep track of which task is waiting for which lock, and then alter the task state depending upon the conditions. But this is definitely something I would like to add in future for higher level resources!
CuriOS: A single address space GUI based operating system built upon a fairly pure Microkernel/Nanokernel. Download latest bootable x86 Disk Image: https://github.com/h5n1xp/CuriOS/blob/main/disk.img.zip
Discord:https://discord.gg/zn2vV2Su
rdos
Member
Member
Posts: 3276
Joined: Wed Oct 01, 2008 1:55 pm

Re: Priority Inversions... The curse of multitasking.

Post by rdos »

nullplan wrote: One of the simplest algorithms for a prioritized round-robin scheduler is to put all runnable tasks into a priority queue, yes, but to tell between static and dynamic priority. Static priority is the value you configure outside of the scheduler, and dynamic priority starts at the same value when a process comes into the queue (or is scheduled to run), but whenever a process is runnable but not picked to be run, its dynamic priority is increased. And you always schedule the task with the highest dynamic priority. That way, starvation is impossible as long as static priority is limited way below the limits of dynamic priority. Sooner or later, any process will come to the top and be scheduled. It will just take longer for a low-priority process.

There is a lot more you can do with scheduling, like giving a bonus to interactive processes and a malus to batch-processes (chances are, batch-processes don't have to be all that responsive, anyway), but that is for later.
I typically let userlevel threads run at the same priority-level (the next lowest) and only use higher priority in kernel threads. Although there is a way to create userlevel threads with higher priority, it's only part of the raw API, and it's not really used. I think letting threads decide their own priority leads to priority inflation and so is not really a good idea.
nullplan wrote: Next problem: That lock algorithm. The problem here is that a task waiting on a contended lock is still considered runnable by the scheduler. Therefore the scheduler is tricked into scheduling the task that will not make forward progress. It would be better to mark a task as sleeping during that time and to wake it up once the lock becomes available. The system I would suggest for that is called a futex, and is a Linux idea I really, really like. Basically, a futex is a pointer to an int, and there are two principal operations you can perform with it: You can wait on it, which makes the kernel suspend the calling thread if the int has the value given in the syscall, and you can wake it, which makes the kernel wake up previously suspended processes on that futex.

This mechanism is so versatile, you can implement any synchronization method with it that comes to mind. But for your case, a simple non-recursive lock could be as simple as:

Code: Select all

struct lock {
   volatile int val; /* values: 0 - lock free, >0 - lock is free but has waiting tasks, INT_MIN + x - lock is locked with contention count x. */
};

void lock(struct lock *lck)
{
  while (__sync_lock_test_and_set(&lck->val, 1))
    futex_wait(&lck->val, 1);
}

void unlock(struct lock *lck)
{
  lck->val = 0;
  futex_wake(&lck->val, 1);
}
Now your processes are no longer runnable while waiting on a lock. Yes, this design can be improved, it is merely illustrative.
Futexes are great, but not that easy to get right with SMP. The primary advantage is that if they are not highly contended, then they will normally execute without syscalls. I use futexes for userlevel synchronization, while in kernel I use a more direct approach which calls the scheduler.
nullplan
Member
Member
Posts: 1767
Joined: Wed Aug 30, 2017 8:24 am

Re: Priority Inversions... The curse of multitasking.

Post by nullplan »

bloodline wrote:I haven’t read about the x86’s atomic instructions yet (I only hit the problem and worked out a solution this morning before work), on the 68k I would use a TAS instruction, I’m sure the X86 works similarly. I’ll definitely need to go atomic if I ever get as far as SMP.
Actually, you need those even on UP, since the processor can get interrupted in between your task seeing lock->isLocked == false and actually setting lock->isLocked = true. However, GCC makes it easy with the sync extensions. Those also work cross-platform. In your case, with a really simple spin lock, you probably want to use __sync_lock_test_and_set(&lock->isLocked, 1). If that returns 0, you got the lock. The only problem is that bool happens to be the only integer type not suitable for use with the sync extensions, so you'd have to change the type of "isLocked".
bloodline wrote:Some really nice design ideas here,
Well, I shan't take credit, this is mostly what I overheard about an OS we use at work (OS-9). They use that scheduler to justify calling themselves a soft real time operating system.
bloodline wrote:The ability for a high priority task to inhibit the execution of all lower priority tasks is a known aspect of the design (I’ve used this scheduler in embedded systems), and I have designed the system with this in mind. User tasks are not allowed a priority above 0, (they are welcome to use lower priorities), prioriy levels above 0 are reserved for operating system tasks, which have a very strict hierarchy of execution.

My system is not Unix. All software on this system is trusted, tasks are expected not to hog CPU and memory is unprotected, tasks are expected not to trample over other tasks’ memory.
All well and good, but in that case, the design of a lock such that a contended lock eats all available CPU power is probably counter-productive.
rdos wrote:I typically let userlevel threads run at the same priority-level (the next lowest) and only use higher priority in kernel threads. Although there is a way to create userlevel threads with higher priority, it's only part of the raw API, and it's not really used. I think letting threads decide their own priority leads to priority inflation and so is not really a good idea.
Yeah, I have, so far, failed to see a reason for user-defined priorities (scheduler-defined priorities are another matter). I mean, when you get right down to it, how would they even know? A good scheduler doesn't starve any task, anyway, and so a priority becomes a number that a task can set and read out again, and what significance it has is murky at best. Did your task not run for 10ms because the scheduler didn't care or because the machine is loaded? Who knows?
rdos wrote:Futexes are great, but not that easy to get right with SMP.
Do you mean the futex implementation, or their use in synchronization primitives? Because the former is rather easy to get right:

Code: Select all

struct futex_waiter {
  spinlock_t waiter_spinlock; /* protects concurrent access to woken and timed_out. */
  phys_addr_t futex_address;
  struct task *tsk;
  int woken;
  int timed_out;
  struct futex_waiter *next, *prev;
};

static struct futex_waiter *global_futex_queue;
static spinlock_t global_futex_spinlock = SPINLOCK_INIT; /* protects entire list structure */

static void futex_timeout(void *arg)
{
  struct futex_waiter *w = arg;
  unsigned flags = spinlock_irqsave(&w->spinlock);
  w->timed_out = 1;
  task_set_runnable(w->tsk);
  spinunlock_irqrestore(&w->spinlock, flags);
}

int futex_wait(const int *addr, int val, const struct timespec *timeout)
{
  int ret = 0;
  phys_addr_t futex_addr = memlock_internal(current_task, addr);
  struct futex_waiter waiter = {SPINLOCK_INIT, futex_addr, current_task};
  struct timer *timer = 0;
  if (timeout)
    timer = add_timer(timeout, futex_timeout, &waiter);
  unsigned flags = spinlock_irqsave(&global_futex_spinlock);
  ret = -EAGAIN;
  if (get_user(addr) == val)
  {
    waiter.next = global_futex_queue;
    if (global_futex_queue)
      global_futex_queue->prev = &waiter;
    global_futex_queue = &waiter;
    spinunlock_irqrestore(&global_futex_spinlock, flags);
    flags = spinlock_irqsave(&waiter.spinlock);
    while (!waiter.woken && !waiter.timed_out && !sigpending(current_task))
    {
      task_set_state(current_task, TIF_SLEEPING);
      spinunlock_irqrestore(&waiter.spinlock, flags);
      schedule();
      flags = spinlock_irqsave(&waiter.spinlock);
    }
    if (sigpending(current_task))
      ret = -EINTR;
    else if (!waiter.woken)
      ret = -ETIMEDOUT;
    else
      ret = 0;
    spinunlock_irqrestore(&waiter.spinlock, flags);
    if (timer)
      cancel_timer(timer);
    flags = spinlock_irqsave(&global_futex_spinlock);
    if (waiter.next)
      waiter.next->prev = waiter.prev;
    if (waiter.prev)
      waiter.prev->next = waiter.next;
  }
  spinunlock_irqrestore(&global_futex_spinlock, flags); 
  memunlock_internal(current_task, addr);
  return ret;
}

int futex_wake(int *addr, int num) {
  unsigned flags = spinlock_irqsave(&global_futex_spinlock);
  int num_woken = 0;
  phys_addr_t futex_addr = memlock_internal(current_task, addr);
  for (struct futex_waiter *w = global_futex_queue; num && w; w = w->next)
    if (w>futex_addr == futex_addr)
    {
      spinlock_internal(&w->spinlock);
      w->woken = 1;
      task_set_runnable(w->tsk);
      spinunlock_internal(&w->spinlock);
      num--; num_woken++;
    }
  memunlock_internal(current_task, addr);
  spinunlock_irqrestore(&global_futex_spinlock);
  return num_woken;
}
OK, it is a lot of code, but it all follows naturally from everything else.
Carpe diem!
moonchild
Member
Member
Posts: 73
Joined: Wed Apr 01, 2020 4:59 pm
Libera.chat IRC: moon-child

Re: Priority Inversions... The curse of multitasking.

Post by moonchild »

nullplan wrote:. However, GCC makes it easy with the sync extensions. Those also work cross-platform. In your case, with a really simple spin lock, you probably want to use __sync_lock_test_and_set(&lock->isLocked, 1). If that returns 0, you got the lock. The only problem is that bool happens to be the only integer type not suitable for use with the sync extensions, so you'd have to change the type of "isLocked".
Don't use the __sync* family of functions, use the __atomic* family. The former is deprecated, less flexible, and doesn't fit with the new memory models of c11/c++11. Oh, and you can use bools with __atomic* :)

Documentation.
User avatar
bloodline
Member
Member
Posts: 264
Joined: Tue Sep 15, 2020 8:07 am
Location: London, UK

Re: Priority Inversions... The curse of multitasking.

Post by bloodline »

moonchild wrote:
nullplan wrote:. However, GCC makes it easy with the sync extensions. Those also work cross-platform. In your case, with a really simple spin lock, you probably want to use __sync_lock_test_and_set(&lock->isLocked, 1). If that returns 0, you got the lock. The only problem is that bool happens to be the only integer type not suitable for use with the sync extensions, so you'd have to change the type of "isLocked".
Don't use the __sync* family of functions, use the __atomic* family. The former is deprecated, less flexible, and doesn't fit with the new memory models of c11/c++11. Oh, and you can use bools with __atomic* :)

Documentation.
Thanks for this!! Very helpful.
CuriOS: A single address space GUI based operating system built upon a fairly pure Microkernel/Nanokernel. Download latest bootable x86 Disk Image: https://github.com/h5n1xp/CuriOS/blob/main/disk.img.zip
Discord:https://discord.gg/zn2vV2Su
rdos
Member
Member
Posts: 3276
Joined: Wed Oct 01, 2008 1:55 pm

Re: Priority Inversions... The curse of multitasking.

Post by rdos »

nullplan wrote:
rdos wrote:Futexes are great, but not that easy to get right with SMP.
Do you mean the futex implementation, or their use in synchronization primitives? Because the former is rather easy to get right:
The primary problem is that data associated with the futex is shared between the application and kernel. For instance, you cannot keep task lists associated with a futex accessible to the application since then corruption of memory by the applications could lead to a crashed scheduler. You definitely don't want spinlocks used by kernel to be in the application memory space either. Therefore, the manipulation of the futex counter use locked instructions that are atomic which means spinlocks are not needed on the application side. You also don't want resources for unused or non-contended futexes in the application to be associated with kernel resources.

I solve this by having a separate futex area where data for futexes are allocated which normally won't get corrupted by the application even if it misuses the heap. This area will keep the counter, the name (if used) and a handle to kernel. The handle is initially set to zero and will only be allocated if locking the futex fails. In kernel, the handle is associated with a list of threads that are blocked on the futex, and a pointer to the userspace structure. This way, the application can sabotage the counter, and make the futex fail to work as expected, but it cannot manipulate the list of blocked threads directly.

Another problem is that I want to use generic API calls for criticial sections in the application and the use of futuxes requires allocation of the futex area in the application address space and patching these syscalls to use futexes instead of doing direct API calls to kernel. I suppose another solution to this is to have this code in libc, but then if you use pure assembly or another compiler & linker then futexes won't be used. Therefore, I let the application loader handle it by patching code in a different manner than normal. Another potential problem of having these structures compiled with the application is that you cannot change it if you decide you need to modify the implementation. With patching this is a non-issue since the implementation is not linked with the application, rather constructed by the application loader.

The application loaders implementation of futexes looks like this (p? labels are changed to the address of the futex area):

Code: Select all

create_us_section   Proc near
    push eax
    push ecx
    push edx
    push edi
    
p1:
    mov edx,12345678h
    mov edi,edx
    mov ecx,MAX_SECTIONS
    xor eax,eax
    repnz scasd
    stc
    jnz csDone
;
    mov ebx,MAX_SECTIONS
    sub ebx,ecx
    push ebx
;
    dec ebx
    sub edi,4
    mov eax,16
    mul ebx
    add eax,4 * MAX_SECTIONS
    mov edx,eax

p2:    
    add edx,12345678h    
    mov [edi],edx
;    
    mov [edx].fs_handle,0
    mov [edx].fs_val,-1
    mov [edx].fs_counter,0
    mov [edx].fs_owner,0
    mov [edx].fs_sect_name,0
;
    pop ebx
    clc

csDone:
    pop edi
    pop edx
    pop ecx
    pop eax    
    ret
create_us_section   Endp

create_named_us_section   Proc near
    push eax
    push ecx
    push edx
    push edi
    push ebp
;
    mov ebp,edi
    
p1n:
    mov edx,12345678h
    mov edi,edx
    mov ecx,MAX_SECTIONS
    xor eax,eax
    repnz scasd
    stc
    jnz cnsDone
;
    mov ebx,MAX_SECTIONS
    sub ebx,ecx
    push ebx
;
    dec ebx
    sub edi,4
    mov eax,16
    mul ebx
    add eax,4 * MAX_SECTIONS
    mov edx,eax

p2n:    
    add edx,12345678h    
    mov [edi],edx
;    
    mov [edx].fs_handle,0
    mov [edx].fs_val,-1
    mov [edx].fs_counter,0
    mov [edx].fs_owner,0
    mov [edx].fs_sect_name,ebp
;
    pop ebx
    clc

cnsDone:
    pop ebp
    pop edi
    pop edx
    pop ecx
    pop eax    
    ret
create_named_us_section   Endp

free_us_section   Proc near
    push edx
;    
    sub ebx,1
    jc fusDone
;
    cmp ebx,MAX_SECTIONS
    jae fusDone
;
    shl ebx,2
    
p3:
    add ebx,12345678h
    xor eax,eax
    xchg eax,[ebx]
    or eax,eax
    jz fusDone
;    
    mov ebx,eax
    mov eax,[ebx].fs_handle
    or eax,eax
    jz fusDone
;    
    UserGateApp cleanup_futex_nr

fusDone:
    xor ebx,ebx
;
    pop edx
    ret
free_us_section   Endp

enter_us_section    Proc near
    push eax
    push ebx
;    
    sub ebx,1
    jc eusDone
;
    cmp ebx,MAX_SECTIONS
    jae eusDone
;
    shl ebx,2
    
p4:
    add ebx,12345678h
    mov ebx,[ebx]
    or ebx,ebx
    jz eusDone
;            
    str ax
    cmp ax,[ebx].fs_owner
    jne eusLock
;
    inc [ebx].fs_counter
    jmp eusDone

eusLock:
    lock add [ebx].fs_val,1
    jc eusTake
;
    mov eax,1
    xchg ax,[ebx].fs_val
    cmp ax,-1
    jne eusBlock

eusTake:
    str ax
    mov [ebx].fs_owner,ax
    mov [ebx].fs_counter,1
    jmp eusDone

eusBlock:
    push edi
    mov edi,[ebx].fs_sect_name
    or edi,edi
    jnz eusNamedBlock
;    
    UserGateApp acquire_futex_nr
    jmp eusBlockPop

eusNamedBlock:
    UserGateApp acquire_named_futex_nr

eusBlockPop:
    pop edi

eusDone:
    pop ebx
    pop eax    
    ret
enter_us_section   Endp

leave_us_section    Proc near
    push eax
    push ebx
;    
    sub ebx,1
    jc lusDone
;
    cmp ebx,MAX_SECTIONS
    jae lusDone
;
    shl ebx,2
    
p5:
    add ebx,12345678h
    mov ebx,[ebx]
    or ebx,ebx
    jz lusDone
;
    str ax
    cmp ax,[ebx].fs_owner
    jne lusDone
;
    sub [ebx].fs_counter,1
    jnz lusDone
;
    mov [ebx].fs_owner,0
    lock sub [ebx].fs_val,1
    jc lusDone
;
    mov [ebx].fs_val,-1
    UserGateApp release_futex_nr

lusDone:
    pop ebx
    pop eax
    ret
leave_us_section    Endp
At the kernel side it looks like this:

Code: Select all

    
;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
;       
;
;           NAME:           acquire_futex
;
;           DESCRIPTION:    Acquire futex
;
;           PARAMS:         ES:(E)BX    address to futex struct
;
;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;

acquire_futex_name    DB 'Acquire Futex',0

acquire_futex   Proc near
    push ds
    push fs
    push eax
    push ebx
    push cx
    push esi
;    
    mov esi,ebx
    mov ebx,es:[esi].fs_handle
    mov ax,FUTEX_HANDLE
    DerefHandle
    jnc acquire_no_sect
;    
    mov ax,SEG data
    mov ds,ax
    EnterSection ds:futex_section
;
    mov ebx,es:[esi].fs_handle
    mov ax,FUTEX_HANDLE
    DerefHandle
    jnc acquire_handle_ok
;
    mov cx,SIZE futex_handle_seg
    AllocateHandle
    mov ds:[ebx].fh_list,0
    mov ds:[ebx].fh_lock,0
    mov [ebx].hh_sign,FUTEX_HANDLE
;
    movzx eax,[ebx].hh_handle
    mov es:[esi].fs_handle,eax

acquire_handle_ok:
    push ds
    mov ax,SEG data
    mov ds,ax
    LeaveSection ds:futex_section
    pop ds

acquire_no_sect:
    call LockCore
    sti
    call cs:lock_futex_proc    
    mov ax,1
    xchg ax,es:[esi].fs_val
    cmp ax,-1
    je acquire_take
;
    mov ax,ds
    push OFFSET acquire_done
    call SaveLockedThread
    mov ds,ax
;
    lea edi,[ebx].fh_list     
    mov es,fs:cs_curr_thread
    mov fs:cs_curr_thread,0
;
    call InsertBlock32
    call cs:unlock_futex_proc
;
    mov es:p_sleep_type,SLEEP_TYPE_FUTEX
    mov es:p_sleep_sel,ds
    mov es:p_sleep_offset,esi
    jmp LoadThread

acquire_take:
    push es
    mov es,fs:cs_curr_thread
    mov ax,es:p_futex_id
    pop es
    mov es:[esi].fs_owner,ax
    mov es:[esi].fs_counter,1
;    
    call cs:unlock_futex_proc
    call UnlockCore
    
acquire_done:
    pop esi
    pop cx
    pop ebx
    pop eax
    pop fs
    pop ds    
    ret
acquire_futex   Endp

acquire_futex16 Proc far
    push ebx
    movzx ebx,bx
    call acquire_futex
    pop ebx
    retf32
acquire_futex16 Endp

acquire_futex32 Proc far
    call acquire_futex
    retf32
acquire_futex32 Endp
    
;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
;       
;
;           NAME:           acquire_named_futex
;
;           DESCRIPTION:    Acquire named futex
;
;           PARAMS:         ES:(E)BX    address to futex struct
;                           ES:(E)DI    name of section
;
;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;

acquire_named_futex_name    DB 'Acquire Named Futex',0

acquire_named_futex   Proc near
    push ds
    push fs
    push eax
    push ebx
    push ecx
    push esi
    push edi
;
    GetThread
    mov fs,ax
    mov esi,OFFSET p_list_name
    mov ecx,31

acquire_named_copy:    
    mov al,es:[edi]
    mov fs:[esi],al
    inc esi
    inc edi
    or al,al
    jz acquire_named_copied
;
    loop acquire_named_copy

acquire_named_copied:
    xor al,al
    mov fs:[esi],al
;    
    mov esi,ebx
    mov ebx,es:[esi].fs_handle
    mov ax,FUTEX_HANDLE
    DerefHandle
    jnc acquire_named_no_sect
;    
    mov ax,SEG data
    mov ds,ax
    EnterSection ds:futex_section
;
    mov ebx,es:[esi].fs_handle
    mov ax,FUTEX_HANDLE
    DerefHandle
    jnc acquire_named_handle_ok
;
    mov cx,SIZE futex_handle_seg
    AllocateHandle
    mov ds:[ebx].fh_list,0
    mov ds:[ebx].fh_lock,0
    mov [ebx].hh_sign,FUTEX_HANDLE
;
    movzx eax,[ebx].hh_handle
    mov es:[esi].fs_handle,eax

acquire_named_handle_ok:
    push ds
    mov ax,SEG data
    mov ds,ax
    LeaveSection ds:futex_section
    pop ds

acquire_named_no_sect:
    call LockCore
    sti
    call cs:lock_futex_proc    
    mov ax,1
    xchg ax,es:[esi].fs_val
    cmp ax,-1
    je acquire_named_take
;
    mov ax,ds
    push OFFSET acquire_named_done
    call SaveLockedThread
    mov ds,ax
;
    lea edi,[ebx].fh_list     
    mov es,fs:cs_curr_thread
    mov fs:cs_curr_thread,0
;
    call InsertBlock32
    call cs:unlock_futex_proc
;
    mov es:p_sleep_type,SLEEP_TYPE_FUTEX
    mov es:p_sleep_sel,ds
    mov es:p_sleep_offset,esi
    jmp LoadThread

acquire_named_take:
    push es
    mov es,fs:cs_curr_thread
    mov ax,es:p_futex_id
    pop es
    mov es:[esi].fs_owner,ax
    mov es:[esi].fs_counter,1
;    
    call cs:unlock_futex_proc
    call UnlockCore
    
acquire_named_done:
    pop edi
    pop esi
    pop ecx
    pop ebx
    pop eax
    pop fs
    pop ds    
    ret
acquire_named_futex   Endp

acquire_named_futex16 Proc far
    push ebx
    push edi
    movzx edi,di
    movzx ebx,bx
    call acquire_named_futex
    pop edi
    pop ebx
    retf32
acquire_named_futex16 Endp

acquire_named_futex32 Proc far
    call acquire_named_futex
    retf32
acquire_named_futex32 Endp
    
;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
;       
;
;           NAME:           release_futex
;
;           DESCRIPTION:    Release futex
;
;           PARAMS:         ES:(E)BX    address to futex struct
;
;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;

release_futex_name    DB 'Release Futex',0

release_futex   Proc near
    push ds
    push fs
    push eax
    push ebx
    push cx
    push esi
;
    mov esi,ebx
    mov ebx,es:[esi].fs_handle
    mov ax,FUTEX_HANDLE
    DerefHandle
    jc release_done
;
    call LockCore
    sti
    call cs:lock_futex_proc    
;    
    mov ax,ds:[ebx].fh_list
    or ax,ax
    jz release_unlock
;    
    mov ax,1
    xchg ax,es:[esi].fs_val
    cmp ax,-1
    jne release_unlock
;    
    push di
;    
    push es
    push esi
    lea esi,ds:[ebx].fh_list
    call RemoveBlock32
    mov es:p_data,0
    mov cx,es
    mov di,es:p_futex_id
    pop esi
    pop es
;
    mov es:[esi].fs_owner,di
    mov es:[esi].fs_counter,1
    call cs:unlock_futex_proc    
;
    push es    
    mov es,cx
    call InsertWakeup
    pop es
;    
    pop di
;    
    call UnlockCore
    jmp release_done

release_unlock:
    call cs:unlock_futex_proc
    call UnlockCore
    
release_done:
    pop esi
    pop cx
    pop ebx
    pop eax
    pop fs
    pop ds    
    ret
release_futex   Endp

release_futex16 Proc far
    push ebx
    movzx ebx,bx
    call release_futex
    pop ebx
    retf32
release_futex16 Endp

release_futex32 Proc far
    call release_futex
    retf32
release_futex32 Endp
    
;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
;       
;
;           NAME:           cleanup_futex
;
;           DESCRIPTION:    Cleanup futex
;
;           PARAMS:         ES:(E)BX    address to futex struct
;
;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;

cleanup_futex_name    DB 'Cleanup Futex',0

cleanup_futex   Proc near
    push ds
    push eax
    push ebx
;
    mov ebx,es:[ebx].fs_handle
    mov ax,FUTEX_HANDLE
    DerefHandle
    jc cleanup_done
;
    FreeHandle

cleanup_done:
    pop ebx
    pop eax
    pop ds
    ret
cleanup_futex   Endp

cleanup_futex16 Proc far
    push ebx
    movzx ebx,bx
    call cleanup_futex
    pop ebx
    retf32
cleanup_futex16 Endp

cleanup_futex32 Proc far
    call cleanup_futex
    retf32
cleanup_futex32 Endp

rdos
Member
Member
Posts: 3276
Joined: Wed Oct 01, 2008 1:55 pm

Re: Priority Inversions... The curse of multitasking.

Post by rdos »

Looking at the code I realize that creating sections is not thread-safe. :(

I think a solution might be to use xchg and inspect the value and retry if it is non-zero:

Code: Select all

    xchg edx,[edi]
    or edx,edx
    jnz Retry

Ok:
    mov edx,[edi]
There is also a spinlock associated with the kernel state of the futex if running with SMP. This is what call cs:lock_futex_proc and call cs:unlock_futex_proc links to. For SMP it looks like this (for single core it just returns):

Code: Select all


;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
;
;
;           NAME:           LockFutexMultiple
;
;           DESCRIPTION:    Lock futex, multiple processor version
;
;           PARAMETERS:     DS:EBX      Futex handle data
;
;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;

LockFutexMultiple  Proc near
    push ax

lfSpinLock:    
    mov ax,ds:[ebx].fh_lock
    or ax,ax
    je lfGet
;
    pause
    jmp lfSpinLock

lfGet:
    inc ax
    xchg ax,ds:[ebx].fh_lock
    or ax,ax
    je lfDone
;
    jmp lfSpinLock

lfDone:
    pop ax    
    ret
LockFutexMultiple  Endp

;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
;
;
;           NAME:           UnlockFutexMultiple
;
;           DESCRIPTION:    Unlock futex, multiple processor version
;
;           PARAMETERS:     DS:EBX      Futex handle data
;
;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;

UnlockFutexMultiple    Proc near
    mov ds:[ebx].fh_lock,0
    sti
    ret
UnlockFutexMultiple    Endp
I think it is important that the spinlock data is in kernel-space only since otherwise an application could cause a core to lock-up forever.

An advantage of associating the kernel-side data with application handles is that if the application doesn't cleanup it's futexes when it terminates, the house-keeping of handles by the loader will free the kernel-side data.
User avatar
bloodline
Member
Member
Posts: 264
Joined: Tue Sep 15, 2020 8:07 am
Location: London, UK

Re: Priority Inversions... The curse of multitasking.

Post by bloodline »

I've never used the gcc atomic builtins before, can anyone confirm I'm using them correctly? The internet seems strangely scant on details regarding their usage.

Code: Select all


inline void Lock(bool* lock){
  
    if( !__atomic_test_and_set( lock, __ATOMIC_ACQ_REL ) ){
        
        //Do some housekeeping, then wait for the lock to free

        while ( !__atomic_test_and_set( lock, __ATOMIC_ACQ_REL ) ) {
            Reschedule();
        }
    
    }

}



inline void FreeLock(bool* lock){
    __atomic_clear( lock, __ATOMIC_RELEASE );
}

CuriOS: A single address space GUI based operating system built upon a fairly pure Microkernel/Nanokernel. Download latest bootable x86 Disk Image: https://github.com/h5n1xp/CuriOS/blob/main/disk.img.zip
Discord:https://discord.gg/zn2vV2Su
Octocontrabass
Member
Member
Posts: 5513
Joined: Mon Mar 25, 2013 7:01 pm

Re: Priority Inversions... The curse of multitasking.

Post by Octocontrabass »

bloodline wrote:I've never used the gcc atomic builtins before, can anyone confirm I'm using them correctly?
The __atomic_test_and_set() builtin returns false when you acquire the lock, and true when another thread currently holds the lock.

For acquiring a lock using a test-and-set operation, you only need __ATOMIC_ACQUIRE to prevent later loads and stores from happening before the lock is acquired. Using __ATOMIC_ACQ_REL also forces previous loads and stores to happen before attempting to acquire the lock, which is probably more restrictive than you need.
bloodline wrote:The internet seems strangely scant on details regarding their usage.
The memory order stuff is modeled directly after std::memory_order from C++, so that might help you find more references for how it's meant to be used.

The individual operations are based on operations that can be made atomic on most CPU architectures. For x86, look at every instruction that allows a LOCK prefix. You might find some more information by looking for how individual instructions would be used, since compiler-level atomic operations are relatively recent.
User avatar
bloodline
Member
Member
Posts: 264
Joined: Tue Sep 15, 2020 8:07 am
Location: London, UK

Re: Priority Inversions... The curse of multitasking.

Post by bloodline »

Octocontrabass wrote:
bloodline wrote:I've never used the gcc atomic builtins before, can anyone confirm I'm using them correctly?
The __atomic_test_and_set() builtin returns false when you acquire the lock, and true when another thread currently holds the lock.
Right you are! misread this line:
the return value is true if and only if the previous contents were “set”.
For acquiring a lock using a test-and-set operation, you only need __ATOMIC_ACQUIRE to prevent later loads and stores from happening before the lock is acquired. Using __ATOMIC_ACQ_REL also forces previous loads and stores to happen before attempting to acquire the lock, which is probably more restrictive than you need.
I see, thanks for the advice.
bloodline wrote:The internet seems strangely scant on details regarding their usage.
The memory order stuff is modeled directly after std::memory_order from C++, so that might help you find more references for how it's meant to be used.
Cheers.
CuriOS: A single address space GUI based operating system built upon a fairly pure Microkernel/Nanokernel. Download latest bootable x86 Disk Image: https://github.com/h5n1xp/CuriOS/blob/main/disk.img.zip
Discord:https://discord.gg/zn2vV2Su
linguofreak
Member
Member
Posts: 510
Joined: Wed Mar 09, 2011 3:55 am

Re: Priority Inversions... The curse of multitasking.

Post by linguofreak »

bloodline wrote: So I did ponder a semaphore system, which would put the task back into the waiting list, only to be scheduled back into when it received the signal that the lock was free. But since a task switch in my system is just a simple stack swap, keeping the task in the ready list has less overhead than the bookkeeping needed to keep track of which task is waiting for which lock, and then alter the task state depending upon the conditions. But this is definitely something I would like to add in future for higher level resources!
You could do something like giving each task a function pointer for a callback "isWaiting()".

For the scheduler, the isWaiting() pointer is used as follows: When a new task is created, the pointer is initialized as a null pointer. When the scheduler selects a task to run, it checks to see if the isWaiting() pointer is null. If it is, it schedules the task. If it is not, it calls the function on the other end of the pointer. If the pointer is non-null when the function returns, the scheduler rejects the task as a candidate to run and moves on down the list of ready tasks. If it is now null, the scheduler runs the task.

For the task, the isWaiting() pointer is used like this: When the task needs to wait on a lock, it sets the isWaiting() pointer to point to a function that tests that lock. When the function is called, it performs the test, and if another task still holds the lock, it does nothing and returns. If the lock is free, the function nulls the isWaiting() pointer and returns.
Post Reply