Priority Inversions... The curse of multitasking.

thewrongchristian · Post by **thewrongchristian** » Wed Nov 18, 2020 3:15 pm

bloodline wrote:
rdos wrote:This seems pretty backward. Tasks waiting for locks are NOT runnable and so should never be possible to schedule and should not be checked by the scheduler. It's typically asynchronous events that make these tasks runnable, and it should be when these happen that tasks are made runnable, and possibly the scheduler is invoked to check if the now active task should be run immediately.
It's not clear as I haven't really given details of how my multitasking works, but currently if a task wants a resource which it locked, it uses its scheduled time slice to check if the lock is free, if not it then immediately relinquishes it's time slice before its quantum ends, it will do this until the resource is free. A task switch has a really low overhead in my system since everything runs in a single address space (it's only slightly more costly than a couple of regular function calls, with a stack swap).

The problem with this approach is that if the locking task is lower priority than the one which wants the resource, then the higher priority task will take all the CPU from the low priority task which will never be able to free the lock.

The problem with your entire approach is that when nothing is scheduled, your CPU will busy wait and burn power.

A thread waiting for a resource just shouldn't be scheduled.

bloodline wrote: The current fix, is to temporarily adjust the priority of the lockwaiting task to the same priority as the locking task. This works fine, but feels inelegant.

I plan to implement a new locking system which takes lock waiting tasks out of the ready list and then the scheduler (when it runs) will check the resource for them, only returning them to the ready list once the resource is free.

The owner of that resource knows when it is finished with it, it's when the resource is unlocked. It can then set any waiters running again. No scheduler polling necessary.

bloodline · Post by **bloodline** » Wed Nov 18, 2020 5:17 pm

thewrongchristian wrote:
bloodline wrote:
rdos wrote:This seems pretty backward. Tasks waiting for locks are NOT runnable and so should never be possible to schedule and should not be checked by the scheduler. It's typically asynchronous events that make these tasks runnable, and it should be when these happen that tasks are made runnable, and possibly the scheduler is invoked to check if the now active task should be run immediately.
It's not clear as I haven't really given details of how my multitasking works, but currently if a task wants a resource which it locked, it uses its scheduled time slice to check if the lock is free, if not it then immediately relinquishes it's time slice before its quantum ends, it will do this until the resource is free. A task switch has a really low overhead in my system since everything runs in a single address space (it's only slightly more costly than a couple of regular function calls, with a stack swap).

The problem with this approach is that if the locking task is lower priority than the one which wants the resource, then the higher priority task will take all the CPU from the low priority task which will never be able to free the lock.
The problem with your entire approach is that when nothing is scheduled, your CPU will busy wait and burn power.

There is no busy wait in my system, a task waiting for a lock will surrender its timeslice immediately if the resource is still locked.

Like most preemptive round robin schedulers, I always have an idle task at the lowest possible priority in the ready list. The idle task just runs a hlt loop, so no power burned. You are welcome to try out the image in a VM, the CPU spends most of its time in a halt state.

Also, I posted the locking code at the beginning of the thread, you can see there is no busy waiting.

I use this locking extensively throughout the GUI (where I first encountered the issue), as the GUI task (very high priority) and any task which wants to display something need access to the same resources.

A thread waiting for a resource just shouldn't be scheduled.

That's fine if you have a higher controller keeping track of the locks, at the moment each task is responsible for checking the locks itself, since task switching is not a very costly event in my system, it probably uses up a similar amount of CPU time as the scheduler keeping track of the resource locks.

bloodline wrote: The current fix, is to temporarily adjust the priority of the lockwaiting task to the same priority as the locking task. This works fine, but feels inelegant.

I plan to implement a new locking system which takes lock waiting tasks out of the ready list and then the scheduler (when it runs) will check the resource for them, only returning them to the ready list once the resource is free.
The owner of that resource knows when it is finished with it, it's when the resource is unlocked. It can then set any waiters running again. No scheduler polling necessary.

That would require the lock to keep a record of all the tasks which want the resource, then upon freeing of the lock one of the tasks in the list would need to be signalled to run, this doesn't feel very robust to me, and would be quite messy to debug... I think...

thewrongchristian · Post by **thewrongchristian** » Wed Nov 18, 2020 7:24 pm

bloodline wrote:
thewrongchristian wrote:
The problem with your entire approach is that when nothing is scheduled, your CPU will busy wait and burn power.
There is no busy wait in my system, a task waiting for a lock will surrender its timeslice immediately if the resource is still locked.

Like most preemptive round robin schedulers, I always have an idle task at the lowest possible priority in the ready list. The idle task just runs a hlt loop, so no power burned. You are welcome to try out the image in a VM, the CPU spends most of its time in a halt state.

Also, I posted the locking code at the beginning of the thread, you can see there is no busy waiting.

I use this locking extensively throughout the GUI (where I first encountered the issue), as the GUI task (very high priority) and any task which wants to display something need access to the same resources.

Consider this scenario:

1. Task A locks L1
2. Task B locks L2
3. Task C locks L3
4. Task A wants L2
5. Task B wants L3
6. Task C wants L1

What is a deadlock in wait queue system, that you can actually detect using dependency tracing, is now a live lock in your system. The wait queue can help you detect this situation if you were so inclined.

Separately, if you do your "hlt" based idle low priority, you have a race condition if you're relying on interrupts to clear any locks. For example:

1. Task A starts an I/O operation
2. I/O completes quicker than expected, signals the completion event
3. Task A waits until the I/O is complete, which presumably involves polling an event
4. With no other tasks to schedule (Task A is waiting for an event that won't happen again) you "hlt" waiting for an interrupt that won't arrive.

In the best case, you have an unrelated interrupt (like a timer) that runs you scheduler again, and task A now notices the I/O has completed, but you've turned an unexpectedly quick task (so quick you missed it) into a long, potentially indeterminate, wait.

bloodline wrote:
A thread waiting for a resource just shouldn't be scheduled.
That's fine if you have a higher controller keeping track of the locks, at the moment each task is responsible for checking the locks itself, since task switching is not a very costly event in my system, it probably uses up a similar amount of CPU time as the scheduler keeping track of the resource locks.

A task waiting on a lock queue will use literally no CPU time.

If you have to poll for locks now being free, you're wasting CPU time, but there are bigger problems with this approach (see above.)

bloodline wrote:
thewrongchristian wrote:
bloodline wrote: The current fix, is to temporarily adjust the priority of the lockwaiting task to the same priority as the locking task. This works fine, but feels inelegant.

I plan to implement a new locking system which takes lock waiting tasks out of the ready list and then the scheduler (when it runs) will check the resource for them, only returning them to the ready list once the resource is free.
The owner of that resource knows when it is finished with it, it's when the resource is unlocked. It can then set any waiters running again. No scheduler polling necessary.
That would require the lock to keep a record of all the tasks which want the resource, then upon freeing of the lock one of the tasks in the list would need to be signalled to run, this doesn't feel very robust to me, and would be quite messy to debug... I think...

You need a simple linked list, which needs to be robust in the face of multi-threading. You'll need that anyway, really, in a robust OS.

You build the lock and its waiting list on simpler, easier to debug primitives.

At the bottom, you have an uninterruptible spin lock. It needs to be uninterruptible so you can safely use it in interrupt handlers.

The spinlock protects the higher level mutex, which encapsulates the locking state as well as the task waiting list. Once you have the spinlock protection, you can poll and modify the wait list with impunity.

The waiting list itself can be either a singly or doubly linked list. I use a doubly linked circular list, with a single head pointer forming the wait list, to enable arbitrary removal from the list without having to walk the list.

That's the foundation every OS is built on. It can be written and debugged at user level, so making it robust can be done in a nice easy environment.

The trickiest part is getting the spinlock reliable in the face of interrupts and SMP (for future use). But once that's done, you can build more complex primitives on top. My own personal sync library, disregarding the processor dependent spinlocks and the linked list macros, is less than 450 lines of C for:

- Spinlock based mutex with wait list
- Mutex based read/write lock
- Mutex based monitor (enter/leave/wait/signal/broadcast)
- Spinlock based monitor (as above, but usable for interrupt locking and signalling)

Given how critical synchronization is to implementing an OS robustly, you need to invest time to do this properly.

rdos · Post by **rdos** » Thu Nov 19, 2020 1:48 am

thewrongchristian wrote: You need a simple linked list, which needs to be robust in the face of multi-threading. You'll need that anyway, really, in a robust OS.

You build the lock and its waiting list on simpler, easier to debug primitives.

At the bottom, you have an uninterruptible spin lock. It needs to be uninterruptible so you can safely use it in interrupt handlers.

The spinlock protects the higher level mutex, which encapsulates the locking state as well as the task waiting list. Once you have the spinlock protection, you can poll and modify the wait list with impunity.

The waiting list itself can be either a singly or doubly linked list. I use a doubly linked circular list, with a single head pointer forming the wait list, to enable arbitrary removal from the list without having to walk the list.

That's the foundation every OS is built on. It can be written and debugged at user level, so making it robust can be done in a nice easy environment.

The trickiest part is getting the spinlock reliable in the face of interrupts and SMP (for future use). But once that's done, you can build more complex primitives on top. My own personal sync library, disregarding the processor dependent spinlocks and the linked list macros, is less than 450 lines of C for:

- Spinlock based mutex with wait list
- Mutex based read/write lock
- Mutex based monitor (enter/leave/wait/signal/broadcast)
- Spinlock based monitor (as above, but usable for interrupt locking and signalling)

Given how critical synchronization is to implementing an OS robustly, you need to invest time to do this properly.

Kind of my opinion too. These synchronization primitives are relatively simple to build without SMP and should be the foundation of a stable OS. In addition to that, there needs to be some function that can wakeup threads from IRQs, which can be a bit tricky to do but makes it much easier to offload much of the processing done in IRQs to server threads.

bloodline · Post by **bloodline** » Thu Nov 19, 2020 4:24 am

I appreciate you taking the time to work through the issues of something as complex as this. I hope this thread will be useful for others too.

thewrongchristian wrote: Consider this scenario:

1. Task A locks L1
2. Task B locks L2
3. Task C locks L3
4. Task A wants L2
5. Task B wants L3
6. Task C wants L1

What is a deadlock in wait queue system, that you can actually detect using dependency tracing, is now a live lock in your system. The wait queue can help you detect this situation if you were so inclined.

That is a very contrived example, but I get your point. In my system, sharing of resources is normally (i.e. at a user space level) done via a class of message which have very strict (and executive enforced) rules about access, and this is all handled within the normal waiting system architecture. But the GUI task currently works differently (this is my first desktop GUI effort so it's easily the weakest part of my project). Other than the GUI task, locks are only used at the executive level, so are not as arbitrary as the example you have given. I will probably (definitely) need to rewrite the GUI task at some point to better use the messaging system.

Separately, if you do your "hlt" based idle low priority, you have a race condition if you're relying on interrupts to clear any locks. For example:

1. Task A starts an I/O operation
2. I/O completes quicker than expected, signals the completion event
3. Task A waits until the I/O is complete, which presumably involves polling an event
4. With no other tasks to schedule (Task A is waiting for an event that won't happen again) you "hlt" waiting for an interrupt that won't arrive.

From this example, I think we might be talking at crossed purposes, which would explain our disagreement. My system is a microkernel, tasks normally can't lock a resource that doesn't belong to them. I'm using locks to ensure synchronisation (I also have a Forbid()/Permit() pair of functions in the kernel which informs scheduler not to reschedule when the task's quantum has expired for a similar purpose, but I don't intend to expose that to the user space, for obvious reasons! I mostly used them for temporary locking while working on other system components, I don't think they are used anymore and can probably be removed).

So all hardware access is handled by a dedicated device class of task (or multiple tasks depending upon the needs of device, but the user's task has no idea how the device works), if a user's task want to perform an I/O operation, it will raise an IORequest class of message with the required device (task). There is a formal interface for this, and the task can chose to either wait for the IO to complete (in which case it will sleep and be woken via the normal singling system), or do other processing, and periodically check on the status of the IORequest, dealing with it as it sees fit, once returned.

Interrupts do not access system resources. They only fill buffers and then signal their associated device task that an interrupt has happened, if that task is waiting on an interrupt it will then be woken and will deal with the newly filled buffer, if that task is already running it will see the signal (i.e. the Wait(signalMask) function will return immediately and not put the task to sleep) and again handle the buffer. User tasks never wait on an interrupt. They always deal with an associated device task via the messaging system.

In the best case, you have an unrelated interrupt (like a timer) that runs you scheduler again, and task A now notices the I/O has completed, but you've turned an unexpectedly quick task (so quick you missed it) into a long, potentially indeterminate, wait.

Good example, I have a timer device task. The timer interrupt has a simple counter and can signal the timer task when a certain value is reached. The timer task is not scheduled until it receives a signal from the timer interrupt that the interrupt event has occurred, it then checks the counter and from that it knows the current "system time", it can then service any IORequests that it may be holding. If a user task needs a timer, it sets up an IORequest with the timer task, specifying the period of time it expects the IORequest to return. The user task can then wait (it won't be scheduled if it is waiting for a message) for the IORequest to be returned by the timer task. It is also free to do other things and use the CheckIO(IORequest*), to see if the timer has returned.

A task waiting on a lock queue will use literally no CPU time.

If you have to poll for locks now being free, you're wasting CPU time, but there are bigger problems with this approach (see above.)

So, I think I didn't really explain the scope of how locks are used in my system. The examples you are citing are all handled by the messaging system (which is built on a system of signals), and any task waiting for a signal will sleep until it receives a signal it is waiting for.

All events, for example, in my GUI are built using proper messages, if they weren't, the idle task would never run!!! since they would all be busy waiting for the GUI to signal them

The particular issue here is that I don't have a reverse mechanism for the the task sending (graphical) data back to the GUI task, so it just reads the data from the task directly. If the task is making changes to the description of the data it wishes to display, it needs to complete that update before the GUI task tries to read it. This really needs a more formal mechanism... and during the writing of this post, I have thought about how that might work... But the concept of atomic data structure updates remains an important one in the OS design.

That would require the lock to keep a record of all the tasks which want the resource, then upon freeing of the lock one of the tasks in the list would need to be signalled to run, this doesn't feel very robust to me, and would be quite messy to debug... I think...
You need a simple linked list, which needs to be robust in the face of multi-threading. You'll need that anyway, really, in a robust OS.

You build the lock and its waiting list on simpler, easier to debug primitives.

At the bottom, you have an uninterruptible spin lock. It needs to be uninterruptible so you can safely use it in interrupt handlers.

The spinlock protects the higher level mutex, which encapsulates the locking state as well as the task waiting list. Once you have the spinlock protection, you can poll and modify the wait list with impunity.

The waiting list itself can be either a singly or doubly linked list. I use a doubly linked circular list, with a single head pointer forming the wait list, to enable arbitrary removal from the list without having to walk the list.

That's the foundation every OS is built on. It can be written and debugged at user level, so making it robust can be done in a nice easy environment.

The trickiest part is getting the spinlock reliable in the face of interrupts and SMP (for future use). But once that's done, you can build more complex primitives on top. My own personal sync library, disregarding the processor dependent spinlocks and the linked list macros, is less than 450 lines of C for:

- Spinlock based mutex with wait list
- Mutex based read/write lock
- Mutex based monitor (enter/leave/wait/signal/broadcast)
- Spinlock based monitor (as above, but usable for interrupt locking and signalling)

Given how critical synchronization is to implementing an OS robustly, you need to invest time to do this properly.

Yes, I think that my architecture already covers this, but I haven't really communicated it very well (my fault, not yours).

I have used my kernel architecture in embedded designs for many years, but I realise the Desktop is a much more complex use case (users suck

), and that's why I'm doing this!

-edit- It would be possible to implement the lock waiting system using the normal task signalling system, I just wanted a more light weight system as I only use Lock()/Free() function pairs to ensure the update of particular data structures is an atomic operation.

rooster · Post by **rooster** » Sun Dec 06, 2020 9:22 pm

bloodline wrote:Here is the solution I have implemented, it might be be useful to others facing similar issues... And more experience members here might be able to critique/spot issues.

It is a well-known method of solving priority-inversion problem. E.g.

http://www.qnx.com/developers/docs/qnxcar2/index.jsp?topic=%2Fcom.qnx.doc.neutrino.sys_arch%2Ftopic%2Fkernel_Priority_inheritance_mutexes.html wrote:Priority inheritance and mutexes

By default, if a thread with a higher priority than the mutex owner attempts to lock a mutex, then the effective priority of the current owner is increased to that of the higher-priority blocked thread waiting for the mutex. The current owner's effective priority is again adjusted when it unlocks the mutex; its new priority is the maximum of its own priority and the priorities of those threads it still blocks, either directly or indirectly.

This scheme not only ensures that the higher-priority thread will be blocked waiting for the mutex for the shortest possible time, but also solves the classic priority-inversion problem.

bloodline · Post by **bloodline** » Mon Dec 07, 2020 7:55 am

rooster wrote:
bloodline wrote:Here is the solution I have implemented, it might be be useful to others facing similar issues... And more experience members here might be able to critique/spot issues.
It is a well-known method of solving priority-inversion problem. E.g.
http://www.qnx.com/developers/docs/qnxcar2/index.jsp?topic=%2Fcom.qnx.doc.neutrino.sys_arch%2Ftopic%2Fkernel_Priority_inheritance_mutexes.html wrote:Priority inheritance and mutexes

By default, if a thread with a higher priority than the mutex owner attempts to lock a mutex, then the effective priority of the current owner is increased to that of the higher-priority blocked thread waiting for the mutex. The current owner's effective priority is again adjusted when it unlocks the mutex; its new priority is the maximum of its own priority and the priorities of those threads it still blocks, either directly or indirectly.

This scheme not only ensures that the higher-priority thread will be blocked waiting for the mutex for the shortest possible time, but also solves the classic priority-inversion problem.

Glad to see this is a proven design!! Many thanks for sharing.

rooster · Post by **rooster** » Mon Dec 07, 2020 9:36 am

Well, why do I even link some QNX man page to show it is well-known, when this method has Wikipedia page (https://en.wikipedia.org/wiki/Priority_inheritance)

OSDev.org

Priority Inversions... The curse of multitasking.

Re: Priority Inversions... The curse of multitasking.

Re: Priority Inversions... The curse of multitasking.

Re: Priority Inversions... The curse of multitasking.

Re: Priority Inversions... The curse of multitasking.

Re: Priority Inversions... The curse of multitasking.

Re: Priority Inversions... The curse of multitasking.

Re: Priority Inversions... The curse of multitasking.

Re: Priority Inversions... The curse of multitasking.