OSDev.org

Posted: **Wed May 16, 2012 3:26 pm**

Using sleep_until, I don't even need a sleep_queue, right? Just having all the threads in a queue, and if now() > wakeup_time, then it can be runt. What about using the HPET One Shot Timer to move a thread from the sleep queue to the "ready to run" quque? I still don't get the point... sorry.

You don't really need a sleep queue. Your OS will probably have timer support and those timers will be, when armed, placed in some sort of 'queue' that allows then to expire at the correct time. The simplest method of doing this is a simple list arranged in expiry order where each timer has a field in its structure with the expiry time. You can check the list from time to time to see whether the current time is later than the expiry time of the head of the list and you can 'expire' the timers at the from of the list.

Once that mechanism is in place you can put timers in the list that represent sleeping tasks (i.e. they contain a pointer to the sleeping task). In some implementations you can actually put the task objects in the list. There is no need to have a separate sleep queue.

BTW, a linked list can be a bit unwieldy at insert time, and there are better ways of queueing timers.

[edit] Just a clarification, by "You don't really need a sleep queue." I mean your scheduler doesn't need a sleep queue, just a bunch of ready queues.

Posted: **Wed May 16, 2012 4:54 pm**

gerryg400 wrote:Just a clarification, by "You don't really need a sleep queue." I mean your scheduler doesn't need a sleep queue, just a bunch of ready queues.

For the very same reason you don't need a queue for blocked thread. sleeping is just another word for blocked for timer.

Posted: **Wed May 16, 2012 11:37 pm**

Hi,

AlfaOmega08 wrote:
Brendan wrote:If one CPU is running at 2 GHz and another CPU is running at 500 MHz (because it got too hot and needs to cool down, for example); then if 2 tasks have the same priority and one gets 10 ms on the first CPU, then the other should get 40 ms on the other (slower) CPU, so that they both get the same fair share of CPU time.
Just realized that, if my knowledge isn't totaly wrong, the APIC timer runs at the same frequency of the CPU. So if I calculate the CPU frequency at boot, with full speed, and keep that as the frequency of the LAPIC Timer, if the CPU slows down, the timer will follow it, and threads will have more time.

No. Normally the local APIC timer runs at the speed of the bus, not at the speed of the CPU. The speed of the bus is constant and isn't effected by CPU speed changes. Newer local APICs do have a "TSC deadline mode" which runs at the speed of the CPU's time stamp counter; but for all newer CPUs the CPU's time stamp counter also runs at a fixed frequency and isn't effected by CPU speed changes.

AlfaOmega08 wrote:
Brendan wrote:For this reason, your kernel needs "sleep_until( when );". You could implement a "sleep( duration )" function like this:
Code: Select all
void sleep( duration ) {
     sleep_until( now() + duration );
}
This also means that tasks sleep until "now() > task->wake_time"; and you don't have to decrease a "time remaining" value for every sleeping task every time the timer IRQ occurs.
Using sleep_until, I don't even need a sleep_queue, right? Just having all the threads in a queue, and if now() > wakeup_time, then it can be runt. What about using the HPET One Shot Timer to move a thread from the sleep queue to the "ready to run" quque? I still don't get the point... sorry.

Start with a simple linked list that is sorted according to each task's "wake time". For a one shot timer, you'd work out how long until the next task should wake up ("delay = sleep_queue_head->wake_time") and set the one shot timer to generate an IRQ when that delay expires. When the IRQ occurs you'd do something like:

Code: Select all

IRQhandler {
    acquire_spinlock(sleep_queue_lock);
    while( (sleep_queue_head != NULL) && ( sleep_queue_head->wake_time <= now() ) ) {
        temp_task = sleep_queue_head;
        sleep_queue_head = temp_task->next;        // Remove task from sleep queue
        make_task_ready_to_run(temp_task);         // Make the task "ready to run"
    }
    if(sleep_queue_head != NULL) {
        delay = sleep_queue_head->wake_time - now();
        set_one_shot_timer(delay);
    }
    release_spinlock(sleep_queue_lock);
}

This seems relatively simple, but there's 3 problems. The first problem is that inserting a task into the sleep queue can be quite slow when there's thousands of tasks on the sleep queue (you need to search through thousands on entries to find the right place to insert). The second problem is that many CPUs may be fighting for the "sleep_queue_lock" (especially when CPUs are inserting lots of tasks and insertion is slow) - you may have to wait for many other CPUs to search thousands of entries each before you can even start searching yourself. The third problem interrupt latency - you have to disable IRQs when you're inserting to make sure that the HPET IRQ doesn't occur and try to acquire the lock on the same CPU that already has the lock (and cause deadlock), and you really don't want to have interrupts disabled while you're searching a list of thousands of entries if you can avoid it.

The easiest way to fix the first problem is to have more "sorted linked lists"/queues (e.g. if there's 1000 queues then each queue will only have 1/1000th of the entries and you can insert tasks onto a queue 1000 times faster). The easiest way to reduce the second problem is the have more locks (e.g. with 1000 queues and 1000 locks, the chance of 2 or more CPUs fighting for the same lock is 1000 times better). Increasing the number of lists/queues improves the insertion time, which also helps with the interrupt latency problem.

So, if you're going to increase the number of queues, what is the smartest way of doing that?

To begin with, I really like "per-CPU" structures. If something is "per CPU" then only one CPU can access it and you can just disable IRQs and have no lock at all. HPET normally doesn't have enough timers capable of one-shot though, so you'd have to use local APIC timers for this; and if you do use local APIC timers for "per CPU" sleep queues then you have to do something about the "local APIC turned off" problem. To solve the "local APIC turned off" problem you can have per-CPU sleep queues that use the local APIC, and an extra global sleep queue that uses HPET that is used to keep things going when a local APIC is turned off. Basically, just before a CPU goes to sleep you'd shift everything from that CPU's sleep queue onto the global sleep queue.

That is a good start, but it still might not be enough. To reduce insertion times more you can have multiple sleep queues per CPU and multiple global sleep queues.

One way to increase the number of queues more would be to do something like "queue_number = task_wake_time % number_of_queues". This is easy and would spread things out well; but it means that your CPU's timer's IRQ handler would have to worry about all of the CPU's queues and not just one of them.

A slightly different way to increase the number of queues more would be to do something like "queue_number = (task_wake_time - time_for_current_first_queue) / time_per_queue" (where "time_per_queue" might be something like 128 ms). In this case, the CPU's timer's IRQ handler only needs to find the first queue and doesn't need to look at all of the CPU's queues. The problem with this is extremely long delays - you probably don't want to have millions of queues per CPU, just in case a task wants to sleep for 12 days.

Most delays are short (e.g. less than 5 seconds). If "time_per_queue" is 128 ms then you'd only need about 40 queues to handle delays that are up to 5 seconds long (40 * 128 ms = 5.12 seconds). If any task wants a delay that is longer than this, you can put the task onto an extra "delay too big" queue. Every 128 ms you'd check the "delay too big" queue and see if there's anything that could/should be shifted to the last "128 ms" queue. However, at the same time the "current first queue" would become empty, and you'd rotate the queues - the old "current first queue" becomes the new "current last queue", and then you check the "delay too big" queue and see if there's anything that should be shifted to the new "current last queue".

Think of it like this (with only 3 queues plus the "delay too big queue"):

Code: Select all

Queue1: ABCDE
Queue2: FGH
Queue3: IJ
QueueX: KLM

You'd wake up task A, then task B, then task C, etc. Eventually it ends up looking like this:

Code: Select all

Queue1:
Queue2: FGH
Queue3: IJ
QueueX: KLMN

Then (possibly a little later) you're finished with Queue1 and have to rotate the queues:

Code: Select all

Queue1: FGH
Queue2: IJ
Queue3:
QueueX: KLMN

And then see if there's any tasks on "QueueX" that need to be shifted:

Code: Select all

Queue1: FGH
Queue2: IJ
Queue3: KL
QueueX: MN

After that you'd wake up task F, task, G, etc; then rotate the queues again.

Of course with linked lists, if you know that "QueueX" is sorted, you only have to find where to split the list and split the list at that point. You don't need to remove tasks from "QueueX" and insert them onto "Queue3" one at a time.

The end result of all this is that you might have 64 queues per CPU (using the local APIC timer); plus another 64 global queues (using HPET). You only ever insert tasks onto the per CPU queues. With 16 CPUs you've got 1024 queues, and if there's 10000 tasks on those queues then you've only got an average of about 9.7 tasks on each queue. You can search a list of 10 things fairly quickly (far more quickly than you can search a list of 10000 things), and (because there's no need for locks for the per CPU queues) most of the lock contention problems are gone. The global queues still need locks, but (except for the HPET IRQ handler) you only need to acquire those locks when you're merging per CPU queues with the global queues before turning a CPU off, which isn't going to be happening too often anyway.

Cheers,

Brendan

Posted: **Thu May 17, 2012 12:47 am**

Thanks Brendan, I think I got the queue think. Except for the memory overhead of having thousands of queues all over the world, I like the idea of splicing up the sleeping queues between CPUs and divide them by wake up time distance.
At this point it's unclear to me why did Intel & MSFT bother designing a new High Precision Timer, when they got the LAPIC Timer which, at this point, seems the only truly needed timer.

However there's one last question: the LAPIC Timer can be either in Periodic Mode, or in One Shot mode (or TSC Deadline obviously), but not both at the same time like I can do with the HPET. So if I have to replace lower priority threads with higher priority ones, before the lower priority ends it's time slice like gerryg400 suggested, I will need One Shot programmed to the high priority wakeup time. Otherwise I would be good with Periodic.

Does this mean that it only have to use One Shot mode, which is either programmed to the end of the time slice, or at the wake time of the next higher priority task?

Thanks again for your help.

Edit: Brendan, you got 4444 posts

Posted: **Thu May 17, 2012 3:43 am**

Hi,

AlfaOmega08 wrote:At this point it's unclear to me why did Intel & MSFT bother designing a new High Precision Timer, when they got the LAPIC Timer which, at this point, seems the only truly needed timer.

Microsoft started their "multimedia timers" thing before SMP became common (and before you could assume a local APIC timer exists), and encouraged HPET to get higher precision than the PIT could provide (with less overhead). The local APIC timer existed beforehand, but became common later.

The local APIC timer is better until/unless a CPU goes to sleep and the local APIC timer stops counting. Because of this problem you can't use the local APIC timer alone for most things (everything except working out when a running task has run for enough time, where you know the CPU isn't asleep) and you need some other timer, like HPET, anyway.

AlfaOmega08 wrote:However there's one last question: the LAPIC Timer can be either in Periodic Mode, or in One Shot mode (or TSC Deadline obviously), but not both at the same time like I can do with the HPET. So if I have to replace lower priority threads with higher priority ones, before the lower priority ends it's time slice like gerryg400 suggested, I will need One Shot programmed to the high priority wakeup time. Otherwise I would be good with Periodic.

For keeping track of "wall clock time" with a timer (e.g. PIT), periodic is probably the best way because you don't need to worry about drift/inaccuracy caused by reprogramming the count. Of course when you can use HPET's main counter, or when you can use "fixed frequency TSC", you don't need any IRQs (or a periodic timer) to keep track of wall clock time to begin with.

For everything else one-shot is better, as you get far higher precision from the same timer and it's easy to avoid unwanted IRQs.

For example, if you've got a periodic IRQ that occurs every 10 ms and you want a 123.456 ms delay then the first twelve IRQs are a waste of time and the thirteenth is going to be 6.544 ms too late. To reduce the number of "waste of time" IRQs you could reduce the frequency and end up with a less precision (e.g. with an IRQ every 100 ms the 123.456 ms delay will involve one "waste of time" IRQ and will be 76.544 ms too late). To improve precision you could increase the frequency and end up with a lot more "waste of time" IRQs (e.g. with an IRQ every 1 ms the 123.456 ms delay will involve 123 "waste of time" IRQs and will be 0.544 ms too late).

For "one-shot" you can have no "waste of time" IRQs at all and (for a crusty old PIT chip) only be 0.000838 ms too late.

AlfaOmega08 wrote:Does this mean that it only have to use One Shot mode, which is either programmed to the end of the time slice, or at the wake time of the next higher priority task?

When using one-shot mode for scheduling, you'd set the new count at the start of every time slice (if the scheduling policy needs an IRQ at all - e.g. if you switch to a real time task that uses "earliest deadline first" scheduling then you wouldn't set the timer for the new task at all).

AlfaOmega08 wrote:Edit: Brendan, you got 4444 posts

D'oh - now I've broken it!

Cheers,

Brendan

Posted: **Sat May 19, 2012 4:45 pm**

Just two minutes before going to sleep a horrible thought come to my mind. I have worked out a rather simple multi level feedback queue scheduler. And I'm now using the LAPIC Timer in one shot mode to preempt a thread. This can happen for two motivations: the threads consumes its time slice, or another thread with higher priority has to be woken up and executed.

My concern is about cli & sti. Obviously they're used a lot in kernel mode for spinlocks, mutexes and who knows what. What if the Timer Shot happens with interrupts disabled? Using a periodic timer you will be first or later preempted. If the shot is not received, the thread will have unlimited time to run.
Is this related in some way to edge/level mode? I thought this could be solved by using the timer in level mode as AFAICS the pin is kept high until EOI. Would this suffice, so that as ints are reenabled the IRQ fires?

Anyway I looked the manuals and while you can set edge/triggered mode for LINTs, Performance, etc., there's no such flag for the timer. What mode should I assume?

Thanks again.

Posted: **Sat May 19, 2012 7:00 pm**

Hi,

AlfaOmega08 wrote:Just two minutes before going to sleep a horrible thought come to my mind. I have worked out a rather simple multi level feedback queue scheduler. And I'm now using the LAPIC Timer in one shot mode to preempt a thread. This can happen for two motivations: the threads consumes its time slice, or another thread with higher priority has to be woken up and executed.

If a higher priority thread has woken up and preempted, then the code that woke up the thread should ask the scheduler to switch tasks/preempt immediately and the scheduler should set a new timer count. There is an unavoidable race condition involved here - it happens when the timer expires while you're in the middle of switching tasks (before you set a new timer count). As long as you've got interrupts disabled until after you set the count, the timer IRQ handler can check the timer's count after interrupts are enabled and see a non-zero timer count.

This means that when the timer IRQ occurs, either the currently running thread has used all of its time slice (timer's count == 0 in the IRQ handler), or the timer IRQ was caused by the race condition above and can be ignored (timer's count != 0 in the IRQ handler).

AlfaOmega08 wrote:My concern is about cli & sti. Obviously they're used a lot in kernel mode for spinlocks, mutexes and who knows what. What if the Timer Shot happens with interrupts disabled? Using a periodic timer you will be first or later preempted. If the shot is not received, the thread will have unlimited time to run.

No. The "CLI" instruction only causes interrupts to be postponed (and not ignored). If the timer IRQ happens when interrupts are "disabled" then you will receive the IRQ when you do STI.

AlfaOmega08 wrote:Is this related in some way to edge/level mode? I thought this could be solved by using the timer in level mode as AFAICS the pin is kept high until EOI. Would this suffice, so that as ints are reenabled the IRQ fires?

Anyway I looked the manuals and while you can set edge/triggered mode for LINTs, Performance, etc., there's no such flag for the timer. What mode should I assume?

It's edge triggered.

Cheers,

Brendan

Posted: **Sun May 20, 2012 2:24 am**

Brendan wrote:If a higher priority thread has woken up and preempted, then the code that woke up the thread should ask the scheduler to switch tasks/preempt immediately and the scheduler should set a new timer count. There is an unavoidable race condition involved here - it happens when the timer expires while you're in the middle of switching tasks (before you set a new timer count). As long as you've got interrupts disabled until after you set the count, the timer IRQ handler can check the timer's count after interrupts are enabled and see a non-zero timer count.

This means that when the timer IRQ occurs, either the currently running thread has used all of its time slice (timer's count == 0 in the IRQ handler), or the timer IRQ was caused by the race condition above and can be ignored (timer's count != 0 in the IRQ handler).

I don't understand how can the race condition happen. What I do is:

Code: Select all

if (isThereAHigherPriorityThreadToBeWokenUpShortly())
   SetupOneShot(HigherPriorityTimeout());
else
   SetupOneShot(5_MS);

The race condition you described would happen if I hade a periodic timer + a one shot timer. The one shot timer signals that I have to wake up a higher priority thread, and while the scheduler is working, the periodic timer fires, right?

At this point the race condition might exist if HigherPriorityTimeout() is so short that I can't get to IRET, but AFTER I set a new timer count.

Thanks.

Posted: **Sun May 20, 2012 5:17 am**

Hi,

AlfaOmega08 wrote:
Brendan wrote:If a higher priority thread has woken up and preempted, then the code that woke up the thread should ask the scheduler to switch tasks/preempt immediately and the scheduler should set a new timer count. There is an unavoidable race condition involved here - it happens when the timer expires while you're in the middle of switching tasks (before you set a new timer count). As long as you've got interrupts disabled until after you set the count, the timer IRQ handler can check the timer's count after interrupts are enabled and see a non-zero timer count.

This means that when the timer IRQ occurs, either the currently running thread has used all of its time slice (timer's count == 0 in the IRQ handler), or the timer IRQ was caused by the race condition above and can be ignored (timer's count != 0 in the IRQ handler).
I don't understand how can the race condition happen. What I do is:
Code: Select all
if (isThereAHigherPriorityThreadToBeWokenUpShortly())
   SetupOneShot(HigherPriorityTimeout());
else
   SetupOneShot(5_MS);

Think about tasks that block.

You might do something like this:

Task is running, timer is counting down
Task decides to "sleep()" or block waiting for IO or something
Kernel puts task into some sort of "waiting" state
Kernel tries to find another task to switch to
Kernel disables IRQs
Kernel switches to the new task
Kernel sets new timer count
Kernel enables IRQs
New task is running

And you might get something like this:

Task is running, timer is counting down
Task decides to "sleep()" or block waiting for IO or something
Kernel puts task into some sort of "waiting" state
Kernel tries to find another task to switch to
Kernel disables IRQs
Timer expires here (but IRQ isn't delivered to CPU yet because IRQs are disabled)
Kernel switches to the new task
Kernel sets new timer count
Kernel enables IRQs
Postponed timer IRQ is received here
New task should be running (but maybe the scheduler thought it used all of its time slice because of the timer IRQ, even though it actually used none of its time slice)

To avoid problems, the timer IRQ handler has to detect when the timer IRQ is an "unwanted but unavoidable" IRQ that should be ignored.

The only other alternative is to atomically disable the local APIC timer while interrupts are enabled, but that just makes things worse:

Task decides to "sleep()" or block waiting for IO or something
Kernel disables the timer
Kernel is interrupted by an unrelated IRQ (from hard disk, ethernet or something)
IRQ handler (for hard disk, ethernet or something) causes a task switch
Eventually, kernel switches back to the task that decided to "sleep()" or block waiting for IO or something; but this means the timer has been setup again and is running
Now we've got the same problem as before, plus a new "task may be waiting for something that already happened" problem

Cheers,

Brendan

Posted: **Sun May 20, 2012 11:11 am**

It took me 10 minutes to understand the race condition thing. I thought "how can the timer fire if count != 0?".
I finally sow the light, and I'm just astonished...

Well Brendan, thank you for the 100th time. I guess my OS would have had serious issues without all your suggestions

Posted: **Mon May 21, 2012 2:08 am**

Brendan wrote:This is what I call "variable time slice scheduling". It's bad because high priority tasks have to wait for all medium/low priority tasks to have their turn before they get any CPU time. Here's what happens when there's 10 low priority threads (A to J), 10 medium priority threads (K to T) and one extremely high priority thread (X), where each character represents 5 ms of time:
Code: Select all
AAABBBCCCDDDEEEFFFGGGHHHIIIJJJKLMNOPQRSTXXXXXXXXXX

It is possible to let higher priority threads always interrupt lower priority threads. The scheduler will keep per-priority queues of ready threads (and a pointer to the current highest priority queue), and will only select threads from the highest priority queue. If the kernel programs different timeouts depending on priority is a different issue. It doesn't have to be related to if lower priority threads are preempted or not.

Brendan wrote: The next step is to recognise that the difference between task priorities can be so large that lower priority tasks shouldn't get CPU time at all if higher priority tasks are able to run. For example, you could have a "priority infinity" task which preempt any lower priority task and run immediately, where no lower priority task gets any CPU time when a "priority infinity" task is able to run. If there's several "priority infinity" tasks they could share the CPU. You could also have "priority infinity + n" tasks, where different "priority infinity + n" tasks share the CPU somehow (in the same way that "priority n" tasks do).

In a similar way, you could have "priority negative infinity" tasks which are preempted by anything and never run if anything else is able to run. You could also have "priority negative infinity - n" tasks, where different "priority negative infinity - n" tasks share the CPU somehow.

Effectively, what we've done is create 3 separate "priority groups". You could have any number of "priority groups". Within each priority group you could use a different policy for scheduling. One group might use "variable frequency scheduling", one group might use "variable time slice scheduling", and other groups might use something completely different.

You could create different groups for different purposes. This allows you to choose the best method of scheduling for the specific type of tasks being scheduled.You might have soft real time tasks that use "earliest deadline first" scheduling and call that "scheduling policy 1". The second group (scheduling policy 2) might use "variable frequency scheduling" and be intended for tasks that handle user interaction (key presses, video, etc). The third group (scheduling policy 3) might use "variable time slice scheduling" and be intended for normal tasks. A fourth group might be for things that should only be done if there's nothing else to do (idle tasks) and might use "variable time slice scheduling" too.

That seems awfully complex. Why not simply decide that higher priority threads are always run before lower priority threads, and that higher priority threads that become ready preempt lower priority threads? If this doesn't work because of uncooperative higher priority threads, a mechanism for automatically decreasing priority could be added. A good indication of an uncooperative high priority thread is that it becomes preempted, rather than blocking itself.

Brendan wrote: For SMP, you don't want a global "scheduler lock", as many CPUs might be fighting for this lock at the same time (and wasting CPU time). The more CPUs there are the worse it would get - lock contention would cause poor scalability. If your scheduler has a lock for each CPU then the chance of CPUs fighting for the same lock is a lot less, and performance/scalability would improves. If your scheduler has 6 different locks per CPU (one for each scheduling policy) then there'd be even less lock contention and you'd improve performance/scalability even more.

You still need global locks to be able to move threads between CPUs.

Posted: **Mon May 21, 2012 2:15 am**

rdos wrote:You still need global locks to be able to move threads between CPUs.

Nonsense. That's only because you can't design anything better.

Posted: **Mon May 21, 2012 2:19 am**

Brendan wrote:If one CPU is running at 2 GHz and another CPU is running at 500 MHz (because it got too hot and needs to cool down, for example); then if 2 tasks have the same priority and one gets 10 ms on the first CPU, then the other should get 40 ms on the other (slower) CPU, so that they both get the same fair share of CPU time.

The best way to do this is probably to use a performance monitoring counter setup to count "instructions retired". The scheduler doesn't let a task run for a certain amount of time, but instead the scheduler lets a task run for a certain number of instructions. If one task runs for 1 million instructions on one CPU and another task runs for 1 million instructions on another CPU, then the tasks get exactly the same share of CPU time regardless of how much "wall clock" time the tasks actually consumed.

If performance monitoring counters can't be used, then you can try to keep track of how fast a CPU can currently execute instructions and use the local APIC timer to approximate "instructions retired". This isn't going to be as accurate (especially when things like hyper-threading and TurboBoost are involved), but it's still going to be better than scheduling " raw time".

I disagree, but it depends on if these threads need more than 10ms on the fast CPU (or 40ms on the slow one). If the threads need more than this amount, load balancing should kick in and move them both to the faster CPU as they will execute faster there even when they compete for the same CPU.

I also think that preemption timeout should never be as high as 40ms, as then it will take much too long before load balancing kicks in. It is more like the thread should go through 40 decisions on the slower CPU where it might be moved as the CPU appears to be overloaded.

Using instructions per second instead of time is also problematic on some CPUs where it is more or less impossible to know what this meassure currently might be (for instance in presence of hyperthreading on an Intel CPU, or in the presence of automatic power management on other CPUs).

Posted: **Mon May 21, 2012 3:38 am**

rdos wrote:It is possible to let higher priority threads always interrupt lower priority threads. The scheduler will keep per-priority queues of ready threads (and a pointer to the current highest priority queue), and will only select threads from the highest priority queue.

If you have a task that is high-priority but doesn't make system calls that often, that thread will almost always be "ready", stopping your lower-priority household tasks cold and starving the system.

(Desktop analogy: Yes, you want that prime number calculation done ASAP, but you still want to see that incoming IM message and have the system react to your input.)

Posted: **Mon May 21, 2012 4:03 am**

Solar wrote:
rdos wrote:It is possible to let higher priority threads always interrupt lower priority threads. The scheduler will keep per-priority queues of ready threads (and a pointer to the current highest priority queue), and will only select threads from the highest priority queue.
If you have a task that is high-priority but doesn't make system calls that often, that thread will almost always be "ready", stopping your lower-priority household tasks cold and starving the system.

(Desktop analogy: Yes, you want that prime number calculation done ASAP, but you still want to see that incoming IM message and have the system react to your input.)

Then such a task should not be allowed to run high priority. For example, you might cap user applications at "kPriorityNormal", except for those run by administrators, which are capped at "kPriorityAboveNormal", while the session manager might run at "kPriorityAboveNormal"/"kPriorityHigh" respectively.

Perhaps the thing we can take from this is that "Urgency" and "Priority" should be separate variables - urgency controls the processing order (i.e. more urgent tasks absolutely preempt less urgent ones) while priority controls the time slice size

OSDev.org

Multitasking, SMP & Timers

Re: Multitasking, SMP & Timers

Re: Multitasking, SMP & Timers

Re: Multitasking, SMP & Timers

Re: Multitasking, SMP & Timers

Re: Multitasking, SMP & Timers

Re: Multitasking, SMP & Timers

Re: Multitasking, SMP & Timers

Re: Multitasking, SMP & Timers

Re: Multitasking, SMP & Timers

Re: Multitasking, SMP & Timers

Re: Multitasking, SMP & Timers

Re: Multitasking, SMP & Timers

Re: Multitasking, SMP & Timers

Re: Multitasking, SMP & Timers

Re: Multitasking, SMP & Timers