Hello There,
No introductions this time. Straight to the point. I have gone past the usual stage, made new things in my kernel. And now well, I have to decide on my scheduler design. Now well, I don't want to do with just a optimal scheduler out there, I want to have the best here. Okay, this is sounding far stretched, so I will make it "almost" the best out here.
Now to tell about my design plans. I want to build the regular OS for desktop users, farily well for programming and compiling, and multimedia usage. I want to have priorities, but I don't get the hang of all the scheduler algorithms for this type of design. I know some of you like Brendan, have a very good scheduling algorithm. So will you all mind explaining them to me, and if I can use a somewhat similar design to them?
Regards,
Shikhin
The best scheduler
Re: The best scheduler
Hi,
The first part of designing anything is to figure out what the design should achieve. If the goals are "extremely easy to understand, for educational purposes" then you'll want a completely different design to "high performance".
So, for the first step, here's a list of design goals (that may or may not apply to any OS other than mine):
It also looks very complex to actually implement. Fortunately, it's actually not as complicated as it looks.
First, I'd find efficient data structures (mostly FIFO queues in my case, implemented as linked lists) to keep track of which tasks (threads) are ready to run. I'd use a different FIFO queue for each priority, for each policy, for each CPU. This means with 4 scheduling policies and 256 thread priorities for each policy, I'd have 2048 FIFOs per CPU (or a total of 8192 FIFOs for 16 CPUs). Because there's a separate reentrancy lock for each of these FIFO queues the chance of lock contention is extremely small (which mostly takes care of the "scalable" goal), and because FIFO queues are O(1) the "fast" goal is mostly satisfied too.
When a task is given CPU time it changes from the "ready to run" state to the "running" state, and the scheduler removes the task from whatever FIFO queue it was on. When a running task is preempted it changes from the "running" state back to the "ready to run" state. If a running task blocks (e.g. calls "sleep()" or something) then it changes to some other state (e.g. the "sleeping" state) and sooner or later whatever it was waiting for occurs and the task is then changed from some other state (e.g. the "sleeping" state) back to the "ready to run" state. The important thing here is that whenever a task's state changes from anything to the "ready to run" state (for any reason), the scheduler has to determine which "ready to run" FIFO queue the task needs to be put on.
When the scheduler has to determine which "ready to run" FIFO queue the task needs to be put on, the task's scheduling policy and the task's priority determine everything *except* which CPU. To determine which CPU, the basic idea is to use some sort of formula to calculate a score for each CPU, and then choose the CPU with the highest score. So, where does "some sort of formula" come from? It's derived from just about everything I put in the "list of design goals" above....
Cheers,
Brendan
The first part of designing anything is to figure out what the design should achieve. If the goals are "extremely easy to understand, for educational purposes" then you'll want a completely different design to "high performance".
So, for the first step, here's a list of design goals (that may or may not apply to any OS other than mine):
- Scheduler should have several policies, where each policy also has task priorities, and each policy may have an entirely different algorithm to determine which task within the policy is the best task to run. For an example, it could have a "real time" policy for tasks that have strict requirements (that uses "earliest deadline first" scheduling), a "normal" policy for normal tasks (that uses a different type of scheduling) and a "background task" policy (that uses "highest priority task gets the CPU" scheduling).
- The scheduling policy and priority used by tasks should be explicit. This means tasks control their own policy and priority (up to some maximum for the task, which is set by whatever created the task); and the scheduler does not attempt to dynamically adjust task priorities using some stupid guesswork that is guaranteed to make the wrong decisions in certain cases (like the borked beyond belief Linux scheduler).
- The OS should measure the performance of CPUs, and calculate important scheduling parameters (e.g. how long each task gets the CPU for) based on actual CPU performance. For an oversimplified example, assume one of the scheduling policies uses "round robin" - on a fast Nehalem giving each task 1 ms of CPU time might be a good idea (but that's far too fast for an old 80486); and on an old 80486 giving each task 100 ms of CPU time might be a good idea (but that's far too slow for a fast Nehalem).
- Scheduling should be aware of CPU topology and take it into account when deciding which CPU/s a task should run on. This can be broken down into 4 separate issues:
- CPU topology includes NUMA information. For NUMA, the memory manager will allocate physical pages for a task that are "close" (fast) for a certain group of CPUs to access, and running the task on a different CPU (that isn't in that group) will involve performance penalties.
- CPU topology includes information about which CPUs share which caches. The idea is that if a task was running on a CPU recently, then it's data may still be in that CPU's caches, and it'd be more efficient to run the task on that CPU again (or another CPU that shares those caches) to avoid unnecessary cache misses. Of course this depends on "recently" (if the task hasn't run on any CPU for a while, then it's likely that the task's data won't be in any CPU's caches). I guess this includes keeping track of how busy each CPU has been too (e.g. if a task was last run on CPU #1 a long time ago, but if CPU #1 has only been doing "HLT" in that time, then it's still likely that the task's data will be in that CPU's caches anyway).
- CPU topology includes information about logical CPUs that share a core's resources, for both hyper-threading (where just about all of the core's resources are shared by logical CPUs) and AMD's new "bulldozer" thing (where only the floating point pipelines are shared by logical CPUs). The idea is to allow performance to be improved by avoiding resources sharing when it's an advantage to do so.
- CPU topology includes information about which CPUs are effected by which dynamic speed controls (e.g. "TurboBoost"). The idea is to allow performance to be improved by taking advantage of the CPU's dynamic speed controls (e.g. schedule tasks on separate chips so they get the full boost, rather than having them on different cores in the same chip and not getting any boost).
- Scheduling should attempt to balance load across all CPUs. There's no point having 1 CPU going flat out and lots of tasks waiting to use that CPU when there's other CPUs doing nothing.
- Scheduling should be aware of power management. If a CPU is getting hot and other CPUs are still cool, then the scheduler should try to shift load to the cooler CPUs. If a CPU is in a sleep state because it got too hot, then the scheduler (and IRQ balancing) should try very hard to avoid taking this CPU out of the sleep state (until after it cools down). This could/should take into account the user's preferences and fan speeds. For example, maybe the user wants "quiet mode" and not "max. performance mode", and the scheduler needs to make sure CPUs don't get hot enough to require fast/noisy fan speeds (at the expense of performance). The scheduler should also handle software controlled throttling. For example, if the computer is running from battery (or even if it's not), and there are no "high priority" tasks to run (only "idle tasks"), then drop the CPU frequency down a bit to save power.
- The scheduler should be scalable. Something like a "global scheduler lock" must be avoided. In the same way, a "reschedule()" function that determines which tasks all CPUs should run next is undesirable. Each CPU should be independent (with it's own timer, etc), such that it's unlikely that 2 CPUs are doing task switches at the same time.
- The scheduler should dynamically manage timers. If there's no tasks for a CPU to run, or if there's only one task for a CPU to run, then the timer used by that CPU should be disabled to avoid the overhead of unnecessary timer IRQs. In a similar way, if the scheduler wants a timer IRQ in 123 ms time, then there should be one timer IRQ in 123 ns time (and not an IRQ every 1 ms, with 123 times as many timer IRQs).
- The scheduler should probably be fast (e.g. "O(1)" where possible). To be honest this is probably the least important thing on the list.
It also looks very complex to actually implement. Fortunately, it's actually not as complicated as it looks.
First, I'd find efficient data structures (mostly FIFO queues in my case, implemented as linked lists) to keep track of which tasks (threads) are ready to run. I'd use a different FIFO queue for each priority, for each policy, for each CPU. This means with 4 scheduling policies and 256 thread priorities for each policy, I'd have 2048 FIFOs per CPU (or a total of 8192 FIFOs for 16 CPUs). Because there's a separate reentrancy lock for each of these FIFO queues the chance of lock contention is extremely small (which mostly takes care of the "scalable" goal), and because FIFO queues are O(1) the "fast" goal is mostly satisfied too.
When a task is given CPU time it changes from the "ready to run" state to the "running" state, and the scheduler removes the task from whatever FIFO queue it was on. When a running task is preempted it changes from the "running" state back to the "ready to run" state. If a running task blocks (e.g. calls "sleep()" or something) then it changes to some other state (e.g. the "sleeping" state) and sooner or later whatever it was waiting for occurs and the task is then changed from some other state (e.g. the "sleeping" state) back to the "ready to run" state. The important thing here is that whenever a task's state changes from anything to the "ready to run" state (for any reason), the scheduler has to determine which "ready to run" FIFO queue the task needs to be put on.
When the scheduler has to determine which "ready to run" FIFO queue the task needs to be put on, the task's scheduling policy and the task's priority determine everything *except* which CPU. To determine which CPU, the basic idea is to use some sort of formula to calculate a score for each CPU, and then choose the CPU with the highest score. So, where does "some sort of formula" come from? It's derived from just about everything I put in the "list of design goals" above....

Cheers,
Brendan
For all things; perfection is, and will always remain, impossible to achieve in practice. However; by striving for perfection we create things that are as perfect as practically possible. Let the pursuit of perfection be our guide.
- Combuster
- Member
- Posts: 9301
- Joined: Wed Oct 18, 2006 3:45 am
- Libera.chat IRC: [com]buster
- Location: On the balcony, where I can actually keep 1½m distance
- Contact:
Re: The best scheduler
Recalling Brendan's scheduler design as he once posted it (because brendan only posted the considerations, not all of the details, and because you mentioned it), it very well suits the task it was set to do: efficiently manage a distributed system. It theoretically falls a tad short when it comes to soft real-time things like audio and responsive UIs, because it has no true notion of "real-time".
Point is, there is no silver bullet. About all the scheduling systems out there have their ups and downs, and you can only cater so much for anything since there really are conflicting points: you can't truly unite hard real-time with perceived performance and fairness, ease of use/implementation, O(1) complexity, and whatever else you might think of.
Consider what round-robin-with-priorities (actually used in deskop systems) does for you, and at which points it fails. Once you got that down, you can consider what other algorithms might or might not do better and if they are better up for your task.
Now to answer your question: IMO the best scheduler system is the one which you can easily swap out the scheduling algorithm for another one.
Point is, there is no silver bullet. About all the scheduling systems out there have their ups and downs, and you can only cater so much for anything since there really are conflicting points: you can't truly unite hard real-time with perceived performance and fairness, ease of use/implementation, O(1) complexity, and whatever else you might think of.
Consider what round-robin-with-priorities (actually used in deskop systems) does for you, and at which points it fails. Once you got that down, you can consider what other algorithms might or might not do better and if they are better up for your task.
Now to answer your question: IMO the best scheduler system is the one which you can easily swap out the scheduling algorithm for another one.

Re: The best scheduler
Hi Brendan,
As it seems, your response is quite insightful, and if I ask some more questions, its because, I am a little slow, not that you left anything out in your explanation.
Yet, I want to make a "high performance" scheduler, so that I could learn the most from it. Maybe this is a little confused design goal, but yet I myself am a confused guy...
Well thanks sir, for all the good points, and I have got some ideas noted down in my notebook.
Regards,
Shikhin
As it seems, your response is quite insightful, and if I ask some more questions, its because, I am a little slow, not that you left anything out in your explanation.
I think its quite easy to have a design plan, where you want something for "educational purposes" and yet "high performance". For example, my OS is for "education purposes". I don't think I would be able to commercially make it successful, at this age.Brendan wrote: The first part of designing anything is to figure out what the design should achieve. If the goals are "extremely easy to understand, for educational purposes" then you'll want a completely different design to "high performance".

I am great full, that you shared the list of design goals, and would like to comment on some of it.Brendan wrote: So, for the first step, here's a list of design goals (that may or may not apply to any OS other than mine):
If I get this right, this is more like, you have different scheduling algorithms, and each is used according to the task. So this means, you could have like one task which is scheduled differently, while the other is scheduled differently. So we would have to first decide which policy should be scheduled this time. So another algorithm for this. Wouldn't this become a little more confusing than the normal design?Brendan wrote: Scheduler should have several policies, where each policy also has task priorities, and each policy may have an entirely different algorithm to determine which task within the policy is the best task to run. For an example, it could have a "real time" policy for tasks that have strict requirements (that uses "earliest deadline first" scheduling), a "normal" policy for normal tasks (that uses a different type of scheduling) and a "background task" policy (that uses "highest priority task gets the CPU" scheduling).
So wouldn't this mean, that a foolish programmer could just set its policy and priority to something very high so that it becomes highly responsive? Well, I doubt any foolish programmer would do the work to use my OS, but still....Brendan wrote: The scheduling policy and priority used by tasks should be explicit. This means tasks control their own policy and priority (up to some maximum for the task, which is set by whatever created the task); and the scheduler does not attempt to dynamically adjust task priorities using some stupid guesswork that is guaranteed to make the wrong decisions in certain cases (like the borked beyond belief Linux scheduler).
Hmm, would this be like, calculate the CPU performance OR get the CPU model, and look it up in something like a performance table? Moreover, this would be like setting dynamic time for tasks, right?Brendan wrote: The OS should measure the performance of CPUs, and calculate important scheduling parameters (e.g. how long each task gets the CPU for) based on actual CPU performance. For an oversimplified example, assume one of the scheduling policies uses "round robin" - on a fast Nehalem giving each task 1 ms of CPU time might be a good idea (but that's far too fast for an old 80486); and on an old 80486 giving each task 100 ms of CPU time might be a good idea (but that's far too slow for a fast Nehalem).
Well thanks sir, for all the good points, and I have got some ideas noted down in my notebook.
Regards,
Shikhin
Re: The best scheduler
Well, as a side note, I wanted to say, I didn't do this post to just get the "best" algorithm out there, and use it. I wanted to see the unique designs made by people, how they implemented it, and how it could be the next revolution, if implemented correctly.Combuster wrote:Recalling Brendan's scheduler design as he once posted it (because brendan only posted the considerations, not all of the details, and because you mentioned it), it very well suits the task it was set to do: efficiently manage a distributed system. It theoretically falls a tad short when it comes to soft real-time things like audio and responsive UIs, because it has no true notion of "real-time".
Point is, there is no silver bullet. About all the scheduling systems out there have their ups and downs, and you can only cater so much for anything since there really are conflicting points: you can't truly unite hard real-time with perceived performance and fairness, ease of use/implementation, O(1) complexity, and whatever else you might think of.
Consider what round-robin-with-priorities (actually used in deskop systems) does for you, and at which points it fails. Once you got that down, you can consider what other algorithms might or might not do better and if they are better up for your task.
Now to answer your question: IMO the best scheduler system is the one which you can easily swap out the scheduling algorithm for another one.
Regards,
Shikhin
Re: The best scheduler
Hi,
My last scheduler had 4 policies. It used cooperative scheduling for the first policy (intended for device drivers, but potentially useful for soft real time too). The next 2 policies (for high priority normal threads and low priority normal threads) used something I call variable frequency scheduling; where threads are given a fixed amount of CPU time, and a thread's priority determines how often it gets that fixed amount of CPU time. The last policy was for idle/background threads, and used variable time slice scheduling (which is like round robin, except a thread's priority determines the amount of CPU time the thread is given). The scheduler policies have precedence - e.g. if there's any threads that can run in the first policy, then none of the threads in any other policy can get any CPU time; and threads in the last policy only get CPU time if there's nothing running in the more important policies. The policies also effected how pre-emption worked - except for the first policy, a thread must be in a more important policy to pre-empt the currently running thread (and the thread priorities aren't taken into account). This helped to avoid frequent task switches (and improved performance), and didn't seem to noticably effect latency. Of course for my next scheduler it'll all be different..
Usually I do something like a mini benchmark; consisting of a wide variety of integer instructions that (occasionally) access memory, that's intended to have a similar mix of instructions as the kernel; and see how often this code can be run in a loop in a fixed length of time. It's definitely not a precise/accurate/thorough benchmark (but it doesn't need to be - it only needs to be good enough for a tuning hint).
Cheers,
Brendan
This is actually normal design. Most Unix clones have 3 scheduling policies ("SCHED_FIFO", "SCHED_RR" and "SCHED_OTHER"). The Windows NT kernel only has cooperative scheduling where threads have a priority from 0 to 31; but priorities 0 to 15 are considered "normal" and priorities 16 to 31 are "soft real time". This is equivalent to having 2 scheduling policies ("normal" and "soft real time") with 16 thread priorities each. OS X is a bit like Windows NT (cooperative scheduling only), but they split it into 4 priority bands (normal, system high priority, kernel mode only and real-time) rather than just 2, which is equivalent to having 4 scheduling policies.Shikhin wrote:If I get this right, this is more like, you have different scheduling algorithms, and each is used according to the task. So this means, you could have like one task which is scheduled differently, while the other is scheduled differently. So we would have to first decide which policy should be scheduled this time. So another algorithm for this. Wouldn't this become a little more confusing than the normal design?Brendan wrote: Scheduler should have several policies, where each policy also has task priorities, and each policy may have an entirely different algorithm to determine which task within the policy is the best task to run. For an example, it could have a "real time" policy for tasks that have strict requirements (that uses "earliest deadline first" scheduling), a "normal" policy for normal tasks (that uses a different type of scheduling) and a "background task" policy (that uses "highest priority task gets the CPU" scheduling).
My last scheduler had 4 policies. It used cooperative scheduling for the first policy (intended for device drivers, but potentially useful for soft real time too). The next 2 policies (for high priority normal threads and low priority normal threads) used something I call variable frequency scheduling; where threads are given a fixed amount of CPU time, and a thread's priority determines how often it gets that fixed amount of CPU time. The last policy was for idle/background threads, and used variable time slice scheduling (which is like round robin, except a thread's priority determines the amount of CPU time the thread is given). The scheduler policies have precedence - e.g. if there's any threads that can run in the first policy, then none of the threads in any other policy can get any CPU time; and threads in the last policy only get CPU time if there's nothing running in the more important policies. The policies also effected how pre-emption worked - except for the first policy, a thread must be in a more important policy to pre-empt the currently running thread (and the thread priorities aren't taken into account). This helped to avoid frequent task switches (and improved performance), and didn't seem to noticably effect latency. Of course for my next scheduler it'll all be different..

Yes, but there's a maximum. For example, if the OS starts a GUI and sets it's maximum to "policy 1, medium priority", then if that GUI starts an application the highest maximum the application can be given by the GUI is "policy 1, medium priority"; but the GUI might set the application's maximum to "policy 1, low priority" instead. Neither the GUI or the application would be able to use "policy 0" at all; and the application can't boost it's priority above "policy 1, low priority" (but if the application happens to be running it "policy 2, medium priority" then it can boost it's up to "policy 2, high priority" or even "policy 1, low priority").Shikhin wrote:So wouldn't this mean, that a foolish programmer could just set its policy and priority to something very high so that it becomes highly responsive? Well, I doubt any foolish programmer would do the work to use my OS, but still....Brendan wrote:The scheduling policy and priority used by tasks should be explicit. This means tasks control their own policy and priority (up to some maximum for the task, which is set by whatever created the task); and the scheduler does not attempt to dynamically adjust task priorities using some stupid guesswork that is guaranteed to make the wrong decisions in certain cases (like the borked beyond belief Linux scheduler).
A CPU's model doesn't mean anything - if the CPU model is "Opteron model 0x123" then it could be a 1 GHz Opteron or a 3.6 Ghz Opteron. Worse, a CPU combined with fast RAM is going to be faster than the exact same CPU combined with slow RAM.Shikhin wrote:Hmm, would this be like, calculate the CPU performance OR get the CPU model, and look it up in something like a performance table? Moreover, this would be like setting dynamic time for tasks, right?Brendan wrote:The OS should measure the performance of CPUs, and calculate important scheduling parameters (e.g. how long each task gets the CPU for) based on actual CPU performance. For an oversimplified example, assume one of the scheduling policies uses "round robin" - on a fast Nehalem giving each task 1 ms of CPU time might be a good idea (but that's far too fast for an old 80486); and on an old 80486 giving each task 100 ms of CPU time might be a good idea (but that's far too slow for a fast Nehalem).
Usually I do something like a mini benchmark; consisting of a wide variety of integer instructions that (occasionally) access memory, that's intended to have a similar mix of instructions as the kernel; and see how often this code can be run in a loop in a fixed length of time. It's definitely not a precise/accurate/thorough benchmark (but it doesn't need to be - it only needs to be good enough for a tuning hint).
Cheers,
Brendan
For all things; perfection is, and will always remain, impossible to achieve in practice. However; by striving for perfection we create things that are as perfect as practically possible. Let the pursuit of perfection be our guide.
Re: The best scheduler
Well that gave me ideas, which I noted down in my notebook. Now wait for some more replies, to provide an insight on what they do....Brendan wrote:Hi...
Cheers,
Brendan

EDIT:
I am a little confused. Now if I get your right, if any First Policy task is present, it gets run. Now suppose there is some Second Policy task, used for highly computation work. It would basically starve out lower policy tasks wouldn't it? Moreover, if there is one task in policy one, two in policy two and three in policy three, wouldn't it basically starve the policy four tasks? To be frank I don't like the concept of letting any thread, of any policy/thread starve...Brendan wrote: This is actually normal design. Most Unix clones have 3 scheduling policies ("SCHED_FIFO", "SCHED_RR" and "SCHED_OTHER"). The Windows NT kernel only has cooperative scheduling where threads have a priority from 0 to 31; but priorities 0 to 15 are considered "normal" and priorities 16 to 31 are "soft real time". This is equivalent to having 2 scheduling policies ("normal" and "soft real time") with 16 thread priorities each. OS X is a bit like Windows NT (cooperative scheduling only), but they split it into 4 priority bands (normal, system high priority, kernel mode only and real-time) rather than just 2, which is equivalent to having 4 scheduling policies.
My last scheduler had 4 policies. It used cooperative scheduling for the first policy (intended for device drivers, but potentially useful for soft real time too). The next 2 policies (for high priority normal threads and low priority normal threads) used something I call variable frequency scheduling; where threads are given a fixed amount of CPU time, and a thread's priority determines how often it gets that fixed amount of CPU time. The last policy was for idle/background threads, and used variable time slice scheduling (which is like round robin, except a thread's priority determines the amount of CPU time the thread is given). The scheduler policies have precedence - e.g. if there's any threads that can run in the first policy, then none of the threads in any other policy can get any CPU time; and threads in the last policy only get CPU time if there's nothing running in the more important policies. The policies also effected how pre-emption worked - except for the first policy, a thread must be in a more important policy to pre-empt the currently running thread (and the thread priorities aren't taken into account). This helped to avoid frequent task switches (and improved performance), and didn't seem to noticably effect latency. Of course for my next scheduler it'll all be different..![]()
Moreover, can you give a little insight on variable frequency?
Re: The best scheduler
Hi,
Policy 2 is for things like an application's heavy processing. Policy 1 is for things like an application's user interface. If the user presses a key, then you want to stop doing the heavy processing immediately and handle that keypress. It might take 10 ms to handle that keypress, but after that there's no more policy 1 threads to run so the OS would return to the heavy processing in policy 2 threads (or to the unimportant stuff the idle/background threads do if there's no heavy processing going on). You wouldn't do heavy processing in policy 1 (or policy 0) because that'd be silly, just like you wouldn't do unimportant idle/background stuff in a higher policy.
Now consider device drivers. For me, device drivers run like normal processes, and IRQs are sent via. messaging to device drivers. When a device driver receives one of these "IRQ messages" you want the device driver to respond immediately - even if there's applications processing user input or something. This is what policy 0 is for. It might take 1 ms for a policy 0 thread to handle something like an IRQ, but after that there's no more policy 0 thread to run so the OS would return to policy 1 threads (or something less important).
Now consider a server with 64 CPUs and 100 Gigabit ethernet that's going flat out. The device driver for the 100 Gigabit ethernet might have a policy 0 thread for handling the IRQs, and that thread might consume 95% of an entire CPU. That's good too. If you try to force that thread to use less CPU time (e.g. by forcing it to use policy 1 or something) then you'll end up with lots of dropped packets, etc; and the web servers or email servers or FTP servers or whatever that are running fine on those other 63 CPUs will end up screwed. Even though there's a policy 0 thread consuming 95% of an entire CPU, that's exactly what you want. The scheduler's load balancing will make sure other threads run on other CPUs and don't starve.
What you don't want is (for e.g.) threads that lock up. For example, if a policy 0 thread is buggy and goes into an infinite loop then you actually will have a problem. In this case probably the best thing you can do is to determine if the thread is still receiving (and handling/sending) messages. If it's using 100% of CPU time and processing lots and lots of messages then it's probably under heavy load and needs more CPU time to cope. If it's using 100% of CPU time and not processing any messages then it's probably broken. So, what I'd do is have some sort of limit - if a policy 0 task doesn't check for messages in 10 ms of CPU time then assume it's broken and terminate it. The same would apply to threads in other policies - for example, maybe have a 10 ms limit for policy 0, a 100 ms limit for policy 1, a 1000 ms limit for policy 2 and a 10000 ms limit for policy 3.
Now, for variable frequency all tasks get the same amount of CPU time, but higher priority tasks get it more often. For the same tasks (A is a high priority task, B is a medium priority task and C is a low priority task) you end up with "ABACABAABACABA". Notice that A still gets twice as much CPU time as B, and A still gets 4 times as much CPU time as C. Also note that if a new high priority task D wants to start then it doesn't need to wait very long at all before it can have a turn. The problem here is that the OS is doing more task switches. Basically it's good for things that need to respond quick (like user interface threads) because they don't need to wait as long to get CPU time, but bad for things that don't need to respond quick (like idle/background threads) because of the extra task switching overhead.
Cheers,
Brendan
Consider "idle/background" threads in policy 3 - you only want them to run when there's nothing important to do. If they don't get any CPU time for 6 days, then the OS must have been doing more important things for 6 days and the OS has successfully prevented the idle/background threads from wasting CPU time for no reason.Shikhin wrote:EDIT:
I am a little confused. Now if I get your right, if any First Policy task is present, it gets run. Now suppose there is some Second Policy task, used for highly computation work. It would basically starve out lower policy tasks wouldn't it? Moreover, if there is one task in policy one, two in policy two and three in policy three, wouldn't it basically starve the policy four tasks? To be frank I don't like the concept of letting any thread, of any policy/thread starve...Brendan wrote: This is actually normal design. Most Unix clones have 3 scheduling policies ("SCHED_FIFO", "SCHED_RR" and "SCHED_OTHER"). The Windows NT kernel only has cooperative scheduling where threads have a priority from 0 to 31; but priorities 0 to 15 are considered "normal" and priorities 16 to 31 are "soft real time". This is equivalent to having 2 scheduling policies ("normal" and "soft real time") with 16 thread priorities each. OS X is a bit like Windows NT (cooperative scheduling only), but they split it into 4 priority bands (normal, system high priority, kernel mode only and real-time) rather than just 2, which is equivalent to having 4 scheduling policies.
My last scheduler had 4 policies. It used cooperative scheduling for the first policy (intended for device drivers, but potentially useful for soft real time too). The next 2 policies (for high priority normal threads and low priority normal threads) used something I call variable frequency scheduling; where threads are given a fixed amount of CPU time, and a thread's priority determines how often it gets that fixed amount of CPU time. The last policy was for idle/background threads, and used variable time slice scheduling (which is like round robin, except a thread's priority determines the amount of CPU time the thread is given). The scheduler policies have precedence - e.g. if there's any threads that can run in the first policy, then none of the threads in any other policy can get any CPU time; and threads in the last policy only get CPU time if there's nothing running in the more important policies. The policies also effected how pre-emption worked - except for the first policy, a thread must be in a more important policy to pre-empt the currently running thread (and the thread priorities aren't taken into account). This helped to avoid frequent task switches (and improved performance), and didn't seem to noticably effect latency. Of course for my next scheduler it'll all be different..![]()
Policy 2 is for things like an application's heavy processing. Policy 1 is for things like an application's user interface. If the user presses a key, then you want to stop doing the heavy processing immediately and handle that keypress. It might take 10 ms to handle that keypress, but after that there's no more policy 1 threads to run so the OS would return to the heavy processing in policy 2 threads (or to the unimportant stuff the idle/background threads do if there's no heavy processing going on). You wouldn't do heavy processing in policy 1 (or policy 0) because that'd be silly, just like you wouldn't do unimportant idle/background stuff in a higher policy.
Now consider device drivers. For me, device drivers run like normal processes, and IRQs are sent via. messaging to device drivers. When a device driver receives one of these "IRQ messages" you want the device driver to respond immediately - even if there's applications processing user input or something. This is what policy 0 is for. It might take 1 ms for a policy 0 thread to handle something like an IRQ, but after that there's no more policy 0 thread to run so the OS would return to policy 1 threads (or something less important).
Now consider a server with 64 CPUs and 100 Gigabit ethernet that's going flat out. The device driver for the 100 Gigabit ethernet might have a policy 0 thread for handling the IRQs, and that thread might consume 95% of an entire CPU. That's good too. If you try to force that thread to use less CPU time (e.g. by forcing it to use policy 1 or something) then you'll end up with lots of dropped packets, etc; and the web servers or email servers or FTP servers or whatever that are running fine on those other 63 CPUs will end up screwed. Even though there's a policy 0 thread consuming 95% of an entire CPU, that's exactly what you want. The scheduler's load balancing will make sure other threads run on other CPUs and don't starve.
What you don't want is (for e.g.) threads that lock up. For example, if a policy 0 thread is buggy and goes into an infinite loop then you actually will have a problem. In this case probably the best thing you can do is to determine if the thread is still receiving (and handling/sending) messages. If it's using 100% of CPU time and processing lots and lots of messages then it's probably under heavy load and needs more CPU time to cope. If it's using 100% of CPU time and not processing any messages then it's probably broken. So, what I'd do is have some sort of limit - if a policy 0 task doesn't check for messages in 10 ms of CPU time then assume it's broken and terminate it. The same would apply to threads in other policies - for example, maybe have a 10 ms limit for policy 0, a 100 ms limit for policy 1, a 1000 ms limit for policy 2 and a 10000 ms limit for policy 3.
Imagine you've got three tasks to run, called A, B and C. For round robin the tasks take equal turns at using the CPU, so you get something like "AAABBBCCCAAABBBCCC". For "variable time slice" tasks at different priorities get different amounts of CPU time. In this case, if A is a high priority task, B is a medium priority task and C is a low priority task, then you get something like "AAAABBCAAAABBC" (where A gets twice as much CPU time as B, and A gets 4 times as much CPU time as C). The problem with this is that if a new high priority task D wants to start it may need to wait for A to finish before it gets any CPU time, which can take a relatively long time.Shikhin wrote:Moreover, can you give a little insight on variable frequency?
Now, for variable frequency all tasks get the same amount of CPU time, but higher priority tasks get it more often. For the same tasks (A is a high priority task, B is a medium priority task and C is a low priority task) you end up with "ABACABAABACABA". Notice that A still gets twice as much CPU time as B, and A still gets 4 times as much CPU time as C. Also note that if a new high priority task D wants to start then it doesn't need to wait very long at all before it can have a turn. The problem here is that the OS is doing more task switches. Basically it's good for things that need to respond quick (like user interface threads) because they don't need to wait as long to get CPU time, but bad for things that don't need to respond quick (like idle/background threads) because of the extra task switching overhead.
Cheers,
Brendan
For all things; perfection is, and will always remain, impossible to achieve in practice. However; by striving for perfection we create things that are as perfect as practically possible. Let the pursuit of perfection be our guide.
Re: The best scheduler
Hi Brendan,
I have read your post, and it gave some quite useful insights into how I should design my scheduler. I am very thankful for it, and have made some notes, and have probably improved my design...
Waiting for more responses,
Shikhin
I have read your post, and it gave some quite useful insights into how I should design my scheduler. I am very thankful for it, and have made some notes, and have probably improved my design...
Waiting for more responses,
Shikhin