Hi,
FlashBurn wrote:Brendan wrote:
For each purpose (in your list, not mine) find out which ones need very similar characteristics. For my list, the desired characteristics for "determining when a task has used enough CPU time" and "determining when a sleeping task should wake up" are mostly the same, so combining these and using the same timer/s for both purposes might make sense.
This is the system I have now (but only on 1 cpu systems), but there is a problem with accuracy. The running thread could give up to the scheduler, but if I increment my kernel timer irq counter every time the scheduler get called, it will get very inaccurate. If there are 2 threads which do much ipc and wait for each other then it will get inaccurate very fast, so I disable interrupts, set a flag and call the scheduler code. Every time this flag is set, the counter wont be incremented. So I hope that it wont become very inaccurate.
For schedulers, I usually start with a "switch_directly_to_task(task_ID task)" function. This is helpful - if a low priority task is running when a high priority task becomes ready to run then you can call the "switch_directly_to_task()" function without doing anything to decide which task to switch to or anything. When a task blocks (e.g. it calls "wait_for_message()" or something) then you need to find a new task to run. For that I write a "switch_to_any_task(void)" function (which decides which task should be run next, and then calls the "switch_directly_to_task()" function from earlier. After writing these functions I've got enough to do a fairly decent co-operative scheduler.
Notice how nothing I said above has anything to do with any timer?
Of course usually I want something more than just co-operative scheduling, and I do want to limit the amount of time a task can use the CPU for (sometimes). Fortunately this is simple - the timer IRQ just calls the existing "switch_to_any_task(void)".
It's not that simple though. There's actually 2 ways to do it. With a "one-shot" timer, during the task switch you'd tell the timer how long the next task is meant to run for; and if the timer generates an IRQ before the next task switch you know the task used all the time it was given. In this case the timer IRQ only needs to send an EOI and then call the "switch_to_any_task(void)" function. If there aren't any tasks that are ready to run, or if only one task is "ready to be run" (e.g. all the other tasks a blocked/waiting for something), you can disable the timer completely (and run the idle loop, or run the same task, until another task becomes ready to run). To make this even better it means you get a lot more control over how long it'll be until the IRQ occurs. For example, if you know another task is meant to wake up and preempt the current task in 70 us time, then you can set the PIT count to 83 and the time before the IRQ occurs will be between 68.716 us (82 * 838 ns) and 69.554 us (83 * 838 ns); which means a whole lot less "while(not_yet)" CPU time wasted.
The other way is to use a periodic timer. In this case, during the task switch you set a "time to live" variable, and the timer IRQ decreases this variable and if the variable reaches zero you send an EOI and then call the "switch_to_any_task(void)" function. This method is more common (especially for ancient "single-CPU only" systems), but it means you end up with a lot of wasted IRQs. For example, if the task is meant to run for a maximum of 100 ms then instead of setting the timer for 100 ms and having one IRQ, you might have an IRQ every 10 ms (and have 9 IRQs that just waste of time before the tenth IRQ causes the task switch). Then there's the "while(not_yet)" problem again - if you know another task is meant to wake up and preempt the current task in 70 us time, then you can't run the another task for (almost) 70 us and you have to waste 70 us of CPU time doing the "while(not_yet)" thing.
Ok, now notice how nothing I said anywhere above has anything to do with keeping track of the current "wall clock" time?
Keeping track of real time (or "wall clock" time) has nothing to do with scheduling. For convenience, some people try to use the same timer to keep track of real time and for scheduling. This idea sucks badly for multi-CPU (much better to use a different timer to keep track of real time). For single CPU it sucks slightly less, but it's a still a bad idea.
The reason it sucks on multi-CPU is that you've typically got a different timer for each CPU (and no idea which one to use for keeping track of real time); and you don't want the "per CPU" timers to be synchronised because that increases overhead (it costs something to keep them synchronised) and increases the chance that multiple CPUs will be fighting for the same reentrancy locks, same cache lines, etc. Mostly you want to treat it as a completely different scheduler for each CPU (with it's own "per CPU" scheduling queues, and it's own "per CPU" timer).
FlashBurn wrote:On smp systems I use the apic (because you need an apic for smp) for the scheduler and the pit (because there always will be a pit <- this could be wrong!) for the sleeping queue. This has the advantage that I can disable the irq for the pit if there is no thread in the sleeping queue and I could also put it in one-shot mode. So this means less work for the scheduler.
So I thought it would be a good idea to use the RTC for the sleeping queue (so I could also offload the sleeping from the scheduler on 1 cpu systems), if the pc has an apic I will use the apic for the scheduler and the pit for the sleeping queue, if the pc has an HPET (I will assume, but only here, that is also has an apic) then I will use the apic for the scheduler and the HPET for the sleeping queue.
I'd have a "coarse" set of sleeping queues to get rid of the majority of the time delay; and then (when a task is about to wake up) shift the task to the scheduler's "fine" sleeping queue. Not sure if that's a good enough explanation though.
Imagine a task wants to sleep for 1234 ms. The current time happens to be 9870000 "ticks" where each tick is 1 ms (or something like that). You do "thread_data->wake_time = current_time + delay = 987650000 + 1234 = 9870000 + 1234 = 9871234". You've got 512 "coarse sleeping queues" (for e.g.) and an IRQ that occurs every 100 ms, so you do "queue_number = int(thread_data->wake_time / 100) % 512 = int(9871234 / 100) % 512 = int(98712.34) % 512 = 98712 % 512 = 408" and put the task onto "coarse sleeping queue" number 408.
For each IRQ (every 100 ms) you add 100 to the current time, and then process the next "coarse sleeping queue". When you process a "coarse sleeping queue" you check if each thread's "wake time" is less than "current_time + 100". If the thread needs to sleep longer you leave it on the same "coarse sleeping queue" for next time that queue is processed. If the thread is close enough to being woken up you remove it from the "coarse sleeping queue", decide which CPU it should be run on, and insert it into the scheduler's "fine sleeping queue" for that CPU.
The scheduler has a "fine sleeping queue" for each CPU; which is kept in order of which thread needs to wake up first. This means that when the scheduler is deciding how long a thread should run for it can look at the first entry in that CPU's "fine sleeping queue" and use that as the maximum time slice. For example, if the scheduler is about to give a task 123 ms of CPU time but the first task on that CPU's "fine sleeping queue" is meant to wake up in 88 ms then the scheduler could give the next task 88 ms of CPU time instead of 123 ms to make sure that the first task on the "fine sleeping queue" is woken up at exactly the right time. This is where the "one-shot" timer is really nice.
When the first task on the CPU's "fine sleeping queue" is woken up, the scheduler decides if it should get CPU time immediately or if there's other more important tasks that should be run. In any case, when the task does get CPU time it can do the "high precision busy-wait" thing (if necessary). This only really makes sense if the "high precision busy-wait" timer is more precise than the timer the scheduler is using (in general, this will only happen if you're using the TSC to do the high precision busy-wait).
Now, imagine this with the RTC being used for the "coarse sleeping queues" (and keeping track of real time), and the PIT being used (in "one shot" mode) for the scheduler (and for the "fine sleeping queue"). If a task asks to sleep for 12345678 ns then you'd wake it up after between 12345678 and 12346516 ns (the delay would have better than 1 us precision, where 1 us precision is 10000 times better than 10 ms precision). You've got to admit that's not bad for a "worst case" on ancient hardware.
For modern hardware you'd probably be using HPET at 1000 Hz for the "coarse sleeping queues" (and keeping track of real time), and the local APIC timer/s (in "one shot" mode) for the scheduler (and for the "fine sleeping queue"), and then you'd finish off the time delay by using the TSC for a very high precision busy-wait. If the local APIC timer is running at 500 MHz (actually fairly slow for modern hardware); then if a task asks to sleep for 12345678.9 ns you'd wake it up after between 12345676 and 12345678 ns, then busy-wait for the remainder. With a 2 GHz CPU (still fairly slow for modern hardware) the TSC would be incremented every 0.5 ns, so you'd be busy-waiting for between 1 ns and 3 ns. The final delay would have better than 1 ns precision (where 1 ns precision is 10000000 times better than 10 ms precision). I'd be happy with that..
Cheers,
Brendan