Scheduling Optimisation
Scheduling Optimisation
When scheduling different priority threads... is it more efficient to change the clock speed so that it interupts quicker on more important threads and then change it back... or would it be more efficient to set the smallest interupt difference and store a quantum count that waits until a thread has had its share?
- Happy Egghead
- Member
- Posts: 32
- Joined: Thu Oct 19, 2006 5:25 am
- Location: Adelaide, Australia
- Contact:
Hi,
I don't think changing the clock speed of the timer on the fly would be a good idea. Take for example a floppy driver. It needs to know how much time to wait between one thing and the next. To do this it would ask for the clock speed and say - "ahh - dividing the BOGOmips by the timeslice, I need to set my wait timer to X." If the timer begins altering itself dynamically all hell could break loose! Ditto screen updates etc.
I think most people go with the priority idea, either a "quantum counter" or different priority lists (each with its own counter/s.)
Quoting myself from another thread (The ego... The ego... The eggo... )
My code has a task list consisting of tasks ready to run (process_ready_list) and processes can be 'readied' multiple times on the list by a yield(to process) function.
So let's say your database app needs to load data, it can setup the transaction and then yieldto(file_manager). This ends the time-slice for the database and gives the file_manager a higher priority (its on the ready task list twice).
If another app also wants file access it can yieldto(file_manager) for a third time giving the file_manager more CPU time, not necessarily runnning consecutively.
As a result everything is still "round-robin" but with variable timeslices (nice processes yielding unused CPU time to others) and prioritization (services are prioritized based on application demands). If apps want more graphics they get more graphics, if they want more network they get more network and so forth. The other services are there, they just aren't as important in the eyes of the apps. Of course the OS can run a serialize_run_list() to remove multiple instances of services when things get out of hand. A service that realizes it has nothing to do can serialize itself thus giving more time to apps.
Threads are just Apps that share a common memory space. So with this setup an App can boost one of it's threads temporarily or a thread can boost another threads priority, possibly in a different Apps address space.
There are whole other ways of doing scheduling but this works for me at the moment.
I've learnt to reserve judgement on my code however until after a few 'months' of it appearing to be 'working'.
I don't think changing the clock speed of the timer on the fly would be a good idea. Take for example a floppy driver. It needs to know how much time to wait between one thing and the next. To do this it would ask for the clock speed and say - "ahh - dividing the BOGOmips by the timeslice, I need to set my wait timer to X." If the timer begins altering itself dynamically all hell could break loose! Ditto screen updates etc.
I think most people go with the priority idea, either a "quantum counter" or different priority lists (each with its own counter/s.)
Quoting myself from another thread (The ego... The ego... The eggo... )
My code has a task list consisting of tasks ready to run (process_ready_list) and processes can be 'readied' multiple times on the list by a yield(to process) function.
So let's say your database app needs to load data, it can setup the transaction and then yieldto(file_manager). This ends the time-slice for the database and gives the file_manager a higher priority (its on the ready task list twice).
If another app also wants file access it can yieldto(file_manager) for a third time giving the file_manager more CPU time, not necessarily runnning consecutively.
As a result everything is still "round-robin" but with variable timeslices (nice processes yielding unused CPU time to others) and prioritization (services are prioritized based on application demands). If apps want more graphics they get more graphics, if they want more network they get more network and so forth. The other services are there, they just aren't as important in the eyes of the apps. Of course the OS can run a serialize_run_list() to remove multiple instances of services when things get out of hand. A service that realizes it has nothing to do can serialize itself thus giving more time to apps.
Threads are just Apps that share a common memory space. So with this setup an App can boost one of it's threads temporarily or a thread can boost another threads priority, possibly in a different Apps address space.
There are whole other ways of doing scheduling but this works for me at the moment.
I've learnt to reserve judgement on my code however until after a few 'months' of it appearing to be 'working'.
Hi,
I also use "one-shot" mode for the PIT or local APIC timer/s, and set the maximum length of the time slice during the task switch. This costs a little overhead during the task switch (especially for the PIT), but means there's no unecessary IRQs while the task is in the middle of it's time slice and makes the scheduling much more accurate.
For example, if you set the PIT to 1 MHz and gave a thread 50 ms of CPU time, then that thread would be interrupted 49 times for no real reason (which causes 49 pipeline flushes, 49 CPL3 -> CPL0 -> CPL3 switches, etc). Also, the time slices would be accurate to within 1 ms. For e.g. if you give a task 1 ms but the task switch occurs just before the timer IRQ occurs, then it might not get any CPU time. A slower timer frequency would decrease your overhead but also decrease your accuracy, while a faster timer frequency would increase your overhead but also increase your accuracy.
With dynamically setting the time delay (i.e. using "one-shot" mode), if you want to give a task 700 ns then you can set the timer IRQ to occur 700 ns from now, and if you want to give a task 500 ms you can set the IRQ to occur 500 ms from now. There is no compromise between overhead and accuracy, and the scheduler can be as accurate as the timer itself (i.e. within 838 ns for the PIT, within 40 ns for a local APIC on a 25 MHz front-side bus, and within 1 ns for a local APIC on a 1 GHz front-side bus).
Now if you understand all this, take a look at this article about Windows Vista internals and have a good laugh. They had a situation where a task given 1 timer interval (about 10 ms) might get much less (between 0 ms and 10 ms), and "improved" it so that a task given 1 timer interval might get much more instead (between 10 ms and 20 ms). On hardware capable of running Vista, my last kernel's scheduler was roughly 1000000 times more accurate...
BTW - nice to see another South Australian in the forums..
Cheers,
Brendan
I normally use the RTC (IRQ8) for measuring real time, and then use the PIT or the local APIC timer/s for scheduling.Happy Egghead wrote:I don't think changing the clock speed of the timer on the fly would be a good idea. Take for example a floppy driver. It needs to know how much time to wait between one thing and the next. To do this it would ask for the clock speed and say - "ahh - dividing the BOGOmips by the timeslice, I need to set my wait timer to X." If the timer begins altering itself dynamically all hell could break loose! Ditto screen updates etc.
I also use "one-shot" mode for the PIT or local APIC timer/s, and set the maximum length of the time slice during the task switch. This costs a little overhead during the task switch (especially for the PIT), but means there's no unecessary IRQs while the task is in the middle of it's time slice and makes the scheduling much more accurate.
For example, if you set the PIT to 1 MHz and gave a thread 50 ms of CPU time, then that thread would be interrupted 49 times for no real reason (which causes 49 pipeline flushes, 49 CPL3 -> CPL0 -> CPL3 switches, etc). Also, the time slices would be accurate to within 1 ms. For e.g. if you give a task 1 ms but the task switch occurs just before the timer IRQ occurs, then it might not get any CPU time. A slower timer frequency would decrease your overhead but also decrease your accuracy, while a faster timer frequency would increase your overhead but also increase your accuracy.
With dynamically setting the time delay (i.e. using "one-shot" mode), if you want to give a task 700 ns then you can set the timer IRQ to occur 700 ns from now, and if you want to give a task 500 ms you can set the IRQ to occur 500 ms from now. There is no compromise between overhead and accuracy, and the scheduler can be as accurate as the timer itself (i.e. within 838 ns for the PIT, within 40 ns for a local APIC on a 25 MHz front-side bus, and within 1 ns for a local APIC on a 1 GHz front-side bus).
Now if you understand all this, take a look at this article about Windows Vista internals and have a good laugh. They had a situation where a task given 1 timer interval (about 10 ms) might get much less (between 0 ms and 10 ms), and "improved" it so that a task given 1 timer interval might get much more instead (between 10 ms and 20 ms). On hardware capable of running Vista, my last kernel's scheduler was roughly 1000000 times more accurate...
BTW - nice to see another South Australian in the forums..
Cheers,
Brendan
For all things; perfection is, and will always remain, impossible to achieve in practice. However; by striving for perfection we create things that are as perfect as practically possible. Let the pursuit of perfection be our guide.
How do you configure the RTC timer then? Do you only use the 'time update' interrupt to update the time? or do you also use the 'periodic interrupt' which would do the transition CPL3 -> CPL0 -> CPL3 more often then 49 times per second.brendan wrote:I normally use the RTC (IRQ8) for measuring real time, and then use the PIT or the local APIC timer/s for scheduling.
Author of COBOS
Would this not increase overhead because one must change the PIT settings (I/O overhead) on every call of the scheular? Also this would cause scheduler code to happen without the timer running meaning the program would be running for the entire length of the timer and measuring scheduler overhead would be harder?Brendan wrote: With dynamically setting the time delay (i.e. using "one-shot" mode), if you want to give a task 700 ns then you can set the timer IRQ to occur 700 ns from now, and if you want to give a task 500 ms you can set the IRQ to occur 500 ms from now. There is no compromise between overhead and accuracy, and the scheduler can be as accurate as the timer itself (i.e. within 838 ns for the PIT, within 40 ns for a local APIC on a 25 MHz front-side bus, and within 1 ns for a local APIC on a 1 GHz front-side bus).
- Combuster
- Member
- Posts: 9301
- Joined: Wed Oct 18, 2006 3:45 am
- Libera.chat IRC: [com]buster
- Location: On the balcony, where I can actually keep 1½m distance
- Contact:
some maths:
reconfiguring channel 0 of the PIT (3 outs) and sending EOI (1 out)
or
sending EOI each interrupt (1 out), for N interrupts
means that the IO overhead equals out when the timeslice equals 4x the minimal slice, with recalibrating taking less time when timeslices get larger. If we include privilege switches in the computation, which favors the first method when tasks switch less often than each interrupt, the equilibrium lies somewhere between 1 and 4 timer interrupts per timeslice. (which is probably processor/mobo dependent)
as for scheduler overhead, you can compute it relatively easily by running a thread that executes a tight checking loop which detects discontinuities in time progress (which could theoretically use any time counting device that does not get modified outside the loop - tried reading out PIT channel 2?).
As for retrieving the time, both timers can be read out to give an offset into the current timeslice, which can then be added to the time at the start of the timeslice.
But then again, my own scheduler is like a drunk monkey compared to this load of theory
reconfiguring channel 0 of the PIT (3 outs) and sending EOI (1 out)
or
sending EOI each interrupt (1 out), for N interrupts
means that the IO overhead equals out when the timeslice equals 4x the minimal slice, with recalibrating taking less time when timeslices get larger. If we include privilege switches in the computation, which favors the first method when tasks switch less often than each interrupt, the equilibrium lies somewhere between 1 and 4 timer interrupts per timeslice. (which is probably processor/mobo dependent)
as for scheduler overhead, you can compute it relatively easily by running a thread that executes a tight checking loop which detects discontinuities in time progress (which could theoretically use any time counting device that does not get modified outside the loop - tried reading out PIT channel 2?).
As for retrieving the time, both timers can be read out to give an offset into the current timeslice, which can then be added to the time at the start of the timeslice.
But then again, my own scheduler is like a drunk monkey compared to this load of theory
@Brendan:
Wow. Thats pretty good. I would not have thought of donig it like that but your method looks great, The fact that it saves so many privlage level switches must be a definate advantage, which (I presume) might go quite a way to recovering the time spent programming the PIT, Nice shot at vista too.
Wow. Thats pretty good. I would not have thought of donig it like that but your method looks great, The fact that it saves so many privlage level switches must be a definate advantage, which (I presume) might go quite a way to recovering the time spent programming the PIT, Nice shot at vista too.
- Happy Egghead
- Member
- Posts: 32
- Joined: Thu Oct 19, 2006 5:25 am
- Location: Adelaide, Australia
- Contact:
Hi,
At the moment the PIT just runs. The timeslice is whatever the frequency is (currently 120hz). This gives about 8.3 ms per switch regardless of processing power. So a task given two timeslices has 2x the priority of one task etc. etc. until the runlist has been fully or partially serilized.
None of the tasks I have implemented are scheduled till they are needed so there's only the init task and the builtin drivers (keyb, fdisk) that actually get any time at all. They're never readied unless there's something for them to do and they 'unready' themselves when done.
I can see where using a 'one shot' would reduce the amount of interrupts that occur and give better timing control - that would be great for a realtime system. I still have nightmares of OS/2 and it's horrible serialised event queue via user space messaging dragging everything to a halt. It's probably the only part of OS/2 that made it into Windows (all versions).
I've occasionally solved a Windows fault (using 9x more than XP) by opening the DVD drive and closing it again. The interrupt from the drive sets off the event queues somehow and everthing starts working again!
The ISA standard used to have a watchdog timer to spit out interrupts when an application overran it's timeslice, much like the 'one shot'. This left the timer interrupt to be used with whatever.It's the one downside of Intel based pc's - lack of timers. Lack of interrupts.
And while I'm on the subject, all video cards have verticle and horizontal interrupts. It should be quite possible to use those timing for scheduling (DirectX supports verticle retrace interrupts for example.) The problem is getting at the interrupt (much easier via PCI than ISA). Having to poll the I/O ports defeats the purpose entirely (CGA snowstorm effect).
This of course has nothing to do with optimising scheduling so I'll stop typing.
I'm just happy to get silly "Task 1, Task 2, Task 1" messages at the moment.
@Brandon:
Pity I can only code when the weathers less than 40 degrees. Had 4 blackouts in the last week, only of a few seconds, but it definately helps to save, save, save..... Must invest in a UPS one of these days to iron out the 'bumps'.
Oh for a nuclear power station !
At the moment the PIT just runs. The timeslice is whatever the frequency is (currently 120hz). This gives about 8.3 ms per switch regardless of processing power. So a task given two timeslices has 2x the priority of one task etc. etc. until the runlist has been fully or partially serilized.
None of the tasks I have implemented are scheduled till they are needed so there's only the init task and the builtin drivers (keyb, fdisk) that actually get any time at all. They're never readied unless there's something for them to do and they 'unready' themselves when done.
I can see where using a 'one shot' would reduce the amount of interrupts that occur and give better timing control - that would be great for a realtime system. I still have nightmares of OS/2 and it's horrible serialised event queue via user space messaging dragging everything to a halt. It's probably the only part of OS/2 that made it into Windows (all versions).
I've occasionally solved a Windows fault (using 9x more than XP) by opening the DVD drive and closing it again. The interrupt from the drive sets off the event queues somehow and everthing starts working again!
The ISA standard used to have a watchdog timer to spit out interrupts when an application overran it's timeslice, much like the 'one shot'. This left the timer interrupt to be used with whatever.It's the one downside of Intel based pc's - lack of timers. Lack of interrupts.
And while I'm on the subject, all video cards have verticle and horizontal interrupts. It should be quite possible to use those timing for scheduling (DirectX supports verticle retrace interrupts for example.) The problem is getting at the interrupt (much easier via PCI than ISA). Having to poll the I/O ports defeats the purpose entirely (CGA snowstorm effect).
This of course has nothing to do with optimising scheduling so I'll stop typing.
I'm just happy to get silly "Task 1, Task 2, Task 1" messages at the moment.
@Brandon:
Pity I can only code when the weathers less than 40 degrees. Had 4 blackouts in the last week, only of a few seconds, but it definately helps to save, save, save..... Must invest in a UPS one of these days to iron out the 'bumps'.
Oh for a nuclear power station !
Hi,
If local APIC/s are present then it's much faster, because the local APIC is built into the CPU itself (it's probably faster to access the local APIC timer's count than it is to access RAM).
For old CPUs (80486 or older) it's possible to use a PIT timer count (e.g. setup PIT channel 2 as a continuous count down timer, read the timer count, do what you need to, then read the timer count again and calculate the difference). As long as the time you're measuring is less than 55 ms (or you're using PIT channel 0 and use the IRQ to keep track of overflow) this works, and is accurate to within roughly 838 ns.
For a fixed frequency timer running at 1 MHz (relatively fast), it's likely that it'll be mostly useless for measuring scheduler overhead (unless your scheduler is so broken that it takes more than 1 ms to do it's work).
Then there's keeping track of real time. For a fixed frequency scheduler IRQ you could use the same IRQ for keeping track of real time, while for dynamic scheduler IRQs you can't. This means for fixed frequency you could have N IRQs per CPU per second, while for a dynamic scheduler IRQ you could have N IRQs per second for measuring real time (as only one CPU is needed to keep track real time) plus an average of M IRQs per CPU per second for scheduling. This is "N * CPUs" compared to "N + M * CPUs".
Then there's other performance problems. If a task wants a 1 ms delay, can the scheduler run other tasks during that time or not? If the scheduler's timing isn't accurate enough you'd need to use a busy loop.
What if a task wants an extremely accurate 1 second delay? In this case you can run other tasks until just before the delay expires, and then do a busy loop (using RDTSC) until the delay expires. This should work regardless of how your scheduler is implemented and could give delays that are accurate to within a several CPU cycles - the difference is how much time is spent running other tasks and how much time is spent in the busy loop.
Lastly, in multi-CPU systems it's possible to have a situation where different CPUs are trying to choose which task to run next at the same time. In this case, if there's any re-entrancy locks in the scheduler's code (likely) you end up with CPU/s waiting for another CPU to release the lock (lock contention). Now imagine what happens if (for e.g.) you're using the PIT as a fixed frequency timer and broadcasting the same timer IRQ to all CPUs. In this case all CPUs get the scheduler IRQ at the same time and you're guaranteed to get very heavy lock contention in the scheduler (i.e. "worst case"). A better idea (still using fixed frequency timers) would be to use the local APIC timers and to stagger the timers so that (for e.g.) if there's 2 CPUs and the timers are running at 1 MHz then every 500 us a different CPU gets an IRQ. Of course this is a difficult thing to setup (especially for an arbitrary number of CPUs) but you'd could end up with very little lock contention (i.e. "best case").
For dynamic scheduler IRQs there is no direct relationship between IRQs on different CPUs, so any lock contention caused by IRQs happening at the same time is "self balancing". Basically, the new time delay is determined from when the task switch occurs, and any lock contention would delay when the task switch occurs. For example, imagine you're unlucky and 2 CPUs get a scheduler IRQ at the same time. One gets the re-entrancy lock, chooses a task to run, does a task switch and sets it's timer for 1 ms later. The second CPU waits for the first then gets the lock, chooses a task to run, does a task switch and sets it's timer for 1 ms later. After 1 ms the first CPU gets it's timer IRQ again, but the second CPU's 1 ms isn't up yet. In general (with variable time slices) you end up with "almost best case" - lock contention isn't quite as good as the "best case" above, but it's never as bad as the "worst case" above and there isn't any complicated setup requirements for staggering the timers.
Cheers,
Brendan
Yes. For the PIT, if you're sneaky (i.e. if you use the "low byte only" access mode instead of the "low byte the high byte" access mode when setting the new PIT count) it can be done with a single OUT instruction. For my last kernel I didn't do this (it didn't give a wide enough range), so I used 2 OUTs.Tyler wrote:Would this not increase overhead because one must change the PIT settings (I/O overhead) on every call of the scheular?Brendan wrote: With dynamically setting the time delay (i.e. using "one-shot" mode), if you want to give a task 700 ns then you can set the timer IRQ to occur 700 ns from now, and if you want to give a task 500 ms you can set the IRQ to occur 500 ms from now. There is no compromise between overhead and accuracy, and the scheduler can be as accurate as the timer itself (i.e. within 838 ns for the PIT, within 40 ns for a local APIC on a 25 MHz front-side bus, and within 1 ns for a local APIC on a 1 GHz front-side bus).
If local APIC/s are present then it's much faster, because the local APIC is built into the CPU itself (it's probably faster to access the local APIC timer's count than it is to access RAM).
For measuring scheduler overhead, use RDTSC or performance monitoring counters (much more accurate anyway).Tyler wrote:Also this would cause scheduler code to happen without the timer running meaning the program would be running for the entire length of the timer and measuring scheduler overhead would be harder?
For old CPUs (80486 or older) it's possible to use a PIT timer count (e.g. setup PIT channel 2 as a continuous count down timer, read the timer count, do what you need to, then read the timer count again and calculate the difference). As long as the time you're measuring is less than 55 ms (or you're using PIT channel 0 and use the IRQ to keep track of overflow) this works, and is accurate to within roughly 838 ns.
For a fixed frequency timer running at 1 MHz (relatively fast), it's likely that it'll be mostly useless for measuring scheduler overhead (unless your scheduler is so broken that it takes more than 1 ms to do it's work).
I don't think the diffences can be calculated so easily. For example, if most tasks don't run for their full time slice you could have 1000 task switches in 10 ms, and if there's only one task that's ready to run (e.g. the system is idle, or doing a long calculation and nothing else) you might have no task switches for 10 minutes.Combuster wrote:means that the IO overhead equals out when the timeslice equals 4x the minimal slice, with recalibrating taking less time when timeslices get larger. If we include privilege switches in the computation, which favors the first method when tasks switch less often than each interrupt, the equilibrium lies somewhere between 1 and 4 timer interrupts per timeslice. (which is probably processor/mobo dependent)
Then there's keeping track of real time. For a fixed frequency scheduler IRQ you could use the same IRQ for keeping track of real time, while for dynamic scheduler IRQs you can't. This means for fixed frequency you could have N IRQs per CPU per second, while for a dynamic scheduler IRQ you could have N IRQs per second for measuring real time (as only one CPU is needed to keep track real time) plus an average of M IRQs per CPU per second for scheduling. This is "N * CPUs" compared to "N + M * CPUs".
Then there's other performance problems. If a task wants a 1 ms delay, can the scheduler run other tasks during that time or not? If the scheduler's timing isn't accurate enough you'd need to use a busy loop.
What if a task wants an extremely accurate 1 second delay? In this case you can run other tasks until just before the delay expires, and then do a busy loop (using RDTSC) until the delay expires. This should work regardless of how your scheduler is implemented and could give delays that are accurate to within a several CPU cycles - the difference is how much time is spent running other tasks and how much time is spent in the busy loop.
Lastly, in multi-CPU systems it's possible to have a situation where different CPUs are trying to choose which task to run next at the same time. In this case, if there's any re-entrancy locks in the scheduler's code (likely) you end up with CPU/s waiting for another CPU to release the lock (lock contention). Now imagine what happens if (for e.g.) you're using the PIT as a fixed frequency timer and broadcasting the same timer IRQ to all CPUs. In this case all CPUs get the scheduler IRQ at the same time and you're guaranteed to get very heavy lock contention in the scheduler (i.e. "worst case"). A better idea (still using fixed frequency timers) would be to use the local APIC timers and to stagger the timers so that (for e.g.) if there's 2 CPUs and the timers are running at 1 MHz then every 500 us a different CPU gets an IRQ. Of course this is a difficult thing to setup (especially for an arbitrary number of CPUs) but you'd could end up with very little lock contention (i.e. "best case").
For dynamic scheduler IRQs there is no direct relationship between IRQs on different CPUs, so any lock contention caused by IRQs happening at the same time is "self balancing". Basically, the new time delay is determined from when the task switch occurs, and any lock contention would delay when the task switch occurs. For example, imagine you're unlucky and 2 CPUs get a scheduler IRQ at the same time. One gets the re-entrancy lock, chooses a task to run, does a task switch and sets it's timer for 1 ms later. The second CPU waits for the first then gets the lock, chooses a task to run, does a task switch and sets it's timer for 1 ms later. After 1 ms the first CPU gets it's timer IRQ again, but the second CPU's 1 ms isn't up yet. In general (with variable time slices) you end up with "almost best case" - lock contention isn't quite as good as the "best case" above, but it's never as bad as the "worst case" above and there isn't any complicated setup requirements for staggering the timers.
Just wait until The Snowy runs dry and the hydro-electric power stations drop off the national grid - ETSA will sell our power at inflated prices to eastern states to make more cash, and shaft us as quick as they can (three cheers for privatisation). Don't tell anyone, but I've been collecting old smoke detectors - I figure it won't take long before I'll be able to refine the small amounts of low grade uranium and build my own personal reactor under my computer desk...Happy Egghead wrote:@Brandon:
Pity I can only code when the weathers less than 40 degrees. Had 4 blackouts in the last week, only of a few seconds, but it definately helps to save, save, save..... Must invest in a UPS one of these days to iron out the 'bumps'.
Oh for a nuclear power station !
Cheers,
Brendan
For all things; perfection is, and will always remain, impossible to achieve in practice. However; by striving for perfection we create things that are as perfect as practically possible. Let the pursuit of perfection be our guide.