high precision sleep function
high precision sleep function
I have a syscall for letting a thread sleep for some time, but the problem here is that the smallest time is 1ms (because I use the pit/apic for it and my handler gets called every 1ms).
So how can I wait for some time which is smaller than 1ms? I can´t use rdtsc for it because of the problems with it (different values on different cpus, different cpu frequency).
There is a function in windows (QueryPerformanceCounter) which can you give a precision at µs level. So how can I implement such a function? Using the timer interrupt is not a good option, because this would give me too much overhead (and I use the pit for letting threads sleep, so I´m limited for the precision).
So how can I wait for some time which is smaller than 1ms? I can´t use rdtsc for it because of the problems with it (different values on different cpus, different cpu frequency).
There is a function in windows (QueryPerformanceCounter) which can you give a precision at µs level. So how can I implement such a function? Using the timer interrupt is not a good option, because this would give me too much overhead (and I use the pit for letting threads sleep, so I´m limited for the precision).
Re: high precision sleep function
Important part bolded. High-frequency timers don't do sub-ms waits but instead give you the current time to a high precision.FlashBurn wrote:There is a function in windows (QueryPerformanceCounter) which can you give a precision at µs level. So how can I implement such a function?.
So, getting a high-precision time is done relatively simply: by asking the timer what its current counter value is and converting it to seconds, remembering that this is in ticks (so you need a fairly accurate idea of the timer frequency to divide by), most likely ticks down from the maximum (so you have to subtract the value from the ticks-per-interrupt to get the number which have passed) and resets to the maximum on every interrupt (so you have to use your kernel-maintained time to get the higher-order part)
Notice, though, that this has overhead depending on what the timer is. The LAPIC is both the highest-precision and lowest-latency, so you'll probably want to start using that first.
Re: high precision sleep function
The problem here is that the apic is not activated (or the cpu has no) on every computer my os supports. So I can´t assume that there is one and use it.Selenic wrote: The LAPIC is both the highest-precision and lowest-latency, so you'll probably want to start using that first.
If I understand you right, I "only" need to take the counter of the timer I´m using and adding my kernTimerCount (which is incremented every time a timer interrupt happens) and give back "(kernTimerCounter * maxValOfTimerCounter) + currValOfTimerCounter".Selenic wrote: So, getting a high-precision time is done relatively simply: by asking the timer what its current counter value is and converting it to seconds, remembering that this is in ticks (so you need a fairly accurate idea of the timer frequency to divide by), most likely ticks down from the maximum (so you have to subtract the value from the ticks-per-interrupt to get the number which have passed) and resets to the maximum on every interrupt (so you have to use your kernel-maintained time to get the higher-order part)
I will look how I can implement this. So thanks!
Re: high precision sleep function
Hi,
Worst case accuracy would be about 2 us (caused by delays accessing slow PIT I/O ports, rather than the PIT counter's 838 ns precision), but people using more modern hardware will never see that. Best case accuracy would be better than 1 ns (for people with a recent CPU).
Note: The TSC frequency is constant (when the CPU is able to run code isn't in a deep sleep state) if CPUID has the new "TSC invariant" feature flag set.
For example, for the PIT, the "count" is decreased once every 838 ns. To wait for 20 us you'd wait for the PIT count to be decreased by 24 "ticks" (20 / 0.838 = 23.86, rounded up to 24). The tricky part is handling counter roll-over correctly. For an example:
Of course this is just an example. The example code would be messed up if a task switch occurred while you're doing the "busy wait" (or if an IRQ handler takes an insanely long time to return); and you'd probably want to run other tasks until the delay is small (to reduce the amount of CPU time wasted).
Of course if you setup the PIT (in "periodic mode") to generate an IRQ every 10 ms then you'd be able to run other tasks until there's less than 10 ms of the delay remaining. However, if you setup the PIT in "one shot mode" then you may be able to run other tasks until there's only a few us of the delay remaining.
For the local APIC timer and HPET the theory is identical (just a different timer count, and a lot more precise).
Cheers,
Brendan
You can detect if TSC is usable and use that if it is. If TSC isn't usable you can detect if the local APIC is present/usable and use that. If TSC and the the local APIC aren't usable, then you can detect if HPET is present/usable and use that. If TSC, the local APIC and HPET aren't usable, then you could assume the PIT is present and use that.FlashBurn wrote:The problem here is that the apic is not activated (or the cpu has no) on every computer my os supports. So I can´t assume that there is one and use it.Selenic wrote: The LAPIC is both the highest-precision and lowest-latency, so you'll probably want to start using that first.
Worst case accuracy would be about 2 us (caused by delays accessing slow PIT I/O ports, rather than the PIT counter's 838 ns precision), but people using more modern hardware will never see that. Best case accuracy would be better than 1 ns (for people with a recent CPU).
Note: The TSC frequency is constant (when the CPU is able to run code isn't in a deep sleep state) if CPUID has the new "TSC invariant" feature flag set.
I'd assume Selenic is suggesting that you read the timer's count (e.g. how many "ticks" until an interrupt would occur), and isn't suggesting you wait for any interrupt.FlashBurn wrote:If I understand you right, I "only" need to take the counter of the timer I´m using and adding my kernTimerCount (which is incremented every time a timer interrupt happens) and give back "(kernTimerCounter * maxValOfTimerCounter) + currValOfTimerCounter".Selenic wrote: So, getting a high-precision time is done relatively simply: by asking the timer what its current counter value is and converting it to seconds, remembering that this is in ticks (so you need a fairly accurate idea of the timer frequency to divide by), most likely ticks down from the maximum (so you have to subtract the value from the ticks-per-interrupt to get the number which have passed) and resets to the maximum on every interrupt (so you have to use your kernel-maintained time to get the higher-order part)
For example, for the PIT, the "count" is decreased once every 838 ns. To wait for 20 us you'd wait for the PIT count to be decreased by 24 "ticks" (20 / 0.838 = 23.86, rounded up to 24). The tricky part is handling counter roll-over correctly. For an example:
Code: Select all
void nano_delay(delay_in_ns) {
delay_count = delay_in_ns / 838 + 1;
start_count = get_PIT_count();
while(start_count >= delay_count) {
// Need to wait for roll-over
next_count = start_count
do {
last_count = next_count;
// Could do HLT here if roll-over causes an IRQ
next_count = get_PIT_count();
} while(next_count < last_count);
delay_count -= start_count;
start_count = roll_over_count;
}
end_count = start_count - delay_ticks;
if(end_count > 0) {
// Wait for end_count or roll-over
next_count = start_count
do {
last_count = next_count;
next_count = get_PIT_count();
} while( (next_count > end_count) && (next_count < last_count) );
}
}
Of course if you setup the PIT (in "periodic mode") to generate an IRQ every 10 ms then you'd be able to run other tasks until there's less than 10 ms of the delay remaining. However, if you setup the PIT in "one shot mode" then you may be able to run other tasks until there's only a few us of the delay remaining.
For the local APIC timer and HPET the theory is identical (just a different timer count, and a lot more precise).
Cheers,
Brendan
For all things; perfection is, and will always remain, impossible to achieve in practice. However; by striving for perfection we create things that are as perfect as practically possible. Let the pursuit of perfection be our guide.
Re: high precision sleep function
I think I haven´t said clear enough what I mean.Brendan wrote: I'd assume Selenic is suggesting that you read the timer's count (e.g. how many "ticks" until an interrupt would occur), and isn't suggesting you wait for any interrupt.
In every timer interrupt handler there is a line like "timerCount++" (I think so) and this counter is used for threads to sleep. So I have a resolution of 10ms (I changed my frequency to 100Hz not 1000Hz anymore) and I would multiply it by the max value of the counter of the timer I´m using (e.g. when using the pit it should be 0x10000!?) and then I add the current value of the counter of the timer I´m using (e.g. reading the current value of the pit counter) and this value is returned to the user. As in windows I would have another function (I forget the name) which returns how much "ticks" the current timer has in 1s, so that the user can use this information to wait for little times.
I would write a function for my os user library which would do this all (I mean sleeping if the timer is greater than the 10ms and then busy waiting). The only problem I see here is that it maybe would be better to write this function as a system service, because when the user needs to call the kernel every time to check if the time he needs to wait is over, it will get much more imprecise, but I don´t want so much system services. I wanted to keep this as small as possible. Maybe I will test it on a real pc how the resolution would be if I use usercode for this.Brendan wrote: Of course if you setup the PIT (in "periodic mode") to generate an IRQ every 10 ms then you'd be able to run other tasks until there's less than 10 ms of the delay remaining. However, if you setup the PIT in "one shot mode" then you may be able to run other tasks until there's only a few us of the delay remaining.
Re: high precision sleep function
If I understand you right, I "only" need to take the counter of the timer I´m using and adding my kernTimerCount (which is incremented every time a timer interrupt happens) and give back "(kernTimerCounter * maxValOfTimerCounter) + currValOfTimerCounter".[/quote]Selenic wrote:remembering that (the counter) most likely ticks down from the maximum (so you have to subtract the value from the ticks-per-interrupt to get the number which have passed)
I've quoted the little bit of my post which you've missed. So the raw number of ticks is "(time * ticks_max) + (ticks_max - ticks_cur)".
But that's raw ticks, which need normalising to some consistent unit for the user. For example, let's do the Unix-style high-precision function which returns seconds and ns separately. Assuming seconds and ms are maintained separately by the kernel, seconds are trivial and ns is (ms*1000000) + ((tick_max - tick_cur) * ns_per_tick)
Also, as Brendan points out, if you have a timer which can do one-shot (preferably without interrupting your main timer, so either a timer which can do both (I don't know if there are any on PCs) or using a different timer) then you can set it to interrupt you after the appropriate length of time, but whether it's worth switching processes depends on the cost of a context switch.
Re: high precision sleep function
Oops, you are right!Selenic wrote: I've quoted the little bit of my post which you've missed. So the raw number of ticks is "(time * ticks_max) + (ticks_max - ticks_cur)".
For this I have another function so that the user knows how many ticks will have passed in 1s.Selenic wrote: But that's raw ticks, which need normalising to some consistent unit for the user. For example, let's do the Unix-style high-precision function which returns seconds and ns separately. Assuming seconds and ms are maintained separately by the kernel, seconds are trivial and ns is (ms*1000000) + ((tick_max - tick_cur) * ns_per_tick)
As I have some sort of abstraction layer I can´t really use 2 different timers (only when I have >= 2cpus so that I use the apic timer for scheduling and the pit for the sleeping queue). I will use the "ticks" method. Maybe I will later do some thinking how I can support HPET, but this timer is only available for very new systems and so it wont be of big use for me.Selenic wrote: Also, as Brendan points out, if you have a timer which can do one-shot (preferably without interrupting your main timer, so either a timer which can do both (I don't know if there are any on PCs) or using a different timer) then you can set it to interrupt you after the appropriate length of time, but whether it's worth switching processes depends on the cost of a context switch.
Re: high precision sleep function
Hi,
If there's no local APIC and no HPET, then there's still the PIT and the RTC (where the PIT could be used for scheduling and the RTC could be used for real time). Don't forget that the RTC is capable of generating a "periodic" IRQ (e.g. at 512 Hz, 1024 Hz, 2048 Hz, 4096 Hz, etc).
My advice would be to find all available timers (PIT, RTC, local APIC timers, HPET, TSC, ACPI timer and anything else) during boot/initialisation, and create a linked list of "timer data structures"; where each "timer data structure" contains information about the timer like this:
That way the scheduler doesn't need to care what the underlying timer/s actually are (but could/would always select the "best" timers for each role), and you could easily add support for more/newer timers whenever you like.
Cheers,
Brendan
When there's multiple CPUs it's best for the scheduler to use multiple timers (one timer per CPU), but in this case there's always one local APIC timer per CPU anyway.FlashBurn wrote:As I have some sort of abstraction layer I can´t really use 2 different timers (only when I have >= 2cpus so that I use the apic timer for scheduling and the pit for the sleeping queue). I will use the "ticks" method. Maybe I will later do some thinking how I can support HPET, but this timer is only available for very new systems and so it wont be of big use for me.
If there's no local APIC and no HPET, then there's still the PIT and the RTC (where the PIT could be used for scheduling and the RTC could be used for real time). Don't forget that the RTC is capable of generating a "periodic" IRQ (e.g. at 512 Hz, 1024 Hz, 2048 Hz, 4096 Hz, etc).
My advice would be to find all available timers (PIT, RTC, local APIC timers, HPET, TSC, ACPI timer and anything else) during boot/initialisation, and create a linked list of "timer data structures"; where each "timer data structure" contains information about the timer like this:
- precision (number of nanoseconds between "ticks"?)
- accuracy (estimated worst case drift per day?)
- how expensive it is to access (slow I/O ports, fast memory mapped, etc)
- capabilities (if it can generate a periodic and/or one-shot IRQ, if it keeps ticking in sleep states, if there's one per CPU, etc)
- the address of the function to initialise/set the timer
- the address of the function to get the timer's current tick/time
- the address of the function to enable/disable periodic or one-shot IRQ
- whatever else you might need
That way the scheduler doesn't need to care what the underlying timer/s actually are (but could/would always select the "best" timers for each role), and you could easily add support for more/newer timers whenever you like.
Cheers,
Brendan
For all things; perfection is, and will always remain, impossible to achieve in practice. However; by striving for perfection we create things that are as perfect as practically possible. Let the pursuit of perfection be our guide.
Re: high precision sleep function
As I read something about HPET I always asked me, how does it work with many CPUs? I mean if you aren´t using the APIC timer for the scheduler, but the HPET you would need 1 interrupt per CPU and now imagine you´ve got 256 CPUs (224 are also too much). Now you need 256 comperators for every CPU and we would need a bigger IDT. So at the moment and as long as we have "only" 256 interrupts on the x86 architecture it´s better to use the apic for scheduling, but you can use the HPET for sleeping/waiting.Brendan wrote: When there's multiple CPUs it's best for the scheduler to use multiple timers (one timer per CPU), but in this case there's always one local APIC timer per CPU anyway.
This is somehow tricky with my os But I think I´ve come up with a good solution. I will have a timer for scheduling and an event timer. So I can still use my abstraction layer and my modules, but I can also use the best available timer for my sleeping queue (and more than 1 timer, but not more than 2 ).Brendan wrote: Then, after you've detected which timers are present (and created the "timer data structures") you'd initialise your scheduler, and the scheduler's initialisation code would choose which timer/s to use for task switches, which timer to use to keep track of real time, which timer/s to use for high precision busy-waiting, etc.
I haven´t read much about the RTC. If I could decide between PIT and RTC for my sleep queue, what timer is the better one? I mean I don´t know if the RTC works like the pit so if I init it so that it fires interrupts with 512Hz, does it have a counter (like the PIT) which I could use for high precision busy waiting or are you initing it to 512Hz an the RTC does the rest? And how about one-shot mode? As the RTC uses i/o-ports (I think so) it would be also "slow" to setup one-shot mode, wouldn´t it?
Re: high precision sleep function
Hi,
If HPET had one IRQ for each CPU then you could use the same interrupt vector for each IRQ, but use "fixed delivery" so that each IRQ is only delivered to one CPU. For example, timer 1 might send IRQ 99 to CPU 1, timer 2 might send IRQ 99 to CPU 2, timer 3 might send IRQ 99 to CPU 3, etc.
Also, I should probably warn you that in some systems that use HPET, there is no PIT and no RTC "periodic IRQ". In this case HPET (and firmware SMM code) is used to emulate the ("legacy") PIT and RTC periodic IRQ; and you can't use a few of the HPET timers (the ones that are capable of generating IRQs) while also using the PIT and/or RTC "periodic IRQ".
The "periodic IRQ" can be be set to any "power of 2" frequency between 2 Hz and 8192 Hz (e.g. 2 Hz, 4 Hz, 8 Hz, 16 Hz, ..., 4096 Hz, 8192 Hz); although frequencies above 4096 Hz can cause problems (too fast for the chipset to handle correctly). Unlike the PIT, it's impossible to set an arbitrary frequency (for example, if you wanted 1500 Hz then you'd have to choose 1024 Hz or 2048 Hz). However, the PIT can be a little annoying if you want to measure real time because there's rounding errors caused by the "strange" frequency it runs at (you can't divide 1.193181666666... MHz by any integer to get an integer "number of IRQs per second" so you might end up with 1,000.1522771723945236 interrupts per second instead of 1000, for e.g.).
The RTC doesn't have a sub-second counter you can read; so it's not useful for something like "high precision busy-wait". For something like "high precision busy-wait" you want a high precision counter but don't need an IRQ (which makes HPET counters or the TSC attractive options).
For overhead, for the RTC you must read one CMOS register to determine the cause of the interrupt (and to allow it to generate subsequent interrupts) and then send an EOI. This means it has higher overhead than the PIT (where you only need to send the EOI and don't need any extra I/O port writes).
For an older single-CPU computer (with PIT and RTC, but no HPET or local APICs) I'd use the PIT in "one-shot" mode for precise IRQs and the PIT's counter for high-precision "busy-wait". This minimises the "while(not_yet)" problem. However, using the PIT like this makes it very hard to also use the PIT to keep track of real time accurately, so in this case I'd use the RTC's "periodic IRQ" to keep track of real time.
For any computer that has a "fixed frequency" TSC I'd use the TSC for the high-precision "busy-wait". For computers where TSC is present but may change frequency, it may be possible to detect when the TSC changes frequency (e.g. the local APIC "thermal status" IRQ) and recalibrate, and still use the TSC for high-precision "busy-wait". If TSC isn't an option I'd use (in order of preference, depending on what's present) local APIC, HPET or PIT.
For any computer that has local APICs, I'd use the local APICs in "one-shot" mode for precise IRQs; and the "next best" timer for keeping track of real time (in order of preference, depending on what's present; HPET, PIT or RTC).
Of course this is a confusing mess (especially if other timers are introduced, or if the kernel uses timers in ways that weren't originally considered), and hard-coding this logic would probably be a mistake for any well designed kernel. Also, everything I've said only applies to 80x86 "PC compatible" computers. If the kernel is meant to be portable, then hard-coding the "which timer for what" logic for all potential architectures/platforms would quickly become a complete disaster...
Cheers,
Brendan
Unfortunately, for HPET most of the timers aren't able to generate an IRQ (usually only one or two of them can).FlashBurn wrote:As I read something about HPET I always asked me, how does it work with many CPUs? I mean if you aren´t using the APIC timer for the scheduler, but the HPET you would need 1 interrupt per CPU and now imagine you´ve got 256 CPUs (224 are also too much). Now you need 256 comperators for every CPU and we would need a bigger IDT. So at the moment and as long as we have "only" 256 interrupts on the x86 architecture it´s better to use the apic for scheduling, but you can use the HPET for sleeping/waiting.Brendan wrote: When there's multiple CPUs it's best for the scheduler to use multiple timers (one timer per CPU), but in this case there's always one local APIC timer per CPU anyway.
If HPET had one IRQ for each CPU then you could use the same interrupt vector for each IRQ, but use "fixed delivery" so that each IRQ is only delivered to one CPU. For example, timer 1 might send IRQ 99 to CPU 1, timer 2 might send IRQ 99 to CPU 2, timer 3 might send IRQ 99 to CPU 3, etc.
Also, I should probably warn you that in some systems that use HPET, there is no PIT and no RTC "periodic IRQ". In this case HPET (and firmware SMM code) is used to emulate the ("legacy") PIT and RTC periodic IRQ; and you can't use a few of the HPET timers (the ones that are capable of generating IRQs) while also using the PIT and/or RTC "periodic IRQ".
The RTC has several sources of IRQs. There's the "update IRQ" (which generates an IRQ once per second if it's enabled), there's the "alarm IRQ" (which will occur when the current month/day/hour/second matches the alarm's month/day/hour/second) and there's the "periodic IRQ".FlashBurn wrote:I haven´t read much about the RTC. If I could decide between PIT and RTC for my sleep queue, what timer is the better one? I mean I don´t know if the RTC works like the pit so if I init it so that it fires interrupts with 512Hz, does it have a counter (like the PIT) which I could use for high precision busy waiting or are you initing it to 512Hz an the RTC does the rest? And how about one-shot mode? As the RTC uses i/o-ports (I think so) it would be also "slow" to setup one-shot mode, wouldn´t it?
The "periodic IRQ" can be be set to any "power of 2" frequency between 2 Hz and 8192 Hz (e.g. 2 Hz, 4 Hz, 8 Hz, 16 Hz, ..., 4096 Hz, 8192 Hz); although frequencies above 4096 Hz can cause problems (too fast for the chipset to handle correctly). Unlike the PIT, it's impossible to set an arbitrary frequency (for example, if you wanted 1500 Hz then you'd have to choose 1024 Hz or 2048 Hz). However, the PIT can be a little annoying if you want to measure real time because there's rounding errors caused by the "strange" frequency it runs at (you can't divide 1.193181666666... MHz by any integer to get an integer "number of IRQs per second" so you might end up with 1,000.1522771723945236 interrupts per second instead of 1000, for e.g.).
The RTC doesn't have a sub-second counter you can read; so it's not useful for something like "high precision busy-wait". For something like "high precision busy-wait" you want a high precision counter but don't need an IRQ (which makes HPET counters or the TSC attractive options).
For overhead, for the RTC you must read one CMOS register to determine the cause of the interrupt (and to allow it to generate subsequent interrupts) and then send an EOI. This means it has higher overhead than the PIT (where you only need to send the EOI and don't need any extra I/O port writes).
For an older single-CPU computer (with PIT and RTC, but no HPET or local APICs) I'd use the PIT in "one-shot" mode for precise IRQs and the PIT's counter for high-precision "busy-wait". This minimises the "while(not_yet)" problem. However, using the PIT like this makes it very hard to also use the PIT to keep track of real time accurately, so in this case I'd use the RTC's "periodic IRQ" to keep track of real time.
For any computer that has a "fixed frequency" TSC I'd use the TSC for the high-precision "busy-wait". For computers where TSC is present but may change frequency, it may be possible to detect when the TSC changes frequency (e.g. the local APIC "thermal status" IRQ) and recalibrate, and still use the TSC for high-precision "busy-wait". If TSC isn't an option I'd use (in order of preference, depending on what's present) local APIC, HPET or PIT.
For any computer that has local APICs, I'd use the local APICs in "one-shot" mode for precise IRQs; and the "next best" timer for keeping track of real time (in order of preference, depending on what's present; HPET, PIT or RTC).
Of course this is a confusing mess (especially if other timers are introduced, or if the kernel uses timers in ways that weren't originally considered), and hard-coding this logic would probably be a mistake for any well designed kernel. Also, everything I've said only applies to 80x86 "PC compatible" computers. If the kernel is meant to be portable, then hard-coding the "which timer for what" logic for all potential architectures/platforms would quickly become a complete disaster...
Cheers,
Brendan
For all things; perfection is, and will always remain, impossible to achieve in practice. However; by striving for perfection we create things that are as perfect as practically possible. Let the pursuit of perfection be our guide.
Re: high precision sleep function
This is interesting, how does it work? The only option I see here is that you could say send the int to all cpus (this way you could also use the pit for smp systems, but how to send the eoi?). Sends the HPET the interrupt w/o the io-apic, because if I remember right, you can specify one 1 cpu or a group of cpus which shall get the interrupt.Brendan wrote: If HPET had one IRQ for each CPU then you could use the same interrupt vector for each IRQ, but use "fixed delivery" so that each IRQ is only delivered to one CPU. For example, timer 1 might send IRQ 99 to CPU 1, timer 2 might send IRQ 99 to CPU 2, timer 3 might send IRQ 99 to CPU 3, etc.
So I´m back at the beginning The RTC isn´t the right timer for doing sleep queue and high precise busy waiting and it also isn´t the right timer for the scheduler (to inflexible).
What I miss here is which timer should I use for the scheduler? Also the one-shot mode gives me problems when it comes to busy-waiting, because there I can´t use a kernel variable which counts the timer irqs.Brendan wrote: For an older single-CPU computer (with PIT and RTC, but no HPET or local APICs) I'd use the PIT in "one-shot" mode for precise IRQs and the PIT's counter for high-precision "busy-wait". This minimises the "while(not_yet)" problem. However, using the PIT like this makes it very hard to also use the PIT to keep track of real time accurately, so in this case I'd use the RTC's "periodic IRQ" to keep track of real time.
For my luck, my kernel needn´t to be portable But I also have the opinion that if you use a mikrokernel, write a new kernel for every new architecture you like to support, so you can use the best the architecture has to offer and because you use servers for the most things, they needn´t to be rewritten, you only have to use the same interface on all architecture, how they are implemented is architecture specific.Brendan wrote: Of course this is a confusing mess (especially if other timers are introduced, or if the kernel uses timers in ways that weren't originally considered), and hard-coding this logic would probably be a mistake for any well designed kernel. Also, everything I've said only applies to 80x86 "PC compatible" computers. If the kernel is meant to be portable, then hard-coding the "which timer for what" logic for all potential architectures/platforms would quickly become a complete disaster...
I also use this logic only for the bare minimum I need in my kernel and a timer for the scheduler and a timer for the sleeping queue are my bare minimum
Re: high precision sleep function
Hi,
For example:
Now, also write a second "list of timers" that shows the characteristics for each timer that happens to be present (like I mentioned previously). Both of these lists could be written on paper during OS design (but one of the lists, or both lists, could be generated during boot/initialisation too).
Finally, for each entry in the first "things I need a timer for" list choose something from the second "available timers" list, so that the "desired characteristics" for each thing best match the "actual characteristics" for available timers. It'd probably be best to start from the most important "things I need a timer for", so you don't end up with too many things using the same timer (or using the "best" timer for something less important and a "not so good" timer for something more important).
Cheers,
Brendan
Interrupt handling and PIC vs. I/O APIC and different delivery modes, etc is an entirely different discussion...FlashBurn wrote:This is interesting, how does it work? The only option I see here is that you could say send the int to all cpus (this way you could also use the pit for smp systems, but how to send the eoi?). Sends the HPET the interrupt w/o the io-apic, because if I remember right, you can specify one 1 cpu or a group of cpus which shall get the interrupt.Brendan wrote:If HPET had one IRQ for each CPU then you could use the same interrupt vector for each IRQ, but use "fixed delivery" so that each IRQ is only delivered to one CPU. For example, timer 1 might send IRQ 99 to CPU 1, timer 2 might send IRQ 99 to CPU 2, timer 3 might send IRQ 99 to CPU 3, etc.
Let's start from the start. Write a list of "things that need some sort of timer", including which characteristics each "thing" needs from a timer.FlashBurn wrote:So I´m back at the beginning The RTC isn´t the right timer for doing sleep queue and high precise busy waiting and it also isn´t the right timer for the scheduler (to inflexible).
For example:
- High precision busy-wait: Needs some sort of counter that is increased or decreased very regularly at a know frequency. Doesn't need any IRQ. Overhead doesn't matter too much because it's a "busy-wait".
- Determining when a task has used "enough" CPU time: Needs some sort of IRQ. More precise is better, but accuracy (e.g. things like drift) doesn't matter much. Under high loads it's very important to have a separate timer per CPU. Low overhead would be nice because it's accessed every task switch.
- Determining when a sleeping task should wake up: Needs some sort of IRQ. Accuracy (e.g. things like drift) probably doesn't matter too much. More precise is better. Under high loads it'd be nice to have more than one of these to help distribute the load across more than one CPU. Low overhead would be nice.
- Keeping track of real time: Doesn't really need an IRQ (but an IRQ would make things easier), and doesn't really need to be precise (one "tick" per second may be good enough). Accuracy (e.g. things like drift) is very important. Only one of these is needed because using more than one would be a nightmare (Segal's Law: "A man with a watch knows what time it is. A man with two watches is never sure."). Overhead doesn't matter much because it's probably not running at a high frequency anyway. Can be used to synchronise and/or calibrate timers used for other purposes.
- Watchdog timer: This timer is entirely optional (I'm not even too sure if I want it or not, and if there's no timers left to use then the OS can live without it). The idea is to generate an NMI occasionally and see if the NMI keeps interrupting the same piece of kernel code each time (and report an error and/or reboot or something if the kernel has locked up). Would have to be able to generate an NMI (IRQ) that is broadcast to all ("online") CPUs, and can't be set to the same frequency as any other timer (otherwise it could interrupt the other timer's IRQ handler all the time and assume the other timer's IRQ is continually looping when it's not). Doesn't need to be precise at all and doesn't need to be accurate either (even a random IRQ could work). Overhead doesn't really matter because it shouldn't be running at a high frequency and might only be used for debugging purposes anyway.
- Profiling timer: This timer is optional too (if there's no more timers then the OS can go without it). The idea is to generate an IRQ at regular intervals and record what code was interrupted, and use the results to determine where the CPU is spending most of it's time. Would have to be able to generate an IRQ, and can't be set to the same frequency as any other timer (otherwise it could give biased results). Doesn't need to be precise at all and doesn't need to be accurate either (even a random IRQ could work). Overhead doesn't really matter because it shouldn't be running at a high frequency and might only be used by developers anyway.
Now, also write a second "list of timers" that shows the characteristics for each timer that happens to be present (like I mentioned previously). Both of these lists could be written on paper during OS design (but one of the lists, or both lists, could be generated during boot/initialisation too).
Finally, for each entry in the first "things I need a timer for" list choose something from the second "available timers" list, so that the "desired characteristics" for each thing best match the "actual characteristics" for available timers. It'd probably be best to start from the most important "things I need a timer for", so you don't end up with too many things using the same timer (or using the "best" timer for something less important and a "not so good" timer for something more important).
Cheers,
Brendan
For all things; perfection is, and will always remain, impossible to achieve in practice. However; by striving for perfection we create things that are as perfect as practically possible. Let the pursuit of perfection be our guide.
Re: high precision sleep function
This is the system I have now (but only on 1 cpu systems), but there is a problem with accuracy. The running thread could give up to the scheduler, but if I increment my kernel timer irq counter every time the scheduler get called, it will get very inaccurate. If there are 2 threads which do much ipc and wait for each other then it will get inaccurate very fast, so I disable interrupts, set a flag and call the scheduler code. Every time this flag is set, the counter wont be incremented. So I hope that it wont become very inaccurate.Brendan wrote: For each purpose (in your list, not mine) find out which ones need very similar characteristics. For my list, the desired characteristics for "determining when a task has used enough CPU time" and "determining when a sleeping task should wake up" are mostly the same, so combining these and using the same timer/s for both purposes might make sense.
As I write there comes another problem to my mind. On smp systems I have a kernel timer irq counter for every cpu and as the APs are running the scheduler before the BSP is running the scheduler the values will be different on different cpus and this could be a problem (the same with the tsc on different cpus). So I could adjust they once (as I do for the tsc) or from time to time, but I don´t know how I should implement this.
On smp systems I use the apic (because you need an apic for smp) for the scheduler and the pit (because there always will be a pit <- this could be wrong!) for the sleeping queue. This has the advantage that I can disable the irq for the pit if there is no thread in the sleeping queue and I could also put it in one-shot mode. So this means less work for the scheduler.
So I thought it would be a good idea to use the RTC for the sleeping queue (so I could also offload the sleeping from the scheduler on 1 cpu systems), if the pc has an apic I will use the apic for the scheduler and the pit for the sleeping queue, if the pc has an HPET (I will assume, but only here, that is also has an apic) then I will use the apic for the scheduler and the HPET for the sleeping queue. Now comes the tricky part, I will always use the timer of the scheduler for the precise busy waiting (makes things easier, because only pit and apic and I know that they have a counter I can use), but the HPET would be better to use here. So I would have a dirty solution for that, writing a module which uses the apic for the scheduler, but the HPET for the precise busy waiting. This would be the system which would be best fit for my kernel.
I only need to search for documents for the RTC if it will work the way I need it to.
Re: high precision sleep function
Hi,
Notice how nothing I said above has anything to do with any timer?
Of course usually I want something more than just co-operative scheduling, and I do want to limit the amount of time a task can use the CPU for (sometimes). Fortunately this is simple - the timer IRQ just calls the existing "switch_to_any_task(void)".
It's not that simple though. There's actually 2 ways to do it. With a "one-shot" timer, during the task switch you'd tell the timer how long the next task is meant to run for; and if the timer generates an IRQ before the next task switch you know the task used all the time it was given. In this case the timer IRQ only needs to send an EOI and then call the "switch_to_any_task(void)" function. If there aren't any tasks that are ready to run, or if only one task is "ready to be run" (e.g. all the other tasks a blocked/waiting for something), you can disable the timer completely (and run the idle loop, or run the same task, until another task becomes ready to run). To make this even better it means you get a lot more control over how long it'll be until the IRQ occurs. For example, if you know another task is meant to wake up and preempt the current task in 70 us time, then you can set the PIT count to 83 and the time before the IRQ occurs will be between 68.716 us (82 * 838 ns) and 69.554 us (83 * 838 ns); which means a whole lot less "while(not_yet)" CPU time wasted.
The other way is to use a periodic timer. In this case, during the task switch you set a "time to live" variable, and the timer IRQ decreases this variable and if the variable reaches zero you send an EOI and then call the "switch_to_any_task(void)" function. This method is more common (especially for ancient "single-CPU only" systems), but it means you end up with a lot of wasted IRQs. For example, if the task is meant to run for a maximum of 100 ms then instead of setting the timer for 100 ms and having one IRQ, you might have an IRQ every 10 ms (and have 9 IRQs that just waste of time before the tenth IRQ causes the task switch). Then there's the "while(not_yet)" problem again - if you know another task is meant to wake up and preempt the current task in 70 us time, then you can't run the another task for (almost) 70 us and you have to waste 70 us of CPU time doing the "while(not_yet)" thing.
Ok, now notice how nothing I said anywhere above has anything to do with keeping track of the current "wall clock" time?
Keeping track of real time (or "wall clock" time) has nothing to do with scheduling. For convenience, some people try to use the same timer to keep track of real time and for scheduling. This idea sucks badly for multi-CPU (much better to use a different timer to keep track of real time). For single CPU it sucks slightly less, but it's a still a bad idea.
The reason it sucks on multi-CPU is that you've typically got a different timer for each CPU (and no idea which one to use for keeping track of real time); and you don't want the "per CPU" timers to be synchronised because that increases overhead (it costs something to keep them synchronised) and increases the chance that multiple CPUs will be fighting for the same reentrancy locks, same cache lines, etc. Mostly you want to treat it as a completely different scheduler for each CPU (with it's own "per CPU" scheduling queues, and it's own "per CPU" timer).
Imagine a task wants to sleep for 1234 ms. The current time happens to be 9870000 "ticks" where each tick is 1 ms (or something like that). You do "thread_data->wake_time = current_time + delay = 987650000 + 1234 = 9870000 + 1234 = 9871234". You've got 512 "coarse sleeping queues" (for e.g.) and an IRQ that occurs every 100 ms, so you do "queue_number = int(thread_data->wake_time / 100) % 512 = int(9871234 / 100) % 512 = int(98712.34) % 512 = 98712 % 512 = 408" and put the task onto "coarse sleeping queue" number 408.
For each IRQ (every 100 ms) you add 100 to the current time, and then process the next "coarse sleeping queue". When you process a "coarse sleeping queue" you check if each thread's "wake time" is less than "current_time + 100". If the thread needs to sleep longer you leave it on the same "coarse sleeping queue" for next time that queue is processed. If the thread is close enough to being woken up you remove it from the "coarse sleeping queue", decide which CPU it should be run on, and insert it into the scheduler's "fine sleeping queue" for that CPU.
The scheduler has a "fine sleeping queue" for each CPU; which is kept in order of which thread needs to wake up first. This means that when the scheduler is deciding how long a thread should run for it can look at the first entry in that CPU's "fine sleeping queue" and use that as the maximum time slice. For example, if the scheduler is about to give a task 123 ms of CPU time but the first task on that CPU's "fine sleeping queue" is meant to wake up in 88 ms then the scheduler could give the next task 88 ms of CPU time instead of 123 ms to make sure that the first task on the "fine sleeping queue" is woken up at exactly the right time. This is where the "one-shot" timer is really nice.
When the first task on the CPU's "fine sleeping queue" is woken up, the scheduler decides if it should get CPU time immediately or if there's other more important tasks that should be run. In any case, when the task does get CPU time it can do the "high precision busy-wait" thing (if necessary). This only really makes sense if the "high precision busy-wait" timer is more precise than the timer the scheduler is using (in general, this will only happen if you're using the TSC to do the high precision busy-wait).
Now, imagine this with the RTC being used for the "coarse sleeping queues" (and keeping track of real time), and the PIT being used (in "one shot" mode) for the scheduler (and for the "fine sleeping queue"). If a task asks to sleep for 12345678 ns then you'd wake it up after between 12345678 and 12346516 ns (the delay would have better than 1 us precision, where 1 us precision is 10000 times better than 10 ms precision). You've got to admit that's not bad for a "worst case" on ancient hardware.
For modern hardware you'd probably be using HPET at 1000 Hz for the "coarse sleeping queues" (and keeping track of real time), and the local APIC timer/s (in "one shot" mode) for the scheduler (and for the "fine sleeping queue"), and then you'd finish off the time delay by using the TSC for a very high precision busy-wait. If the local APIC timer is running at 500 MHz (actually fairly slow for modern hardware); then if a task asks to sleep for 12345678.9 ns you'd wake it up after between 12345676 and 12345678 ns, then busy-wait for the remainder. With a 2 GHz CPU (still fairly slow for modern hardware) the TSC would be incremented every 0.5 ns, so you'd be busy-waiting for between 1 ns and 3 ns. The final delay would have better than 1 ns precision (where 1 ns precision is 10000000 times better than 10 ms precision). I'd be happy with that..
Cheers,
Brendan
For schedulers, I usually start with a "switch_directly_to_task(task_ID task)" function. This is helpful - if a low priority task is running when a high priority task becomes ready to run then you can call the "switch_directly_to_task()" function without doing anything to decide which task to switch to or anything. When a task blocks (e.g. it calls "wait_for_message()" or something) then you need to find a new task to run. For that I write a "switch_to_any_task(void)" function (which decides which task should be run next, and then calls the "switch_directly_to_task()" function from earlier. After writing these functions I've got enough to do a fairly decent co-operative scheduler.FlashBurn wrote:This is the system I have now (but only on 1 cpu systems), but there is a problem with accuracy. The running thread could give up to the scheduler, but if I increment my kernel timer irq counter every time the scheduler get called, it will get very inaccurate. If there are 2 threads which do much ipc and wait for each other then it will get inaccurate very fast, so I disable interrupts, set a flag and call the scheduler code. Every time this flag is set, the counter wont be incremented. So I hope that it wont become very inaccurate.Brendan wrote: For each purpose (in your list, not mine) find out which ones need very similar characteristics. For my list, the desired characteristics for "determining when a task has used enough CPU time" and "determining when a sleeping task should wake up" are mostly the same, so combining these and using the same timer/s for both purposes might make sense.
Notice how nothing I said above has anything to do with any timer?
Of course usually I want something more than just co-operative scheduling, and I do want to limit the amount of time a task can use the CPU for (sometimes). Fortunately this is simple - the timer IRQ just calls the existing "switch_to_any_task(void)".
It's not that simple though. There's actually 2 ways to do it. With a "one-shot" timer, during the task switch you'd tell the timer how long the next task is meant to run for; and if the timer generates an IRQ before the next task switch you know the task used all the time it was given. In this case the timer IRQ only needs to send an EOI and then call the "switch_to_any_task(void)" function. If there aren't any tasks that are ready to run, or if only one task is "ready to be run" (e.g. all the other tasks a blocked/waiting for something), you can disable the timer completely (and run the idle loop, or run the same task, until another task becomes ready to run). To make this even better it means you get a lot more control over how long it'll be until the IRQ occurs. For example, if you know another task is meant to wake up and preempt the current task in 70 us time, then you can set the PIT count to 83 and the time before the IRQ occurs will be between 68.716 us (82 * 838 ns) and 69.554 us (83 * 838 ns); which means a whole lot less "while(not_yet)" CPU time wasted.
The other way is to use a periodic timer. In this case, during the task switch you set a "time to live" variable, and the timer IRQ decreases this variable and if the variable reaches zero you send an EOI and then call the "switch_to_any_task(void)" function. This method is more common (especially for ancient "single-CPU only" systems), but it means you end up with a lot of wasted IRQs. For example, if the task is meant to run for a maximum of 100 ms then instead of setting the timer for 100 ms and having one IRQ, you might have an IRQ every 10 ms (and have 9 IRQs that just waste of time before the tenth IRQ causes the task switch). Then there's the "while(not_yet)" problem again - if you know another task is meant to wake up and preempt the current task in 70 us time, then you can't run the another task for (almost) 70 us and you have to waste 70 us of CPU time doing the "while(not_yet)" thing.
Ok, now notice how nothing I said anywhere above has anything to do with keeping track of the current "wall clock" time?
Keeping track of real time (or "wall clock" time) has nothing to do with scheduling. For convenience, some people try to use the same timer to keep track of real time and for scheduling. This idea sucks badly for multi-CPU (much better to use a different timer to keep track of real time). For single CPU it sucks slightly less, but it's a still a bad idea.
The reason it sucks on multi-CPU is that you've typically got a different timer for each CPU (and no idea which one to use for keeping track of real time); and you don't want the "per CPU" timers to be synchronised because that increases overhead (it costs something to keep them synchronised) and increases the chance that multiple CPUs will be fighting for the same reentrancy locks, same cache lines, etc. Mostly you want to treat it as a completely different scheduler for each CPU (with it's own "per CPU" scheduling queues, and it's own "per CPU" timer).
I'd have a "coarse" set of sleeping queues to get rid of the majority of the time delay; and then (when a task is about to wake up) shift the task to the scheduler's "fine" sleeping queue. Not sure if that's a good enough explanation though.FlashBurn wrote:On smp systems I use the apic (because you need an apic for smp) for the scheduler and the pit (because there always will be a pit <- this could be wrong!) for the sleeping queue. This has the advantage that I can disable the irq for the pit if there is no thread in the sleeping queue and I could also put it in one-shot mode. So this means less work for the scheduler.
So I thought it would be a good idea to use the RTC for the sleeping queue (so I could also offload the sleeping from the scheduler on 1 cpu systems), if the pc has an apic I will use the apic for the scheduler and the pit for the sleeping queue, if the pc has an HPET (I will assume, but only here, that is also has an apic) then I will use the apic for the scheduler and the HPET for the sleeping queue.
Imagine a task wants to sleep for 1234 ms. The current time happens to be 9870000 "ticks" where each tick is 1 ms (or something like that). You do "thread_data->wake_time = current_time + delay = 987650000 + 1234 = 9870000 + 1234 = 9871234". You've got 512 "coarse sleeping queues" (for e.g.) and an IRQ that occurs every 100 ms, so you do "queue_number = int(thread_data->wake_time / 100) % 512 = int(9871234 / 100) % 512 = int(98712.34) % 512 = 98712 % 512 = 408" and put the task onto "coarse sleeping queue" number 408.
For each IRQ (every 100 ms) you add 100 to the current time, and then process the next "coarse sleeping queue". When you process a "coarse sleeping queue" you check if each thread's "wake time" is less than "current_time + 100". If the thread needs to sleep longer you leave it on the same "coarse sleeping queue" for next time that queue is processed. If the thread is close enough to being woken up you remove it from the "coarse sleeping queue", decide which CPU it should be run on, and insert it into the scheduler's "fine sleeping queue" for that CPU.
The scheduler has a "fine sleeping queue" for each CPU; which is kept in order of which thread needs to wake up first. This means that when the scheduler is deciding how long a thread should run for it can look at the first entry in that CPU's "fine sleeping queue" and use that as the maximum time slice. For example, if the scheduler is about to give a task 123 ms of CPU time but the first task on that CPU's "fine sleeping queue" is meant to wake up in 88 ms then the scheduler could give the next task 88 ms of CPU time instead of 123 ms to make sure that the first task on the "fine sleeping queue" is woken up at exactly the right time. This is where the "one-shot" timer is really nice.
When the first task on the CPU's "fine sleeping queue" is woken up, the scheduler decides if it should get CPU time immediately or if there's other more important tasks that should be run. In any case, when the task does get CPU time it can do the "high precision busy-wait" thing (if necessary). This only really makes sense if the "high precision busy-wait" timer is more precise than the timer the scheduler is using (in general, this will only happen if you're using the TSC to do the high precision busy-wait).
Now, imagine this with the RTC being used for the "coarse sleeping queues" (and keeping track of real time), and the PIT being used (in "one shot" mode) for the scheduler (and for the "fine sleeping queue"). If a task asks to sleep for 12345678 ns then you'd wake it up after between 12345678 and 12346516 ns (the delay would have better than 1 us precision, where 1 us precision is 10000 times better than 10 ms precision). You've got to admit that's not bad for a "worst case" on ancient hardware.
For modern hardware you'd probably be using HPET at 1000 Hz for the "coarse sleeping queues" (and keeping track of real time), and the local APIC timer/s (in "one shot" mode) for the scheduler (and for the "fine sleeping queue"), and then you'd finish off the time delay by using the TSC for a very high precision busy-wait. If the local APIC timer is running at 500 MHz (actually fairly slow for modern hardware); then if a task asks to sleep for 12345678.9 ns you'd wake it up after between 12345676 and 12345678 ns, then busy-wait for the remainder. With a 2 GHz CPU (still fairly slow for modern hardware) the TSC would be incremented every 0.5 ns, so you'd be busy-waiting for between 1 ns and 3 ns. The final delay would have better than 1 ns precision (where 1 ns precision is 10000000 times better than 10 ms precision). I'd be happy with that..
Cheers,
Brendan
For all things; perfection is, and will always remain, impossible to achieve in practice. However; by striving for perfection we create things that are as perfect as practically possible. Let the pursuit of perfection be our guide.
Re: high precision sleep function
Ok, I like your idea with 2 different sleeping queues. I will try implement it. On 1 cpu systems it will be easy, but I also got an idea how I can solve some problems on smp systems (this wont be as optimal as your suggestion, but it will be good enough).
edit::
Found one more problem (but only on multi cpu systems and not for one cpu systems)
Ok, I´m using the rtc for the coarse sleeping queue and have a frequency of 8Hz (125ms). So the periodic interrupt will happen in (e.g.) 5ms and a threads wants to go to sleep for say 130ms. So what happens is that it gets en-queued into the sleeping queue and will sleep for (e.g.) 5ms and then the periodic interrupt will occur and the sleep time is decreased by 125ms and the thread will be put onto the sleeping queue of the scheduler and will sleep for 5ms. So the problem here is that the thread wanted to sleep for 130ms ,but has slept only for 10ms. So either I would increase the frequency of the coarse sleeping queue or I would need a high precision timer to look when it was inserted and save that time (so I have a hen egg problem ). As always its easy on one cpu systems, but tricky on multi cpu systems.
edit2::
I just had a look at the Haiku source and they also use the one-shot modus of the timers and they implement this high precision sleeping queue on a per cpu basis and I think I will use that too. It makes things easier and I have less code to change and rewrite
I don´t really get why should I have 512 (for your example) "coarse sleeping queues"?! I would use a delta-queue and every item in the delta-queue would be a list of threads which need to wait x ticks (where x would be "time to wait in ms" / 100, for your example of 100ms for every tick) and every time the irq get called I will decrement the 1st item of the delta-queue and if it is 0 I will add all threads in the list of the item to the "fine sleeping queue" of the scheduler on the cpu the irq code is running on. Then I would call the scheduler code and the scheduler will look how long the current thread has ran and will do the needed calculations for the next thread or the current thread.Brendan wrote: Imagine a task wants to sleep for 1234 ms. The current time happens to be 9870000 "ticks" where each tick is 1 ms (or something like that). You do "thread_data->wake_time = current_time + delay = 987650000 + 1234 = 9870000 + 1234 = 9871234". You've got 512 "coarse sleeping queues" (for e.g.) and an IRQ that occurs every 100 ms, so you do "queue_number = int(thread_data->wake_time / 100) % 512 = int(9871234 / 100) % 512 = int(98712.34) % 512 = 98712 % 512 = 408" and put the task onto "coarse sleeping queue" number 408.
edit::
Found one more problem (but only on multi cpu systems and not for one cpu systems)
Ok, I´m using the rtc for the coarse sleeping queue and have a frequency of 8Hz (125ms). So the periodic interrupt will happen in (e.g.) 5ms and a threads wants to go to sleep for say 130ms. So what happens is that it gets en-queued into the sleeping queue and will sleep for (e.g.) 5ms and then the periodic interrupt will occur and the sleep time is decreased by 125ms and the thread will be put onto the sleeping queue of the scheduler and will sleep for 5ms. So the problem here is that the thread wanted to sleep for 130ms ,but has slept only for 10ms. So either I would increase the frequency of the coarse sleeping queue or I would need a high precision timer to look when it was inserted and save that time (so I have a hen egg problem ). As always its easy on one cpu systems, but tricky on multi cpu systems.
edit2::
I just had a look at the Haiku source and they also use the one-shot modus of the timers and they implement this high precision sleeping queue on a per cpu basis and I think I will use that too. It makes things easier and I have less code to change and rewrite
Last edited by FlashBurn on Wed Mar 24, 2010 5:31 am, edited 2 times in total.