I haven´t finished my implementation yet, but there is one problem on my mind. Imagine you have 2 cpus one is running a thread with priority 28 and the other is in the scheduler and gets the next thread (priority 30) and looks into its sleeping queue and in 20ms a thread will be woken up (priority 29), so the thread with priority 30 will only run 20ms and then the scheduler gets called and the thread (prio 29) is woken up, but as you still have a thread with prio 30 which hasn´t used all its time, the thread with prio 29 can´t run, but it could run on the other cpu where a thread with prio 28 is running. The problem I see here is the worst case that the thread with prio 28 just started to run and the scheduler decided that it can use its whole time before the scheduler is called again, so the wait time for the thread with prio 29 can be longer than if I would use a periodic timer.
I hope you could follow my example. My solution would be that I could send a lowest priority IPI to all cpus w/o myself (before I inserted the thread in my global thread ready queue) and the cpu with the lowest priority then checks if the actual thread can run the rest of its time or if the just woken up thread has a higher priority. Problem here is the intel manuals say I should not use lowest priority IPIs and I don´t know if this would be a good idea looking on performance.
So what would you say (I have one global thread queue and not one for every cpu)?
high precision sleep function
Re: high precision sleep function
I just finished my rewrite to one-shot mode and it seems to work. I have 2 problems left (what would os dev w/o left problems ).
I wanted to have my idle thread so that, if there is no thread which wants to run and no thread which is waiting, I disable the timer and the only thing which could get the pc out of halting is an interrupt which inserts a thread into the ready queue.
So this is easy on one cpu systems, but I have no idea how to do this on smp systems.
I thought it would be good if I send a lowest priority IPI to all cpu exclude the one which is sending it, so if all cpus are halting one will be woken up, but the problem here is that this could also be the only cpu running (because the idle cpus have a higher TPR value than the running cpus, this is for sending irqs to running cpus and not waking up a cpu to only handle an interrupt) and so it could happen that all cpus are halting and only one is running. Another approach could be to send an IPI to all exclude the one which is sending it, but this is not good for performance, because then all cpus try to get the same lock and if there is only one thread to run, I would wake up all cpus, the 1st would get the thread and all other cpus would go back to sleep (maybe this is the best option).
I also thought that it would be good to make it better so that I 1st check if the new thread (which will be inserted into the ready queue) has a higher priority than the one I´m currently running, so I wouldn´t need to send an IPI, but would do a reschedule.
Now for the 2nd problem. I have now a high precision sleeping queue, but what I don´t have is a high precision counter. I mean I know that in games you get the value of a high precision counter do all the stuff to be done and then get the current value of the high precision counter and look if you need to wait/sleep or if you can do the next frame.
So on an one cpu system I would use my scheduler and would use the same timer count value I´m using for the high precision sleeping queue, but on smp systems this will not work, because every cpu has another timer (with another value).
What it makes even trickier to use this method is that when my idle thread is running, that this high precision counter wont be actualized.
The solution I would now use is the tsc, because it will e only inaccurate if the cpu changes its frequency and this wont happen if the cpu is running e.g. a game main loop thread, or?! Or are there programs which use such a counter for other things, like getting the value waiting some time and getting the current value and want to know how long they have waited?
I wanted to have my idle thread so that, if there is no thread which wants to run and no thread which is waiting, I disable the timer and the only thing which could get the pc out of halting is an interrupt which inserts a thread into the ready queue.
So this is easy on one cpu systems, but I have no idea how to do this on smp systems.
I thought it would be good if I send a lowest priority IPI to all cpu exclude the one which is sending it, so if all cpus are halting one will be woken up, but the problem here is that this could also be the only cpu running (because the idle cpus have a higher TPR value than the running cpus, this is for sending irqs to running cpus and not waking up a cpu to only handle an interrupt) and so it could happen that all cpus are halting and only one is running. Another approach could be to send an IPI to all exclude the one which is sending it, but this is not good for performance, because then all cpus try to get the same lock and if there is only one thread to run, I would wake up all cpus, the 1st would get the thread and all other cpus would go back to sleep (maybe this is the best option).
I also thought that it would be good to make it better so that I 1st check if the new thread (which will be inserted into the ready queue) has a higher priority than the one I´m currently running, so I wouldn´t need to send an IPI, but would do a reschedule.
Now for the 2nd problem. I have now a high precision sleeping queue, but what I don´t have is a high precision counter. I mean I know that in games you get the value of a high precision counter do all the stuff to be done and then get the current value of the high precision counter and look if you need to wait/sleep or if you can do the next frame.
So on an one cpu system I would use my scheduler and would use the same timer count value I´m using for the high precision sleeping queue, but on smp systems this will not work, because every cpu has another timer (with another value).
What it makes even trickier to use this method is that when my idle thread is running, that this high precision counter wont be actualized.
The solution I would now use is the tsc, because it will e only inaccurate if the cpu changes its frequency and this wont happen if the cpu is running e.g. a game main loop thread, or?! Or are there programs which use such a counter for other things, like getting the value waiting some time and getting the current value and want to know how long they have waited?
Re: high precision sleep function
Hi,
When a task becomes ready to run (for any reason), you decide which CPU it can/should be run on and put the task on (one of?) that CPU's "ready to run" scheduler queue/s. If you decided that the new "ready to run" task should be run on the current CPU then you either preempt the current task or keep running the current task. If you decided that the new "ready to run" task should be run on a different CPU; then, if the other CPU is idle or if it's running a lower priority task (that should be preempted by the task that just became "ready to run"), you send an IPI (using "fixed delivery") directly to that other CPU (and only that CPU); and if the other CPU is currently doing something more important you don't need to do anything.
Also, (for CPUs that support it) when there's no tasks to run you could use MONITOR/MWAIT (instead of HLT). In this case, when a CPU puts a task onto an idle CPU's "ready to run" scheduler queue there's no need to send any IPI (because the idle CPU is monitoring it's "ready to run" queue and wakes up when the queue is modified).
Of course if you're using "global ready to run queues" (rather than "per CPU ready to run queues") then you may have to "broadcast to all but self". IPIs (and IRQs) are relatively expensive because the CPU needs to do IDT lookups (then GDT lookups and protection checks), and can't do branch prediction to avoid a complete pipeline flush; so broadcasting an IPI (or an IRQ) is something you want to avoid. You'd also need some sort of reentrancy lock protecting the global queue; and because all CPUs would receive the "broadcast to all" IPI at the same time they'll all be fighting for the same "global queue reentrancy lock" at the same time; so you get a (relatively expensive) IPI followed by guaranteed lock contention.
For 2 CPUs this might happen 20 times per second and you'll only have 1 CPU getting the IPI overhead (a total of 20 IPIs per second). For 4 CPUs this might happen 40 times per second and you'll have 3 CPUs getting the IPI overhead (a total of 120 IPIs per second). For 8 CPUs this might happen 80 times per second and you'll have 7 CPUs getting the IPI overhead (a total of 560 IPIs per second). For 16 CPUs this might happen 160 times per second and you'll have 15 CPUs getting the IPI overhead (a total of 2400 IPIs per second). On top of this you've got more and more CPUs fighting for the same lock each time an IPI occurs (and more and more time before all CPUs have been able to acquire that lock); and at some point (with enough CPUs) the next IPI will be sent before all CPUs have finished handling the last IPI. Basically the overhead is exponential (it's a complete disaster for scalability).
When the task is removed from the initial sleeping queue you'd do "local_APIC_counter_wake_time = (previous_timer_wake_time - previous_timer_current_time) + local_APIC_counter_current_time"; or (for TSC) you'd do "TSC_wake_time = (previous_timer_wake_time - previous_timer_current_time) + TSC_current_time". Basically you convert from one timescale to the new timescale when you shift it from one timer to another.
Cheers,
Brendan
You probably shouldn't use "send to lowest priority" for any interrupt that comes from the local APIC (which includes IPIs), because it's not supported on most CPUs.FlashBurn wrote:I wanted to have my idle thread so that, if there is no thread which wants to run and no thread which is waiting, I disable the timer and the only thing which could get the pc out of halting is an interrupt which inserts a thread into the ready queue.
So this is easy on one cpu systems, but I have no idea how to do this on smp systems.
I thought it would be good if I send a lowest priority IPI to all cpu exclude the one which is sending it, so if all cpus are halting one will be woken up, but the problem here is that this could also be the only cpu running (because the idle cpus have a higher TPR value than the running cpus, this is for sending irqs to running cpus and not waking up a cpu to only handle an interrupt) and so it could happen that all cpus are halting and only one is running. Another approach could be to send an IPI to all exclude the one which is sending it, but this is not good for performance, because then all cpus try to get the same lock and if there is only one thread to run, I would wake up all cpus, the 1st would get the thread and all other cpus would go back to sleep (maybe this is the best option).
When a task becomes ready to run (for any reason), you decide which CPU it can/should be run on and put the task on (one of?) that CPU's "ready to run" scheduler queue/s. If you decided that the new "ready to run" task should be run on the current CPU then you either preempt the current task or keep running the current task. If you decided that the new "ready to run" task should be run on a different CPU; then, if the other CPU is idle or if it's running a lower priority task (that should be preempted by the task that just became "ready to run"), you send an IPI (using "fixed delivery") directly to that other CPU (and only that CPU); and if the other CPU is currently doing something more important you don't need to do anything.
Also, (for CPUs that support it) when there's no tasks to run you could use MONITOR/MWAIT (instead of HLT). In this case, when a CPU puts a task onto an idle CPU's "ready to run" scheduler queue there's no need to send any IPI (because the idle CPU is monitoring it's "ready to run" queue and wakes up when the queue is modified).
Of course if you're using "global ready to run queues" (rather than "per CPU ready to run queues") then you may have to "broadcast to all but self". IPIs (and IRQs) are relatively expensive because the CPU needs to do IDT lookups (then GDT lookups and protection checks), and can't do branch prediction to avoid a complete pipeline flush; so broadcasting an IPI (or an IRQ) is something you want to avoid. You'd also need some sort of reentrancy lock protecting the global queue; and because all CPUs would receive the "broadcast to all" IPI at the same time they'll all be fighting for the same "global queue reentrancy lock" at the same time; so you get a (relatively expensive) IPI followed by guaranteed lock contention.
For 2 CPUs this might happen 20 times per second and you'll only have 1 CPU getting the IPI overhead (a total of 20 IPIs per second). For 4 CPUs this might happen 40 times per second and you'll have 3 CPUs getting the IPI overhead (a total of 120 IPIs per second). For 8 CPUs this might happen 80 times per second and you'll have 7 CPUs getting the IPI overhead (a total of 560 IPIs per second). For 16 CPUs this might happen 160 times per second and you'll have 15 CPUs getting the IPI overhead (a total of 2400 IPIs per second). On top of this you've got more and more CPUs fighting for the same lock each time an IPI occurs (and more and more time before all CPUs have been able to acquire that lock); and at some point (with enough CPUs) the next IPI will be sent before all CPUs have finished handling the last IPI. Basically the overhead is exponential (it's a complete disaster for scalability).
If a task is running on a CPU and doing a "high precision busy-wait", then that CPU isn't idle, and the CPU's local APIC timer count is counting.FlashBurn wrote:Now for the 2nd problem. I have now a high precision sleeping queue, but what I don´t have is a high precision counter. I mean I know that in games you get the value of a high precision counter do all the stuff to be done and then get the current value of the high precision counter and look if you need to wait/sleep or if you can do the next frame.
So on an one cpu system I would use my scheduler and would use the same timer count value I´m using for the high precision sleeping queue, but on smp systems this will not work, because every cpu has another timer (with another value).
What it makes even trickier to use this method is that when my idle thread is running, that this high precision counter wont be actualized.
When the task is removed from the initial sleeping queue you'd do "local_APIC_counter_wake_time = (previous_timer_wake_time - previous_timer_current_time) + local_APIC_counter_current_time"; or (for TSC) you'd do "TSC_wake_time = (previous_timer_wake_time - previous_timer_current_time) + TSC_current_time". Basically you convert from one timescale to the new timescale when you shift it from one timer to another.
Cheers,
Brendan
For all things; perfection is, and will always remain, impossible to achieve in practice. However; by striving for perfection we create things that are as perfect as practically possible. Let the pursuit of perfection be our guide.
Re: high precision sleep function
If the other cpus are running a thread with a priority >= the just inserted thread, then no cpu will try to get the lock. So if the cpu is idle it´s not that important that the cpu tries to get the lock. For the situation that cpu2 gets the IPI and is running a thread which has a priority which is less than the highest priority ready thread and gets the lock, but another cpu wants also the lock and gets its some time later and there is no higher thread its not optimal, but it works at the momentBrendan wrote: Of course if you're using "global ready to run queues" (rather than "per CPU ready to run queues") then you may have to "broadcast to all but self". IPIs (and IRQs) are relatively expensive because the CPU needs to do IDT lookups (then GDT lookups and protection checks), and can't do branch prediction to avoid a complete pipeline flush; so broadcasting an IPI (or an IRQ) is something you want to avoid. You'd also need some sort of reentrancy lock protecting the global queue; and because all CPUs would receive the "broadcast to all" IPI at the same time they'll all be fighting for the same "global queue reentrancy lock" at the same time; so you get a (relatively expensive) IPI followed by guaranteed lock contention.
I don´t really get what you want to say hereBrendan wrote: When the task is removed from the initial sleeping queue you'd do "local_APIC_counter_wake_time = (previous_timer_wake_time - previous_timer_current_time) + local_APIC_counter_current_time"; or (for TSC) you'd do "TSC_wake_time = (previous_timer_wake_time - previous_timer_current_time) + TSC_current_time". Basically you convert from one timescale to the new timescale when you shift it from one timer to another.
The problem with the high precision counter on smp systems is the same problem with tsc on smp systems, they wont be equal, so that if the thread runs on cpu1 gets the current tsc value and then do some stuff and is preempted and then runs on cpu2 and finishes its work and gets the current tsc value and this is a problem because, as I said they wont be equal.
To the other problem with having only one ready queue for all cpus is, that I don´t really know how to implement a per cpu ready queue. The problem I have with this is, how to synchronize the cpu queues? I mean it could be that cpu1 is running a thread with prio 30 and has another thread with prio 29 on its ready queue and cpu2 is running a thread with prio 28 and this thread just finished running and now it looks at its cpu ready queue and there is a thread with prio 20 so this thread will run, but there is a thread with prio 29 on cpu1´s ready queue and this thread should be the next, but it wont run, because cpu2 doesn´t know that there is another higher priority thread on cpu1´s ready queue.
Re: high precision sleep function
I think the idea is that you don't give programs the raw counter value, you normalise it immediately, taking into account things like timing and frequency differences. Then you can subtract one from the other and get a meaningful delta, even if different counters are used.FlashBurn wrote:I don´t really get what you want to say hereBrendan wrote: When the task is removed from the initial sleeping queue you'd do "local_APIC_counter_wake_time = (previous_timer_wake_time - previous_timer_current_time) + local_APIC_counter_current_time"; or (for TSC) you'd do "TSC_wake_time = (previous_timer_wake_time - previous_timer_current_time) + TSC_current_time". Basically you convert from one timescale to the new timescale when you shift it from one timer to another.
The problem with the high precision counter on smp systems is the same problem with tsc on smp systems, they wont be equal, so that if the thread runs on cpu1 gets the current tsc value and then do some stuff and is preempted and then runs on cpu2 and finishes its work and gets the current tsc value and this is a problem because, as I said they wont be equal.
Re: high precision sleep function
Ok, I think I found a solution for the high precision counter thing. On 1 cpu systems I will use the timer (scheduler) for the high precision counter and on smp systems I will use the PIT for this, because it isn´t used and I can configure it so that it fires an interrupt every 55ms and if a thread wants to know the current count of the counter I can read the count from the PIT.
The question is know, is a resolution of 838ns precise enough? I don´t know how much precision things like video, audio and gaming need.
The question is know, is a resolution of 838ns precise enough? I don´t know how much precision things like video, audio and gaming need.
Re: high precision sleep function
Hi,
Normally the PIT is in "low byte then high byte" mode, and the counter can change after you've read the low byte but before you've read the high byte. To avoid that you need to issue a "latch" command. That adds up to about 3 I/O port accesses (at 1 us each); so the PIT counter can be updated 4 times between complete reads, and the precision is 4*838 ns = 3.352 us. Worse, if another CPU is trying to access the PIT count at the same time the whole thing gets out-of-sync. To avoid that you need a lock (e.g. CPU1 gets the lock, sends the "latch" command, reads the low byte and high byte correctly, then frees the lock). In this case the precision is going to depend on lock contention (and there's no way to predict the "worst case").
Alternatively it's possible to put the counter in "low byte only mode". This solves most of the problems and means you can simply read the low byte (with no "latch" command and no lock). However you still can't access the PIT count faster than 1 us; so when there's 5 CPUs trying to access it at the same time the first CPU might get the count after 1 us, the second CPU might get the count after 2 us, and the fifth CPU might get the count after 5 us. Also, using "low byte only mode" means the count is effectively only 8-bit, which means you'd need a "PIT counter rollover IRQ" every 214.528 us (256 * 838 ns), or a frequency of about 4.661 KHz. At those speeds legacy/ISA hardware can choke; and you'd need to be very careful that you don't miss IRQs because of having interrupts disabled (e.g. CLI) for too long elsewhere. There's also going to be race conditions to worry about (e.g. the counter has rolled over but the IRQ hasn't been handled yet).
Cheers,
Brendan
For legacy/ISA devices it typically takes 1 us to access an I/O port, so you won't be able to achieve 838ns precision.FlashBurn wrote:Ok, I think I found a solution for the high precision counter thing. On 1 cpu systems I will use the timer (scheduler) for the high precision counter and on smp systems I will use the PIT for this, because it isn´t used and I can configure it so that it fires an interrupt every 55ms and if a thread wants to know the current count of the counter I can read the count from the PIT.
Normally the PIT is in "low byte then high byte" mode, and the counter can change after you've read the low byte but before you've read the high byte. To avoid that you need to issue a "latch" command. That adds up to about 3 I/O port accesses (at 1 us each); so the PIT counter can be updated 4 times between complete reads, and the precision is 4*838 ns = 3.352 us. Worse, if another CPU is trying to access the PIT count at the same time the whole thing gets out-of-sync. To avoid that you need a lock (e.g. CPU1 gets the lock, sends the "latch" command, reads the low byte and high byte correctly, then frees the lock). In this case the precision is going to depend on lock contention (and there's no way to predict the "worst case").
Alternatively it's possible to put the counter in "low byte only mode". This solves most of the problems and means you can simply read the low byte (with no "latch" command and no lock). However you still can't access the PIT count faster than 1 us; so when there's 5 CPUs trying to access it at the same time the first CPU might get the count after 1 us, the second CPU might get the count after 2 us, and the fifth CPU might get the count after 5 us. Also, using "low byte only mode" means the count is effectively only 8-bit, which means you'd need a "PIT counter rollover IRQ" every 214.528 us (256 * 838 ns), or a frequency of about 4.661 KHz. At those speeds legacy/ISA hardware can choke; and you'd need to be very careful that you don't miss IRQs because of having interrupts disabled (e.g. CLI) for too long elsewhere. There's also going to be race conditions to worry about (e.g. the counter has rolled over but the IRQ hasn't been handled yet).
"Need" is a strange word. Mostly you do the best you can and hope it's good enough for whatever software people write (unless it's a "hard real time" OS).FlashBurn wrote: I don´t know how much precision things like video, audio and gaming need.
Cheers,
Brendan
For all things; perfection is, and will always remain, impossible to achieve in practice. However; by striving for perfection we create things that are as perfect as practically possible. Let the pursuit of perfection be our guide.
- Owen
- Member
- Posts: 1700
- Joined: Fri Jun 13, 2008 3:21 pm
- Location: Cambridge, United Kingdom
- Contact:
Re: high precision sleep function
In general, games tend to need to be able to time themselves to about 60-120Hz. Audio tends to be timed by the audio hardware itself; video runs at 24-60Hz and provided the jitter is kept small it will be unnoticable (For an added bonus: Most of the time the player will sync the video to the audio rather than the other way around)
Re: high precision sleep function
I just did some benchmarking and these are the results (the benchmarks were done on a AMD K5-PR120):
Setting or reading the PIT: ~3350ns
Executing my scheduler (with 1 PIT reading and 1 PIT setting): ~15000ns
I will try if I can benchmark also on a PC with an enabled apic (and then maybe on the same machine with the PIT too) to see how good or bad the results are for it.
So the 1st thing I can say (w/o having much experience how long a scheduler usually takes) wow this takes longer than I thought
But although its very long I would also say its better than if I wouldn´t use one-shot modus of the timer. Because if take out the PIT code (~7000ns) than there are 8000ns left. So the scheduler only (w/o reading and setting the timer) would be more than the timer code. If I now say that if I would have a scheduler which uses a timer with a frequency of 500Hz (2ms) and I have a thread which wants to run 30ms (= 30000000ns) the scheduler would be called 14 times w/o enqueuing the thread (and so just some timer calculation, maybe 1600ns) and this would be an overhead of (14 times * 1600ns=) 22400ns! So on a case where the thread will use all its time the one-shot method will be faster. The only problem I see is that if there is a thread which needs to sleep very long (>1s) will be imprecise ~750000ns (750us) per second it needs to wait (this depends on the times the scheduler has to run and how often the ints are disabled and the timer fired an interrupt).
Maybe I should start a new thread where we could discuss what scheduler algorithms are better for single cpu and multi cpu and how long a scheduler should take at maximum?!
Setting or reading the PIT: ~3350ns
Executing my scheduler (with 1 PIT reading and 1 PIT setting): ~15000ns
I will try if I can benchmark also on a PC with an enabled apic (and then maybe on the same machine with the PIT too) to see how good or bad the results are for it.
So the 1st thing I can say (w/o having much experience how long a scheduler usually takes) wow this takes longer than I thought
But although its very long I would also say its better than if I wouldn´t use one-shot modus of the timer. Because if take out the PIT code (~7000ns) than there are 8000ns left. So the scheduler only (w/o reading and setting the timer) would be more than the timer code. If I now say that if I would have a scheduler which uses a timer with a frequency of 500Hz (2ms) and I have a thread which wants to run 30ms (= 30000000ns) the scheduler would be called 14 times w/o enqueuing the thread (and so just some timer calculation, maybe 1600ns) and this would be an overhead of (14 times * 1600ns=) 22400ns! So on a case where the thread will use all its time the one-shot method will be faster. The only problem I see is that if there is a thread which needs to sleep very long (>1s) will be imprecise ~750000ns (750us) per second it needs to wait (this depends on the times the scheduler has to run and how often the ints are disabled and the timer fired an interrupt).
Maybe I should start a new thread where we could discuss what scheduler algorithms are better for single cpu and multi cpu and how long a scheduler should take at maximum?!