I know. My last scheduler used this for "policy 0" (device driver IRQ handlers, etc). It's bad for normal, medium and low priority threads where the intention is to create the illusion of tasks running at the same time. Consider something like a game, where more CPU time means smoother animation. Now imagine 2 of these "as much CPU time as I can get" tasks trying to share the same CPU.rdos wrote:It is possible to let higher priority threads always interrupt lower priority threads.Brendan wrote:This is what I call "variable time slice scheduling". It's bad because high priority tasks have to wait for all medium/low priority tasks to have their turn before they get any CPU time.
Because that doesn't allow you to choose the best method of scheduling for the specific type of tasks being scheduled. If your OS is intentionally designed to be bad for everything except one specific use, then a scheduler that is bad for everything except one specific type of tasks might be fine; but most people aren't stupid enough to intentionally limit the potential uses of their OS in that way.rdos wrote:That seems awfully complex. Why not simply decide that higher priority threads are always run before lower priority threads, and that higher priority threads that become ready preempt lower priority threads?Brendan wrote:You could create different groups for different purposes. This allows you to choose the best method of scheduling for the specific type of tasks being scheduled.
Not sure where you think I claimed otherwise - I was only suggesting "many global locks" is better than "one global lock". However, it is possible to use lock-free queues (even though I didn't say so; you may not need global locks).rdos wrote:You still need global locks to be able to move threads between CPUs.Brendan wrote:For SMP, you don't want a global "scheduler lock", as many CPUs might be fighting for this lock at the same time (and wasting CPU time). The more CPUs there are the worse it would get - lock contention would cause poor scalability. If your scheduler has a lock for each CPU then the chance of CPUs fighting for the same lock is a lot less, and performance/scalability would improves. If your scheduler has 6 different locks per CPU (one for each scheduling policy) then there'd be even less lock contention and you'd improve performance/scalability even more.
You're only thinking about your own "scheduler that is bad for everything except one specific type of tasks" system. You're not taking into account tasks that should never block. For example, imagine 2 different users who are both trying to run something like Bochs at the same time - should one user get worse performance than the other user simply because their task got the slow CPU and the scheduler is unfair?rdos wrote:I disagree, but it depends on if these threads need more than 10ms on the fast CPU (or 40ms on the slow one). If the threads need more than this amount, load balancing should kick in and move them both to the faster CPU as they will execute faster there even when they compete for the same CPU.Brendan wrote:If one CPU is running at 2 GHz and another CPU is running at 500 MHz (because it got too hot and needs to cool down, for example); then if 2 tasks have the same priority and one gets 10 ms on the first CPU, then the other should get 40 ms on the other (slower) CPU, so that they both get the same fair share of CPU time.
The best way to do this is probably to use a performance monitoring counter setup to count "instructions retired". The scheduler doesn't let a task run for a certain amount of time, but instead the scheduler lets a task run for a certain number of instructions. If one task runs for 1 million instructions on one CPU and another task runs for 1 million instructions on another CPU, then the tasks get exactly the same share of CPU time regardless of how much "wall clock" time the tasks actually consumed.
If performance monitoring counters can't be used, then you can try to keep track of how fast a CPU can currently execute instructions and use the local APIC timer to approximate "instructions retired". This isn't going to be as accurate (especially when things like hyper-threading and TurboBoost are involved), but it's still going to be better than scheduling " raw time".
I agree - if your scheduler isn't very good (e.g. uses a timer in periodic mode) and your "sleep()" has poor precision for no reason, then your crappy piece of crud will probably be simpler than something that's several orders of magnitude better.rdos wrote:That also seems awfully complicated for little to no gain. If you chose a sufficiently short preemption timeout in the scheduler (in your case perhaps 50ms, although I use 1ms), then it is easier to just update a list of low precision timers every time the scheduler is invoked. This way, there would be no reentrancy issues. Additionally, you don't need any extra IRQs to update lists when there might be no entries and thus nothing to do.Brendan wrote:This also means that tasks sleep until "now() > task->wake_time"; and you don't have to decrease a "time remaining" value for every sleeping task every time the timer IRQ occurs.
This also makes the delta queue idea a bit redundant - you just use a list/queue that is sorted in order of wake time.
To reduce the "O(n)" insertion time, you can have buckets - one list for tasks that should wake up in the next 100 ms, one list for tasks that should wake up between 100 and 200 ms, one list for tasks that should wake up in 200 to 300 ms, etc. If you've got 128 lists/queues (representing 100 ms of time each) then it only covers 12.8 seconds. For delays larger than that you'd need a "longer than 12.8 seconds" list/queue too. Every 100 ms your timer IRQ handler would switch to the next list and discard the old (empty) list, then create a new list by splitting the "longer than 12.8 seconds" list.
Why stop there? If you're going to have bad precision anyway (e.g. rather than "better than micro-second" precision for "nano_sleep()"), then you can make it a lot simpler if you only update when the OS is shut down.rdos wrote:Less lock contention? Updating when the scheduler runs means no locks are needed.Brendan wrote:This means faster insertion, more queues/lists, more locks, less lock contention, etc. In theory, this can also be "per CPU" (e.g. using local APIC timer for the timer IRQ) to further reduce insertion times; with some other timer (e.g. HPET) used as a backup for when the local APIC is turned off to save power (e.g. where you merge a CPU's queues with the HPET's global queue before turning the local APIC timer off).
I also wonder if you're doing a "stop the world and reschedule all CPUs at the same time" thing that completely sucks.
And under heavy load, high priority tasks that should get CPU time immediately just sit in the global queue forever because CPUs are too busy to bother checking the global queue? You can see how this could be funny if the task sitting in the queue forever is a shell that the administrator is trying to use to kill a fork bomb.rdos wrote:That means the local CPU's thread queues must use spinlocks, somewhat defeating the purpose of having per-CPU thread queues. A much better approach is to place excess load in a global queue, with a global spinlock. Then as long as the CPU has threads to run it never needs to access any spinlocks to schedule work.Brendan wrote:The scheduler decides which CPU a task should use when a task needs to be put back onto a "ready to run" queue. This is how load balancing works (e.g. one CPU might look at it's own queue and say "I'm too busy!" and put the task on a different CPU's queue).
The purpose of the local queues was to drastically reduce the chance of lock contention. A very small chance of lock contention (even when there's hundreds of CPU under heavy load) does not defeat the purpose of local queues. Creating severe design flaws just to avoid an irrelevant "non-problem" isn't an improvement.rdos wrote:Right, which kind of defeats the purpose of local queues.Brendan wrote:A CPU can only take tasks off of its own "ready to run" queue; but because any CPU can insert a task into any other CPU's run queue you'd still need a lock for each queue.
It is a fact that nobody has ever been bitten by a shark when a they've got a pineapple jammed up their rectum; therefore (even though sharks aren't a problem on dry land and my solution is far worse than the problem it tries to solve); you should insert a pineapple up your butt to avoid being bitten by sharks.
Only if the scheduler is incredibly stupid (e.g. shifts tasks from a CPU under load to a CPU that is under even more load).rdos wrote:Another problem is that CPUs can start to crosspost threads in each others queues all the time, eventually getting stuck on the spinlocks used, and getting no work done.
Thanks for agreeing with me.rdos wrote:That's even worse since IPIs consume a massive amount of CPU cycles, primarily on the target CPU, and these cycles does nothing useful and thus contribute to more IPIs and more overload.Brendan wrote:Alternatively you could use IPIs. For example, if one CPU wants to put a task onto another CPU's queue, it could send an IPI to the other CPU to ask it to put the task on it's own queue. However, assuming low lock contention, an IPI is more expensive than locks; and because you need to tell the other CPU which task to put on its queue you'd have to store a "thread ID" somewhere and you'd probably need a lock to protect that (so a third CPU doesn't trash your "thread ID" value while you're sending the IPI).
This was only to illustrate a race condition, wasn't intended for any other purpose, and wasn't a suggested method of doing task switches.rdos wrote:By turning of interrupts for this long you have effectively ruined each and every possibility of having decent interrupt performance. I bet this is why you need minimum delays with timer IRQs.Brendan wrote:Think about tasks that block.AlfaOmega08 wrote:I don't understand how can the race condition happen.
You might do something like this:
- Task is running, timer is counting down
- Task decides to "sleep()" or block waiting for IO or something
- Kernel puts task into some sort of "waiting" state
- Kernel tries to find another task to switch to
- Kernel disables IRQs
- Kernel switches to the new task
- Kernel sets new timer count
- Kernel enables IRQs
- New task is running
By turning of interrupts for this long (during your step 9) you have "effectively ruined each and every possibility of having decent interrupt performance".rdos wrote:A more proper way to do it:
1. Task decides to "sleep()" or block waiting for IO or something
2. Kernel puts task into some sort of "waiting" state
3. Kernel acquires scheduler lock
4. Kernel processes delayed IRQs
5. Kernel tries to find another task to switch to
6. Kernel loads new timeout (if expired go to 4)
7. Kernel setup up LDT and CR3 environment
8. Kernel releases scheduler lock and disables IRQs (if waiting IRQ events, goto 3)
9. Kernel loads new register context and switches to the new thread (iretd enables interrupts again)
A more proper way to do it:
- <a new list of steps that is essentially identical to rdos's list with a few different/irrelevant details, that was essentially identical to my original list with a few different/irrelevant details>
Cheers,
Brendan