How to organize threads in an SMP scheduler
Re: How to organize threads in an SMP scheduler
Nice that you document what you are doing as I´m very interested in this topic.
I can´t imagine that a per CPU/core queue is better than a global queue. I know that the global queue approach has much lock contention, but what if one could make it a (or one per priority) lock-free queue?
So with this approach one would honor priorities and it should also be better for multithreaded programs, because I can imagine that it possible that 2 or more threads of the same program can be on the same CPU/core. I know that this can also happen with the global queue, but less likely(?).
I can´t imagine that a per CPU/core queue is better than a global queue. I know that the global queue approach has much lock contention, but what if one could make it a (or one per priority) lock-free queue?
So with this approach one would honor priorities and it should also be better for multithreaded programs, because I can imagine that it possible that 2 or more threads of the same program can be on the same CPU/core. I know that this can also happen with the global queue, but less likely(?).
Re: How to organize threads in an SMP scheduler
I'm wondering where your IPIs come from? I see no point in using IPIs in the scheduler.
What I do is this:
What I do is this:
- 32 priority queues per CPU each protected by a spinlock (only runnable threads are kept in the list)
- A single queue from just woken threads which aren't assigned yet protected with a single spinlock
- Check what the highest priority thread(1) is in the CPU-queue
- Check what the highest priority thread(2) is in the just-woken-queue
- If 1 > 2 just schedule, else pull 2 and schedule it
- If a priority queue is overloaded compared to other CPUs push threads to the least loaded CPU
With lock-free you think of it the "CAS-way"? CPUs will still compete at this point. The only advantage I see here is that it would be interruptible.FlashBurn wrote:I know that the global queue approach has much lock contention, but what if one could make it a (or one per priority) lock-free queue?
Re: How to organize threads in an SMP scheduler
Hi,
"However, the problem is when a thread is <blocked or spinning?> on a critical section. It needs to keep the spinlock until it has been queued in <the global wait list or one of the many different wait-lists?> , but with the current scheduler-design this is not possible."
For threads that are blocked on a critical section; "blocked" implies some sort of mutex where the thread is given no CPU time at all until the mutex is released. In this case you have (for e.g.) one queue for each mutex, where all tasks that are blocked/waiting to acquire a mutex are put onto the queue for that mutex. When a thread releases the mutex it checks the mutex's queue and unblocks the first thread that is waiting for the mutex (if necessary). The mutex's queue needs to be protected by a spinlock; so for a system with 123 mutexes you have 123 queues and 123 splinlocks.
If a thread is spinning on a critical section; then the scheduler doesn't necessarily need to know or care. The main problem is avoiding deadlocks where 2 (or more) things on the same CPU try to use the same spinlock at the same time. A simple example is a spinlock that is acquired by an IRQ handler - if the lock is held by something else on the same CPU when the IRQ occurs, then the IRQ handler can't acquire the lock until the IRQ handler returns, so the IRQ handler spins forever trying to acquire the spinlock. The same deadlock problem can occur for locks that the scheduler requires to perform task switches - something else running on the CPU acquires the lock, then the scheduler tries to do a task switch but can't because something has a lock it requires, so the scheduler spins forever.
There is another problem though. If a spinlock is never used by an IRQ handler or by the scheduler; one thread can acquire the spinlock, then the scheduler can switch to another thread, then all other threads will be unable to acquire that lock until the first thread gets CPU time and releases it. This will work correctly, but it will also cause major performance problems (threads could waste lots of CPU time spinning). There's 2 ways to avoid this problem - use mutexes instead (which are relatively expensive and should only be used where the critical sections are relatively long), or disable/postpone task switches (for that CPU) while a thread has acquired a spinlock (on that CPU).
This means that to avoid deadlocks and performance problems, the kernel needs 2 different types of spinlocks:
If you think about this, you'll notice that code running in a critical section inside an IRQ handler can ask the scheduler to do a task switch without any problem (the task switch will be delayed until it's safe).
For threads that block there's 2 different situations. The first is when the thread has been put on some sort of queue or data structure corresponding to some sort of resource, and something else will put the thread onto a "ready to run" queue when whatever it's waiting for occurs (e.g. a timer IRQ will unblock the thread eventually, another thread releasing a mutex will unblock the thread eventually, etc).
The other situation is that the thread is blocked for some reason where no queue or data structure is involved. In this case all you need is a set of flags for each thread that says what reason/s it's blocked. An example would be "waiting for IPC", where the kernel clears the thread's "waiting for IPC" flag when the thread receives some IPC and then puts the thread onto a "ready to run" queue if none of the flags are set.
For the "set of flags that say if/why a thread is blocked", I'd include one flag for all of the "thread is put on some sort of queue or data structure" cases. Then the scheduler can provide 2 generic functions - one to block the current task (e.g. "block_thread(int flags);") that sets the "why a thread is blocked" flag/s and causes a task switch; and another function (e.g. "clearFlag(threadID thread, int flags);") that clears flags and takes care of unblocking threads and pre-emption.
If you combine all of the above; you get the following primitives:
Cheers,
Brendan
That is good - some things that should never have been handled by taking a global scheduler lock need to be found and fixed so they don't rely on the existence of any single global lock.rdos wrote:OK, so the load balancer seem to work ok, but some synchronization primitives seem to fail. Now that the scheduler-locks are per core (and not per system), some things that could previously be handled with locking the scheduler, no longer can be handled like this.
Not sure I understand this properly. For example:rdos wrote:Mostly it seems to be critical sections that are broken. These need to use the global list spinlock. However, the problem is when a thread is blocked on a critical section. It needs to keep the spinlock until it has been queued in the wait-list, but with the current scheduler-design this is not possible. IOW, I need a new blocking function that presumes the global list spinlock is already taken, and that puts the thread in the target list, and releases the spinlock. Ideally, it should first put the thread in the section blocked list, release the spinlock, save the thread context (registers) and reschedule. Alternatively, save thread cotext, put thread in section blocked list, release the spinlock, and reschedule. The latter scenario would create more lock-contention, but it would be faster when there is no lock-contention. Maybe critical sections should even contain a spinlock field so each section could have a private spinlock instead. This would increase size of critical section data for non-SMP systems, but it would scale better on SMP-systems.
"However, the problem is when a thread is <blocked or spinning?> on a critical section. It needs to keep the spinlock until it has been queued in <the global wait list or one of the many different wait-lists?> , but with the current scheduler-design this is not possible."
For threads that are blocked on a critical section; "blocked" implies some sort of mutex where the thread is given no CPU time at all until the mutex is released. In this case you have (for e.g.) one queue for each mutex, where all tasks that are blocked/waiting to acquire a mutex are put onto the queue for that mutex. When a thread releases the mutex it checks the mutex's queue and unblocks the first thread that is waiting for the mutex (if necessary). The mutex's queue needs to be protected by a spinlock; so for a system with 123 mutexes you have 123 queues and 123 splinlocks.
If a thread is spinning on a critical section; then the scheduler doesn't necessarily need to know or care. The main problem is avoiding deadlocks where 2 (or more) things on the same CPU try to use the same spinlock at the same time. A simple example is a spinlock that is acquired by an IRQ handler - if the lock is held by something else on the same CPU when the IRQ occurs, then the IRQ handler can't acquire the lock until the IRQ handler returns, so the IRQ handler spins forever trying to acquire the spinlock. The same deadlock problem can occur for locks that the scheduler requires to perform task switches - something else running on the CPU acquires the lock, then the scheduler tries to do a task switch but can't because something has a lock it requires, so the scheduler spins forever.
There is another problem though. If a spinlock is never used by an IRQ handler or by the scheduler; one thread can acquire the spinlock, then the scheduler can switch to another thread, then all other threads will be unable to acquire that lock until the first thread gets CPU time and releases it. This will work correctly, but it will also cause major performance problems (threads could waste lots of CPU time spinning). There's 2 ways to avoid this problem - use mutexes instead (which are relatively expensive and should only be used where the critical sections are relatively long), or disable/postpone task switches (for that CPU) while a thread has acquired a spinlock (on that CPU).
This means that to avoid deadlocks and performance problems, the kernel needs 2 different types of spinlocks:
- Spinlocks that aren't acquired by IRQ handlers, where task switches (for that CPU) are postponed while the spinlock is held
- Spinlocks that are acquired by IRQ handlers, where task switches (for that CPU) are postponed and IRQs are disabled (for that CPU) while the spinlock is held
- disable IRQs for the CPU
- attempt to acquire the spinlock
- if the lock was not acquired, re-enable IRQs and do a NOP or PAUSE (so it doesn't mess up IRQ latency), then disable IRQs and try to acquire the lock again
- increase the "number of spinlocks held" counter for the CPU
- re-enable IRQs
- if the "number of spinlocks held" counter" is non-zero then:
- set the "task switch was postponed" flag for this CPU
- return without doing the task switch
- else:
- do the task switch
- disable IRQs for the CPU
- releases the actual spinlock
- decrease the "number of spinlocks held" counter for the CPU
- if the "number of spinlocks held" was decreased to zero; and if the "task switch was postponed" flag is set:
- clear the "task switch was postponed" flag
- tell the scheduler to do a task switch
- re-enable IRQs
If you think about this, you'll notice that code running in a critical section inside an IRQ handler can ask the scheduler to do a task switch without any problem (the task switch will be delayed until it's safe).
For threads that block there's 2 different situations. The first is when the thread has been put on some sort of queue or data structure corresponding to some sort of resource, and something else will put the thread onto a "ready to run" queue when whatever it's waiting for occurs (e.g. a timer IRQ will unblock the thread eventually, another thread releasing a mutex will unblock the thread eventually, etc).
The other situation is that the thread is blocked for some reason where no queue or data structure is involved. In this case all you need is a set of flags for each thread that says what reason/s it's blocked. An example would be "waiting for IPC", where the kernel clears the thread's "waiting for IPC" flag when the thread receives some IPC and then puts the thread onto a "ready to run" queue if none of the flags are set.
For the "set of flags that say if/why a thread is blocked", I'd include one flag for all of the "thread is put on some sort of queue or data structure" cases. Then the scheduler can provide 2 generic functions - one to block the current task (e.g. "block_thread(int flags);") that sets the "why a thread is blocked" flag/s and causes a task switch; and another function (e.g. "clearFlag(threadID thread, int flags);") that clears flags and takes care of unblocking threads and pre-emption.
If you combine all of the above; you get the following primitives:
- 2 slightly different types of spinlocks
- 1 function to block threads
- 1 function to unblock threads
Cheers,
Brendan
For all things; perfection is, and will always remain, impossible to achieve in practice. However; by striving for perfection we create things that are as perfect as practically possible. Let the pursuit of perfection be our guide.
Re: How to organize threads in an SMP scheduler
Hi,
With a spinlock, the CPU spins until the lock is free, which wastes CPU time but doesn't necessarily waste as much of the CPU's resources if it's done properly (e.g. with PAUSE, etc).
With lock-free, the CPU repeatedly attempts some sort of calculation or operation until it can successfully commit the result, which wastes a similar amount of CPU time but uses more CPU resources (which is worse for things like hyper-threading where CPU resources are shared).
In addition, simple spinlocks and lock-free code both lack "fairness"; which means that some CPUs can be lucky and make progress while other CPUs can be unlucky and repeatedly fail to make any progress. Without fairness, the worst case time until progress is made by any CPU is effectively infinite (as there's no guarantee that a specific CPU won't always be "unlucky"). For spinlocks it's not too hard to fix the problem (e.g. replace the spinlocks with some form of ticket lock). For lock-free you're screwed.
Cheers,
Brendan
Lock-free sucks worse than spinlocks.FlashBurn wrote:I know that the global queue approach has much lock contention, but what if one could make it a (or one per priority) lock-free queue
With a spinlock, the CPU spins until the lock is free, which wastes CPU time but doesn't necessarily waste as much of the CPU's resources if it's done properly (e.g. with PAUSE, etc).
With lock-free, the CPU repeatedly attempts some sort of calculation or operation until it can successfully commit the result, which wastes a similar amount of CPU time but uses more CPU resources (which is worse for things like hyper-threading where CPU resources are shared).
In addition, simple spinlocks and lock-free code both lack "fairness"; which means that some CPUs can be lucky and make progress while other CPUs can be unlucky and repeatedly fail to make any progress. Without fairness, the worst case time until progress is made by any CPU is effectively infinite (as there's no guarantee that a specific CPU won't always be "unlucky"). For spinlocks it's not too hard to fix the problem (e.g. replace the spinlocks with some form of ticket lock). For lock-free you're screwed.
Cheers,
Brendan
For all things; perfection is, and will always remain, impossible to achieve in practice. However; by striving for perfection we create things that are as perfect as practically possible. Let the pursuit of perfection be our guide.
Re: How to organize threads in an SMP scheduler
rdos wrote:The load-balancer approach is a complete disaster for the message passing application. It organizes the threads so that the common receiver-thread is on one core, and the sender threads on the remaining cores. This means each message needs two IPIs, and performance becomes really bad.
In general; if a thread becomes unblocked (for any reason, including messaging/IPC), and if the thread is a higher priority than the currently running thread; then the higher priority thread that you unblocked could/should preempt the lower priority thread. If the unblocked thread is meant to run on the same CPU you do an immediate task switch, and if the unblocked thread is meant to run on a different CPU you send an IPI to that CPU so that the scheduler can do a task switch on that CPU.cyr1x wrote:I'm wondering where your IPIs come from? I see no point in using IPIs in the scheduler.
I'd assume RDOS is using synchronous messaging. For example, when one thread sends a message to another thread, the sending thread is blocked and an IPI may be sent to the CPU that the receiving thread is run on; then the receiving thread handles the message and sends a reply back to the first thread (which may involve a second IPI).
If the sender and receiver are running on different CPUs and are the highest priority tasks for their CPUs; then this means there will be 2 IPIs for pre-emption for every "send and reply".
There's only 3 ways I can think of to reduce the problem - simply don't send the IPI at all (which would seriously effect latency and make the entire OS feel sluggish), only send the IPI when the unblocked task is significantly higher priority (which would be a compromise between IPI overhead and latency), or switch to asynchronous messaging.
I'd also assume some of the overhead RDOS is seeing could be attributed to thread switching overheads (and possibly remaining lock contention in either the IPC code or the scheduler); and isn't all directly (rather than indirectly) caused by the IPIs alone.
Cheers,
Brendan
For all things; perfection is, and will always remain, impossible to achieve in practice. However; by striving for perfection we create things that are as perfect as practically possible. Let the pursuit of perfection be our guide.
Re: How to organize threads in an SMP scheduler
This could be, but only if ints are disabled. Because if a thread can be preempted than lock-free algorithms are better (you don´t need to spin and you can make progress, although another thread had already tried to do something and was preempted).Brendan wrote: Lock-free sucks worse than spinlocks.
What I don´t get, what is if there is a cpu (A) whith 3 threads in its queue and 2 cpus (B,C) with 1 thread in their queue. So the threads on the 2 cpus (with 1 thread; B,C) are waiting on something and these 2 cpus (B,C) now do what? I think the load-balancer will be run by both and both will be trying to get a thread from the other cpu (A) and so they need to get the spinlock for this cpu´s queue and now imagine that in this moment this cpu also wants to get the spinlock. This would be the same situation as if I would have a global queue, wouldn´t it?
Another problem I don´t understand, do the load-balancer need to go through all the cpu´s queues to find a new thread and a global (maybe unblocking) queue? From where does the load-balancer get the info which would be a good cpu to get a thread from?
Re: How to organize threads in an SMP scheduler
In my case it doesn't need to lock the threads, and it doesn't need to go through cpu queues. It only reads the number of clock-tics a thread has consumed, including how many clock-tics a cpus null-thread has consumed. This information is available without locking. At least it only needs to block-out the thread deletion handler, so it doesn't use deleted threads. The move is also done without locking, by modifying the core the thread is linked to.FlashBurn wrote:Another problem I don´t understand, do the load-balancer need to go through all the cpu´s queues to find a new thread and a global (maybe unblocking) queue? From where does the load-balancer get the info which would be a good cpu to get a thread from?
Re: How to organize threads in an SMP scheduler
Yes, that is correct. That is why there are two IPIs if threads are on different cores.Brendan wrote:I'd assume RDOS is using synchronous messaging. For example, when one thread sends a message to another thread, the sending thread is blocked and an IPI may be sent to the CPU that the receiving thread is run on; then the receiving thread handles the message and sends a reply back to the first thread (which may involve a second IPI).
If the sender and receiver are running on different CPUs and are the highest priority tasks for their CPUs; then this means there will be 2 IPIs for pre-emption for every "send and reply".
Yes, it can be solved in various ways, but there is a point to optimize performance for this very scenario, and this is because IPC is based on one of the fundamental primitives in the scheduler, the Signal/WaitForSignal functions. So if the message passing application performs well, so will many other things as well.Brendan wrote: There's only 3 ways I can think of to reduce the problem - simply don't send the IPI at all (which would seriously effect latency and make the entire OS feel sluggish), only send the IPI when the unblocked task is significantly higher priority (which would be a compromise between IPI overhead and latency), or switch to asynchronous messaging.
Re: How to organize threads in an SMP scheduler
Yes, and even if I go back to using a global list for threads ready to run, the locked sections will be much shorter in the new design, only locking the code that directly manipulates these lists. The advantage of a global list of threads ready to run is that a core that wants something to do can directly check the global lists for the highest priority thread, and run it. The need for IPIs to notify that a new thread is available would no longer be needed, and the message passing application would run at full speed at all times.Brendan wrote:That is good - some things that should never have been handled by taking a global scheduler lock need to be found and fixed so they don't rely on the existence of any single global lock.
Yes, something like that. A critical section is a simple mutex where the thread would be blocked if the section is busy. The problem was that the whole action of blocking the thread was not atomic (which I've solved now).Brendan wrote:For threads that are blocked on a critical section; "blocked" implies some sort of mutex where the thread is given no CPU time at all until the mutex is released. In this case you have (for e.g.) one queue for each mutex, where all tasks that are blocked/waiting to acquire a mutex are put onto the queue for that mutex. When a thread releases the mutex it checks the mutex's queue and unblocks the first thread that is waiting for the mutex (if necessary). The mutex's queue needs to be protected by a spinlock; so for a system with 123 mutexes you have 123 queues and 123 splinlocks.
Yes, which is why Signal and WakeUp needs to use spinlocks that disable interrupts, while spinlocks on critical sections could enable interrupts since critical sections are not allowed in IRQs.Brendan wrote:If a thread is spinning on a critical section; then the scheduler doesn't necessarily need to know or care. The main problem is avoiding deadlocks where 2 (or more) things on the same CPU try to use the same spinlock at the same time. A simple example is a spinlock that is acquired by an IRQ handler - if the lock is held by something else on the same CPU when the IRQ occurs, then the IRQ handler can't acquire the lock until the IRQ handler returns, so the IRQ handler spins forever trying to acquire the spinlock.
Yes, which is why I have the scheduler locks, which are not spinlocks, but a way to tell the scheduler that it cannot switch task. These are mostly useful for IRQs, that can interrupt both scheduler code and non-scheduler code. When scheduler code is interrupted, any wakeups or signals will be defered and handled after the scheduler is back to "nesting-level" 0 again.Brendan wrote:There is another problem though. If a spinlock is never used by an IRQ handler or by the scheduler; one thread can acquire the spinlock, then the scheduler can switch to another thread, then all other threads will be unable to acquire that lock until the first thread gets CPU time and releases it.
Re: How to organize threads in an SMP scheduler
Right now I'm planning to implement the reverse.cyr1x wrote:I'm wondering where your IPIs come from? I see no point in using IPIs in the scheduler.
What I do is this:
- 32 priority queues per CPU each protected by a spinlock (only runnable threads are kept in the list)
- A single queue from just woken threads which aren't assigned yet protected with a single spinlock

I will have a global queue of threads ready to run, because this makes it possible to always run the highest priority threads, and it eliminates IPIs when signalling between threads on different cores. OTOH, I think I will have a list of just woken threads per core, and have one spinlock per core to protect these lists. The only thing I've not decided on is how to keep threads waiting for signals. I think the best would be to keep those in a shared global list as well, protected with a global spinlock, and when they are woken up by a core, move them to this core's wakeup-list.
Re: How to organize threads in an SMP scheduler
Hi,
In addition; because the scheduler doesn't optimise for anything (e.g. NUMA, cache sharing, hyper-threading), the end user configured CPU affinity masks for most tasks to try to avoid performance penalties. When the scheduler running on a CPU needs to decide which task to run next, it acquires the global lock, then searches the "ready to run" queue looking for the highest priority task that the CPU is able to run. While one CPU is doing this expensive search, all the other CPUs can't acquire the global lock and can't do any useful work.
Because the OS makes heavy use of synchronous messages (lots and lots of task switches, etc), under heavy load, for a system with 2 CPUs each CPU spends 65% of its time doing useful work, 25% of it's time trying/failing to acquire the global lock, 5% of it's time searching the "ready to run" queue and 5% of its time doing task switches; and is only slightly faster than a single-CPU machine. Under heavy load, for a system with 64 CPUs each CPU spends 1% of its time doing useful work, 96% of it's time trying/failing to acquire the global lock, 2% of it's time searching the "ready to run" queue and 1% of its time doing task switches; and is 50 times slower than a single-CPU machine.
Per-CPU locks and per-CPU queues are barely adequate. Every time you say "global", an innocent puppy dies.
Cheers,
Brendan
Imagine a system with 8 CPUs. Seven of the CPUs have just switched to low priority tasks, except for one CPU which is running an extremely high priority task. An IRQ occurs on the CPU running the extremely high priority task, which causes a very high priority task to be unblocked. The very high priority task doesn't/shouldn't preempt the extremely high priority task on the CPU that received the IRQ. The other seven CPUs don't know the high priority task exists until they check the scheduler's queue, and that won't happen until the low priority tasks use the CPU time they were given. Because of this it takes up to 100 ms (or something like that) before the very high priority task gets any CPU time because the scheduler failed to preempt low priority tasks.rdos wrote:I will have a global queue of threads ready to run, because this makes it possible to always run the highest priority threads, and it eliminates IPIs when signalling between threads on different cores.
In addition; because the scheduler doesn't optimise for anything (e.g. NUMA, cache sharing, hyper-threading), the end user configured CPU affinity masks for most tasks to try to avoid performance penalties. When the scheduler running on a CPU needs to decide which task to run next, it acquires the global lock, then searches the "ready to run" queue looking for the highest priority task that the CPU is able to run. While one CPU is doing this expensive search, all the other CPUs can't acquire the global lock and can't do any useful work.
Because the OS makes heavy use of synchronous messages (lots and lots of task switches, etc), under heavy load, for a system with 2 CPUs each CPU spends 65% of its time doing useful work, 25% of it's time trying/failing to acquire the global lock, 5% of it's time searching the "ready to run" queue and 5% of its time doing task switches; and is only slightly faster than a single-CPU machine. Under heavy load, for a system with 64 CPUs each CPU spends 1% of its time doing useful work, 96% of it's time trying/failing to acquire the global lock, 2% of it's time searching the "ready to run" queue and 1% of its time doing task switches; and is 50 times slower than a single-CPU machine.
Per-CPU locks and per-CPU queues are barely adequate. Every time you say "global", an innocent puppy dies.
Cheers,
Brendan
For all things; perfection is, and will always remain, impossible to achieve in practice. However; by striving for perfection we create things that are as perfect as practically possible. Let the pursuit of perfection be our guide.
Re: How to organize threads in an SMP scheduler
OK. I suppose I could keep a current priority-level per CPU, and send an IPI to one CPU when this happens. But it is a trade-off. It takes time in the scheduler to check this, and the task will run within 1ms (the maximum timeslice) anyway.Brendan wrote:Imagine a system with 8 CPUs. Seven of the CPUs have just switched to low priority tasks, except for one CPU which is running an extremely high priority task. An IRQ occurs on the CPU running the extremely high priority task, which causes a very high priority task to be unblocked. The very high priority task doesn't/shouldn't preempt the extremely high priority task on the CPU that received the IRQ. The other seven CPUs don't know the high priority task exists until they check the scheduler's queue, and that won't happen until the low priority tasks use the CPU time they were given. Because of this it takes up to 100 ms (or something like that) before the very high priority task gets any CPU time because the scheduler failed to preempt low priority tasks.
Finding (and dequeueing) the highest priority thread is not an expensive operation. It typically takes 10-20 instructions, because this would be the current priority-level, and the head element in this list.Brendan wrote:In addition; because the scheduler doesn't optimise for anything (e.g. NUMA, cache sharing, hyper-threading), the end user configured CPU affinity masks for most tasks to try to avoid performance penalties. When the scheduler running on a CPU needs to decide which task to run next, it acquires the global lock, then searches the "ready to run" queue looking for the highest priority task that the CPU is able to run. While one CPU is doing this expensive search, all the other CPUs can't acquire the global lock and can't do any useful work.
No, RDOS does not use messaging internally. This is for applications only. It makes frequent use of Signal/WaitForSignal in the kernel. Typically when IRQs signal server-threads, and when timers abort requests.Brendan wrote:Because the OS makes heavy use of synchronous messages (lots and lots of task switches, etc), under heavy load, for a system with 2 CPUs each CPU spends 65% of its time doing useful work, 25% of it's time trying/failing to acquire the global lock, 5% of it's time searching the "ready to run" queue and 5% of its time doing task switches; and is only slightly faster than a single-CPU machine. Under heavy load, for a system with 64 CPUs each CPU spends 1% of its time doing useful work, 96% of it's time trying/failing to acquire the global lock, 2% of it's time searching the "ready to run" queue and 1% of its time doing task switches; and is 50 times slower than a single-CPU machine.
Brendan wrote:Per-CPU locks and per-CPU queues are barely adequate. Every time you say "global", an innocent puppy dies.

Re: How to organize threads in an SMP scheduler
OK, so here is a revised plan of how to implement SMP in my kernel:
* The scheduler lock (which locks-out thread switches on the core) is per core, and is identical between SMP and non-SMP systems. It doesn't have a spinlock and it doesn't disable interrupts (other than briefly). The locks come in three variants: 1. From ordinary code (LockCore/UnlockCore), from ISRs (TryLockCore/TryUnlockCore) and when the scheduler is about to load a new thread (LoadUnlock, which disables interrupts). All the locks described below must be inside a scheduler-locked section of code.
* The global ready-queue. It is protected by a global spinlock on SMP, and doesn't need to disable interrupts. It is a null operation on non-SMP. This queue cannot be accessed from IRQs, which is why it doesn't need to disable interrupts. On SMP it might nevertheless disable interrupts anyway in order to minimize lock-contention.
* Wakeup queues per core. Doesn't need spinlocks on SMP, but need to disable interrupts on both SMP and non-SMP. This queue can be accessed from IRQs, but not from other cores.
* Timed waits would use a local timer on the core. When the timeout fires, the thread would be placed in the wakeup queue.
* Signals would use a per-thread spinlock, and would need to disable interrupts. Threads waiting for signals would not be queued (similar to timed waits), but would indicate their state in the thread-control block. When the thread is signalled, it would be placed in the wakeup-queue.
* Critical section (mutex). A local spinlock variable is added to the data-type. Since critical sections cannot be acquired / released from IRQs, there is no need to disable interrupts.
Since both wakeups and signals can be sent from IRQs, and the global ready-queue should not be accessed from IRQs, these events are recorded as entries in the local wakeup queue instead of being directly posted to the ready-queue. The scheduler might decide to use these entries directly if the ready-queue is empty or has insufficient priority-entries (would improve messaging). In fact, it might be a very good idea to favor messaging-code over polling code, which would be the case if entries in the local wakeup queue where processed before entries in the global ready-queue (if they have the same priority). This would also considerably reduce lock-contention for the global ready-queue (there is no need to lock in order to get highest pending priority). The only problem would be possible starvation of entries in the global ready-queue. Maybe a mixed algorithm would be useful. Sometimes the local wakeups would be posted to the ready-queue, but mostly they would execute first.
Some additional thoughts. The wakeups needs to be processed when the scheduler lock goes down to inactive, so this would mean the entries would need to be queued in a local ready-queue if they were to be executed first. So maybe this is a good compromise between a global ready-queue design and a local ready-queue design? When an entry in the local ready queue executes to long (CPU-bound), it would become preempted and put in the global ready-queue, while if it blocks on something, it would end up in some core's local ready-queue. There would also be an mechanism that forces entries from the wakeup-queue to be posted in the global ready-queue under situations of starvation (perhaps every 8 or 16th entry could be posted in the global ready-queue?). This seems like a solution that would scale well for short bursts of execution, while still keeping the important aspect of priorities for CPU bound threads.
Another thought on this that has to do with load-balancing. If one cores becomes overloaded with threads in it's local ready-queue (which can be meassured by comparing number of thread-loads from the local ready-queue with the number of thread-loads from the global ready-queue), the scheduler could decide to post more of the entries in the wakeup queue to the global ready-queue in order to balance load.
Determining the ratio of posts to local vs global ready-queue for cores could be the job of a new, modified load balancer. This figure would directly affect load balancing. A high percentage of posts to the global ready-queue would affect performance negatively, but would improve load-balancing, while a low percentage would increase performance, but potentially lead to unbalanced load.
The scheduler's policy when loading a new thread is simple. If the global ready-queue contains threads with higher priority than the local ready-queue, lock the global ready-qeueue, dequeue the thread, release the lock and run this thread. Otherwise, dequeue an entry from the local ready-queue and run this thread.
The scheduler's policy when a thread exists execution is also simple. If the thread was blocked, it would be put in a blocked queue of some sort. If not, it would be posted to the global ready-queue.
EDIT: Updated the signal queue with the new idea presented before.
* The scheduler lock (which locks-out thread switches on the core) is per core, and is identical between SMP and non-SMP systems. It doesn't have a spinlock and it doesn't disable interrupts (other than briefly). The locks come in three variants: 1. From ordinary code (LockCore/UnlockCore), from ISRs (TryLockCore/TryUnlockCore) and when the scheduler is about to load a new thread (LoadUnlock, which disables interrupts). All the locks described below must be inside a scheduler-locked section of code.
* The global ready-queue. It is protected by a global spinlock on SMP, and doesn't need to disable interrupts. It is a null operation on non-SMP. This queue cannot be accessed from IRQs, which is why it doesn't need to disable interrupts. On SMP it might nevertheless disable interrupts anyway in order to minimize lock-contention.
* Wakeup queues per core. Doesn't need spinlocks on SMP, but need to disable interrupts on both SMP and non-SMP. This queue can be accessed from IRQs, but not from other cores.
* Timed waits would use a local timer on the core. When the timeout fires, the thread would be placed in the wakeup queue.
* Signals would use a per-thread spinlock, and would need to disable interrupts. Threads waiting for signals would not be queued (similar to timed waits), but would indicate their state in the thread-control block. When the thread is signalled, it would be placed in the wakeup-queue.
* Critical section (mutex). A local spinlock variable is added to the data-type. Since critical sections cannot be acquired / released from IRQs, there is no need to disable interrupts.
Since both wakeups and signals can be sent from IRQs, and the global ready-queue should not be accessed from IRQs, these events are recorded as entries in the local wakeup queue instead of being directly posted to the ready-queue. The scheduler might decide to use these entries directly if the ready-queue is empty or has insufficient priority-entries (would improve messaging). In fact, it might be a very good idea to favor messaging-code over polling code, which would be the case if entries in the local wakeup queue where processed before entries in the global ready-queue (if they have the same priority). This would also considerably reduce lock-contention for the global ready-queue (there is no need to lock in order to get highest pending priority). The only problem would be possible starvation of entries in the global ready-queue. Maybe a mixed algorithm would be useful. Sometimes the local wakeups would be posted to the ready-queue, but mostly they would execute first.
Some additional thoughts. The wakeups needs to be processed when the scheduler lock goes down to inactive, so this would mean the entries would need to be queued in a local ready-queue if they were to be executed first. So maybe this is a good compromise between a global ready-queue design and a local ready-queue design? When an entry in the local ready queue executes to long (CPU-bound), it would become preempted and put in the global ready-queue, while if it blocks on something, it would end up in some core's local ready-queue. There would also be an mechanism that forces entries from the wakeup-queue to be posted in the global ready-queue under situations of starvation (perhaps every 8 or 16th entry could be posted in the global ready-queue?). This seems like a solution that would scale well for short bursts of execution, while still keeping the important aspect of priorities for CPU bound threads.
Another thought on this that has to do with load-balancing. If one cores becomes overloaded with threads in it's local ready-queue (which can be meassured by comparing number of thread-loads from the local ready-queue with the number of thread-loads from the global ready-queue), the scheduler could decide to post more of the entries in the wakeup queue to the global ready-queue in order to balance load.
Determining the ratio of posts to local vs global ready-queue for cores could be the job of a new, modified load balancer. This figure would directly affect load balancing. A high percentage of posts to the global ready-queue would affect performance negatively, but would improve load-balancing, while a low percentage would increase performance, but potentially lead to unbalanced load.
The scheduler's policy when loading a new thread is simple. If the global ready-queue contains threads with higher priority than the local ready-queue, lock the global ready-qeueue, dequeue the thread, release the lock and run this thread. Otherwise, dequeue an entry from the local ready-queue and run this thread.
The scheduler's policy when a thread exists execution is also simple. If the thread was blocked, it would be put in a blocked queue of some sort. If not, it would be posted to the global ready-queue.
EDIT: Updated the signal queue with the new idea presented before.
-
- Member
- Posts: 30
- Joined: Wed Jan 13, 2010 7:59 am
- Location: Germany / Nuernberg
Re: How to organize threads in an SMP scheduler
Hi,
The penalty for a transfer of the Cache-Content from one CPU to the other CPU is really smaller than your 100 ms.
Greetings,
Erik
And this is a bug, an IRQ should ever go to the CPU running the task with the lowest priority.Brendan wrote:An IRQ occurs on the CPU running the extremely high priority task
The penalty for a transfer of the Cache-Content from one CPU to the other CPU is really smaller than your 100 ms.
Greetings,
Erik
Re: How to organize threads in an SMP scheduler
How would you enforce that?ErikVikinger wrote:Hi,
And this is a bug, an IRQ should ever go to the CPU running the task with the lowest priority.Brendan wrote:An IRQ occurs on the CPU running the extremely high priority task