OSDev.org

Posted: **Tue May 22, 2012 2:28 pm**

rdos wrote:OK. That seems reasonable, but what is "other tasks"? All user level tasks, or some particular user level task?

"Other tasks" means all other tasks. And that time is distributed between them according to their priority.

Posted: **Tue May 22, 2012 2:28 pm**

Hi,

rdos wrote:
Brendan wrote:You only need to use "lock add" (when a task is added to the queue) and "lock sub" (when task is removed from the queue/s) to keep track of how much work is on a CPU's queue/s. Any/all CPUs can read each other's "current work on queue" values with plain old "mov" (no spinlock, no lock prefix).
How scalable is it to let each CPU scheduler loop over n CPUs in order to gather load information? That's and O(n) algorithm. I can really imagine your 1,000 cores looping 999 times each time a new task should be scheduled in order to gather load information from other CPUs, and not getting anything useful done.

For NUMA systems; for performance/scalability reasons, a process is "pinned" to a specific NUMA domain. For example, this means you can make sure that all the pages the process uses can be accessed by that process without penalties (and without excessive Quickpath/hyper-transport traffic). Shifting a process from one NUMA domain to another means that (not necessarily immediately) all threads need to be shifted to different CPUs and all physical pages need to be replaced. It's a very expensive operation. Basically, there's 2 types of load balancing - balancing load of processes across NUMA domains; and balancing load of threads across CPUs within the same NUMA domain.

Note: Without NUMA optimisations like this, your OS will never scale beyond (approximately) 32 CPUs anyway, and therefore you don't need to care about systems with more CPUs.

The design of Intel's x2APIC has effectively limited the hardware to a maximum of 16 CPUs per NUMA domain. This means that, for balancing load of threads across CPUs within the same NUMA domain, you only have to worry about a maximum of 16 CPUs.

With 1000 cores you don't loop 999 times each time a new task should be scheduled - you only loop 16 times because there's 63 NUMA domains with 16 CPUs in each NUMA domain.

You also don't really need to do it each time a task should be scheduled. For example, if the CPU is idle (or if the CPU's load is less than some sort of minimum threshold), then you can skip it.

rdos wrote:
Brendan wrote:So you're trying to say that for per-CPU queues (where other CPUs might occasionally add a thread) the lock contention is a problem and it can't be done locklessly; but as soon as you've got many CPUs constantly checking, adding and removing threads from the same global queue/s there is no problems and it can be lockless?
I think I told you before that CPUs cannot directly add new tasks into another CPUs local queue, and thus the local queues need no locks. The way moving of tasks between CPUs work is by letting CPUs add entries to the global queue (or to an intermediate level if there are many CPUs).

You're wrong. CPUs can directly add new tasks into another CPUs queue - I've been doing it like this for years.

You may have meant that CPUs cannot directly add new tasks into another CPUs local queue for the current implementation of RDOS. In that case, nobody cares too much what RDOS does - I think I told you before that (for a large variety of different reasons) RDOS is hideous mess that needs to be redesigned from scratch.

Cheers,

Brendan

Posted: **Tue May 22, 2012 2:47 pm**

Brendan wrote:You may have meant that CPUs cannot directly add new tasks into another CPUs local queue for the current implementation of RDOS. In that case, nobody cares too much what RDOS does - I think I told you before that (for a large variety of different reasons) RDOS is hideous mess that needs to be redesigned from scratch.

I think I meant exactly what I wrote: The design of the global queue is the replacement for directly adding threads to other CPUs and makes it unnecesary to measure CPU load or know anything about other CPUs in the system. It is this design that makes it unnecesary to provide locks for local queues. This has nothing to do with any RDOS implementation. It is part of the alternative scheduler design.

Posted: **Tue May 22, 2012 3:04 pm**

Hi,

AlfaOmega08 wrote:
WIP: when a thread is found starved on a low priority queue for a long time (2, 3 seconds??) it's priority is boosted to 15, it's given double time slice, and then gets back to it's original priority. (suggested from Win XP scheduler)

Is starvation necessarily bad in this case? I like the idea of allowing threads who's priority is so low that they never get any CPU time unless there's nothing else to do, that are used for things don't need to be done at all (e.g. BOINC, superoptimization, etc).

AlfaOmega08 wrote:I also read that linux CPU schedulers GRAB threads from other CPU's queues (when they're idle) instead of pushing to other CPUs (when they're busy), or using a global queue. This also implies that the idle CPU checks other CPUs. What about this method?

If CPUs only grab tasks from other CPUs when they're idle, then it won't work. For example, one CPU might be extremely over-loaded and 7 CPUs might be running extremely low priority tasks; and because none of the CPUs are idle no load balancing happens at all.

If CPUs only grab tasks from other CPUs when the only tasks they can run have a priority that is less than <threshold>, then it still won't work for the same reason. For example, one CPU might be extremely over-loaded and 7 CPUs might be running tasks that are a tiny bit higher priority than <threshold>; and no load balancing happens again.

If CPUs grab tasks from other CPUs when they're idle or when all the tasks they could run have a priority lower than <threshold>, but also grab tasks from other CPUs using some other mechanism; then you've essentially got 2 things doing the same job and I'd have to wonder why the "some other mechanism" isn't enough on its own.

Cheers,

Brendan

Posted: **Tue May 22, 2012 3:15 pm**

Hi,

rdos wrote:
Brendan wrote:You may have meant that CPUs cannot directly add new tasks into another CPUs local queue for the current implementation of RDOS. In that case, nobody cares too much what RDOS does - I think I told you before that (for a large variety of different reasons) RDOS is hideous mess that needs to be redesigned from scratch.
I think I meant exactly what I wrote: The design of the global queue is the replacement for directly adding threads to other CPUs and makes it unnecesary to measure CPU load or know anything about other CPUs in the system. It is this design that makes it unnecesary to provide locks for local queues. This has nothing to do with any RDOS implementation. It is part of the alternative scheduler design.

Ok, in that case you can optimise it by getting rid of the need for all CPUs to constantly check global queues (by allowing CPUs to insert tasks on other CPU's local queues, or grab tasks from other CPU's queues).

Don't forget that you need to measure CPU load somehow regardless of how you do load balancing (and probably want it for power management and providing statistics to the user/administrator anyway); so "avoiding" the need for measuring CPU load is a stupid idea.

Cheers,

Brendan

Posted: **Wed May 23, 2012 1:10 am**

Brendan wrote:Ok, in that case you can optimise it by getting rid of the need for all CPUs to constantly check global queues (by allowing CPUs to insert tasks on other CPU's local queues, or grab tasks from other CPU's queues).

One read of highest global priority is much faster than checking load-status of up to 16 cores. Additionally, acquiring a spinlock every time you access the local queue because some other CPU might insert something into it is not free.

So, no, that is not an optimization.

Brendan wrote:Don't forget that you need to measure CPU load somehow regardless of how you do load balancing (and probably want it for power management and providing statistics to the user/administrator anyway); so "avoiding" the need for measuring CPU load is a stupid idea.

Not with such a high precision. Power managment only need load statistics per second or something like that. I use amount of time executed by null-thread for that meassure, not how many tasks are queued up and ready to run. This same meassure is used by the load logger.

Posted: **Wed May 23, 2012 2:06 am**

Brendan wrote:If CPUs only grab tasks from other CPUs when they're idle, then it won't work. For example, one CPU might be extremely over-loaded and 7 CPUs might be running extremely low priority tasks; and because none of the CPUs are idle no load balancing happens at all.

It works very well when combined with shuffling some percentage of executed tasks to the global queue (or if you like to keep your direct moves, to any random CPU). The global queue acts as "surplus" load that any idle CPU can get work from. This is why the global queue also solves the load balancing problem without any load information at all.

So, 7 CPUs cannot continously execute low priority tasks as these tasks will get moved to the global queue, where they are only reclaimed by CPUs that are idle and when no higher priority tasks are ready.

Brendan wrote:If CPUs only grab tasks from other CPUs when the only tasks they can run have a priority that is less than <threshold>, then it still won't work for the same reason. For example, one CPU might be extremely over-loaded and 7 CPUs might be running tasks that are a tiny bit higher priority than <threshold>; and no load balancing happens again.

If CPUs grab tasks from other CPUs when they're idle or when all the tasks they could run have a priority lower than <threshold>, but also grab tasks from other CPUs using some other mechanism; then you've essentially got 2 things doing the same job and I'd have to wonder why the "some other mechanism" isn't enough on its own.

This just shows that your proposed solution is too complex, and that the complexity still doesn't allow it to solve basic load balancing problems adequately. I think you need to do a major redesign. Perhaps start from scratch?

Posted: **Wed May 23, 2012 3:50 am**

rdos wrote:The design of the global queue is the replacement for directly adding threads to other CPUs and makes it unnecesary to measure CPU load or know anything about other CPUs in the system.

With global queue you'll have a great performance lag caused by filling cache lines. For example, if your task was executed on core#0 then on next time slice core#1 selected that task from global queue, the cache filled in core#0 is wasted and core#1 will spend much time feeding required code and data from memory. The same applies for NUMA domains. Transferring task from one domain to another may require transferring large amount of data between domains. The same reason (cache fill) is considered when system optimized for performance (large time slice provided for the task as a whole) opposite to interactivity (time slice is divided between tasks to reduce responce time).

Brendan wrote:Is starvation necessarily bad in this case? I like the idea of allowing threads who's priority is so low that they never get any CPU time unless there's nothing else to do, that are used for things don't need to be done at all (e.g. BOINC, superoptimization, etc).

Yes, there is a special kind of tasks which I consider as true background processes (you provided a good samples!), but that class of tasks needs special design (full isolation) and planning to avoid resource starvation. No other task is allowed to depend on that background process! For example, synchronous messages to that tasks should be prohibited.

Posted: **Wed May 23, 2012 4:19 am**

Yoda wrote:
rdos wrote:The design of the global queue is the replacement for directly adding threads to other CPUs and makes it unnecesary to measure CPU load or know anything about other CPUs in the system.
With global queue you'll have a great performance lag caused by filling cache lines. For example, if your task was executed on core#0 then on next time slice core#1 selected that task from global queue, the cache filled in core#0 is wasted and core#1 will spend much time feeding required code and data from memory. The same applies for NUMA domains. Transferring task from one domain to another may require transferring large amount of data between domains. The same reason (cache fill) is considered when system optimized for performance (large time slice provided for the task as a whole) opposite to interactivity (time slice is divided between tasks to reduce responce time).

Yes, but the global queue is only used for exchange of tasks in load balancing. In RDOS, currently 3% of all tasks that are switched from /activated are placed in the global queue, while 97% is placed in the local queue. That keeps the moves between cores at a reasonable rate so the cache problem becomes reduced. By selecting a lower percentage, less cache misses are expected, but it will also take longer for load to even out. It might even be possible to dynamically adapt the percentage. With a maximum time-slice of 1ms, 3% means that at least 30 tasks per second are posted to the global queue under load, which appears to provide functional load balancing under various conditions.

For NUMA domains, things are more complicated. As Brendan correctly wrote, there is a need to transfer whole processes between domains, which will need some additional logic. I can even imagine that common parts of the kernel should be duplicated in all domains in order to avoid large data transfers. If I designed such a solution, I'd also have some shared areas between domains that could be used for fast IPC.

Posted: **Wed May 23, 2012 7:21 pm**

Hi,

rdos wrote:
Brendan wrote:If CPUs only grab tasks from other CPUs when they're idle, then it won't work. For example, one CPU might be extremely over-loaded and 7 CPUs might be running extremely low priority tasks; and because none of the CPUs are idle no load balancing happens at all.
It works very well when combined with shuffling some percentage of executed tasks to the global queue (or if you like to keep your direct moves, to any random CPU). The global queue acts as "surplus" load that any idle CPU can get work from. This is why the global queue also solves the load balancing problem without any load information at all.

So, 7 CPUs cannot continously execute low priority tasks as these tasks will get moved to the global queue, where they are only reclaimed by CPUs that are idle and when no higher priority tasks are ready.

So, a CPU thinks it only has one low priority task to run so the scheduler doesn't bother setting a timer count (to avoid wasting time on pointless/unnecessary IRQs); and after the idle task has run for several "infinite" time slices you put the only task the CPU can run onto a global queue queue for no reason, then realise you've got no tasks to run, and grab a task (probably the exact same task) off of the global queue?

Then a high priority task unblocks, so you put that on the global queue where it sits ignored for several "infinite" time slices, until eventually one day the user reboots the computer and the problem goes away.

Can I guess that your "solution" to this is to fail to avoid wasting time on pointless/unnecessary IRQs, so that if/when the high priority task does unblock it only sits in the global queue for several time slices? E.g. if you shift the only task the CPU can run to the global queue every tenth time the pointless/unnecessary IRQ occurs, and (for low priority tasks) the task is given 100 ms time slices; then the high priority task sits in the global queue for up to 10*100 ms?

rdos wrote:
Brendan wrote:If CPUs only grab tasks from other CPUs when the only tasks they can run have a priority that is less than <threshold>, then it still won't work for the same reason. For example, one CPU might be extremely over-loaded and 7 CPUs might be running tasks that are a tiny bit higher priority than <threshold>; and no load balancing happens again.

If CPUs grab tasks from other CPUs when they're idle or when all the tasks they could run have a priority lower than <threshold>, but also grab tasks from other CPUs using some other mechanism; then you've essentially got 2 things doing the same job and I'd have to wonder why the "some other mechanism" isn't enough on its own.
This just shows that your proposed solution is too complex, and that the complexity still doesn't allow it to solve basic load balancing problems adequately. I think you need to do a major redesign. Perhaps start from scratch?

That's not my solution; that's me (subtly) pointing out that grabbing tasks from a global queue when a CPU is idle doesn't work very well.

My solution is to determine which CPU a task should use when it becomes "ready to run". What I've described so far is: when a task becomes "ready to run", find the CPU with the least load (within the task's "home" NUMA domain) and put the task on that CPU's "ready to run" queue. In reality my scheduler isn't that simple.

In reality it's more like: for each CPU allowed by the thread's CPU affinity mask; calculate a score for the CPU based on several factors (the task's type/scheduling policy, CPU load, CPU speed, CPU temperature, if the task may have data in the CPU's caches, if the CPU is in the task's preferred NUMA domain and NUMA penalties, etc), and choose the CPU with the "best" score. This may mean that a task is not given to the CPU with the least load because (for example) another CPU with slightly more load happens to share an L2 cache with the CPU that the task used last time it was given CPU time.

Of course I can use different code to calculate CPU scores in different situations. For a system with 4 separate single core Pentium III chips I might use a simpler calculation that ignores shared caches (as there are none), and for a modern "multi-core with hyperthreading and TurboBoost" system I might have a more complex calculation that takes into account how the load on one CPU effects the performance of other CPUs. For something like AMD's Bulldozer (where 2 cores share the same floating point pipeline) the calculation could try to put tasks that use floating point on one core and tasks that don't use floating point on another core.

Now, how would you add optimisations like these to a system that grabs tasks from a global queue when the CPU is idle?

Cheers,

Brendan

Posted: **Wed May 23, 2012 8:42 pm**

Hi,

Brendan wrote:Now, how would you add optimisations like these to a system that grabs tasks from a global queue when the CPU is idle?

I thought I'd post an example scenario for people (rdos) to think about.

Imagine the CPU is a quad-core i7, with hyper-threading and TurboBoost (8 logical CPUs). CPU#0 is currently running a very high priority task and all the rest are idle. CPU#6 and CPU#7 have been idle for a long time and are currently in a deep sleep state. The temperatures for all CPUs/cores is roughly the same (as they're all in the same physical chip) but CPU#0 is a little bit warmer than the rest (as it's the only CPU that isn't idle). A very low priority task was recently running on CPU#1 and blocked waiting for data to be read from disk. That data has arrived from disk and unblocked the very low priority task.

If the scheduler is perfect, what would it do?

Shifting the currently running very high priority task to a different CPU makes no sense at all, so it remains running on CPU#0.

For the very low priority task, you don't want to wake up CPU#6 or CPU#7 because they're helping to keep the temperature of the physical chip (and the other cores) down; and you don't really want to use CPU#2, CPU#3, CPU#4, CPU#5, CPU#6 or CPU#7 because in that case TurboBoost will reduce the speed that CPU#0 is running at, and the performance of the very high priority task would be reduced.

If you put the very low priority task on CPU#1 then you avoid the TurboBoost problem, and it might run faster (as some of the task's data may still be in that CPU's caches); but hyper-threading would mean that the performance of the very high priority task would badly effected, and you don't want that because the performance of the very low priority isn't important at all and the performance of the very high priority task is very important.

The optimal solution is to schedule the very low priority task on the same CPU as the very high priority task; even though this CPU has the most load and is running at the warmest temperature, and even though all other CPUs are idle.

Cheers,

Brendan

Posted: **Wed May 23, 2012 8:47 pm**

I should thanks RDOS for his provoke on Brendan's valuable information & idea, really.

Posted: **Thu May 24, 2012 1:02 am**

Brendan wrote:So, a CPU thinks it only has one low priority task to run so the scheduler doesn't bother setting a timer count (to avoid wasting time on pointless/unnecessary IRQs); and after the idle task has run for several "infinite" time slices you put the only task the CPU can run onto a global queue queue for no reason, then realise you've got no tasks to run, and grab a task (probably the exact same task) off of the global queue?

AFAIK, all tasks in RDOS run with 1ms time slices regardless of priority. That also is true for the idle thread. As far as the design idea goes, it doesn't say much about time slice sizes. Putting a task onto the global queue and grabbing it again is not abnormal. It is perfectly legal and has no problems (apart from taking a spinlock, which you even need with the local queue).

Brendan wrote:Then a high priority task unblocks, so you put that on the global queue where it sits ignored for several "infinite" time slices, until eventually one day the user reboots the computer and the problem goes away.

Not so. Whenever a task unblocks, the scheduler is invoked on the CPU that unblocks it. Even if the scheduler on that CPU decides to put
the thread on the global queue, it would take it back from there when it schedules the next task, so there is no delay.

Brendan wrote:My solution is to determine which CPU a task should use when it becomes "ready to run". What I've described so far is: when a task becomes "ready to run", find the CPU with the least load (within the task's "home" NUMA domain) and put the task on that CPU's "ready to run" queue. In reality my scheduler isn't that simple.

In reality it's more like: for each CPU allowed by the thread's CPU affinity mask; calculate a score for the CPU based on several factors (the task's type/scheduling policy, CPU load, CPU speed, CPU temperature, if the task may have data in the CPU's caches, if the CPU is in the task's preferred NUMA domain and NUMA penalties, etc), and choose the CPU with the "best" score. This may mean that a task is not given to the CPU with the least load because (for example) another CPU with slightly more load happens to share an L2 cache with the CPU that the task used last time it was given CPU time.

Of course I can use different code to calculate CPU scores in different situations. For a system with 4 separate single core Pentium III chips I might use a simpler calculation that ignores shared caches (as there are none), and for a modern "multi-core with hyperthreading and TurboBoost" system I might have a more complex calculation that takes into account how the load on one CPU effects the performance of other CPUs. For something like AMD's Bulldozer (where 2 cores share the same floating point pipeline) the calculation could try to put tasks that use floating point on one core and tasks that don't use floating point on another core.

Ohh, now you have added 5-10 new variables per CPU you need to enter into your scheduler calculations, and slowed down the scheduler even more. Do you think the scheduler has time for anything else than these calculations? Maybe you need to setup a maximum time-slice for the scheduler as well so you can abort the calculations when they use all available CPU time?

Besides, several of these parameters are CPU-specific and motherboard specific, and I'm even sure that a fully-featured ACPI driver will not give you all the answers.

Brendan wrote:Now, how would you add optimisations like these to a system that grabs tasks from a global queue when the CPU is idle?

Simple answer: I don't. It is a huge overkill to enter all of these factors into the scheduler. Some of the factors are irrelevant (CPU load and temperature), others can be avoided (you don't need to run CPUs on different speeds, and most chipsets don't support this anyway) while some are impossible to calculate (you don't know what the L2 cache contains, so it is pure guess-work).

Posted: **Thu May 24, 2012 1:07 am**

Dude, do work on your soft skills. That's all I will say at this point.

Posted: **Thu May 24, 2012 1:48 am**

Brendan wrote:I thought I'd post an example scenario for people (rdos) to think about.

OK, thanks.

Brendan wrote:Imagine the CPU is a quad-core i7, with hyper-threading and TurboBoost (8 logical CPUs). CPU#0 is currently running a very high priority task and all the rest are idle. CPU#6 and CPU#7 have been idle for a long time and are currently in a deep sleep state. The temperatures for all CPUs/cores is roughly the same (as they're all in the same physical chip) but CPU#0 is a little bit warmer than the rest (as it's the only CPU that isn't idle). A very low priority task was recently running on CPU#1 and blocked waiting for data to be read from disk. That data has arrived from disk and unblocked the very low priority task.

If the scheduler is perfect, what would it do?

Shifting the currently running very high priority task to a different CPU makes no sense at all, so it remains running on CPU#0.

For the very low priority task, you don't want to wake up CPU#6 or CPU#7 because they're helping to keep the temperature of the physical chip (and the other cores) down; and you don't really want to use CPU#2, CPU#3, CPU#4, CPU#5, CPU#6 or CPU#7 because in that case TurboBoost will reduce the speed that CPU#0 is running at, and the performance of the very high priority task would be reduced.

If you put the very low priority task on CPU#1 then you avoid the TurboBoost problem, and it might run faster (as some of the task's data may still be in that CPU's caches); but hyper-threading would mean that the performance of the very high priority task would badly effected, and you don't want that because the performance of the very low priority isn't important at all and the performance of the very high priority task is very important.

The optimal solution is to schedule the very low priority task on the same CPU as the very high priority task; even though this CPU has the most load and is running at the warmest temperature, and even though all other CPUs are idle.

Some reflexions:

1. Waking up CPU-cores (or putting them to sleep) should not be done in the scheduler. This is the job of the power management driver. Therefore, we can exclude the option that the scheduler would wake up CPU #6 or CPU #7.

2. With proper load balancing all cores that are not in deep sleep have equal load over time, and thus similar temperature. That makes the temperature parameter irrelevant. Additionally, if load is sufficiently high to cause significant temperature rise, then all cores should be active. So, just exclude that from the discussion.

Next, you might want to look at the load distribution I posted for dual core Intel Atom (with hyperthreading) in the "what do your OS look like" thread. The load distribution is quite interesting in that CPU #0 and CPU #2 has the highest load, with CPU #1 and CPU #3 having lower load. CPU #0 and CPU #1 have common resources as do CPU #2 and CPU #3. I think this can be explained by the "global queue" algorithm used. The algorithm ensures that load is evenly spread, and when CPU #0 and CPU #1 have common resources, a thread executing on CPU #0 makes it less likely that CPU #1 will grab a task from the global queue (especially a task posted by CPU #0), which is seen in the load diagrams I posted. This is not an automatic benefit of your algorithm.

So, I think I conclude by proposing that avoiding the issues you want to use in your complex scheduler calculations is far better. For instance:

1. When load balancing works, and load over time on different cores are similar, temperature becomes similar, and thus temperature can be excluded as an issue

2. The shared resource problem can be avoided to some extent with global queue scheduling (at least the hyperthreading related problem), and thus is not an issue.

3. Locality can be handled by decreasing post frequency to the global queue, possibly using a kernel-thread in the scheduler to adapt to different conditions on a longer time-scale (seconds).

4. The power manager should ensure that all cores run at the same frequency, and that all cores are started when load is sufficiently high. This eliminates the differential frequency problem and the temperature problem due to deep sleep.

There are still two issues that I have not solved properly. One has to do with the load on the Intel Core Duo, which is on CPU #0 all the time. It doesn't seem like the global queue algorithm works properly on that CPU. The other (probably related) issue is that load over time on CPUs is not identical on my dual core AMD (or probably on any other CPU). On that machine, CPU #0 idle thread currently has 17 hours and CPU #1 idle thread has 11 hours. I need a way to even-out long term load in order to achieve even temperature. One solution might be to have post frequency to global queue per CPU, and increase it for CPUs with more load and decrease it for CPUs with less load. That probably would solve the long-term load issue, but I'm unsure if it would solve the issue on Intel Core Duo.

OSDev.org

Multitasking, SMP & Timers

Re: Multitasking, SMP & Timers

Re: Multitasking, SMP & Timers

Re: Multitasking, SMP & Timers

Re: Multitasking, SMP & Timers

Re: Multitasking, SMP & Timers

Re: Multitasking, SMP & Timers

Re: Multitasking, SMP & Timers

Re: Multitasking, SMP & Timers

Re: Multitasking, SMP & Timers

Re: Multitasking, SMP & Timers

Re: Multitasking, SMP & Timers

Re: Multitasking, SMP & Timers

Re: Multitasking, SMP & Timers

Re: Multitasking, SMP & Timers

Re: Multitasking, SMP & Timers