OSDev.org

Posted: **Thu Mar 01, 2012 1:18 pm**

I common way to solve SMP scheduling is to have a priority list for each CPU. Linux has this for example (actually two but not important here). My question is why this method is used since preempted threads are still stuck at one CPU. If another CPU gets ready and could run that thread, it will not happen or am I completely wrong here.

This method is in general a good way to avoid race conditions in the SMP scheduling but is it completely fair when it comes to priority?

I'm looking into this case versus having one system wide priority list. With one list I see more locking and also a myriad of race conditions that could happen.

Posted: **Thu Mar 01, 2012 7:08 pm**

Hi,

OSwhatever wrote:I common way to solve SMP scheduling is to have a priority list for each CPU. Linux has this for example (actually two but not important here). My question is why this method is used since preempted threads are still stuck at one CPU. If another CPU gets ready and could run that thread, it will not happen or am I completely wrong here.

This method is in general a good way to avoid race conditions in the SMP scheduling but is it completely fair when it comes to priority?

I'm looking into this case versus having one system wide priority list. With one list I see more locking and also a myriad of race conditions that could happen.

For scalability, I always try to imagine thousands of CPUs all under load, as it helps to see problems in designs. In this case it'd be thousands of CPUs creating tasks and putting them on the scheduler's queue/s and removing tasks from the scheduler/s queue.

Having one global list would cause a lot of lock contention; unless you find a way to use lock-less algorithms and it causes retries instead (where retries are just as bad or worse than lock contention); unless you can find a way to use block-less algorithms where you've still got massive problems with "cache line ping pong" - lots of CPUs trying to modify the same cache line, causing that cache line to be constantly transferred between CPUs. Of course it's hard to find/design lock-less algorithms, and extremely difficult to find/design block-less algorithms (where all CPUs are guaranteed to make forward progress); and often it's simply impossible. Note: for locks and lock-less algorithms you still have the "cache line ping pong" problem in addition to the lock contention or retries problem.

The obvious solution is to have a separate queue for each CPU. That way when there's thousands of CPUs there's thousands of locks and massive reduction in lock contention (or lock-less retries) and "cache line ping pong".

However, scalability alone isn't the only problem - you'd want to consider load balancing too. The thing to look at here is when you decide which CPU a task should use. If you decide which CPU a task should use at the last possible moment, then the system will respond very quickly to changes in load (e.g. if half the tasks terminate, then the remaining tasks will be "instantly" spread across available CPUs). This is easy to do for "one global queue". If you decide which CPU a task should use once when a task is created (e.g. a task always uses the same CPU) then the system will respond very slowly to changes in load (e.g. if half the tasks terminate, then you may end up with CPUs that are idle when there's plenty of remaining tasks that they could be running).

That's the compromise - scalability vs. how quickly the system adapts to changes in load.

The way I do it is to have per-CPU "ready to run" queues (for scalability); where tasks are removed from the CPU's "ready to run" queue when they start running (e.g. when they go from the "ready to run" state to the "running" state). This means that tasks frequently need to be put back on a CPU's "ready to run" queue, and I can do the load balancing at that time. This gives good scalability (but not as good "decide which CPU a task should use when a task is created" would), and means that the system responds to changes in load within a reasonable amount of time (but not as quickly as "global queue" would).

The other thing I do is have many "ready to run" queues per CPU. For example, with 256 task priorities I might have 256 "ready to run" queues per CPU (so that with 16 CPUs I'd have 4096 queues). This increases the amount of locks and reduces lock contention even more.

Cheers,

Brendan

Posted: **Fri Mar 02, 2012 2:40 am**

Indeed the per CPU scheduler queues are better in terms of lock congestion. However once a CPU has been allocated to a thread, it's stuck there or you have to allocate again.

So let's say CPU 0 runs at prio 10, then a thread comes in at prio 5, preempts CPU 0. The thread at prio 10 must be postponed on CPU 0. What happens here in this case? Do you put it back on the CPU 0 scheduler queue or do you try to find a new CPU where the thread can run? It's this case I'm a bit weary about, once the thread is stuck at a CPU it will add latency to low prio threads.

Posted: **Fri Mar 02, 2012 5:54 am**

It makes sense actually and I think it is getting more clear for me now. The solution is quite simple, just my mind that blocks them as usual.

Posted: **Fri Mar 02, 2012 6:59 am**

Hi,

OSwhatever wrote:So let's say CPU 0 runs at prio 10, then a thread comes in at prio 5, preempts CPU 0. The thread at prio 10 must be postponed on CPU 0.

The thread at priority 10 was in the "running" state (and not the "ready to run" state) and therefore must be put onto a "ready to run" queue; and the scheduler decides which CPU's queue it should be put on when it does that.

OSwhatever wrote:It's this case I'm a bit weary about, once the thread is stuck at a CPU it will add latency to low prio threads.

You're right - you can end up with all CPUs idle except one, and a few low priority tasks waiting to run on the remaining CPU. However...

By definition, low priority threads aren't really a high priority for scheduling...

The most important thing is high priority threads. High priority threads tend to spend most of their time blocked waiting for something. When that something happens they stop waiting and become "ready to run" and the scheduler determines which CPU they should be run on at that time (where they almost always pre-empt whatever was running on that CPU). This means high priority threads mostly always run on the "best" CPU immediately.

Low priority threads mostly just fill in the gaps when there's no high priority threads to run. For extremely low priority tasks it shouldn't even matter if the task never gets any CPU time. Even though you could have some CPUs idle and have few low priority tasks waiting to run on the remaining CPU, it will sort itself out quickly enough that it won't matter much (partly because higher priority tasks are frequently preempting lower priority tasks and causing the preempted lower priority tasks to be rescheduled).

Cheers,

Brendan

Posted: **Fri Mar 02, 2012 8:38 am**

What structure do you use in order to find the CPU to run on? Typically you want to find the CPU that runs a thread with lowest priority. Is it a bitmap of running CPUs for each priority?

Posted: **Fri Mar 02, 2012 9:40 pm**

Hi,

OSwhatever wrote:What structure do you use in order to find the CPU to run on? Typically you want to find the CPU that runs a thread with lowest priority. Is it a bitmap of running CPUs for each priority?

That's where things get complicated.

To find the best CPU to run a task on, you want to give each CPU a score and find the CPU with the best score.

You may want "CPU affinity". In this case, each task might have a bitfield where each bit represents a CPU; and when you're trying to find the best CPU to run a task on you skip/ignore CPUs that aren't allowed in the task's CPU affinity bitfield.

This gives you a loop that might look something like this:

Code: Select all

find_best_CPU(TASK_INFO *task) {
    int CPU_number;
    int score;
    int best_score = INT_MAX;
    int best_CPU;

    for(CPU_number = 0; CPU_number < Max_CPU_number; CPU_number++) {
        if( (task->CPU_affinity & (1 << CPU_number)) != 0) {
            score = get_CPU_score(CPU_number, task);
            if(score < best_score) {
                best_score = score;
                best_CPU = CPU_number;
            }
        }
    }
   return best_CPU;
}

Of course the "if( (task->CPU_affinity & (1 << CPU_number)) != 0) {" part would need to be less lame if you want to support more than 32 CPUs (I'd use the "BTS" instruction to support massive bitfields, but doing it in C is messy).

Anyway, this gives you a single function ("get_CPU_score(CPU_number, task)") that controls all load balancing. This is useful because you're going to be playing with load balancing a lot (tuning the scheduler) and don't want it spread all over the place.

To begin with, "get_CPU_score(CPU_number, task)" could just find out how many tasks are on the CPU's "ready to run" queues and return a score based on that. However, this is far from ideal.

You don't want to have all low priority tasks on one CPU and all high priority tasks on another CPU, because high priority tasks tend to block quickly (you'll probably end up with one CPU idle and one with lots of low priority tasks) and it would mean high priority tasks don't get CPU time as quickly due to other high priority tasks. For this reason you want to balance high priority load across all CPUs without caring about low priority load, you want to do balance low priority load across all CPUs without caring about high priority load, etc. Basically you want a separate load rating for each priority group.

This means "get_CPU_score(CPU_number, task)" would find out which priority group the task is in, and could just find out how many tasks in that priority group are on the CPU's "ready to run" queues and return a score based on that. However, this isn't perfect either.

If your "get_CPU_score(CPU_number, task)" finds an initial score based on CPU load, then you can modify that initial score under certain conditions.

If a task was running on a CPU recently, then you might want to try to run the task on the same CPU again in the hope that the task's data is still in that CPU's cache. This could be as simple as "if( task->lastCPU == CPU_number) score *= 0.9;" to improve a CPU's score a little if it was the CPU that the task used last time. It could be a lot more complex though. For modern computers different CPUs share different caches - you might want to try to run the task on the same CPU as last time, or on any other CPUs that shares caches with that CPU. Also, if the CPU/s have done lots of other work since the task was run, then it's likely that the task's data won't be in the CPU's cache anymore - you might want to estimate the chance that the task's data is still in the cache/s and modify the CPU's score based on "chance data is still in the cache".

Next, if one CPU is about to melt and another CPU is cool, then you'd prefer to schedule load on the CPU that is cool. For modern CPUs there's temperature sensors you can use, and you can adjust a CPU's score based on the CPU's temperature. This can help to improve performance by avoiding thermal throttling. You might think that load balancing alone is enough to balance CPU temperatures, and you'd be right if cooling is identical for all CPUs. CPU cooling isn't identical though, especially for "multi-socket" motherboards (where different physical chips are in different locations and get different air-flow and different layers of dust).

Then there's things like hyper-threading and "turbo-boost", where the load on one CPU effects the performance of other CPUs (e.g. making a low priority task run on an idle CPU might make high priority tasks running on other CPUs slower). If all CPUs are idle, then for performance (for high priority tasks) you want to schedule tasks on different cores (and not different logical CPUs in the same core); but for power management (for low priority tasks) you want to schedule tasks on different logical CPUs in the same core (and not different cores) so that other cores can stay in a low power state. For things like turbo-boost it's sort of similar. You can modify the CPU's score to take this into account too.

Another idea would be NUMA optimisations. Each process might have a "home NUMA domain", where you want to run the process' threads on a CPU that is in the process' "home NUMA domain" to avoid memory access penalties. You wouldn't want to use CPU affinity to prevent the threads from running on CPUs that aren't in the process' home NUMA domain as that's not flexible enough (maybe the CPUs in the process' home NUMA domain are severely overloaded and the memory access penalties are the least of 2 evils). You can modify the CPU's score to take into account how bad those memory access penalties are. Note: these penalties are different for different CPUs - e.g. for a CPU in the right NUMA domain there wouldn't be any penalty, a CPU in a "nearby" NUMA domain might have a small penalty, and a CPU in a "distant" NUMA domain might have a high penalty.

Hopefully by now you've got the idea - the "get_CPU_score(CPU_number, task)" code can take into account a very wide variety of different things. I'd recommend starting with a very simple "get_CPU_score(CPU_number, task)" though - it's not worth worrying about the more complicated stuff until your OS is working well and you can benchmark your changes. The main idea is having a central function that controls all load balancing, so that you can improve/change it later without much hassle.

Cheers,

Brendan

Posted: **Sun Mar 04, 2012 3:28 pm**

It is possible to design a mixed-concept with both a global ready queue and per CPU ready queues. Normally, a thread is kept on a particular CPU, but sometimes it is switched to the global queue and then used in load balancing.

OSDev.org

SMP scheduling with own prioty list

SMP scheduling with own prioty list

Re: SMP scheduling with own prioty list

Re: SMP scheduling with own prioty list

Re: SMP scheduling with own prioty list

Re: SMP scheduling with own prioty list

Re: SMP scheduling with own prioty list

Re: SMP scheduling with own prioty list

Re: SMP scheduling with own prioty list