Hi,
OSwhatever wrote:What structure do you use in order to find the CPU to run on? Typically you want to find the CPU that runs a thread with lowest priority. Is it a bitmap of running CPUs for each priority?
That's where things get complicated.
To find the best CPU to run a task on, you want to give each CPU a score and find the CPU with the best score.
You may want "CPU affinity". In this case, each task might have a bitfield where each bit represents a CPU; and when you're trying to find the best CPU to run a task on you skip/ignore CPUs that aren't allowed in the task's CPU affinity bitfield.
This gives you a loop that might look something like this:
Code: Select all
find_best_CPU(TASK_INFO *task) {
int CPU_number;
int score;
int best_score = INT_MAX;
int best_CPU;
for(CPU_number = 0; CPU_number < Max_CPU_number; CPU_number++) {
if( (task->CPU_affinity & (1 << CPU_number)) != 0) {
score = get_CPU_score(CPU_number, task);
if(score < best_score) {
best_score = score;
best_CPU = CPU_number;
}
}
}
return best_CPU;
}
Of course the "if( (task->CPU_affinity & (1 << CPU_number)) != 0) {" part would need to be less lame if you want to support more than 32 CPUs (I'd use the "BTS" instruction to support massive bitfields, but doing it in C is messy).
Anyway, this gives you a single function ("get_CPU_score(CPU_number, task)") that controls all load balancing. This is useful because you're going to be playing with load balancing a lot (tuning the scheduler) and don't want it spread all over the place.
To begin with, "get_CPU_score(CPU_number, task)" could just find out how many tasks are on the CPU's "ready to run" queues and return a score based on that. However, this is far from ideal.
You don't want to have all low priority tasks on one CPU and all high priority tasks on another CPU, because high priority tasks tend to block quickly (you'll probably end up with one CPU idle and one with lots of low priority tasks) and it would mean high priority tasks don't get CPU time as quickly due to other high priority tasks. For this reason you want to balance high priority load across all CPUs without caring about low priority load, you want to do balance low priority load across all CPUs without caring about high priority load, etc. Basically you want a separate load rating for each priority group.
This means "get_CPU_score(CPU_number, task)" would find out which priority group the task is in, and could just find out how many tasks in that priority group are on the CPU's "ready to run" queues and return a score based on that. However, this isn't perfect either.
If your "get_CPU_score(CPU_number, task)" finds an initial score based on CPU load, then you can modify that initial score under certain conditions.
If a task was running on a CPU recently, then you might want to try to run the task on the same CPU again in the hope that the task's data is still in that CPU's cache. This could be as simple as "if( task->lastCPU == CPU_number) score *= 0.9;" to improve a CPU's score a little if it was the CPU that the task used last time. It could be a lot more complex though. For modern computers different CPUs share different caches - you might want to try to run the task on the same CPU as last time, or on any other CPUs that shares caches with that CPU. Also, if the CPU/s have done lots of other work since the task was run, then it's likely that the task's data won't be in the CPU's cache anymore - you might want to estimate the chance that the task's data is still in the cache/s and modify the CPU's score based on "chance data is still in the cache".
Next, if one CPU is about to melt and another CPU is cool, then you'd prefer to schedule load on the CPU that is cool. For modern CPUs there's temperature sensors you can use, and you can adjust a CPU's score based on the CPU's temperature. This can help to improve performance by avoiding thermal throttling. You might think that load balancing alone is enough to balance CPU temperatures, and you'd be right if cooling is identical for all CPUs. CPU cooling isn't identical though, especially for "multi-socket" motherboards (where different physical chips are in different locations and get different air-flow and different layers of dust).
Then there's things like hyper-threading and "turbo-boost", where the load on one CPU effects the performance of other CPUs (e.g. making a low priority task run on an idle CPU might make high priority tasks running on other CPUs slower). If all CPUs are idle, then for performance (for high priority tasks) you want to schedule tasks on different cores (and not different logical CPUs in the same core); but for power management (for low priority tasks) you want to schedule tasks on different logical CPUs in the same core (and not different cores) so that other cores can stay in a low power state. For things like turbo-boost it's sort of similar. You can modify the CPU's score to take this into account too.
Another idea would be NUMA optimisations. Each process might have a "home NUMA domain", where you want to run the process' threads on a CPU that is in the process' "home NUMA domain" to avoid memory access penalties. You wouldn't want to use CPU affinity to prevent the threads from running on CPUs that aren't in the process' home NUMA domain as that's not flexible enough (maybe the CPUs in the process' home NUMA domain are severely overloaded and the memory access penalties are the least of 2 evils). You can modify the CPU's score to take into account how bad those memory access penalties are.
Note: these penalties are different for different CPUs - e.g. for a CPU in the right NUMA domain there wouldn't be any penalty, a CPU in a "nearby" NUMA domain might have a small penalty, and a CPU in a "distant" NUMA domain might have a high penalty.
Hopefully by now you've got the idea - the "get_CPU_score(CPU_number, task)" code can take into account a very wide variety of different things. I'd recommend starting with a very simple "get_CPU_score(CPU_number, task)" though - it's not worth worrying about the more complicated stuff until your OS is working well and you can benchmark your changes. The main idea is having a central function that controls all load balancing, so that you can improve/change it later without much hassle.
Cheers,
Brendan