Multitasking, SMP & Timers

rdos · Post by **rdos** » Thu May 24, 2012 6:59 am

Apparently, the load difference when using a main animation thread that does a timed wait + a burst of video updates, is caused by the IRQ happening on a specific CPU, which then grabs the unblocked thread. It is not obvious how to fix this problem and achieve similar average load on all CPUs when this thread contributes to a majority of the load.

Some possibilities:

1. By changing the post frequency depending on relative load between CPUs, and also temporarily disabling the CPU from taking the thread from the global queue, it is possible to solve it. However, in the end, post frequency to the global queue of the favored (IRQ) CPU ends up at 100% and at the non-favored at 0%. When these parameters have been reached, load becomes even.

2. By posting unblocked threads to a random (or lowest loaded) CPU queue, load would also become even. However, unless an IPI is also sent, the target CPU first executes until the current time-slice expires, and only after that will it execute the new thread. That means the target CPU might first execute the idle-thread for up to 1ms before it goes on to do something useful.

3. Perhaps it might be possible to reprogram the HPET (or PIT) regularily to interrupt the CPU with the lowest load. That should really solve the issue, but then if the lowest loaded CPU decided to hybernate, it must change the delivery of the HPET MSIs to another CPU.

Edit: The reason for the problem is simple, and #3 is the way to fix it. It is once more "lowest delivery mode" for MSIs that don't work properly. Apparently, in lowest priority mode, the IRQ is always delivered to the same core when all cores are idle at the lowest priority. This means that it cannot be used in load-balancing, but rather it is better to use fixed delivery and always deliver to the CPU that has the lowest load. Then you let the power management driver calculate which CPU has the lowest load, and redirect the timer IRQs to it. A simple solution that adds no overhead to the scheduler.

Brendan · Post by **Brendan** » Thu May 24, 2012 10:26 am

Hi,

rdos wrote:
Brendan wrote:My solution is to determine which CPU a task should use when it becomes "ready to run". What I've described so far is: when a task becomes "ready to run", find the CPU with the least load (within the task's "home" NUMA domain) and put the task on that CPU's "ready to run" queue. In reality my scheduler isn't that simple.

In reality it's more like: for each CPU allowed by the thread's CPU affinity mask; calculate a score for the CPU based on several factors (the task's type/scheduling policy, CPU load, CPU speed, CPU temperature, if the task may have data in the CPU's caches, if the CPU is in the task's preferred NUMA domain and NUMA penalties, etc), and choose the CPU with the "best" score. This may mean that a task is not given to the CPU with the least load because (for example) another CPU with slightly more load happens to share an L2 cache with the CPU that the task used last time it was given CPU time.

Of course I can use different code to calculate CPU scores in different situations. For a system with 4 separate single core Pentium III chips I might use a simpler calculation that ignores shared caches (as there are none), and for a modern "multi-core with hyperthreading and TurboBoost" system I might have a more complex calculation that takes into account how the load on one CPU effects the performance of other CPUs. For something like AMD's Bulldozer (where 2 cores share the same floating point pipeline) the calculation could try to put tasks that use floating point on one core and tasks that don't use floating point on another core.
Ohh, now you have added 5-10 new variables per CPU you need to enter into your scheduler calculations, and slowed down the scheduler even more. Do you think the scheduler has time for anything else than these calculations? Maybe you need to setup a maximum time-slice for the scheduler as well so you can abort the calculations when they use all available CPU time?

I explained how easy it is to track CPU load earlier (and that it's needed anyway). Figuring out if the task uses (or more correctly, has used) FPU/MMX/SSE is easy because the scheduler needs this information anyway (for avoiding unnecessary FPU/MMX/SSE state saving/loading during task switches). Optimising for hyper-threading and TurboBoost is just a more complex way of using the CPU load you already have. Most modern CPUs have temperature sensors built in so you don't really need to do anything to read that (but it would involve different code for different CPUs). Detecting which CPUs share which caches is a bit harder, but that only happens once during boot - testing a set of flags (e.g. to see if cache/s are shared by other CPUs within the core, or within the chip) is fast/easy.

Notice how there's nothing really hard (or really expensive) in any of that?

Now let's looks at what it's avoiding. Cache misses are expensive (maybe 150 cycles each), and if you can avoid 200 cache misses that adds up to about 30000 cycles. Is it worth spending about 20 cycles for this?

Hyper-threading (if you get it wrong) costs you about 40% of performance. Is it worth spending about 20 cycles to avoid this?

TurboBoost varies, but you can typically get up to about 50% performance boost by getting it right. Is it worth spending about 20 cycles get this right?

Then there's temperatures. This is mostly (but not entirely) for multi-socket systems. One of the CPU fans dies (or is closer to the air intakes and gets better cooling, or gets choked with dust and worse cooling) and you're going to get different temperatures at the same load. On a NUMA system like mine, the IO hub is going to be attached to one NUMA domain and you want that NUMA domain to be running drivers (and processes that use the drivers, and processes that use processes that use drivers, etc) to avoid extra latency and quickpath/hyper-transport link traffic (that slows everything else down). You want that NUMA domain the have higher load, but (assuming even cooling) that chip is going to get hotter. If you screw this up and a chip overheads and you can expect the performance of each CPU in the chip (e.g. 8 CPUs out of 16) to drop down to maybe 12.5% of normal performance. More likely is that the user gets annoyed because they set the OS to "quiet" but you screwed up and now there's a CPU fan going at max. speed/noise. Is it worth spending about 20 cycles to avoid this?

rdos wrote:
Brendan wrote:Now, how would you add optimisations like these to a system that grabs tasks from a global queue when the CPU is idle?
Simple answer: I don't. It is a huge overkill to enter all of these factors into the scheduler. Some of the factors are irrelevant (CPU load and temperature), others can be avoided (you don't need to run CPUs on different speeds, and most chipsets don't support this anyway) while some are impossible to calculate (you don't know what the L2 cache contains, so it is pure guess-work).

Um, what? It's very easy to end up with different CPUs running at different speeds (turbo-boost, hyper-threading, power management, thermal throttling). It's also easy to guess when a task's data will definitely still be in caches (e.g. if nothing else used that CPU since the task used it last), and even something very simple (assume the CPU a task used last time might have something left in its cache regardless of how much that CPU has done since the task run) is going to avoid a lot more cache misses than failing to take it into account at all.

Simple answer: "simple is better" only applies for things like code maintenance, and doesn't work for performance of developers.

rdos wrote:
Brendan wrote:Imagine the CPU is a quad-core i7, with hyper-threading and TurboBoost (8 logical CPUs). CPU#0 is currently running a very high priority task and all the rest are idle. CPU#6 and CPU#7 have been idle for a long time and are currently in a deep sleep state. The temperatures for all CPUs/cores is roughly the same (as they're all in the same physical chip) but CPU#0 is a little bit warmer than the rest (as it's the only CPU that isn't idle). A very low priority task was recently running on CPU#1 and blocked waiting for data to be read from disk. That data has arrived from disk and unblocked the very low priority task.

If the scheduler is perfect, what would it do?
Some reflexions:

I didn't ask for reflections. Mostly, I hoped you'd understand that sometimes the optimal solution is to run tasks on busy CPUs instead of idle CPUs.

rdos wrote:1. Waking up CPU-cores (or putting them to sleep) should not be done in the scheduler. This is the job of the power management driver. Therefore, we can exclude the option that the scheduler would wake up CPU #6 or CPU #7.

The scheduler needs to know all about "CPU power management" to make effective scheduling descisions, and the "CPU power management" needs to know all about scheduling to make effective power management decisions. Trying to separate these things is pointless and just makes it harder to make good decisions.

rdos wrote:2. With proper load balancing all cores that are not in deep sleep have equal load over time, and thus similar temperature. That makes the temperature parameter irrelevant. Additionally, if load is sufficiently high to cause significant temperature rise, then all cores should be active. So, just exclude that from the discussion.

Wrong (far too simplistic).

rdos wrote:Next, you might want to look at the load distribution I posted for dual core Intel Atom (with hyperthreading) in the "what do your OS look like" thread. The load distribution is quite interesting in that CPU #0 and CPU #2 has the highest load, with CPU #1 and CPU #3 having lower load.

Wrong. This is most likely because your scheduler is crap and not because load is intentionally different for a good reason.

rdos wrote:So, I think I conclude by proposing that avoiding the issues you want to use in your complex scheduler calculations is far better. For instance:

And I think I can conclude that 90% of the time you are wrong (and that's fine). The problem is that you can't see when you're wrong or learn from your mistakes (and that's not fine - it's a perpetual cycle of "fail").

rdos wrote:1. When load balancing works, and load over time on different cores are similar, temperature becomes similar, and thus temperature can be excluded as an issue

Wrong. I think you're forgetting multi-socket systems.

rdos wrote:2. The shared resource problem can be avoided to some extent with global queue scheduling (at least the hyperthreading related problem), and thus is not an issue.

Wrong (the entire idea of "shared resource problem can be avoided with a shared resource" is completely retarded).

rdos wrote:3. Locality can be handled by decreasing post frequency to the global queue, possibly using a kernel-thread in the scheduler to adapt to different conditions on a longer time-scale (seconds).

Wrong - it's the "shared resource problem can be avoided with a shared resource" idiocy extended to locality.

rdos wrote:4. The power manager should ensure that all cores run at the same frequency, and that all cores are started when load is sufficiently high. This eliminates the differential frequency problem and the temperature problem due to deep sleep.

Wrong. If there's 2 chips with 2 CPU fans and one fan dies, you want to ensure all cores run at extremely slow speeds? I don't think it's possible to be more stupid.

Cheers,

Brendan

rdos · Post by **rdos** » Thu May 24, 2012 12:35 pm

Brendan wrote:Now let's looks at what it's avoiding. Cache misses are expensive (maybe 150 cycles each), and if you can avoid 200 cache misses that adds up to about 30000 cycles. Is it worth spending about 20 cycles for this?

OK, assume that you can save 150 cycles on avoiding the global posts. These might be done about 3 out of 100 times the scheduler is run. So multiply you 20 cycles with 100 / 3 and you have 667 cycles, which is over 4 times more.

Brendan wrote:Hyper-threading (if you get it wrong) costs you about 40% of performance. Is it worth spending about 20 cycles to avoid this?

This is harder to quantify. I'm not sure that you will gain much if any performance by trying to optimize this over the simple algorithm to try to even-out load over all available cores.

Brendan wrote:TurboBoost varies, but you can typically get up to about 50% performance boost by getting it right. Is it worth spending about 20 cycles get this right?

Not sure about this either as I currently don't use TurboBoost. I also decided to ignore Intel Atoms cycle modulation as it doesn't work properly. About the only useful algoritm there is is the P-states, which can save power by lowering voltages.

Brendan wrote:Then there's temperatures. This is mostly (but not entirely) for multi-socket systems. One of the CPU fans dies (or is closer to the air intakes and gets better cooling, or gets choked with dust and worse cooling) and you're going to get different temperatures at the same load. On a NUMA system like mine, the IO hub is going to be attached to one NUMA domain and you want that NUMA domain to be running drivers (and processes that use the drivers, and processes that use processes that use drivers, etc) to avoid extra latency and quickpath/hyper-transport link traffic (that slows everything else down). You want that NUMA domain the have higher load, but (assuming even cooling) that chip is going to get hotter. If you screw this up and a chip overheads and you can expect the performance of each CPU in the chip (e.g. 8 CPUs out of 16) to drop down to maybe 12.5% of normal performance. More likely is that the user gets annoyed because they set the OS to "quiet" but you screwed up and now there's a CPU fan going at max. speed/noise. Is it worth spending about 20 cycles to avoid this?

On 99% of today's CPUs (which are single socket, non-NUMA), you would waste the cycles for no benefit. For a NUMA or multi-socket machine, I'd have per NUMA scheduler parameters that would allow one NUMA domain to have different load than another, while keeping the algorithm intact within the same NUMA domain. Then we both need some smart algorithms to decide where to run programs and drivers. I really don't see those as related to the realtime actions of the scheduler.

Brendan wrote:Um, what? It's very easy to end up with different CPUs running at different speeds (turbo-boost, hyper-threading, power management, thermal throttling).

Most of those are regulated by the OS. For instance, power management can make sure that all CPUs run at the same frequences. Same for turno-boost. Thermal throttling can be avoided by keeping temperatures even (at least on the majority of single-socket machines I target).

Brendan wrote:It's also easy to guess when a task's data will definitely still be in caches (e.g. if nothing else used that CPU since the task used it last), and even something very simple (assume the CPU a task used last time might have something left in its cache regardless of how much that CPU has done since the task run) is going to avoid a lot more cache misses than failing to take it into account at all.

By keeping the task on the same core you ensure this.

Brendan wrote:The scheduler needs to know all about "CPU power management" to make effective scheduling descisions, and the "CPU power management" needs to know all about scheduling to make effective power management decisions. Trying to separate these things is pointless and just makes it harder to make good decisions.

I disagree. The isolation principle of object oriented programing applies. If you don't absolutely need to know something about another piece of software, you shouldn't care about it's internal design or decisions.

In fact, the scheduler doesn't need to know anything about power management. The CPUs that are active are scheduled, while the one's that aren't are not. Because of the global queue design, one CPU scheduler doesn't need to know anything about what other CPUs do. In fact, the same scheduler design works for both 1 CPU and for 16 CPUs.

The only thing that the power management driver needs to know about is which CPUs are in the system, and their load (which it gathers from timings of idle tasks). I've implemented the power management driver in the ACPI-module, as it needs quite a few things that are part of ACPI. It is the power management driver that starts and stops CPUs, and that modulates frequencies and voltages on CPUs. Of course, a single CPU system doesn't need ACPI or a power management driver to operate. IOW, the power management driver is an optional component.

rdos · Post by **rdos** » Thu May 24, 2012 1:36 pm

I just checked load balancing with multiple IPC over IP senders. It turns out that the scheduler cannot load balance these either as the tasks are unblocked from network IRQs. OTOH, there are two different scenarios here. On the Intel Core Duo, all the load is on CPU #0. This is because it uses IO-APIC, and all IO-APIC IRQs are delivered to BSP (because of the lowest priority delivery bug in Core Duo). On the 6-core AMD, which has a network card that supports MSI-delivery, lowest priority delivery is still used, but similar to HPET on the portable AMD, the same CPU is always targeted when all CPUs are idle, meaning that load will not be balanced (it seems the last active CPU always gets the IRQ).

I think the proper thing to do about this is to let the load balancer (the power management thread) redirect delivery for both HPET, MSI interrupts and IO-APIC IRQs to the CPU with the lowest load. It probably should switch one IRQ at a time, cycling through the available IRQs. With this logic, lowest priority delivery could be kicked-out once and for all.

I suppose Brendan would have preferred to solve this in the scheduler instead, but such a solution would either increase latency of IRQ-handlers, or require an extra IPI to activate the task on another CPU. By delivering the IRQ to the CPU that later will run the server portion, and unblock the user program, everything can be run on one CPU only, with no IPIs.

Combuster · Post by **Combuster** » Thu May 24, 2012 3:18 pm

I'll take my turn to point out your deficiencies. Maybe with repeated drilling you may once get the message.

rdos wrote:
Brendan wrote:Now let's looks at what it's avoiding. Cache misses are expensive (maybe 150 cycles each), and if you can avoid 200 cache misses that adds up to about 30000 cycles. Is it worth spending about 20 cycles for this?
OK, assume that you can save 150 cycles on avoiding the global posts. These might be done about 3 out of 100 times the scheduler is run. So multiply you 20 cycles with 100 / 3 and you have 667 cycles, which is over 4 times more.

You only proved that not shifting tasks is more efficient for just one cpu than checking if they should be. The world minus you understands that not shifting tasks is bad for performance at about all the other corners you have.

Brendan wrote:Hyper-threading (if you get it wrong) costs you about 40% of performance. Is it worth spending about 20 cycles to avoid this?
This is harder to quantify. I'm not sure that you will gain much if any performance by trying to optimize this over the simple algorithm to try to even-out load over all available cores.

Of course, hacking in a feature into an already performance-crappy implementation won't get you the improvements you want.

Brendan wrote:TurboBoost varies, but you can typically get up to about 50% performance boost by getting it right. Is it worth spending about 20 cycles get this right?
Not sure about this either as I currently don't use TurboBoost.

RTFM on what it actually does. It's always there, and even you'll have it if you have less active threads than cores.

I also decided to ignore Intel Atoms cycle modulation as it doesn't work properly.

And we all know you can't accept that you're wrong so you blame Intel which has run millions of tests to know that it works.

About the only useful algoritm there is is the P-states, which can save power by lowering voltages.

And apparently, not even a simple HLT comes to mind.

Brendan wrote:It's also easy to guess when a task's data will definitely still be in caches (e.g. if nothing else used that CPU since the task used it last), and even something very simple (assume the CPU a task used last time might have something left in its cache regardless of how much that CPU has done since the task run) is going to avoid a lot more cache misses than failing to take it into account at all.
By keeping the task on the same core you ensure this.

And yet you randomly drop threads to other cores.

Brendan wrote:The scheduler needs to know all about "CPU power management" to make effective scheduling descisions, and the "CPU power management" needs to know all about scheduling to make effective power management decisions. Trying to separate these things is pointless and just makes it harder to make good decisions.
I disagree. The isolation principle of object oriented programing applies. If you don't absolutely need to know something about another piece of software, you shouldn't care about it's internal design or decisions.
In fact, the scheduler doesn't need to know anything about power management. The CPUs that are active are scheduled, while the one's that aren't are not. Because of the global queue design, one CPU scheduler doesn't need to know anything about what other CPUs do. In fact, the same scheduler design works for both 1 CPU and for 16 CPUs.

Oh someone lent you a crowbar and now you use it on everything. What's more efficient, having a project manager that puts people on projects not knowing their holidays, and a second manager granting holidays even when there's deadlines that week, or one person that does both? You lose a lot of profit by not having those tasks cooperate at all.

The only thing that the power management driver needs to know about is which CPUs are in the system, and their load (which it gathers from timings of idle tasks). I've implemented the power management driver in the ACPI-module, as it needs quite a few things that are part of ACPI. It is the power management driver that starts and stops CPUs, and that modulates frequencies and voltages on CPUs.

PM is not entirely in charge, see above.

Of course, a single CPU system doesn't need ACPI or a power management driver to operate.

So the multicore laptop has a longer battery life because you tune it down when it's idle while your singlecore laptop always run at full speed?

Solar · Post by **Solar** » Fri May 25, 2012 12:41 am

Combuster wrote:The world minus you understands...

@rdos:

I think the problem here is that, frequently, two completely different discussions mesh with each other in threads like this one.

No-one tells you that you must do things differently in your OS. Do it whatever way you like, it's your OS. Perhaps your algorithms work good enough for you, or perhaps you don't care, it doesn't matter either way.

The problem is that you keep popping up in generic discussions on architecture - i.e., discussions that concern other people's designs, not yours - and field a combination of "my design is the best" and "no-one else knows what he's talking about".

The originally generic discussion rapidly degrades from that point onward, because there are people here that do know what they're talking about, and they don't think your design is so great. And because you utterly, completely, 110% fail to ever acknowledge that your design is anything but perfect - to a degree that I haven't seen in over 20 years of discussing tech on the internet - everyone involved gets irritated and annoyed, and the original discussion gets drowned in another case of "rdos vs. everybody else".

The impression, quite clearly, is that you are not here to learn anything. For you, this forum seems to be a one-way, a place where you can show others how great every single design decision you ever made truly is. It's the utter lack of self-reflection that's chafing.

Take note that quite some people here other than you have studied these subjects, are working as professionals in the field, and talk from a position of experience and knowledge. They might even (gasp!) know things that you don't. It would behoove you to show the ability to state "in RDOS I do it this way, because of that", and leave it at that. Your constant patronizing of anyone not seeing things your way is really, really, really getting people worked up.

rdos · Post by **rdos** » Fri May 25, 2012 12:47 am

Combuster wrote:I'll take my turn to point out your deficiencies. Maybe with repeated drilling you may once get the message.
rdos wrote:
Brendan wrote:Now let's looks at what it's avoiding. Cache misses are expensive (maybe 150 cycles each), and if you can avoid 200 cache misses that adds up to about 30000 cycles. Is it worth spending about 20 cycles for this?
OK, assume that you can save 150 cycles on avoiding the global posts. These might be done about 3 out of 100 times the scheduler is run. So multiply you 20 cycles with 100 / 3 and you have 667 cycles, which is over 4 times more.
You only proved that not shifting tasks is more efficient for just one cpu than checking if they should be. The world minus you understands that not shifting tasks is bad for performance at about all the other corners you have.

I think I proved that doing this only once in a while is more efficient than doing it every time.

Combuster wrote:Of course, hacking in a feature into an already performance-crappy implementation won't get you the improvements you want.

Who said I need any improvements? My primary issue is stability, and performance comes after that. If I can achieve better performance with no stability issues, I'll do it, but complex scheduling algorithms are hard to test for stability. That's partly why I chose a simple one that in fact seems to work just as well as the complex one's in typical cases.

Combuster wrote:And apparently, not even a simple HLT comes to mind.

The idle thread naturally uses HLT. P-states are used to achieve more power savings than HLT alone will do by reducing clock frequency and core voltage.

Combuster wrote:And yet you randomly drop threads to other cores.

I do? That's more than I am aware of myself. I randomly drop them to the global queue where idle or low-priority cores can pick them up.

Combuster wrote:So the multicore laptop has a longer battery life because you tune it down when it's idle while your singlecore laptop always run at full speed?

I don't own a singlecore laptop, but it is naturally possible to use ACPI and the PM on that one too, provided the processor "drivers" for doing power management are present. I was referring to the AMD Geode, which doesn't have any power-management features (other then HLT), and more or less is an old-type computer that doesn't support much of the modern multicore features. I run that one with the same scheduler as the multicore one's. The only difference is that the spinlocks becomes nops.

rdos · Post by **rdos** » Fri May 25, 2012 12:56 am

Solar wrote:I think the problem here is that, frequently, two completely different discussions mesh with each other in threads like this one.

Moderators could delete those (if they are nonsense), or create new threads if they are interesting

Solar wrote: The problem is that you keep popping up in generic discussions on architecture - i.e., discussions that concern other people's designs, not yours - and field a combination of "my design is the best" and "no-one else knows what he's talking about".

If you read the first post, the OP is asking for ideas.

Solar wrote: The originally generic discussion rapidly degrades from that point onward, because there are people here that do know what they're talking about, and they don't think your design is so great. And because you utterly, completely, 110% fail to ever acknowledge that your design is anything but perfect - to a degree that I haven't seen in over 20 years of discussing tech on the internet - everyone involved gets irritated and annoyed, and the original discussion gets drowned in another case of "rdos vs. everybody else".

The impression, quite clearly, is that you are not here to learn anything. For you, this forum seems to be a one-way, a place where you can show others how great every single design decision you ever made truly is. It's the utter lack of self-reflection that's chafing.

Take note that quite some people here other than you have studied these subjects, are working as professionals in the field, and talk from a position of experience and knowledge. They might even (gasp!) know things that you don't. It would behoove you to show the ability to state "in RDOS I do it this way, because of that", and leave it at that. Your constant patronizing of anyone not seeing things your way is really, really, really getting people worked up.

Too me it seems more like some professionals have no ability to go beyond what they have learnt from university courses and other designs and acknowledge that things don't need to be done in the same way they aways have been, or like they have been done in popular OSes like Windows and Linux.

The discussion of alternative scheduling algorithms for SMP is highly ontopic, and not related to RDOS at all. It is generic.

And if you note, it was Brendan that first described his ideas about SMP scheduling (that it appears he have tried to implement in his OS in the past). I don't understand why other people are allowed to describe their ideas about SMP scheduling while I'm not. Can you explain that logic to me?

Brendan · Post by **Brendan** » Fri May 25, 2012 2:53 am

Hi,

rdos wrote:
Brendan wrote:Now let's looks at what it's avoiding. Cache misses are expensive (maybe 150 cycles each), and if you can avoid 200 cache misses that adds up to about 30000 cycles. Is it worth spending about 20 cycles for this?
OK, assume that you can save 150 cycles on avoiding the global posts. These might be done about 3 out of 100 times the scheduler is run. So multiply you 20 cycles with 100 / 3 and you have 667 cycles, which is over 4 times more.

After every task switch you return to CPL=3 and the CPU tries to fetch the instruction at the task's EIP. This begins with a potential TLB miss which includes 2 potential cache misses (when the CPU tries to fetch the relevant part of the page directory, then the relevant part of the page table), followed by a third potential cache miss when the CPU tries to fetch the cache line at EIP. That's 3 potential cache misses (costing about 450 cycles) before the CPU manages to execute the task's first instruction after a task switch. After that, it's likely that the task will need to fetch more instructions, likely that it'll touch some sort of data, likely that it'll touch it's stack, etc.

Basically, after a task switch, there's at least 3 potential cache misses, and possibly hundreds (and maybe even thousands) of potential cache misses. These are the cache misses that we're trying to avoid.

Now, if you always schedule a task on the same CPU it used last time, then you've effectively got no load balancing at all and your OS will be crap. If you always schedule a task on the CPU with the least load, then you'd be paying the "potential cache miss" penalties more often (with 2 CPUs you'd guess wrong 50% of the time, with 4 CPUs you'd guess wrong 65% of the time, with 8 CPUs you'd guess wrong 87.5% of the time, etc). The idea is to compromise - determine when increasing the chance of avoiding the cache miss penalties is more or less important than load balancing.

It's more than this though - the compromise between CPU load balancing and optimising cache is just one example. The idea is to determine which CPU is the best CPU for a task to use, based on any information that can help (for performance and for power management). If optimising cache helps on one system and doesn't help on another, then that's fine. If tracking which tasks communicate with each other (and trying to put tasks that communicate a lot on CPUs that are "close") helps on one system and doesn't help on another then that's fine too. If dancing naked in the rain helps on one system and doesn't help on another then that's also fine.

The point is that it's possible/easy to have different code for each specific case that is able to make the best/most intelligent decisions for that specific case.

rdos wrote:
Brendan wrote:Hyper-threading (if you get it wrong) costs you about 40% of performance. Is it worth spending about 20 cycles to avoid this?
This is harder to quantify. I'm not sure that you will gain much if any performance by trying to optimize this over the simple algorithm to try to even-out load over all available cores.

Since hyper-threading was introduced, Intel have been saying "for best performance, spread tasks across available cores". Your lame scheduler probably doesn't do this, but it's a good start.

What Intel haven't said is that the opposite also applies. For best power consumption, *don't* spread tasks across available cores. To improve things like battery life in laptops; it's better to have both CPUs in one core busy with another core idle/sleeping than it is to have both cores busy.

Obviously there's some compromise between performance and power consumption. For example, for very high priority tasks you'd want to maximise performance and spread tasks across available cores; and for very low priority tasks you'd want to minimise power consumption and avoid spreading tasks across available cores. Of course this should also be effected by the OS's global power management policy - for example, for a large server you might want to optimise performance for everything except very low priority tasks, but during a power failure (when the same large server is running off of a UPS) you'd might want to minimise power consumption for everything except very high priority tasks.

Then there's the interaction between high priority tasks and low priority tasks - e.g. trying to avoid putting a low priority task on idle CPU that is in the same core as a CPU being used by a high priority task.

Notice how power management influences the decision of which CPU a task should use, and the decision of which CPU a task should use influences power consumption?

rdos wrote:
Brendan wrote:TurboBoost varies, but you can typically get up to about 50% performance boost by getting it right. Is it worth spending about 20 cycles get this right?
Not sure about this either as I currently don't use TurboBoost. I also decided to ignore Intel Atoms cycle modulation as it doesn't work properly. About the only useful algoritm there is is the P-states, which can save power by lowering voltages.

If the CPU supports TurboBoost, then you can't avoid using TurboBoost (unless you disable it in the BIOS and never get any extra performance by never having any boost).

For TurboBoost, in general, the speed of "in use" cores depends on the number of idle cores. For example, for a quad-core CPU running at "2 GHz", a core might run at 2 GHz when there are no idle cores, 2.1 GHz when one core is idle, 2.3 GHz when 2 cores are idle and 2.6 GHz when 3 cores are idle.

To optimise it's a little bit like hyper-threading. For a "multi-socket" system; for performance it's better to spread tasks across separate chips, and for power consumption it's better to avoid spreading tasks across separate chips. For both "single-socket" and "multi-socket" there's also interaction between high priority tasks and low priority tasks (e.g. better for the performance of very high priority tasks to avoid putting very low priority tasks on idle cores).

Of course most CPUs that support TurboBoost also support hyper-threading, and you can't optimise for TurboBoost and hyper-threading separately. For example, for a "quad-core with hyper-threading" CPU, if there's 2 tasks then there's 3 possibilities:

Put the tasks on different CPUs in different cores. This is best for the performance of both tasks, and worst for power consumption. Probably the best option if both tasks are high priority tasks.
Put the tasks on different CPUs in the same core. This is worst for the performance of both tasks, and good for power consumption. Probably the best option if both tasks are low priority tasks.
Put both tasks on the same CPU. This is best for the performance of one task and worst for the performance of the other task, and good for power consumption. Probably the best option if one task is high priority and the other is low priority.

Just like hyper-threading, there's some compromise between performance and power consumption. E.g. for a medium priority task, does the scheduler optimise for performance or minimise power consumption?

Notice how power management influences the decision of which CPU a task should use, and the decision of which CPU a task should use influences power consumption?

rdos wrote:
Brendan wrote:It's also easy to guess when a task's data will definitely still be in caches (e.g. if nothing else used that CPU since the task used it last), and even something very simple (assume the CPU a task used last time might have something left in its cache regardless of how much that CPU has done since the task run) is going to avoid a lot more cache misses than failing to take it into account at all.
By keeping the task on the same core you ensure this.

By keeping the task on the same core, you ensure your load balancing sucks dog balls. It's a compromise. Why choose between "always pizza and nothing else" and "always ice-cream and nothing else" when you could have ice-cream when you want it more than pizza, or pizza when you want that more than ice-cream?

rdos wrote:
Brendan wrote:The scheduler needs to know all about "CPU power management" to make effective scheduling descisions, and the "CPU power management" needs to know all about scheduling to make effective power management decisions. Trying to separate these things is pointless and just makes it harder to make good decisions.
I disagree.

You're not allowed to disagree until you make at least one good decision yourself.

rdos wrote:The isolation principle of object oriented programing applies. If you don't absolutely need to know something about another piece of software, you shouldn't care about it's internal design or decisions.

Object oriented programing says that there's a compromise between coupling and cohesion; and excessive coupling (too many inter-dependencies between different objects) is bad, and excessive cohesion (objects full of "not very related related" things) is bad, and the best approach is a balance of both.

This leads to the right way of doing it - the scheduler makes the decisions (without needing to care how to change CPU speeds, sleep states, etc), and the power management code does what it's told (without needing to care about CPU load, etc).

Cheers,

Brendan

DavidCooper · Post by **DavidCooper** » Fri May 25, 2012 2:01 pm

Is there any way of doing a performance test on RDOS to compare it with other operating systems to see how efficient it is? Ultimately, it's only the performance that's going to tell you whether it's a reasonable design. Of course, even if it compares badly it may still be that he's managing to wring the best possible performance out of a pile of obsolete pants, but unless what he's doing applies universally it isn't likely to be of any use to anyone else. What would really help to settle all these interminable arguments would be if an operating system existed which could be modified easily to conform with RDOS's ideas (or anyone else's) so that it could be compared directly with other versions which conform with other people's ideas, but with everything else the same. If a suitable OS exists, it could be used to prove the point and win the argument outright every time, and when you think about the enormous amount of time being wasted on these neverending discussions, it could be more time-effective for people to put their time into demonstrating directly that they are right rather than just asserting that they are.

jbemmel · Post by **jbemmel** » Sat May 26, 2012 12:22 am

Brendan wrote: Now, if you always schedule a task on the same CPU it used last time, then you've effectively got no load balancing at all and your OS will be crap. If you always schedule a task on the CPU with the least load, then you'd be paying the "potential cache miss" penalties more often (with 2 CPUs you'd guess wrong 50% of the time, with 4 CPUs you'd guess wrong 65% of the time, with 8 CPUs you'd guess wrong 87.5% of the time, etc). The idea is to compromise - determine when increasing the chance of avoiding the cache miss penalties is more or less important than load balancing.

Just a remark that sometimes CPUs share caches (e.g. on i7 two hyper-threads on a core share the L1/L2 cache, while different cores share the L3 cache but not L1/L2 - at least not directly). See e.g. http://www.tomshardware.com/reviews/Int ... 41-10.html. It gets really complicated to predict cache hit rates...

OSDev.org

Multitasking, SMP & Timers

Re: Multitasking, SMP & Timers

Re: Multitasking, SMP & Timers

Re: Multitasking, SMP & Timers

Re: Multitasking, SMP & Timers

Re: Multitasking, SMP & Timers

Re: Multitasking, SMP & Timers

Re: Multitasking, SMP & Timers

Re: Multitasking, SMP & Timers

Re: Multitasking, SMP & Timers

Re: Multitasking, SMP & Timers

Re: Multitasking, SMP & Timers