Hi,
rdos wrote:Brendan wrote:Now let's looks at what it's avoiding. Cache misses are expensive (maybe 150 cycles each), and if you can avoid 200 cache misses that adds up to about 30000 cycles. Is it worth spending about 20 cycles for this?
OK, assume that you can save 150 cycles on avoiding the global posts. These might be done about 3 out of 100 times the scheduler is run. So multiply you 20 cycles with 100 / 3 and you have 667 cycles, which is over 4 times more.
After every task switch you return to CPL=3 and the CPU tries to fetch the instruction at the task's EIP. This begins with a potential TLB miss which includes 2 potential cache misses (when the CPU tries to fetch the relevant part of the page directory, then the relevant part of the page table), followed by a third potential cache miss when the CPU tries to fetch the cache line at EIP. That's 3 potential cache misses (costing about 450 cycles) before the CPU manages to execute the task's first instruction after a task switch. After that, it's likely that the task will need to fetch more instructions, likely that it'll touch some sort of data, likely that it'll touch it's stack, etc.
Basically, after a task switch, there's at least 3 potential cache misses, and possibly hundreds (and maybe even thousands) of potential cache misses. These are the cache misses that we're trying to avoid.
Now, if you always schedule a task on the same CPU it used last time, then you've effectively got no load balancing at all and your OS will be crap. If you always schedule a task on the CPU with the least load, then you'd be paying the "potential cache miss" penalties more often (with 2 CPUs you'd guess wrong 50% of the time, with 4 CPUs you'd guess wrong 65% of the time, with 8 CPUs you'd guess wrong 87.5% of the time, etc). The idea is to compromise - determine when increasing the chance of avoiding the cache miss penalties is more or less important than load balancing.
It's more than this though - the compromise between CPU load balancing and optimising cache is just one example. The idea is to determine which CPU is the best CPU for a task to use, based on any information that can help (for performance and for power management). If optimising cache helps on one system and doesn't help on another, then that's fine. If tracking which tasks communicate with each other (and trying to put tasks that communicate a lot on CPUs that are "close") helps on one system and doesn't help on another then that's fine too. If dancing naked in the rain helps on one system and doesn't help on another then that's also fine.
The point is that it's possible/easy to have different code for each specific case that is able to make the best/most intelligent decisions for that specific case.
rdos wrote:Brendan wrote:Hyper-threading (if you get it wrong) costs you about 40% of performance. Is it worth spending about 20 cycles to avoid this?
This is harder to quantify. I'm not sure that you will gain much if any performance by trying to optimize this over the simple algorithm to try to even-out load over all available cores.
Since hyper-threading was introduced, Intel have been saying "for best performance, spread tasks across available cores". Your lame scheduler probably doesn't do this, but it's a good start.
What Intel haven't said is that the opposite also applies. For best power consumption, *don't* spread tasks across available cores. To improve things like battery life in laptops; it's better to have both CPUs in one core busy with another core idle/sleeping than it is to have both cores busy.
Obviously there's some compromise between performance and power consumption. For example, for very high priority tasks you'd want to maximise performance and spread tasks across available cores; and for very low priority tasks you'd want to minimise power consumption and avoid spreading tasks across available cores. Of course this should also be effected by the OS's global power management policy - for example, for a large server you might want to optimise performance for everything except very low priority tasks, but during a power failure (when the same large server is running off of a UPS) you'd might want to minimise power consumption for everything except very high priority tasks.
Then there's the interaction between high priority tasks and low priority tasks - e.g. trying to avoid putting a low priority task on idle CPU that is in the same core as a CPU being used by a high priority task.
Notice how power management influences the decision of which CPU a task should use, and the decision of which CPU a task should use influences power consumption?
rdos wrote:Brendan wrote:TurboBoost varies, but you can typically get up to about 50% performance boost by getting it right. Is it worth spending about 20 cycles get this right?
Not sure about this either as I currently don't use TurboBoost. I also decided to ignore Intel Atoms cycle modulation as it doesn't work properly. About the only useful algoritm there is is the P-states, which can save power by lowering voltages.
If the CPU supports TurboBoost, then you can't avoid using TurboBoost (unless you disable it in the BIOS and never get any extra performance by never having any boost).
For TurboBoost, in general, the speed of "in use" cores depends on the number of idle cores. For example, for a quad-core CPU running at "2 GHz", a core might run at 2 GHz when there are no idle cores, 2.1 GHz when one core is idle, 2.3 GHz when 2 cores are idle and 2.6 GHz when 3 cores are idle.
To optimise it's a little bit like hyper-threading. For a "multi-socket" system; for performance it's better to spread tasks across separate chips, and for power consumption it's better to avoid spreading tasks across separate chips. For both "single-socket" and "multi-socket" there's also interaction between high priority tasks and low priority tasks (e.g. better for the performance of very high priority tasks to avoid putting very low priority tasks on idle cores).
Of course most CPUs that support TurboBoost also support hyper-threading, and you can't optimise for TurboBoost and hyper-threading separately. For example, for a "quad-core with hyper-threading" CPU, if there's 2 tasks then there's 3 possibilities:
- Put the tasks on different CPUs in different cores. This is best for the performance of both tasks, and worst for power consumption. Probably the best option if both tasks are high priority tasks.
- Put the tasks on different CPUs in the same core. This is worst for the performance of both tasks, and good for power consumption. Probably the best option if both tasks are low priority tasks.
- Put both tasks on the same CPU. This is best for the performance of one task and worst for the performance of the other task, and good for power consumption. Probably the best option if one task is high priority and the other is low priority.
Just like hyper-threading, there's some compromise between performance and power consumption. E.g. for a medium priority task, does the scheduler optimise for performance or minimise power consumption?
Notice how power management influences the decision of which CPU a task should use, and the decision of which CPU a task should use influences power consumption?
rdos wrote:Brendan wrote:It's also easy to guess when a task's data will definitely still be in caches (e.g. if nothing else used that CPU since the task used it last), and even something very simple (assume the CPU a task used last time might have something left in its cache regardless of how much that CPU has done since the task run) is going to avoid a lot more cache misses than failing to take it into account at all.
By keeping the task on the same core you ensure this.
By keeping the task on the same core, you ensure your load balancing sucks dog balls. It's a compromise. Why choose between "always pizza and nothing else" and "always ice-cream and nothing else" when you could have ice-cream when you want it more than pizza, or pizza when you want that more than ice-cream?
rdos wrote:Brendan wrote:The scheduler needs to know all about "CPU power management" to make effective scheduling descisions, and the "CPU power management" needs to know all about scheduling to make effective power management decisions. Trying to separate these things is pointless and just makes it harder to make good decisions.
I disagree.
You're not allowed to disagree until you make at least one good decision yourself.
rdos wrote:The isolation principle of object oriented programing applies. If you don't absolutely need to know something about another piece of software, you shouldn't care about it's internal design or decisions.
Object oriented programing says that there's a compromise between
coupling and
cohesion; and excessive coupling (too many inter-dependencies between different objects) is bad, and excessive cohesion (objects full of "not very related related" things) is bad, and the best approach is a balance of both.
This leads to the right way of doing it - the scheduler makes the decisions (without needing to care how to change CPU speeds, sleep states, etc), and the power management code does what it's told (without needing to care about CPU load, etc).
Cheers,
Brendan