OSDev.org

Posted: **Fri Jan 06, 2012 8:12 pm**

Hi,

rdos wrote:
Brendan wrote:You mean that you aren't going to think about using P-states to control CPU temperature at the moment, and you're going to implement something that breaks as soon as you do implement CPU temperature management properly?
That's correct. I'm not going to implement temperature control right now, and especially not with p-states. The BIOS already sets up the cores in P0, and SMM handles catastrophic overtemperatures (along with some processors as well), so what I want to do is to put cores at suitable power-consumption states that are lower power than P0.

Most (Intel) CPUs will suddenly drop down to about 12.5% or 25% of their performance when the CPU gets hot, and stop completely for catastrophic over-temperature. Firmware only sets a few settings during boot (shouldn't be hard to find the relevant MSRs in Intel's manual), and everything else is done automatically by the CPU (with no SMM involved). These things are not power management - they're thermal protection to (try to) make sure the CPU doesn't melt.

For thermal management (rather than thermal protection), I'd want a scheduler that is able to shift load to cooler CPUs as part of it's load balancing, fan speed control and P-state control. I wouldn't want an OS where a CPUs unexpectedly drop from 100% to 12.5% when you're running a high priority or real-time thread.

rdos wrote:Additionally, I'll implement the functions to start/stop cores depending on load. I don't think I will bother with throttling at all, as this is typically most useful for games on desktop computers, and that is not my "thing".

Um? "Throttling" is using any P-state (except P0) for any reason; where the reason might be to help manage temperature and/or acoustics (e.g. fan noise), or to help reduce power consumption (e.g. increase battery life).

Load is load. The CPU doesn't know the difference between "load that is a computer game" and "load that is a complex scientific calculation" and "load that is a compiler". Desktop computers are typically large enough to have adequate cooling and can typically handle high load for extended periods of time without over-heating; and they rarely run from battery (e.g. a UPS). On the other hand, laptops often don't have adequate cooling (due to size/space restrictions) and often do run from battery.

When you said "I don't think I will bother with throttling at all, as this is typically most useful for games on desktop computers, and that is not my "thing"." did you really mean "I don't think I will bother with thermal throttling at all, as this is typically most useful for scientific calculations and compiling on laptops, and that is not my "thing"."?

What exactly is your "thing"? Are you deliberately trying to write an OS that is only useful for situations where the ambient temperature is controlled (e.g. air-conditioned server rooms), which happens to be crap for things like ATMs that are often installed in exterior walls (where around here, the ambient temperature can reach 45 degrees Celsius in the afternoon in summer)?

rdos wrote:So if I decide to change my mind, and build-in throttling and temperature management, that is perfectly possible to do on top of the p-state interface. That is basically only a sub-mode of P0 state.

As soon as you start looking at thermal throttling, you start getting feedback loops.

rdos wrote:
Brendan wrote:The time that P-state changes take depends on which CPU. For the rough formula I used as an example in my last post, you'd be running at 50% within 1 second, at 75% within 2 seconds and at 87.5% within 3 seconds. The main flaw in my example is that you'd never actually reach 100% (but you'd round to the nearest supported P-state, so that doesn't matter much).
I don't find that satisfactory. You should be able to get from 10% to 100% in 100ms, worse case 250ms, otherwise it is too slow. The other direction could take seconds, that's no problem, but decreasing p-state should be fast.

I'm sorry to hear that my rough example wasn't exactly what you wanted. I will refund all of the Research and Development money you've paid me as soon as I can.

My only objective was to get you to think about how P-states could be used properly (including things like acoustic/noise management and thermal management), so that you're at least aware of ways to improve your "only half a solution" solution.

rdos wrote:
Brendan wrote:Of course - nobody would be silly enough to use boring old algebra with variables that range from 0 to 1 when there's a way to ignore most of the common operators and use the power of hype instead....
I've used fuzzy logic before, and it is efficient for certain types of problems that are hard to describe with determinstic rules. In this case, you want sudden increases in load to affect p-states fast, while also keeping the system at optimal load (around 50%). That is a problem that isn't easy to formulate with algebra.

Example of rules:

1. If load is very high, quickly go to P0
2. If load is high, decrease p-state one mark
3. If load is intermediate, your system is fine, and you do nothing
4. If load is low, increase p-state one mark

Conditions: (just a preliminary example)

A. Very high is a slope from 75% to 90%, and then 1.0
B. High has a slope from 50% to 65%, is 1.0 from 65% to 80% and then has a slope to 90% where it is 0.0
C. Intermediate has a slope from 25% to 40%, is 1.0 from 40% to 60%, and then has a slope to 75% where it is 0.0
D. Low is 1.0 up to 25%, then has a slope to 40% where it is 0.0

How would you formulate this with algerbra?

I wouldn't - I'd design something better, because I wouldn't limit my design to a small subset of available operators.

Maybe something like:

Code: Select all

    CPU_load_rating = high_priority_load * 0.7 + medium_priority_load * 0.3;
    if( system_is_on_battery) {
        power_multiplier = performance_or_efficiency_user_preference_for_battery;
    } else {
        power_multiplier = performance_or_efficiency_user_preference_for_mains_power;
    }
    desired_CPU_speed = CPU_load_rating * power_multiplier;

    if(CPU_speed < desired_CPU_speed) {
       CPU_speed = desired_CPU_speed * 0.9 +  CPU_speed * 0.1;     // Fast speed up
    } else {
       CPU_speed = desired_CPU_speed * 0.4 +  CPU_speed * 0.6;     // Slow speed down
    }
    Pstate = find_Pstate( CPU_speed );    // Search for the closest available P-state for the CPU speed,
                                          //   based on information for the specific CPU we're running on
    set_Pstate(Pstate);
}

NOTICE: this example pseudo-code for an incomplete solution is only an example, is only psuedo-code, and is only an incomplete solution.

For any "fuzzy logic" code, just change the comments to convert it into "non-fuzzy logic" code.

Cheers,

Brendan

Posted: **Sat Jan 07, 2012 7:41 am**

Brendan wrote:Most (Intel) CPUs will suddenly drop down to about 12.5% or 25% of their performance when the CPU gets hot, and stop completely for catastrophic over-temperature. Firmware only sets a few settings during boot (shouldn't be hard to find the relevant MSRs in Intel's manual), and everything else is done automatically by the CPU (with no SMM involved). These things are not power management - they're thermal protection to (try to) make sure the CPU doesn't melt.

OK, but it provides the basis for a functional system without functional termal management.

Brendan wrote:For thermal management (rather than thermal protection), I'd want a scheduler that is able to shift load to cooler CPUs as part of it's load balancing, fan speed control and P-state control. I wouldn't want an OS where a CPUs unexpectedly drop from 100% to 12.5% when you're running a high priority or real-time thread.

I've already decided against having different load on different cores. The fact that many CPUs also do not support per-core P-states makes this decision even more compelling. It is the scheduler that is responsible for ensuring that load is balanced between cores, and thus that temperatures does not differ between cores. This must be the best method, as having a uniform temperature is much better than having some cores that are overheating and some that are unused. The only reason why I would shut-down cores is because there is no load for them. If the system is so highly loaded that cores are about to overheat, then all cores should be started, and they should be running at similar load.

However, looking at the previous load curves I posted, it is obvious that this is not yet working properly, because if it did, the animation thread would have switched between core 0 and core 1 at least a few times per minute, which it doesn't seem to do.

OTOH, in regards to real-time threads, you might have a point about thermal management there, but since real-time threads are pegged to a core, it would be the scheduler that would force higher P-states to the other cores when temperature becomes too hot.

Brendan wrote:Load is load. The CPU doesn't know the difference between "load that is a computer game" and "load that is a complex scientific calculation" and "load that is a compiler". Desktop computers are typically large enough to have adequate cooling and can typically handle high load for extended periods of time without over-heating; and they rarely run from battery (e.g. a UPS). On the other hand, laptops often don't have adequate cooling (due to size/space restrictions) and often do run from battery.

When you said "I don't think I will bother with throttling at all, as this is typically most useful for games on desktop computers, and that is not my "thing"." did you really mean "I don't think I will bother with thermal throttling at all, as this is typically most useful for scientific calculations and compiling on laptops, and that is not my "thing"."?

What exactly is your "thing"? Are you deliberately trying to write an OS that is only useful for situations where the ambient temperature is controlled (e.g. air-conditioned server rooms), which happens to be crap for things like ATMs that are often installed in exterior walls (where around here, the ambient temperature can reach 45 degrees Celsius in the afternoon in summer)?

There is no valid excuse for installing ATMs without adequate climatic control! Absolutely none. Not all PCs have advanced power-management. For instance, the PC we use have no ACPI termal zone, it doesn't have a temperature reading, it has no fan, and no P-states. If you install that PC as an ATM without adequate climatic control you are in for big trouble!

We just cannot expect software to compensate for lousy design-decisions. If you place a PC outdoors in Scandinavia, you can be pretty sure that it will be cooked in summer, and freezing cold in winter. To not add a cooling system and a heater in such a system has no valid excuse.

Brendan wrote:I wouldn't - I'd design something better, because I wouldn't limit my design to a small subset of available operators.

Maybe something like:

Code: Select all

    CPU_load_rating = high_priority_load * 0.7 + medium_priority_load * 0.3;
    if( system_is_on_battery) {
        power_multiplier = performance_or_efficiency_user_preference_for_battery;
    } else {
        power_multiplier = performance_or_efficiency_user_preference_for_mains_power;
    }
    desired_CPU_speed = CPU_load_rating * power_multiplier;

    if(CPU_speed < desired_CPU_speed) {
       CPU_speed = desired_CPU_speed * 0.9 +  CPU_speed * 0.1;     // Fast speed up
    } else {
       CPU_speed = desired_CPU_speed * 0.4 +  CPU_speed * 0.6;     // Slow speed down
    }
    Pstate = find_Pstate( CPU_speed );    // Search for the closest available P-state for the CPU speed,
                                          //   based on information for the specific CPU we're running on
    set_Pstate(Pstate);
}

NOTICE: this example pseudo-code for an incomplete solution is only an example, is only psuedo-code, and is only an incomplete solution.

Not too bad, but it would need some tuning.

Brendan wrote:For any "fuzzy logic" code, just change the comments to convert it into "non-fuzzy logic" code.

Your example clearly doesn't have the same properties as the one I designed. My system basically added one to P-state when load was below ca 35%, subtracted one from P-state when load was above ca 65%, and put the system in P0 when load was above ca 85%. If I wanted to add temperature to the equation (like you seem to imply that I should), this just expands the rule-matrix with low and high temperature. For instance, if temperature is high and increasing, I'd add one to P-state. If temperature is high and stable, I would never decrease P-state, only increase it if the loading was below ca 35%.

Posted: **Sat Jan 07, 2012 10:00 am**

Hi,

rdos wrote:It is the scheduler that is responsible for ensuring that load is balanced between cores, and thus that temperatures does not differ between cores. This must be the best method, as having a uniform temperature is much better than having some cores that are overheating and some that are unused.

Imagine a server with a pair of separate Pentium III ("Coppermine") single-core chips, running a single-threaded process that consumes 100% of one CPU's time and very little else. If your scheduler's load balancing ignores temperatures then it runs the process on the same CPU all the time (to reduce cache misses and task switches). On a hot day (high ambient temperature, where increasing fan speed doesn't help much) how do you end up with the same temperature for both chips? When the over-used chip starts overheating and the CPU drops down to 12.5% of it's normal speed in self defence, does your scheduler notice or does it keep running the process on the slow CPU while the cool/fast CPU is still idle?

Now think of a system with a pair of separate dual-core chips running two single-threaded processes. Do you run both processes on different cores of the same chip?

How about a system with four separate dual-core Opterons running 7 single threaded processes - do you have 3 hot CPUs and one warm?

rdos wrote:Your example clearly doesn't have the same properties as the one I designed. My system basically added one to P-state when load was below ca 35%, subtracted one from P-state when load was above ca 65%, and put the system in P0 when load was above ca 85%. If I wanted to add temperature to the equation (like you seem to imply that I should), this just expands the rule-matrix with low and high temperature. For instance, if temperature is high and increasing, I'd add one to P-state.

My example didn't have the same properties as the one you designed. Under a constant load of 33% you repeatedly add one to the P-state until the CPU is running as slow as it possibly can be, and for a constant load of 66% you subtract one for the P-state until the CPU is running at P0. You don't take into account if the system is running on battery power (UPS or laptop battery) but I'd want to automatically adjust if mains power is plugged in or removed; and you have no way of allowing the user/administrator to adjust the behaviour (in the Winter, I like setting my systems to "max. power" - if I'm going to pay for heating the room anyway, then I see no reason not to get better performance too). Finally you assume that there's a linear relationship between P-states when there may not be. For example, there might be 4 P-states, where CPU speed is 100% for P0, 95% for P1, 92% for P2 and 10% for P3 (where a large change in load might cause a switch from P0 to P2 and only a small difference in CPU speed, or a small change in load might cause a switch from P3 to P2 and a huge change in CPU speed).

Also note that the power consumed in each P-state isn't proportional to the CPU speed in that P-state. For example, the P-states could look like this:

P0 = 100% CPU speed, 100% power consumption
P1 = 95% CPU speed, 40% power consumption
P2 = 92% CPU speed, 37% power consumption
P3 = 10% CPU speed, 33% power consumption

Because of this, you might want to use the P-state's "power consumption" field to find the best "fastest P-state allowed by thermal management"; and then you use the "Core Frequency" field to determine the final Pstate (between Pn and "fastest P-state allowed by thermal management") to actually use.

Finally, the "Latency" field and "Bus Master Latency" field should probably also be taken into account (and both these fields may be different for each different Pstate). I'd be tempted to use the latency fields to determine how long after a P-state change until another P-state change is allowed (e.g. during a P-state change, multiply the latency for that P-state by 100 and set a timer, and refuse to do another P-state change until the timer expires). I'd also be tempted to use the latency fields to determine if a P-state change is worth bothering with (e.g. if you're currently using P1 but P2 is slightly closer to what you want, maybe switching to P2 isn't worth the latency cost and you decide to keep using P1 for now anyway).

Cheers,

Brendan

Posted: **Sat Jan 07, 2012 11:19 am**

Brendan wrote:Imagine a server with a pair of separate Pentium III ("Coppermine") single-core chips, running a single-threaded process that consumes 100% of one CPU's time and very little else. If your scheduler's load balancing ignores temperatures then it runs the process on the same CPU all the time (to reduce cache misses and task switches). On a hot day (high ambient temperature, where increasing fan speed doesn't help much) how do you end up with the same temperature for both chips? When the over-used chip starts overheating and the CPU drops down to 12.5% of it's normal speed in self defence, does your scheduler notice or does it keep running the process on the slow CPU while the cool/fast CPU is still idle?

Now think of a system with a pair of separate dual-core chips running two single-threaded processes. Do you run both processes on different cores of the same chip?

How about a system with four separate dual-core Opterons running 7 single threaded processes - do you have 3 hot CPUs and one warm?

But if a single threaded process consumes 100% of CPU time it won't be running on the same CPU all the time. It would become switched between CPUs. This worked for the JPEG coder, which became switched between cores several times. It didn't work for the more cooperative animation thread, probably because it uses a timed wait between redraws.

What I will do to ensure equal load, is to move threads away from cores that have below-normal execution time for their null thread.

Brendan wrote:My example didn't have the same properties as the one you designed. Under a constant load of 33% you repeatedly add one to the P-state until the CPU is running as slow as it possibly can be, and for a constant load of 66% you subtract one for the P-state until the CPU is running at P0.

Not true. You forget that load increases as P-state increases. If you have 33% load in P0, you might get 45% load in P1, and thus in the long run you end up in P1, as 45% is in the normal zone. Similarily, if you have 75% load in P3, you might have 50% load in P2, and thus would stabilize in P2.

Brendan wrote:You don't take into account if the system is running on battery power (UPS or laptop battery) but I'd want to automatically adjust if mains power is plugged in or removed; and you have no way of allowing the user/administrator to adjust the behaviour (in the Winter, I like setting my systems to "max. power" - if I'm going to pay for heating the room anyway, then I see no reason not to get better performance too).

You have a point with winter-settings, but I don't like configuration settings. I like everything to be automatic.

Brendan wrote:Finally you assume that there's a linear relationship between P-states when there may not be. For example, there might be 4 P-states, where CPU speed is 100% for P0, 95% for P1, 92% for P2 and 10% for P3 (where a large change in load might cause a switch from P0 to P2 and only a small difference in CPU speed, or a small change in load might cause a switch from P3 to P2 and a huge change in CPU speed).

Also note that the power consumed in each P-state isn't proportional to the CPU speed in that P-state. For example, the P-states could look like this:
P0 = 100% CPU speed, 100% power consumption
P1 = 95% CPU speed, 40% power consumption
P2 = 92% CPU speed, 37% power consumption
P3 = 10% CPU speed, 33% power consumption
Because of this, you might want to use the P-state's "power consumption" field to find the best "fastest P-state allowed by thermal management"; and then you use the "Core Frequency" field to determine the final Pstate (between Pn and "fastest P-state allowed by thermal management") to actually use.

If the relationship was linear, I wouldn't bother with P-states, as entering HLT (C1) would work just as well.

Here are the settings in my AMD Athlon:

Code: Select all

P0: 3000MHz, 125W
P1: 2800MHz, 108W
P2: 2600MHz, 93W
P3: 2400MHz, 79W
P4: 2200MHz, 67W
P5: 2000MHz, 60W
P6: 1800MHz, 54W
P7: 1000MHz, 27W

Brendan wrote:Finally, the "Latency" field and "Bus Master Latency" field should probably also be taken into account (and both these fields may be different for each different Pstate). I'd be tempted to use the latency fields to determine how long after a P-state change until another P-state change is allowed (e.g. during a P-state change, multiply the latency for that P-state by 100 and set a timer, and refuse to do another P-state change until the timer expires). I'd also be tempted to use the latency fields to determine if a P-state change is worth bothering with (e.g. if you're currently using P1 but P2 is slightly closer to what you want, maybe switching to P2 isn't worth the latency cost and you decide to keep using P1 for now anyway).

I haven't seen any PC where they differ between P-states.

Posted: **Sat Jan 07, 2012 1:22 pm**

There is a need to change the algorithm slighty. Instead of calculating load as an average of the load of all cores, it is necesary to use the highest load on any core. Otherwise, when P7 is entered, and some single threaded application that uses 100% CPU is started, average load is around 50%, and no P-state transisition is performed. In this case one core is loaded 0% and the other 100%. Thus, it is necesary to use maximum load.

Edit: This algorithm now works perfectly well on an AMD K8 (which has a really terrible P-state algoritm).

Posted: **Sat Jan 07, 2012 7:45 pm**

Hi,

rdos wrote:What I will do to ensure equal load, is to move threads away from cores that have below-normal execution time for their null thread.

What if one chip has an efficient CPU fan, and another chip doesn't (full of dust, air-flow blocked, sitting next to a graphics card so the air it sucks in is already hot, or just plain faulty fan)? In that case, equal load means unequal temperatures.

rdos wrote:
Brendan wrote:My example didn't have the same properties as the one you designed. Under a constant load of 33% you repeatedly add one to the P-state until the CPU is running as slow as it possibly can be, and for a constant load of 66% you subtract one for the P-state until the CPU is running at P0.
Not true. You forget that load increases as P-state increases. If you have 33% load in P0, you might get 45% load in P1, and thus in the long run you end up in P1, as 45% is in the normal zone. Similarily, if you have 75% load in P3, you might have 50% load in P2, and thus would stabilize in P2.

I think we're calculating load different. You're using percentage of time not idle; while I'm taking task priorities into account. For example, with "CPU_load_rating = high_priority_load * 0.7 + medium_priority_load * 0.3;" if the CPU is constantly running medium priority tasks then the CPU_load_rating would be a 30% regardless of which P-state you're using.

rdos wrote:
Brendan wrote:You don't take into account if the system is running on battery power (UPS or laptop battery) but I'd want to automatically adjust if mains power is plugged in or removed; and you have no way of allowing the user/administrator to adjust the behaviour (in the Winter, I like setting my systems to "max. power" - if I'm going to pay for heating the room anyway, then I see no reason not to get better performance too).
You have a point with winter-settings, but I don't like configuration settings. I like everything to be automatic.

I like both (e.g. good default behaviour so that people don't need to care, with configuration settings hidden in some "advanced" menu or something for people that do care). It probably isn't worthwhile to (for e.g.) use the time zone (or GPS if the system has one) to determine roughly where the computer is, and implementing automated winter adjustments (although it might be a nice feature). Automatically changing the power management when mains power is plugged in or removed is very common though.

rdos wrote:If the relationship was linear, I wouldn't bother with P-states, as entering HLT (C1) would work just as well.

That's probably because you're calculating CPU load in a "simple" way. For e.g. if the CPU/s are spending all their time running extremely high priority tasks or if the CPUs are spending all their time running extremely low priority tasks, then you treat both of these very different cases as "100% load" just because the CPUs aren't spending time idle. In my case, one would be "100% load" and the other might be "20% load" because extremely low priority tasks don't matter so much.

rdos wrote:
Brendan wrote:Finally, the "Latency" field and "Bus Master Latency" field should probably also be taken into account (and both these fields may be different for each different Pstate). I'd be tempted to use the latency fields to determine how long after a P-state change until another P-state change is allowed (e.g. during a P-state change, multiply the latency for that P-state by 100 and set a timer, and refuse to do another P-state change until the timer expires). I'd also be tempted to use the latency fields to determine if a P-state change is worth bothering with (e.g. if you're currently using P1 but P2 is slightly closer to what you want, maybe switching to P2 isn't worth the latency cost and you decide to keep using P1 for now anyway).
I haven't seen any PC where they differ between P-states.

I've never actually seen you (I don't even know what you look like). Given the fact that I haven't seen you, you can't ever exist; and if I do happen see you in future then I'll have to assume that I was mistaken because I know you can't exist.

rdos wrote:There is a need to change the algorithm slighty. Instead of calculating load as an average of the load of all cores, it is necesary to use the highest load on any core. Otherwise, when P7 is entered, and some single threaded application that uses 100% CPU is started, average load is around 50%, and no P-state transisition is performed. In this case one core is loaded 0% and the other 100%. Thus, it is necesary to use maximum load.

I'd assume there's at least 2 different cases - one where P-states effect all logical CPUs in the chip (but not CPUs in other chips), and one where each core has it's own independent P-states that only effect logical CPUs within that core. There might even be intermediate cases (e.g. a quad core where each pair of cores have independent P-states). Of course you should probably use ACPI (the "_PSD" object) to determine what relationships there are between P-state controls and logical CPUs (although to be honest I haven't figured out what tells you which CPUs are in which "dependency domain", and the more time I spend looking into ACPI the more I want to avoid ACPI).

Doh. It also looks like the set of "currently available" P-states can change (see the "_PPC" object).

Cheers,

Brendan

Posted: **Sun Jan 08, 2012 4:33 am**

Brendan wrote:What if one chip has an efficient CPU fan, and another chip doesn't (full of dust, air-flow blocked, sitting next to a graphics card so the air it sucks in is already hot, or just plain faulty fan)? In that case, equal load means unequal temperatures.

Then it must be a computer with more than one physical processor module. Those are quite uncommon. Standard PCs are only equipped with a single physical processor module, so that's what I primary will target. Additionally, I wouldn't be able to debug such a setup because I have no PC with more than one physical processor module.

Brendan wrote:I think we're calculating load different. You're using percentage of time not idle; while I'm taking task priorities into account. For example, with "CPU_load_rating = high_priority_load * 0.7 + medium_priority_load * 0.3;" if the CPU is constantly running medium priority tasks then the CPU_load_rating would be a 30% regardless of which P-state you're using.

I'm aware of that, but I still fail to see how your load could be independent of P-state. Does that mean it is some kind of assumption, because if load is meassured in some way it must differ between P-states and processor frequency. It really doesn't matter if you calculate it per priority-class or not.

Brendan wrote:That's probably because you're calculating CPU load in a "simple" way. For e.g. if the CPU/s are spending all their time running extremely high priority tasks or if the CPUs are spending all their time running extremely low priority tasks, then you treat both of these very different cases as "100% load" just because the CPUs aren't spending time idle. In my case, one would be "100% load" and the other might be "20% load" because extremely low priority tasks don't matter so much.

I'm not using priority-classes a lot, but rather all applications have the same priority. It is some system-threads that are above normal priority because they act as IRQ-servers or have some other time-critical function. Threads that run with above normal priority should not use a lot of CPU-time, and if they do, they should not be above normal priority. That leaves us with application programs running at the next-lowest (null-thread is lowest) priority. It is usually the application programs that loads the processor, and so they should determine P-state, not short above normal priority threads.

Brendan wrote:I'd assume there's at least 2 different cases - one where P-states effect all logical CPUs in the chip (but not CPUs in other chips), and one where each core has it's own independent P-states that only effect logical CPUs within that core. There might even be intermediate cases (e.g. a quad core where each pair of cores have independent P-states). Of course you should probably use ACPI (the "_PSD" object) to determine what relationships there are between P-state controls and logical CPUs (although to be honest I haven't figured out what tells you which CPUs are in which "dependency domain", and the more time I spend looking into ACPI the more I want to avoid ACPI).

The trick is to discard things that are not important. The ACPI specification only gives you a long list of objects that might be supported. It doesn't tell you which objects are usually supported. You can spend many years in handling all the available ACPI-objects without adding any relevant functionality because in the end nobody supports them. It is better to look at typical configurations and then decide what to implement based on what those support.

Posted: **Sun Jan 08, 2012 4:39 am**

Now I can also start a new core at any time (previously all cores were started at boot). This means that the system now will only be booted with BSP, and that the P-state manager will start additional cores when load is high enough to warrant this. Now I just need to implement this as well, and then I need to implement C3/C4 or whatever method might be good for shutting down a core when load is too low.

Looking at the P-state table, it is obvious it is not linear. What that should mean is that additional cores should be started when load increase instead of decreasing P-state.

The P-state logic is also kind of complex since some processors have a logic that really cannot be run on multiple cores at the same time, but instead must run on one core only in a P-state thread (AMD K8), while others have a more well-thoughout logic (AMD K10+), but which needs to run on all the cores at the same time. I'm not sure if the latter logic is best implemented with IPIs, or by adding hooks in the scheduler. I think scheduler hooks might be the best, since then it is possible to check if the core is in the correct P-state, and if not, change P-state. Maybe it would be enough to just "hook" the idle procedure (hlt instruction) and move that from the scheduler to the P-state manager? Perhaps also an hook when a thread's time-slice is used up. That should cover all scenarios.

Posted: **Sun Jan 08, 2012 5:33 am**

Brendan wrote:Doh. It also looks like the set of "currently available" P-states can change (see the "_PPC" object).

Presumably this relates to A) Thermal throttlng, and B) TDP-based performance boosting (e.g. AMD TurbCore/Intel TurboBoost)

Posted: **Sun Jan 08, 2012 8:31 am**

Owen wrote:
Brendan wrote:Doh. It also looks like the set of "currently available" P-states can change (see the "_PPC" object).
Presumably this relates to A) Thermal throttlng, and B) TDP-based performance boosting (e.g. AMD TurbCore/Intel TurboBoost)

More likely to processor overheating. If the processor is too hot, it won't allow some of the lowest P-states.

Posted: **Sun Jan 08, 2012 8:30 pm**

Hi,

rdos wrote:
Brendan wrote:What if one chip has an efficient CPU fan, and another chip doesn't (full of dust, air-flow blocked, sitting next to a graphics card so the air it sucks in is already hot, or just plain faulty fan)? In that case, equal load means unequal temperatures.
Then it must be a computer with more than one physical processor module. Those are quite uncommon. Standard PCs are only equipped with a single physical processor module, so that's what I primary will target. Additionally, I wouldn't be able to debug such a setup because I have no PC with more than one physical processor module.

I've got 5 "dual chip" computers here. It's not that uncommon, especially when someone wants more processing power that a single chip can deliver.

rdos wrote:
Brendan wrote:I think we're calculating load different. You're using percentage of time not idle; while I'm taking task priorities into account. For example, with "CPU_load_rating = high_priority_load * 0.7 + medium_priority_load * 0.3;" if the CPU is constantly running medium priority tasks then the CPU_load_rating would be a 30% regardless of which P-state you're using.
I'm aware of that, but I still fail to see how your load could be independent of P-state. Does that mean it is some kind of assumption, because if load is meassured in some way it must differ between P-states and processor frequency. It really doesn't matter if you calculate it per priority-class or not.

If you're running a medium priority task that would consume 100% of CPU time for 6 hours at P0, then changing the P-state makes no difference to CPU load (and only effects how long until the task completes its work). If you're running 100 different medium priority tasks that consume 1% of CPU time each (for 6 hours at P0) then it's the same thing (P-state effects how long until all 100 tasks have completed, and has no effect on load).

The way you calculate load ("time spent not idle") you'd call it "100% load" and you'd end up using P0 for 6 hours. The way I calculate load (taking task priorities into account) I'd call it "30% load" and could end up using P3 for 12 hours instead. Because of the way you calculate load, only "100% load" and "0% load" can be constant and unaffected by P-state. Because of the way I'd calculate load (taking task priorities into account), "30%" load can be a constant load that isn't effected by P-state.

Note: I've also noticed that the weightings for my load calculation example were fairly bad - something like "CPU_load_rating = high_priority_load * 1 + medium_priority_load * 0.85 + low_priority_load * 0.6;" would be better. Of course I tend to use 256 different thread priorities so there's probably a more appropriate way for me (e.g. "load_sum += time_running_thread * (255 - thread_priority) / 255; load_time += time_running_thread;" during task switches and "CPU_load_rating = load_sum / load_time; load_sum = 0; load_time = 0;" when adjusting P-state).

rdos wrote:The trick is to discard things that are not important. The ACPI specification only gives you a long list of objects that might be supported. It doesn't tell you which objects are usually supported. You can spend many years in handling all the available ACPI-objects without adding any relevant functionality because in the end nobody supports them. It is better to look at typical configurations and then decide what to implement based on what those support.

I'm more likely to go a completely different way - CPU drivers and motherboard drivers, that are used instead of ACPI if possible.

rdos wrote:Now I can also start a new core at any time (previously all cores were started at boot). This means that the system now will only be booted with BSP, and that the P-state manager will start additional cores when load is high enough to warrant this. Now I just need to implement this as well, and then I need to implement C3/C4 or whatever method might be good for shutting down a core when load is too low.

The only way to shut down a core is using "CLI;HLT" to put it back into the "wait for SIPI" state (although you should also disable caches and flush them, clear the local APIC's logical destination register, etc). If you can start CPUs whenever you like and shut them down whenever you like; then you've got most of what you need for hot-plug CPU support.

Owen wrote:
Brendan wrote:Doh. It also looks like the set of "currently available" P-states can change (see the "_PPC" object).
Presumably this relates to A) Thermal throttlng, and B) TDP-based performance boosting (e.g. AMD TurbCore/Intel TurboBoost)

rdos wrote:More likely to processor overheating. If the processor is too hot, it won't allow some of the lowest P-states.

I'm wondering if it's something even uglier. For example, imagine if different P-states use different frequencies and different voltages, where frequency is "per core" and voltage is "per physical chip". Different P-states might only be available if the voltage would be within a range that suits all cores (e.g. if one core is using P6, then P0 isn't available on other cores because there's no voltage that is acceptable for both P6 and P0).

Cheers,

Brendan

Posted: **Mon Jan 09, 2012 2:24 am**

Brendan wrote:I've got 5 "dual chip" computers here. It's not that uncommon, especially when someone wants more processing power that a single chip can deliver.

OK, but it is not my primary target, so it has low priority, meaning I won't implement it right now.

Brendan wrote:If you're running a medium priority task that would consume 100% of CPU time for 6 hours at P0, then changing the P-state makes no difference to CPU load (and only effects how long until the task completes its work). If you're running 100 different medium priority tasks that consume 1% of CPU time each (for 6 hours at P0) then it's the same thing (P-state effects how long until all 100 tasks have completed, and has no effect on load).

The way you calculate load ("time spent not idle") you'd call it "100% load" and you'd end up using P0 for 6 hours. The way I calculate load (taking task priorities into account) I'd call it "30% load" and could end up using P3 for 12 hours instead. Because of the way you calculate load, only "100% load" and "0% load" can be constant and unaffected by P-state. Because of the way I'd calculate load (taking task priorities into account), "30%" load can be a constant load that isn't effected by P-state.

The way I see your first example is one of two possible scenarios:
1. Somebody is busy-polling hardware and wasting CPU-time. If this is the case, your solution seems relevant
2. Somebody is running a huge calculation. If this is the case, they want it to complete as fast as possible, and it makes the most sense to run it at P0 regardless of priority

The way I see your second example, is 100 cooperative threads that runs 1% of the time, and are blocked for 99% of the time. In this case it is clear that they should run at a P-state that makes them keep their preferrred cycles as they might have been done in this way and except to be run like that. If they for instance are designed to wait 100ms, do some job in a loop, if you select a P-state where they might have to wait 200ms to get the processor because of overload, you have interfered in unacceptable ways with the programs if the correct execution could be sustained at a lower P-state.

The goal of P-states should be to minimize power while not affecting performance and how the user perceives the system. If you select a high P-state in the first scenario, the user will think your system is crap because it takes twice as long to run his calculation on your system compared to on another system.

Brendan wrote:I'm more likely to go a completely different way - CPU drivers and motherboard drivers, that are used instead of ACPI if possible.

You won't be able to ignore ACPI for PCI interrupt routings. That's the minimum.

Brendan wrote:The only way to shut down a core is using "CLI;HLT" to put it back into the "wait for SIPI" state (although you should also disable caches and flush them, clear the local APIC's logical destination register, etc). If you can start CPUs whenever you like and shut them down whenever you like; then you've got most of what you need for hot-plug CPU support.

Will CLI;HLT make the core entering C2 or C3? Seems like ACPI defines IO-ports to enter these modes.

Brendan wrote:I'm wondering if it's something even uglier. For example, imagine if different P-states use different frequencies and different voltages, where frequency is "per core" and voltage is "per physical chip". Different P-states might only be available if the voltage would be within a range that suits all cores (e.g. if one core is using P6, then P0 isn't available on other cores because there's no voltage that is acceptable for both P6 and P0).

Could be. That's another reason to use the same P-states. At least within the same physical processor module.

Posted: **Mon Jan 09, 2012 3:53 am**

Hi,

rdos wrote:
Brendan wrote:I've got 5 "dual chip" computers here. It's not that uncommon, especially when someone wants more processing power that a single chip can deliver.
OK, but it is not my primary target, so it has low priority, meaning I won't implement it right now.

Your highest priority is half-implementing things, so that they need to be redesigned and rewritten later?

rdos wrote:
Brendan wrote:If you're running a medium priority task that would consume 100% of CPU time for 6 hours at P0, then changing the P-state makes no difference to CPU load (and only effects how long until the task completes its work). If you're running 100 different medium priority tasks that consume 1% of CPU time each (for 6 hours at P0) then it's the same thing (P-state effects how long until all 100 tasks have completed, and has no effect on load).

The way you calculate load ("time spent not idle") you'd call it "100% load" and you'd end up using P0 for 6 hours. The way I calculate load (taking task priorities into account) I'd call it "30% load" and could end up using P3 for 12 hours instead. Because of the way you calculate load, only "100% load" and "0% load" can be constant and unaffected by P-state. Because of the way I'd calculate load (taking task priorities into account), "30%" load can be a constant load that isn't effected by P-state.
The way I see your first example is one of two possible scenarios:
1. Somebody is busy-polling hardware and wasting CPU-time. If this is the case, your solution seems relevant
2. Somebody is running a huge calculation. If this is the case, they want it to complete as fast as possible, and it makes the most sense to run it at P0 regardless of priority

It could just be something like SETI@home. If someone wanted it to complete as fast as possible then they would've asked for high priority, not something half-way between highest and lowest priority.

The way I see both of these examples is "they're examples". A kernel developers job is to (try to) make sure the OS behaves well regardless of what load is present. As soon as you start trying to predict the load of unknown processes at unknown times on unknown hardware, you've stopped trying to make sure the OS behaves well regardless of what load is present, and have therefore failed to be an effective kernel developer.

rdos wrote:The goal of P-states should be to minimize power while not affecting performance and how the user perceives the system.

If that's your goal, just leave it at P0 permanently (and only use C-states). A good OS would care about things like battery life, temperature, acoustics/noise, etc, and wouldn't ignore everything except performance.

rdos wrote:If you select a high P-state in the first scenario, the user will think your system is crap because it takes twice as long to run his calculation on your system compared to on another system.

If you select P0 in the first scenario, the user will think your system is crap because their laptop's battery will be dead in half an hour, even though it's not doing anything too important.

rdos wrote:
Brendan wrote:I'm more likely to go a completely different way - CPU drivers and motherboard drivers, that are used instead of ACPI if possible.
You won't be able to ignore ACPI for PCI interrupt routings. That's the minimum.

I will be able to ignore ACPI for PCI interrupts; either because the motherboard driver tells me what they are, or because the device uses MSI, or because I'm able to auto-detect without being told (e.g. start by assuming the device is connected to all possible IRQs, then reduce the number of IRQs it could be connected to as IRQs occur).

rdos wrote:
Brendan wrote:The only way to shut down a core is using "CLI;HLT" to put it back into the "wait for SIPI" state (although you should also disable caches and flush them, clear the local APIC's logical destination register, etc). If you can start CPUs whenever you like and shut them down whenever you like; then you've got most of what you need for hot-plug CPU support.
Will CLI;HLT make the core entering C2 or C3? Seems like ACPI defines IO-ports to enter these modes.

Taking CPU/s offline isn't something that ACPI currently covers (as far as I know). You'd want to put the CPU into its lowest power C-state and then do "CLI;HLT", but for some reason Linux doesn't do this (I think their highest priority is half-implementing things that will need redesigning later too), so for Linux an offline CPU can consume more power than an online (but idle) CPU.

Cheers,

Brendan

Posted: **Mon Jan 09, 2012 4:18 am**

Brendan wrote:Your highest priority is half-implementing things, so that they need to be redesigned and rewritten later?

I want to complete what I start out on. If I have too high ambitions, and especially untestable ambitions like multi-processor boards that I don't own, I might never get done with anything, or ship untested code, or both.

Brendan wrote:It could just be something like SETI@home. If someone wanted it to complete as fast as possible then they would've asked for high priority, not something half-way between highest and lowest priority.

If they want to run something like this in the background, they could do it in a cooperative way.

Brendan wrote:The way I see both of these examples is "they're examples". A kernel developers job is to (try to) make sure the OS behaves well regardless of what load is present. As soon as you start trying to predict the load of unknown processes at unknown times on unknown hardware, you've stopped trying to make sure the OS behaves well regardless of what load is present, and have therefore failed to be an effective kernel developer.

I don't predict load, I meassure it.

Brendan wrote:
rdos wrote:The goal of P-states should be to minimize power while not affecting performance and how the user perceives the system.
If that's your goal, just leave it at P0 permanently (and only use C-states). A good OS would care about things like battery life, temperature, acoustics/noise, etc, and wouldn't ignore everything except performance.

No, because you could do the same job at the same performance with lower power, less temperature and noise just be selecting optimal P-states. If you want longer battery-life or less noise you could just tweak the parameters and select lower P-states that doesn't generate the same performance. It's not a major redesign, just some parameter changes. However, the default would be to keep performance at the lowest possible power.

I still don't see how priority enters the discussion. You would need users to select priorities for all their tasks in order for this to make sense, which, again, is not something I want to have.

Brendan wrote:
rdos wrote:If you select a high P-state in the first scenario, the user will think your system is crap because it takes twice as long to run his calculation on your system compared to on another system.
If you select P0 in the first scenario, the user will think your system is crap because their laptop's battery will be dead in half an hour, even though it's not doing anything too important.

Maybe. Depends on what the program does, and if it runs on a laptop at all.

Posted: **Mon Jan 09, 2012 5:05 am**

Hi,

rdos wrote:
Brendan wrote:It could just be something like SETI@home. If someone wanted it to complete as fast as possible then they would've asked for high priority, not something half-way between highest and lowest priority.
If they want to run something like this in the background, they could do it in a cooperative way.

If they want to run something like this in the background, can they do it without your OS taking them to P0 permanently?

rdos wrote:
Brendan wrote:The way I see both of these examples is "they're examples". A kernel developers job is to (try to) make sure the OS behaves well regardless of what load is present. As soon as you start trying to predict the load of unknown processes at unknown times on unknown hardware, you've stopped trying to make sure the OS behaves well regardless of what load is present, and have therefore failed to be an effective kernel developer.
I don't predict load, I meassure it.

Wow - I wish I could predict the load of unknown processes at unknown times on unknown hardware.

rdos wrote:
Brendan wrote:
rdos wrote:The goal of P-states should be to minimize power while not affecting performance and how the user perceives the system.
If that's your goal, just leave it at P0 permanently (and only use C-states). A good OS would care about things like battery life, temperature, acoustics/noise, etc, and wouldn't ignore everything except performance.
No, because you could do the same job at the same performance with lower power, less temperature and noise just be selecting optimal P-states. If you want longer battery-life or less noise you could just tweak the parameters and select lower P-states that doesn't generate the same performance. It's not a major redesign, just some parameter changes. However, the default would be to keep performance at the lowest possible power.

You can't do the same job at the same performance with lower power. You can only make a compromise between performance and power. For high priority tasks you want high performance, for low priority tasks you want lower power. Surely you can see that for medium priority tasks you want something in between?

rdos wrote:I still don't see how priority enters the discussion. You would need users to select priorities for all their tasks in order for this to make sense, which, again, is not something I want to have.

Users don't need to select priorities (although it'd be nice if they could if/when they want to). Software should tell the scheduler what it wants. A thread that's responsible for updating the user interface should be relatively high priority, a thread that does spell checking while the user types could be medium priority, a thread that regenerates search indexes could be low priority. Whoever wrote the code can use reasonable defaults.

rdos wrote:
Brendan wrote:
rdos wrote:If you select a high P-state in the first scenario, the user will think your system is crap because it takes twice as long to run his calculation on your system compared to on another system.
If you select P0 in the first scenario, the user will think your system is crap because their laptop's battery will be dead in half an hour, even though it's not doing anything too important.

Maybe. Depends on what the program does, and if it runs on a laptop at all.

For the system I use for most things (a workstation and not a laptop), I'd have to say that fan noise and heat are my biggest problems. I don't think I've ever seen it go above 20% CPU load. It's one of the reasons I think Linux could be much better.

Cheers,

Brendan

OSDev.org

Processor P-states and power management

Re: Processor P-states and power management

Re: Processor P-states and power management

Re: Processor P-states and power management

Re: Processor P-states and power management

Re: Processor P-states and power management

Re: Processor P-states and power management

Re: Processor P-states and power management

Re: Processor P-states and power management

Re: Processor P-states and power management

Re: Processor P-states and power management

Re: Processor P-states and power management

Re: Processor P-states and power management

Re: Processor P-states and power management

Re: Processor P-states and power management

Re: Processor P-states and power management