Page 1 of 4

Processor P-states and power management

Posted: Tue Jan 03, 2012 9:00 am
by rdos
OK, so I have managed to decode the ACPI _PCT and _PSS processor objects on three different AMD PCs. What the _PCT object says is to use a hardware specific register for both control and to get the current performance status. It also gives some values in the _PSS object, along with core frequency and power consumption. The problem is that the status port address on all three machines is 0, which is pretty useless. The control-port is more informative on two of the PCs (it contains the value 0xC0010062, which according to the AMD processor manual is the PERF_CTL MSR). The 3:rd AMD PC, which is older, contains the value 0 for the control port as well.

ACPICA doesn't contain any method to handle register descriptors with the 0x7F (fixed hardware) type, but since the address value on two PCs seems to be an MSR, it might be a fair guess that the address is an MSR, and that controlling P-state can be done by writing to this MSR. But that still leaves the case when the address is 0, which means there is no indication at all how to change p-state, or what the control-values in _PSS actually refer to. Maybe it is reasonable to use the Intel standard MSR (IAR32_PERF_CTL) when the address is zero?

Another problem is that one of my machines generate a protection fault when MSR 0xC0010063 (AMDs P-state status MSR) is read, while on one of them, this seems to work and give reasonable values as a result.

Anybody know how this is supposed to work? Why does ACPI provide more or less useless information regarding this, when they could have defined a MSR access type, and provide the real addresses for the hardware?

Re: Processor P-states and power management

Posted: Tue Jan 03, 2012 11:32 am
by Brynet-Inc
I added support for this to OpenBSD on AMD K10+, and yes, those are MSR addresses.

OpenBSD's ACPI implementation evaluates the _PSS object already, so I had to do was create a state table matching FID with frequency, and write the desired FID to the CONTRL MSR.. you have to delay a bit (..100ms?) before reading the STATUS MSR again, it's in the K10 BKDG (..bios kernel developer guide).

The STATUS register being 0 is actually correct, and it should mean it's running at full speed.

This method is processor (..and family, actually) specific though, you'll need to add support for AMD K8-style MSRs.. and Intel has a few different ways of doing it. I think there is an ACPI standard way, but I don't think it's widely supported (_PCT?).

I'm curious what CPU faulted when you read the STATUS MSR? just recently there was an issue brought up about my implementation doing the same with Linux KVM/QEMU.. but that was an emulation error, they were pretending to be K10 but not implementing K10 specific MSR's.

My proposed fix was to avoid the MSR read if the ACPI tables indicated no _PSS/p-states, but apparently AMD also recommends a CPUID check before touching the MSRs at all.

Re: Processor P-states and power management

Posted: Tue Jan 03, 2012 1:44 pm
by rdos
Brynet-Inc wrote:OpenBSD's ACPI implementation evaluates the _PSS object already, so I had to do was create a state table matching FID with frequency, and write the desired FID to the CONTRL MSR.. you have to delay a bit (..100ms?) before reading the STATUS MSR again, it's in the K10 BKDG (..bios kernel developer guide).
Too me, it seems like it would be best to avoid reading the status MSR at all. Plus that it seems like the only thing that is really safe is to keep performance P-state per system, and not per core, as many processors don't allow individual frequencies or core voltages anyway. Then you just shut-down (probably a higher C-state) or restart cores when the P-state control gets out of bounds (too high or too low load). I actually think I will only initially boot-up the BSP, and let the power manager startup the AP cores if/when it is required.
Brynet-Inc wrote:The STATUS register being 0 is actually correct, and it should mean it's running at full speed.
It is not the value that is 0, but the address in _PCT.
Brynet-Inc wrote:This method is processor (..and family, actually) specific though, you'll need to add support for AMD K8-style MSRs.. and Intel has a few different ways of doing it. I think there is an ACPI standard way, but I don't think it's widely supported (_PCT?).
If just _PCT has the model-specific value, I wouldn't have to worry about this, but in one case _PCT has zero for both control and status, and all MSR accesses seems to fail (protection fault) on the processor.
Brynet-Inc wrote:I'm curious what CPU faulted when you read the STATUS MSR? just recently there was an issue brought up about my implementation doing the same with Linux KVM/QEMU.. but that was an emulation error, they were pretending to be K10 but not implementing K10 specific MSR's.
It was an AMD Athlon 64 X2 Dual Core processor 6000+. Windows lists it as Family 15, model 67. I think it is one of the very first AMD Athlon processors.
Brynet-Inc wrote:My proposed fix was to avoid the MSR read if the ACPI tables indicated no _PSS/p-states, but apparently AMD also recommends a CPUID check before touching the MSRs at all.
If the control address of _PCT contains 0, that is certainly the case. This processor probably has some other (non-MSR) way of doing this.

Re: Processor P-states and power management

Posted: Tue Jan 03, 2012 2:30 pm
by rdos
Now I have also inspected the _CST object on 2 machines. On the AMD 64 X2, this object is not present. On the 6-core AMD, it is present, but the content does not seem to confirm to the ACPI-specfication. Instead of listing one package per C-state, it seems like it lists the C-states themselves only (and it lists C1 and C2). That probably means that I will discard this information, and not bother to read it. C1 is entered with hlt anyway, so doesn't need ACPI.

I have also managed to find the P_BLK mentioned in the ACPI specification. It requires some really nasty type-casting in ACPICA to get this information. There really should be an AcpiHandleToObject procedure declared somewhere. :evil:

Anyway, the AMD 64 X2 has no P_BLK (it has length 0). The 6-core AMD does have a P_BLK with the correct length (6 bytes), and thus P_LVL2 and P_LVL3 are accessible for C2 and C3 state.

Re: Processor P-states and power management

Posted: Tue Jan 03, 2012 3:08 pm
by Cognition
ACPICA allows you to register your own bus handlers for certain objects via the AcpiInstallAddressSpaceHandler function. There's probably a specification out there by AMD telling you exactly how to implement the a vendor specific hardware range. I know Intel has specification describing how to implement it's power state control. Essentially for the Intel systems you're just writing some code that allows ACPI to interact with the power management and performance control registers, there's also an _OSC method you need to interact with to actually indicate what features the OS supports. Personally I haven't poked around in it much, but it sounds like AMD uses a very similar system.

Re: Processor P-states and power management

Posted: Tue Jan 03, 2012 3:18 pm
by Brynet-Inc
rdos wrote:Too me, it seems like it would be best to avoid reading the status MSR at all. Plus that it seems like the only thing that is really safe is to keep performance P-state per system, and not per core, as many processors don't allow individual frequencies or core voltages anyway. Then you just shut-down (probably a higher C-state) or restart cores when the P-state control gets out of bounds (too high or too low load). I actually think I will only initially boot-up the BSP, and let the power manager startup the AP cores if/when it is required.
Reading the STATUS register is the only way to know if the p-state changed. At least on newer AMD systems, the p-state is synchronized on each core.. but it doesn't hurt to make the change on all cores.
rdos wrote:It is not the value that is 0, but the address in _PCT.
The MSR addresses are hardcoded for AMD K10 processors, the driver I wrote is specifically for family 10 and higher AMD processors.. which only relies on the ACPI _PSS object.
rdos wrote:If just _PCT has the model-specific value, I wouldn't have to worry about this, but in one case _PCT has zero for both control and status, and all MSR accesses seems to fail (protection fault) on the processor.
I have no idea, I don't rely on _PCT existing or being valid.
rdos wrote:It was an AMD Athlon 64 X2 Dual Core processor 6000+. Windows lists it as Family 15, model 67. I think it is one of the very first AMD Athlon processors.
The first Athlon 64's are AMD's K8, not AMD's K10+.

The MSRs are at a totally different address (..CONTROL is 0xc0010041, STATUS is 0xc0010042). There is a K8 BKDG from AMD explains how to use them.
rdos wrote:If the control address of _PCT contains 0, that is certainly the case. This processor probably has some other (non-MSR) way of doing this.
I don't know how that works, I'm not aware of any systems that have _PSS objects but don't support the appropriate vendor/model specific MSRs.

Re: Processor P-states and power management

Posted: Tue Jan 03, 2012 3:57 pm
by rdos
Brynet-Inc wrote:The MSRs are at a totally different address (..CONTROL is 0xc0010041, STATUS is 0xc0010042). There is a K8 BKDG from AMD explains how to use them.
Yes, that works better. When reading the STATUS MSR, it returns the status value specified for the P0 state.

Re: Processor P-states and power management

Posted: Tue Jan 03, 2012 4:29 pm
by rdos
Cognition wrote:ACPICA allows you to register your own bus handlers for certain objects via the AcpiInstallAddressSpaceHandler function.
No, I wouldn't do that as I find the ACPICA environment terrible. Instead, I'm creating the information I need by calling the appropriate ACPICA name-space walks and evaluate the objects I need to evaluate.
Cognition wrote:I know Intel has specification describing how to implement it's power state control.
OK, that is probably why so many machines have non-sensible information in _PCT. They basically tell BIOS developers to put zero in those fields, and fall-back on processor-specific drivers. It is only AMD K10+ that seems to code the MSR value in the control field at least.

Re: Processor P-states and power management

Posted: Tue Jan 03, 2012 4:43 pm
by Brynet-Inc
rdos wrote:It is only AMD K10+ that seems to code the MSR value in the control field at least.
K8 is a bit harder, I believe you have to set the VID (..voltage) along with the FID.. and I don't think it's smart enough to figure that out itself, unlike K10+.

But yes, you can set the FID/VID with the control MSR. It's just a little more work. Again, stop guessing and read the K8 BKDG.

Re: Processor P-states and power management

Posted: Wed Jan 04, 2012 3:05 pm
by rdos
I think I need to implement the statistics functions required for processor power management first. The primary information needed is percentage of time spent running threads per core (or percentage of time spent doing nothing, null thread). This information could also be interesting to present to end-users as diagrams per core. Additionally, there is a need to know percentage of time a core is in active state (as opposed to shut-down).

Today, I have number of tics (one tic is 1/1.19 us) each thread has executed code. This information is also available for the each core's null-thread. IOW, I know the number of tics each core has spent doing nothing. What I don't know is the total time each core has been active, but that might not be needed either.

Possible algoritm:
1. Record system time (in tics)
2. Read out number of tics each cores null-thread has used
3. Wait for some time (for instance, 100ms)
4. Record system time and calculate elapsed time since last read
5. Read out each core's null-thread tics again, and subtract from previous value
6. Load per core can now be calculated as 100 - diff (core null-thread tics) / diff (system time) * 100

This algoritm cannot handle stopped cores directly, since this information would not be available.

Another algoritm might be:
1. When adding used tics to a thread, also add these tics to a counter for the current core
2. Load now can be calculated directly as 100 - core null-thread tics / total core tics * 100 or as a difference when these values are sampled as above (elapsed time is not needed in this scenario).

In order to handle a stopped core, it would simply be prohibited to add to total core tics when it is in stopped state. The duty cycle (active state) can now be calculated as total core tics / total elapsed tics * 100.

Additionally, in the presentation, a stopped core is now indicated when the same value for total core tics is read. In the diagram, this can be presented by not drawing a line. This would be sensible as a stopped core would yield insufficient data for calculating load (divide by 0).

Possible API:
RdosGetCoreLoad(int core, long long *null_tics, long long *core_tics, long long *total_tics)

Edit: More suitable interface:
C: RdosGetCoreLoad(int core, long long *null_tics, long long *core_tics);
Asm:
IN: AX = Core
OUT: EDX:EAX core tics
OUT: ECX:EBX null tics

and
C: RdosGetCoreDuty(int core, long long *core_tics, long long *total_tics);
Asm:
IN: AX = Core
OUT: EDX:EAX core tics
OUT: ECX:EBX total tics

Both a graphical user-mode presentation application and the p-state manager could use this information.

The actual operating frequency (per system, not per core) could also be retrieved. This would overload the normal CPU frequency API measured at boot time. This parameter could also be plotted in the presentation of CPU performance / load.

Re: Processor P-states and power management

Posted: Thu Jan 05, 2012 8:27 am
by rdos
OK, so the performance monitor user-app is done, and it seems to work. When I use one second interval between samples the load curves look quite smoth, even if load varies. 250ms sampling interval also works ok, at least if the graph isn't redrawn every time, because then the monitor app itself loads too much. The 250ms data is pretty smoth as well. When 25ms sampling interval is used, there is a lot of noise in the data (several 10s of percent), which makes such fast sampling unusable. 100ms is intermediate, with quite some noise. I would probably use a 250ms interval for adjusting p-states, possibly 100ms if the adjustments are made small (for instance, max one state at a time).

Re: Processor P-states and power management

Posted: Thu Jan 05, 2012 6:56 pm
by Brendan
Hi,

Imagine there's 2 different controls that both range from 0.0 to 1.0. These controls might be set by the user or administrator, or whatever - doesn't matter.

The first control is for how noisy the OS is allowed to be. It controls things like the maximum fan speed the OS will use (but would also effect disk drives). For server rooms people would set it to "no noise limit" (1.0), for things like bedrooms people would set it to "minimum noisiness" (0.0) and for offices people might want it at "average noisiness" (0.5). As the CPU gets hotter you increase the fan speed to compensate, until you reach the maximum fan speed that the "noisiness" control allows. When you're at the maximum allowed fan speed, if the CPU gets hotter you compensate by adjusting a (per CPU) "P-state limit" variable. Of course you'd do the reverse if the CPU is cooling down (change the "P-state limit" variable closer to P0, and decrease fan speed if the "P-state limit" variable is already at P0).

The second control is how much power to consume, where 0.0 is "minimum power usage" and 1.0 is "maximum performance". Every task switch you determine the priority of the new task (e.g. 1.0 for highest priority task, 0.0 for idle task) and multiply that by the "power/performance control" to get a desired speed and determine the "desired P-state".

Once you've got all this, you calculate a "target P-state", which is whatever is slowest. For example, if "desired P-state" is P3 and the "P-state limit" variable is P5, then use P5. If the "desired P-state" is P2 and the "P-state limit" variable is P0, then use P2.

The target P-state isn't the actual P-state though - you want to smooth it out a bit so that you're not changing the P-state too often, and limit the number of P-state changes per second.

For "smoothing it out", you do not want to (for e.g.) calculate the "average P-state" each second and use that alone. You want do something more like "new_Pstate = average_Pstate_for_this_second * 0.5 + old_Pstate * 0.5". This prevents the P-state from fluctuating too much too quickly, which may be necessary to avoid oscillations. For example, imagine if you use "average P-state" alone and the CPU is under constant load, and the P-state is changed from "lowest speed" to "flat out". The CPU temperature will rise quickly, causing the "P-state limit" variable to change quickly, causing the P-state to be changed back from "flat out" back to "lowest speed", causing the CPU temperature to drop quickly, causing the "P-state limit" variable to change quickly, causing the P-state to be changed from "lowest speed" back to "flat out" again. It's a feedback loop that causes the CPU speed, temperature and fan speed to oscillate; even though load is constant.


Cheers,

Brendan

Re: Processor P-states and power management

Posted: Fri Jan 06, 2012 5:19 am
by rdos
Well, the p-state itself only decreases performance from a state that should be sustainable, so I don't think p-state changes are prone to oscillations. I think you might mean boosting (throttling) states, which are separate from p-states. When throttling is used, it is not sustainable, and so needs to be done more smart. For the moment, I'm not going to implement throttling, only p-states.

The primary problem with p-state changes is that they are relatively slow. It is quite likely not possible to do this in the scheduler, as it can potentially change this much too fast. OTOH, you would not want to use seconds between adjustments either, because that leads to users perceiving the system as slow. You don't want to have a situation where the user starts some heavy application, and it is run at perhaps 10% of the normal speed for several seconds. In order for users to perceive the system as responsive, p-state changes to lower states (higher performance) must be done perhaps with a few 100ms, not within seconds. The other way is less problematic.

I think I will use a fuzzy regulator for this.

Re: Processor P-states and power management

Posted: Fri Jan 06, 2012 6:11 am
by Brendan
Hi,
rdos wrote:Well, the p-state itself only decreases performance from a state that should be sustainable, so I don't think p-state changes are prone to oscillations. I think you might mean boosting (throttling) states, which are separate from p-states. When throttling is used, it is not sustainable, and so needs to be done more smart. For the moment, I'm not going to implement throttling, only p-states.
You mean that you aren't going to think about using P-states to control CPU temperature at the moment, and you're going to implement something that breaks as soon as you do implement CPU temperature management properly?
rdos wrote:The primary problem with p-state changes is that they are relatively slow. It is quite likely not possible to do this in the scheduler, as it can potentially change this much too fast. OTOH, you would not want to use seconds between adjustments either, because that leads to users perceiving the system as slow. You don't want to have a situation where the user starts some heavy application, and it is run at perhaps 10% of the normal speed for several seconds. In order for users to perceive the system as responsive, p-state changes to lower states (higher performance) must be done perhaps with a few 100ms, not within seconds. The other way is less problematic.
The time that P-state changes take depends on which CPU. For the rough formula I used as an example in my last post, you'd be running at 50% within 1 second, at 75% within 2 seconds and at 87.5% within 3 seconds. The main flaw in my example is that you'd never actually reach 100% (but you'd round to the nearest supported P-state, so that doesn't matter much).
rdos wrote:I think I will use a fuzzy regulator for this.
Of course - nobody would be silly enough to use boring old algebra with variables that range from 0 to 1 when there's a way to ignore most of the common operators and use the power of hype instead.... :roll:


Cheers,

Brendan

Re: Processor P-states and power management

Posted: Fri Jan 06, 2012 9:38 am
by rdos
Brendan wrote:You mean that you aren't going to think about using P-states to control CPU temperature at the moment, and you're going to implement something that breaks as soon as you do implement CPU temperature management properly?
That's correct. I'm not going to implement temperature control right now, and especially not with p-states. The BIOS already sets up the cores in P0, and SMM handles catastrophic overtemperatures (along with some processors as well), so what I want to do is to put cores at suitable power-consumption states that are lower power than P0. Additionally, I'll implement the functions to start/stop cores depending on load. I don't think I will bother with throttling at all, as this is typically most useful for games on desktop computers, and that is not my "thing".

So if I decide to change my mind, and build-in throttling and temperature management, that is perfectly possible to do on top of the p-state interface. That is basically only a sub-mode of P0 state.
Brendan wrote:The time that P-state changes take depends on which CPU. For the rough formula I used as an example in my last post, you'd be running at 50% within 1 second, at 75% within 2 seconds and at 87.5% within 3 seconds. The main flaw in my example is that you'd never actually reach 100% (but you'd round to the nearest supported P-state, so that doesn't matter much).
I don't find that satisfactory. You should be able to get from 10% to 100% in 100ms, worse case 250ms, otherwise it is too slow. The other direction could take seconds, that's no problem, but decreasing p-state should be fast.
Brendan wrote:Of course - nobody would be silly enough to use boring old algebra with variables that range from 0 to 1 when there's a way to ignore most of the common operators and use the power of hype instead.... :roll:
I've used fuzzy logic before, and it is efficient for certain types of problems that are hard to describe with determinstic rules. In this case, you want sudden increases in load to affect p-states fast, while also keeping the system at optimal load (around 50%). That is a problem that isn't easy to formulate with algebra.

Example of rules:

1. If load is very high, quickly go to P0
2. If load is high, decrease p-state one mark
3. If load is intermediate, your system is fine, and you do nothing
4. If load is low, increase p-state one mark

Conditions: (just a preliminary example)

A. Very high is a slope from 75% to 90%, and then 1.0
B. High has a slope from 50% to 65%, is 1.0 from 65% to 80% and then has a slope to 90% where it is 0.0
C. Intermediate has a slope from 25% to 40%, is 1.0 from 40% to 60%, and then has a slope to 75% where it is 0.0
D. Low is 1.0 up to 25%, then has a slope to 40% where it is 0.0

How would you formulate this with algerbra?