Determining if a thermal event occurred

Question about which tools to use, bugs, the best way to implement a function, etc should go here. Don't forget to see if your question is answered in the wiki first! When in doubt post here.
Post Reply
limp
Member
Member
Posts: 90
Joined: Fri Jun 12, 2009 7:18 am

Determining if a thermal event occurred

Post by limp »

Dear all,

Brendan mentioned in this post http://forum.osdev.org/viewtopic.php?f= ... hilit=hpet that:
Brendan wrote: for some CPUs the TSC doesn't run at a fixed frequency, but you can use the "thermal sensor" IRQ (in the local APIC) to determine when the speed of the RDTSC changes; and therefore you could still use the TSC for precise timing by doing something like "virtual_ticks += (current_TSC_count - last_TSC_count) * scaling_factor" (where the scaling factor is changed whenever the RDTSC speed changes).
I have an Intel Atom 330 which unfortunately doesn’t have an invariant TSC. So, what I want to do is to check before and after I've taken some timing measurements, whether or not a thermal event occurred during the measurements (so that I know if the CPU frequency changed). If this is not the case, then I’ll assume my measurements are accurate otherwise I’ll re-take them.

Brendan mentioned to use the "thermal sensor" IRQ (in the local APIC) to determine when the speed of the RDTSC changes but I don’t really understand how this could be done in practice.

Probably what he means is to check the “Thermal Status Log flag, bit 1” of the IA32_THERM_STATUS MSR register before and after the measurements. The problem I’ve got is that Intel manuals say that the aforementioned MSR has been introduced as an MSR in the 0x0F family, model 0x0 (i.e. Intel Xeon Processors).

Does that mean that my Intel Atom processor (family=0x6, model=0x1C) doesn’t support it? If not, how can I determine if a thermal event occurred in my case?

Thanks a lot for your time.

Regards,
limp.
Last edited by limp on Wed Aug 10, 2011 4:40 am, edited 1 time in total.
User avatar
Karlosoft
Member
Member
Posts: 277
Joined: Thu Feb 14, 2008 10:46 am
Location: Italy
Contact:

Re: Determining if a thermal event occurred

Post by Karlosoft »

http://ark.intel.com/products/35641
According to intel atom 330 doesn't support that feature
User avatar
Brendan
Member
Member
Posts: 8561
Joined: Sat Jan 15, 2005 12:00 am
Location: At his keyboard!
Contact:

Re: Determining if a thermal event occurred

Post by Brendan »

Hi,
limp wrote:Brendan mentioned to use the "thermal sensor" IRQ (in the local APIC) to determine when the speed of the RDTSC changes but I don’t really understand how this could be done in practice.
The idea is to use the (variable frequency) TSC to obtain a (fixed rate) counter, by adjusting for frequency changes.

Imagine if you've got a function like this:

Code: Select all

uint64_t virtual_ticks = 0;
uint64_t last_TSC_count = 0;
double scaling_factor = 1.0;

uint64_t get_virtual_ticks(void) {
    uint64_t current_TSC_count;

    current_TSC_count = RDTSC;
    virtual_ticks += (current_TSC_count - last_TSC_count) * scaling_factor;
    last_TSC_count = current_TSC_count;
    return virtual_ticks;
}
Now imagine a "thermal status" IRQ hander, sort of like this:

Code: Select all

double scaling_factor_when_throttled = 0.5;

void thermal_status_IRQ(void) {

    // Update the virtual ticks counter first to account for time passed before the speed change

    get_virtual_ticks();

    // Set new scaling factor (for new speed)

    if( (getMSR(IA32_THERM_STATUS_MSR) & 1) == 0) {
        // Speed changed from "slow" to "normal"
        scaling_factor = 1.0;
    } else {
        // Speed changed from "normal" to "slow"
        scaling_factor = scaling_factor_when_throttled;
    }
}
In this case, "get_virtual_ticks()" would/could return a fixed rate counter value that isn't effected by automatic thermal throttling; and you could use it to (for e.g.) measure how much time has passed. For example:

Code: Select all

    uint64_t start_time;
    uint64_t time_taken;

    start_time = get_virtual_ticks();
    do_something();
    time_taken = get_virtual_ticks() - start_time;
However, it's not that simple, and the code above is just a rough outline to describe the idea better.

The global variables would have to be "per CPU" variables (and be "volatile"), and you'd need some re-entrancy protection (although disabling IRQs should work, because if it's all "per CPU" you don't need to worry about other CPUs). It also won't handle software controlled throttling (you'd have to keep track of "scaling_factor_for_last_speed_requested_by_software" instead of assuming that "normal speed" means 100%).

You'd also have to find a way to determine the right value for "scaling_factor_when_throttled" (which depends on which CPU and how the firmware set it up). At a minimum you'd want some way of auto-detecting (e.g. the first time the CPU is throttled, use the PIT or some other timer to measure the frequency of the TSC during thermal throttling) to use as a fallback. For some CPUs there's MSRs that you might be able to use to determine the frequency the TSC during thermal throttling, so (depending on CPU, available information, etc) in some cases you might be able to avoid the "fallback" option.
limp wrote:The problem I’ve got is that Intel manuals say that the aforementioned MSR has been introduced as an MSR in the 0x0F family, version 0x0 (i.e. Intel Xeon Processors).

Does that mean that my Intel Atom processor (family=0x6, model=0x1C) doesn’t support it? If not, how can I determine if a thermal event occurred in my case?
Some CPUs with "family = 6" are newer and do support it, and some CPUs with "family = 6" are older and don't support it. You have to look at the model number to determine which, um, "sub-family"(?) the CPU is.

Chronological order (for Intel) is "..., Pentium III (family 6), Pentium 4 (family F), Pentium M (family 6), ..., Sandy Bridge (family 6)".


Cheers,

Brendan
For all things; perfection is, and will always remain, impossible to achieve in practice. However; by striving for perfection we create things that are as perfect as practically possible. Let the pursuit of perfection be our guide.
limp
Member
Member
Posts: 90
Joined: Fri Jun 12, 2009 7:18 am

Re: Determining if a thermal event occurred

Post by limp »

Thanks Brendan,

I see that it’s quite complex to implement what you’re saying so I’ll try to avoid doing that at least for now.

According to my understanding, the only reason why the CPU frequency (and hence the TSC rate) should change is in the case that a thermal monitoring event forces it to change. So, if I ensure that no thermal events have happened during my measurements, I’ll assume that the CPU frequency (hence the TSC rate) was constant so the measurements accurate (no need to do mess with scaling factors etc.). Do you find something wrong to my assumption?
Brendan wrote: Some CPUs with "family = 6" are newer and do support it, and some CPUs with "family = 6" are older and don't support it. You have to look at the model number to determine which, um, "sub-family"(?) the CPU is.

Chronological order (for Intel) is "..., Pentium III (family 6), Pentium 4 (family F), Pentium M (family 6), ..., Sandy Bridge (family 6)".
That’s really weird chronological order system Intel is using! As I said, my Intel Atom processor is family=0x6 and model=0x1C so I guess it’s newer than 0x0F family, version 0x0 which is where the IA32_THERM_STATUS MSR originally introduced. Is there an Intel document or something having a list of Intel’s CPUs family/models in chronological order?

One last thing that looked weird is that on the processor’s spec page, it is mentioned that it doesn’t support "Thermal Monitoring Technologies" however; the processor’s datasheet mentions that it supports TM1. What does Intel mean by “Thermal Monitoring Technologies”? Is this something else than TM1, TM2. By the way, I found using CPUID that both TM1 and TM2 are available on my processor.

Regards,
limp.
User avatar
Brendan
Member
Member
Posts: 8561
Joined: Sat Jan 15, 2005 12:00 am
Location: At his keyboard!
Contact:

Re: Determining if a thermal event occurred

Post by Brendan »

Hi,
limp wrote:According to my understanding, the only reason why the CPU frequency (and hence the TSC rate) should change is in the case that a thermal monitoring event forces it to change. So, if I ensure that no thermal events have happened during my measurements, I’ll assume that the CPU frequency (hence the TSC rate) was constant so the measurements accurate (no need to do mess with scaling factors etc.). Do you find something wrong to my assumption?
It really depends what you're trying to measure (and if it's possible and practical to restart/re-measure if speed changed part way through). My stuff was intended to measure "wall clock time" (e.g. keeping track of "number of nanoseconds since 2000"); and what you're doing sounds more like profiling (e.g. determining how fast a specific piece of code is).

The normal approach for using TSC for profiling is to do a few hundreds tests, then discard statistical anomalies (e.g. results from individual tests that are much higher than all the others), then find the average of the remaining tests to find the final result (e.g. out of 100 tests you might discard 5 of them, then do "average = sum(kept_tests[]) / (number_of_kept_tests)". Because this sort of testing is fairly useless, most people don't care about the (hopefully small) chance of thermal throttling effecting the results.
limp wrote:
Brendan wrote: Some CPUs with "family = 6" are newer and do support it, and some CPUs with "family = 6" are older and don't support it. You have to look at the model number to determine which, um, "sub-family"(?) the CPU is.

Chronological order (for Intel) is "..., Pentium III (family 6), Pentium 4 (family F), Pentium M (family 6), ..., Sandy Bridge (family 6)".
That’s really weird chronological order system Intel is using! As I said, my Intel Atom processor is family=0x6 and model=0x1C so I guess it’s newer than 0x0F family, version 0x0 which is where the IA32_THERM_STATUS MSR originally introduced. Is there an Intel document or something having a list of Intel’s CPUs family/models in chronological order?
Not sure - the list of Intel CPU's on wikipedia is a logical ordering, but they've added "chronological entries" so it can also be used to get a good idea of the chronological order.
limp wrote:One last thing that looked weird is that on the processor’s spec page, it is mentioned that it doesn’t support "Thermal Monitoring Technologies" however; the processor’s datasheet mentions that it supports TM1. What does Intel mean by “Thermal Monitoring Technologies”? Is this something else than TM1, TM2. By the way, I found using CPUID that both TM1 and TM2 are available on my processor.
I have no idea what Intel's marketing people think "Thermal Monitoring Technologies" might be (and I suspect nobody outside Intel's marketing department really know either). It might mean there's a digital thermometer built in that software can use to determine CPU temperature, but that's just a random guess.


Cheers,

Brendan
For all things; perfection is, and will always remain, impossible to achieve in practice. However; by striving for perfection we create things that are as perfect as practically possible. Let the pursuit of perfection be our guide.
Post Reply