OSDev.org

Posted: **Fri Jun 17, 2011 4:46 am**

Hi,

limp wrote:
Brendan wrote: It would seem strange to use performance monitoring counters (which are intended for very thorough and accurate profiling) to implement "poor man's profiling", so that only leaves the watchdog timer. The slow speed your looking for (e.g. "some fixed time (e.g. 1000 ms)" would also seem consistent with the watchdog timer idea.
Well, I am trying to investigate this as an option. I can't see something that can stop us of using performance monitoring counters (PMCs) for profiling. For example, if I wanted to measure very small intervals (microsecond precision) with a performance monitoring counter, in theory, I can get even better accuracy than if I have used LAPIC timer, if I ensure that any CPU feature (e.g. "SpeedStep", etc.) is disabled (please correct me if I am missing something on that).

For the purpose of performance tuning, you'd want to use the performance monitoring counters to measure things like branch mispredictions, cache misses, number of pipeline flushes, instructions retired, etc.

For accurately working out where the CPU is spending the majority of it's time I'd use the "single stepping on branches, exceptions and interrupts" feature. Basically, you'd enable it; and every time the CPU does a control transfer (branch, call, ret, iret or starts an interrupt handler) the CPU generates a debug exception, and in the debug exception you record the source EIP, the target EIP and the time stamp (TSC). From that you can determine exactly where every single cycle was spent.

If you combine "single stepping on branches, exceptions and interrupts" with performance monitoring counters (and record performance monitoring counter values in the debug exception handler, in addition to the source EIP, the target EIP and the time stamp); then not only would you be able to determine exactly where every single cycle was spent, you'd be able to determine which pieces of code are causing things like branch mispredictions, cache misses, pipeline flushes, etc. This is probably the best profiling support you can dream of, and would also work fine with interrupts disabled (no need for NMI).

What you're talking about ("poor man's profiling") gives you far less information, and is a lot less accurate. Worst case is that it can be extremely misleading - for example, it might determine that the function "foo" is consuming 100% of CPU time when in reality the function "foo" is only consuming 1% of CPU time and just happens to be called at the same frequency that the profiler's timer is running at.

In my opinion, performance monitoring counters are complex and messy (different code for each different CPU), so if you're going to accept the large amount of hassle it takes to implement support for them (and test it), then you should at least do a little extra work and do profiling properly. If you want to avoid the large amount of hassle and just want "quick and dirty", then do "poor man's profiling" with the PIT or local APIC timer or something (and also support older CPUs without performance monitoring counters). In this case don't bother using NMI (it's not worth doing when the results are going to be inaccurate/misleading anyway).

limp wrote:Wait a minute. I thought that both LAPIC timer and PMCs are driven by the processor bus frequency. How could it be LAPIC timer to run at fixed frequency and PMCs to run at variable frequency? It would be really helpful if you could clarify this.

Edit: What I mean is that when a PMC is monitoring a "CPU_CLK_UNHLATED" event, the rate at which PMC increments should be equal to the rate that LAPIC timer decreases (in this case, both PMC and LAPIC timer are driven by bus frequency).

The performance monitoring counters run at the CPU's frequency (e.g. 1.87 Ghz), not the bus frequency. If the CPU is running at 1234 MHz and the bus is running at 123.4 MHz, then the "CPU_CLK_UNHLATED" event would occur 10 times more often than the local APIC timer's count is decreased.

limp wrote:Something from a previous post of yours:

Brendan wrote: The performance monitoring counters (and the local APIC timer) tell the CPU's local APIC to send an IRQ to the processor. It wouldn't be visible on the bus, and there's no way to (for e.g.) make a performance monitoring counter (or local APIC timer) send an IRQ to a different CPU.
So that means that PMCs are located inside the CPU core, right? Are they using a dedicated line for connecting with the LAPIC's PMC pins? I am a bit confused on what it's actually connected on the processor bus (any hint on that will be quite useful).

A CPU's core includes that CPU's performance monitoring stuff and the local APIC.

For what's connected to the bus, here's some pretty pictures (courtesy of www.realworldtech.com's article about Intel's "core" architecture):

Cheers,

Brendan

Posted: **Fri Jun 17, 2011 5:54 am**

Hi Brendan,

Brendan wrote: The performance monitoring counters run at the CPU's frequency (e.g. 1.87 Ghz), not the bus frequency. If the CPU is running at 1234 MHz and the bus is running at 123.4 MHz, then the "CPU_CLK_UNHLATED" event would occur 10 times more often than the local APIC timer's count is decreased.

Are you sure that PMCs run at CPU frequency (is there any reference on that?) ?
What I've found http://software.intel.com/sites/product ... alted.html mentions that the “CPU_CLK_UNHALTED.REF” PMC event can “give you a measurement of the elapsed time while the core was not in the halt state, by dividing the event count by the bus frequency”. If CPU_CLK_UNHALTED.REF/bus freq gives the elapsed time then AFAIU PMC is driven by bus clock.

Thanks very much for the cool pictures you posted for showing what is connected on the CPU bus and for the really cool idea for using PMCs for profilling code execution. Please excuse me but I am still a bit confused about what is connected where. If I understood correctly, the CPU bus connects the L2 cache with the FSB interface.

What I am confused about is how, for example, PIC is connected to LINTO pin of LAPIC or where in the pictures is the "*Interrupt Controller Communication Bus" (ICC Bus) that connects the I/O APIC with the LAPIC.

Thanks again! Your help is really appreciated.

Cheers

Posted: **Fri Jun 17, 2011 9:54 pm**

Hi,

limp wrote:
Brendan wrote: The performance monitoring counters run at the CPU's frequency (e.g. 1.87 Ghz), not the bus frequency. If the CPU is running at 1234 MHz and the bus is running at 123.4 MHz, then the "CPU_CLK_UNHLATED" event would occur 10 times more often than the local APIC timer's count is decreased.
Are you sure that PMCs run at CPU frequency (is there any reference on that?) ?
What I've found http://software.intel.com/sites/product ... alted.html mentions that the “CPU_CLK_UNHALTED.REF” PMC event can “give you a measurement of the elapsed time while the core was not in the halt state, by dividing the event count by the bus frequency”. If CPU_CLK_UNHALTED.REF/bus freq gives the elapsed time then AFAIU PMC is driven by bus clock.

On that page there's 2 different events. The first ("CPU_CLK_UNHALTED.CORE.P") measures CPU cycles and is effected by Enhanced Intel SpeedStep Technology or TM2. The second ("CPU_CLK_UNHALTED.BUS") does measure bus cycles. However, that page is also only for Intel Atom, and I haven't seen anything like "CPU_CLK_UNHALTED.BUS" on any other CPUs (especially Yonah). All P6 and later CPUs have something like "CPU_CLK_UNHALTED.CORE.P" though (which measures CPU cycles and not bus cycles).

limp wrote:What I am confused about is how, for example, PIC is connected to LINTO pin of LAPIC or where in the pictures is the "*Interrupt Controller Communication Bus" (ICC Bus) that connects the I/O APIC with the LAPIC.

How PIC/s, IO APIC/s and local APIC are connected depends on the chipset - the best diagrams/explanation I've seen is in Intel's "MultiProcessor Specification". Unless you're considering using the PIC chips and IO APIC/s at the same time (which is a bad idea in my opinion) you don't really need to know/care how it actually is connected.

Modern CPUs don't have an ICC bus - that was Pentium and P6 only; and newer CPUs use the system bus for communication between the IO APIC/s and local APIC/s instead. There's a good diagram of this near the start of the APIC chapter in the Intel Manual (figures 10-2 and 10-3).

Cheers,

Brendan

Posted: **Sat Jun 18, 2011 5:53 am**

Hi,

Thanks for your response Brendan. Please excuse me if I insist but I am trying to get a better insight on that as it is quite important for my project to know how this works.

Brendan wrote:
limp wrote:
Brendan wrote: The performance monitoring counters run at the CPU's frequency (e.g. 1.87 Ghz), not the bus frequency. If the CPU is running at 1234 MHz and the bus is running at 123.4 MHz, then the "CPU_CLK_UNHLATED" event would occur 10 times more often than the local APIC timer's count is decreased.
Are you sure that PMCs run at CPU frequency (is there any reference on that?) ?
What I've found http://software.intel.com/sites/product ... alted.html mentions that the “CPU_CLK_UNHALTED.REF” PMC event can “give you a measurement of the elapsed time while the core was not in the halt state, by dividing the event count by the bus frequency”. If CPU_CLK_UNHALTED.REF/bus freq gives the elapsed time then AFAIU PMC is driven by bus clock.
On that page there's 2 different events. The first ("CPU_CLK_UNHALTED.CORE.P") measures CPU cycles and is effected by Enhanced Intel SpeedStep Technology or TM2. The second ("CPU_CLK_UNHALTED.BUS") does measure bus cycles. However, that page is also only for Intel Atom, and I haven't seen anything like "CPU_CLK_UNHALTED.BUS" on any other CPUs (especially Yonah). All P6 and later CPUs have something like "CPU_CLK_UNHALTED.CORE.P" though (which measures CPU cycles and not bus cycles).

The "CPU_CLK_UNHALTED.BUS" event is not only on Intel Atom processors but also on processors based on the Intel Core microarchitecture, and equivalent event specifications are on the Intel Core Duo and Intel Core Solo processors ("NonHlt_Ref_Cycles" event num. 0x3C, UMASK 0x01). From a quick observation, it seems to be available on processors targeting more on the embedded market (or for netbooks, etc.).

So, the "CPU_CLK_UNHALTED" event (event num. 0x79, UMASK 0x00) that I see on P6 processors which says that it "counts the number of cycles during which the processor is not halted", it actually refers to CPU cycles and not BUS cycles, right? From what I understand, there is no way of measuring the bus cycles in other processors than the aforementioned ones which have the "CPU_CLK_UNHALTED.BUS" event.

Trying to classify in which Intel CPUs there is an option to count bus cycles using PMC can help me know my options at least.

Brendan wrote: Modern CPUs don't have an ICC bus - that was Pentium and P6 only; and newer CPUs use the system bus for communication between the IO APIC/s and local APIC/s instead.

I found that, thanks. So, from what I understand, the "3-wire APIC Bus" they are mentioning in the Intel manuals is the ICC bus mentioned in the "MultiProcessor Specification", am I correct? The thing is that in the "MultiProcessor Specification", only the name ICC bus is mentioned and you can easily get confused as they don't mention in the developers manuals that they've changed the name to "3-wire APIC Bus".

Kind regards.

Posted: **Sat Jun 18, 2011 6:35 am**

Hi,

limp wrote:So, the "CPU_CLK_UNHALTED" event (event num. 0x79, UMASK 0x00) that I see on P6 processors which says that it "counts the number of cycles during which the processor is not halted", it actually refers to CPU cycles and not BUS cycles, right?

I'd always assume "CPU cycles" unless it specifically says "bus cycles".

limp wrote:Trying to classify in which Intel CPUs there is an option to count bus cycles using PMC can help me know my options at least.

I took a closer look. I found the equivalent of Atom's "CPU_CLK_UNHALTED.BUS" for Core Solo and Core Duo, and for Core 2. I failed to find an equivalent for any other CPUs.

limp wrote:I found that, thanks. So, from what I understand, the "3-wire APIC Bus" they are mentioning in the Intel manuals is the ICC bus mentioned in the "MultiProcessor Specification", am I correct?

To be honest, I wouldn't be surprised if "ICC" is what Intel called it for the old 82489DX (external local APIC) chips used in 80486 and early Pentium systems, and "3-wire APIC bus" is what Intel called it for P6. Fortunately it makes no difference how the local APIC actually communicates with the IO APIC unless you're creating your own chipset or CPU (that has to be compatible with other Intel chips).

Cheers,

Brendan

Posted: **Wed Jun 22, 2011 9:39 am**

Hi Brendan,

Brendan wrote: I took a closer look. I found the equivalent of Atom's "CPU_CLK_UNHALTED.BUS" for Core Solo and Core Duo, and for Core 2. I failed to find an equivalent for any other CPUs.

I agree on that however I think that there events on other processors that can be used for that (e.g. CPU_CLK_UNHALTED.REF_P on Intel i7/ Xeon 5500 processors).

You've mentioned that

The performance monitoring counters run at the CPU's frequency (e.g. 1.87 Ghz) and not at the bus frequency.

When I count "CPU_CLK_UNHALTED.BUS" (i.e. bus cycles), the PMC run at the CPU frequency or the BUS frequency? From what we've discussed, I concluded that when we count "CPU_CLK_UNHALTED.BUS" the PMC is incremented at the BUS frequency so, if the processor doesn't enter a HLT state (which I guess also includes ACPI sleep states, etc.) we get the same accuracy with the LAPIC timer (actually both the PMC and the LAPIC timer are driven by the processor bus in this case).

Also, form what I understood, when we count "CPU_CLK_UNHALTED.BUS", the counting is not affected by things like "SpeedStep" and "TurboBoost" technologies, TM1/TM2 thermal events or even "thermal throttling".

Brendan wrote: Note: It's also possible to use both - e.g. use the local APIC count to continually synchronise the TSC or performance monitoring counter/s with real time, to get the accuracy of the local APIC timer with the precision of the TSC or performance monitoring counter/s. This is potentially over-complicated and likely to be messy though.

This sounds quite interesting. Let's say that I want to get periodic interrupts using the PMC. How can I use another timer (the LAPIC timer for instance) for adjusting the PMC (that is adjust the PMC overflow value using the LAPIC timer)?

Brendan wrote: To be honest, I wouldn't be surprised if "ICC" is what Intel called it for the old 82489DX (external local APIC) chips used in 80486 and early Pentium systems, and "3-wire APIC bus" is what Intel called it for P6. Fortunately it makes no difference how the local APIC actually communicates with the IO APIC unless you're creating your own chipset or CPU (that has to be compatible with other Intel chips).

Hmm..that makes some sense, thanks for sharing!

Brendan wrote: using the PIC chips and IO APIC/s at the same time (which is a bad idea in my opinion)

Could you bit a bit more specific on that? You mean that using both at the same time could lead on things like spurious interrupts?

Sorry for the big number of questions and thanks again for all your help.

Kind regards.

Posted: **Thu Jun 23, 2011 2:29 am**

Hi,

limp wrote:You've mentioned that
The performance monitoring counters run at the CPU's frequency (e.g. 1.87 Ghz) and not at the bus frequency.
When I count "CPU_CLK_UNHALTED.BUS" (i.e. bus cycles), the PMC run at the CPU frequency or the BUS frequency? From what we've discussed, I concluded that when we count "CPU_CLK_UNHALTED.BUS" the PMC is incremented at the BUS frequency so, if the processor doesn't enter a HLT state (which I guess also includes ACPI sleep states, etc.) we get the same accuracy with the LAPIC timer (actually both the PMC and the LAPIC timer are driven by the processor bus in this case).

The performance monitoring stuff is all implemented in the CPU's core and therefore runs at the CPU's frequency. The signals that the performance monitoring stuff monitor may originate from elsewhere (including the bus, if you want to drag a signal from the "shared bus interface controller" all the way into the CPU core, for e.g.).

HLT (and MONITOR/MWAIT) can be considered the "lightest" sleep state - fastest wake up time and least power saving, with no parts of the CPU shut down. Typically there's deeper sleep states that shut parts of the CPU down, where the deepest sleep states may shut down the local APIC timer. The performance monitoring stuff is more likely to be shut down than the local APIC timer though. However, if you're doing "poor man's profiling" on the kernel (rather than just on processes where profiling can be disabled when no processes are running), then the CPU will never be idle for long enough to make entering any of the deeper sleep states worthwhile.

limp wrote:Also, form what I understood, when we count "CPU_CLK_UNHALTED.BUS", the counting is not affected by things like "SpeedStep" and "TurboBoost" technologies, TM1/TM2 thermal events or even "thermal throttling".

That's right.

On more recent CPUs (e.g. CPUs where the TSC runs at a fixed frequency) I'd also expect there to be a way to measure "reference CPU clocks" rather than "actual CPU clocks".

limp wrote:
Brendan wrote: Note: It's also possible to use both - e.g. use the local APIC count to continually synchronise the TSC or performance monitoring counter/s with real time, to get the accuracy of the local APIC timer with the precision of the TSC or performance monitoring counter/s. This is potentially over-complicated and likely to be messy though.
This sounds quite interesting. Let's say that I want to get periodic interrupts using the PMC. How can I use another timer (the LAPIC timer for instance) for adjusting the PMC (that is adjust the PMC overflow value using the LAPIC timer)?

If you've got some other timer running at a fixed frequency, and if you keep track of how many "PMC counter ticks" have occurred between the other timer's IRQs; then it only take simple maths to recalculate the correct PMC overflow value during the other timer's IRQ handler.

limp wrote:
Brendan wrote: using the PIC chips and IO APIC/s at the same time (which is a bad idea in my opinion)
Could you bit a bit more specific on that? You mean that using both at the same time could lead on things like spurious interrupts?

It's not that there's a specific problem. It's that the advantages (typically none) don't justify the added complexity (e.g. testing all the corner cases on all possible hardware just to see if it actually does work reliably, rather than only working on some/most computers).

It's a bit like using performance monitoring counters for "poor man's profiling" - the advantages (none) don't justify the massive pain in the neck it would be to implement and test...

Cheers,

Brendan

Posted: **Thu Jun 23, 2011 6:27 am**

Hi Brendan,

Thanks again for clarifying everything!

Brendan wrote: The performance monitoring stuff is all implemented in the CPU's core and therefore runs at the CPU's frequency. The signals that the performance monitoring stuff monitor may originate from elsewhere (including the bus, if you want to drag a signal from the "shared bus interface controller" all the way into the CPU core, for e.g.).

I see..So, if we're measuring "CPU_CLK_UNHALTED.BUS" events, the sampling rate that the PMC will use for detecting such events will be equal to the CPU's frequency, right?

Brendan wrote: It's not that there's a specific problem. It's that the advantages (typically none) don't justify the added complexity (e.g. testing all the corner cases on all possible hardware just to see if it actually does work reliably, rather than only working on some/most computers).

It's a bit like using performance monitoring counters for "poor man's profiling" - the advantages (none) don't justify the massive pain in the neck it would be to implement and test...

Cheers

Posted: **Thu Jun 23, 2011 7:03 am**

Hi,

limp wrote:
Brendan wrote: The performance monitoring stuff is all implemented in the CPU's core and therefore runs at the CPU's frequency. The signals that the performance monitoring stuff monitor may originate from elsewhere (including the bus, if you want to drag a signal from the "shared bus interface controller" all the way into the CPU core, for e.g.).
I see..So, if we're measuring "CPU_CLK_UNHALTED.BUS" events, the sampling rate that the PMC will use for detecting such events will be equal to the CPU's frequency, right?

I'd assume the performance monitoring hardware does the equivalent of "if(new_events != 0) add new_events to counter" every CPU cycle, and the event select logic determines where the "new_events" value comes from. Of course I haven't designed any of Intel's CPUs (so I don't know for sure). It's like a black box - what matters is how the black box behaves from the outside (the interface that software sees), and the contents of that black box (how any of it is actually implemented in silicon) doesn't matter to anyone (other than the creator/s of the black box).

Cheers,

Brendan

OSDev.org

Performance Monitoring Counters on Intel Celeron M

Re: Performance Monitoring Counters on Intel Celeron M

Re: Performance Monitoring Counters on Intel Celeron M

Re: Performance Monitoring Counters on Intel Celeron M

Re: Performance Monitoring Counters on Intel Celeron M

Re: Performance Monitoring Counters on Intel Celeron M

Re: Performance Monitoring Counters on Intel Celeron M

Re: Performance Monitoring Counters on Intel Celeron M

Re: Performance Monitoring Counters on Intel Celeron M

Re: Performance Monitoring Counters on Intel Celeron M