Re: Performance Monitoring Counters on Intel Celeron M
Posted: Fri Jun 17, 2011 4:46 am
Hi,
For accurately working out where the CPU is spending the majority of it's time I'd use the "single stepping on branches, exceptions and interrupts" feature. Basically, you'd enable it; and every time the CPU does a control transfer (branch, call, ret, iret or starts an interrupt handler) the CPU generates a debug exception, and in the debug exception you record the source EIP, the target EIP and the time stamp (TSC). From that you can determine exactly where every single cycle was spent.
If you combine "single stepping on branches, exceptions and interrupts" with performance monitoring counters (and record performance monitoring counter values in the debug exception handler, in addition to the source EIP, the target EIP and the time stamp); then not only would you be able to determine exactly where every single cycle was spent, you'd be able to determine which pieces of code are causing things like branch mispredictions, cache misses, pipeline flushes, etc. This is probably the best profiling support you can dream of, and would also work fine with interrupts disabled (no need for NMI).
What you're talking about ("poor man's profiling") gives you far less information, and is a lot less accurate. Worst case is that it can be extremely misleading - for example, it might determine that the function "foo" is consuming 100% of CPU time when in reality the function "foo" is only consuming 1% of CPU time and just happens to be called at the same frequency that the profiler's timer is running at.
In my opinion, performance monitoring counters are complex and messy (different code for each different CPU), so if you're going to accept the large amount of hassle it takes to implement support for them (and test it), then you should at least do a little extra work and do profiling properly. If you want to avoid the large amount of hassle and just want "quick and dirty", then do "poor man's profiling" with the PIT or local APIC timer or something (and also support older CPUs without performance monitoring counters). In this case don't bother using NMI (it's not worth doing when the results are going to be inaccurate/misleading anyway).
For what's connected to the bus, here's some pretty pictures (courtesy of www.realworldtech.com's article about Intel's "core" architecture):
Cheers,
Brendan
For the purpose of performance tuning, you'd want to use the performance monitoring counters to measure things like branch mispredictions, cache misses, number of pipeline flushes, instructions retired, etc.limp wrote:Well, I am trying to investigate this as an option. I can't see something that can stop us of using performance monitoring counters (PMCs) for profiling. For example, if I wanted to measure very small intervals (microsecond precision) with a performance monitoring counter, in theory, I can get even better accuracy than if I have used LAPIC timer, if I ensure that any CPU feature (e.g. "SpeedStep", etc.) is disabled (please correct me if I am missing something on that).Brendan wrote: It would seem strange to use performance monitoring counters (which are intended for very thorough and accurate profiling) to implement "poor man's profiling", so that only leaves the watchdog timer. The slow speed your looking for (e.g. "some fixed time (e.g. 1000 ms)" would also seem consistent with the watchdog timer idea.
For accurately working out where the CPU is spending the majority of it's time I'd use the "single stepping on branches, exceptions and interrupts" feature. Basically, you'd enable it; and every time the CPU does a control transfer (branch, call, ret, iret or starts an interrupt handler) the CPU generates a debug exception, and in the debug exception you record the source EIP, the target EIP and the time stamp (TSC). From that you can determine exactly where every single cycle was spent.
If you combine "single stepping on branches, exceptions and interrupts" with performance monitoring counters (and record performance monitoring counter values in the debug exception handler, in addition to the source EIP, the target EIP and the time stamp); then not only would you be able to determine exactly where every single cycle was spent, you'd be able to determine which pieces of code are causing things like branch mispredictions, cache misses, pipeline flushes, etc. This is probably the best profiling support you can dream of, and would also work fine with interrupts disabled (no need for NMI).
What you're talking about ("poor man's profiling") gives you far less information, and is a lot less accurate. Worst case is that it can be extremely misleading - for example, it might determine that the function "foo" is consuming 100% of CPU time when in reality the function "foo" is only consuming 1% of CPU time and just happens to be called at the same frequency that the profiler's timer is running at.
In my opinion, performance monitoring counters are complex and messy (different code for each different CPU), so if you're going to accept the large amount of hassle it takes to implement support for them (and test it), then you should at least do a little extra work and do profiling properly. If you want to avoid the large amount of hassle and just want "quick and dirty", then do "poor man's profiling" with the PIT or local APIC timer or something (and also support older CPUs without performance monitoring counters). In this case don't bother using NMI (it's not worth doing when the results are going to be inaccurate/misleading anyway).
The performance monitoring counters run at the CPU's frequency (e.g. 1.87 Ghz), not the bus frequency. If the CPU is running at 1234 MHz and the bus is running at 123.4 MHz, then the "CPU_CLK_UNHLATED" event would occur 10 times more often than the local APIC timer's count is decreased.limp wrote:Wait a minute. I thought that both LAPIC timer and PMCs are driven by the processor bus frequency. How could it be LAPIC timer to run at fixed frequency and PMCs to run at variable frequency? It would be really helpful if you could clarify this.
Edit: What I mean is that when a PMC is monitoring a "CPU_CLK_UNHLATED" event, the rate at which PMC increments should be equal to the rate that LAPIC timer decreases (in this case, both PMC and LAPIC timer are driven by bus frequency).
A CPU's core includes that CPU's performance monitoring stuff and the local APIC.limp wrote:Something from a previous post of yours:
So that means that PMCs are located inside the CPU core, right? Are they using a dedicated line for connecting with the LAPIC's PMC pins? I am a bit confused on what it's actually connected on the processor bus (any hint on that will be quite useful).Brendan wrote: The performance monitoring counters (and the local APIC timer) tell the CPU's local APIC to send an IRQ to the processor. It wouldn't be visible on the bus, and there's no way to (for e.g.) make a performance monitoring counter (or local APIC timer) send an IRQ to a different CPU.
For what's connected to the bus, here's some pretty pictures (courtesy of www.realworldtech.com's article about Intel's "core" architecture):
Cheers,
Brendan