TSC rate

limp · Post by **limp** » Thu Oct 08, 2009 7:34 am

Hi all,

I am taking some measurements using the TSC on an Atom processor and I have a doubt about the rate of this counter. Intel manual says that "for Intel Atom processors the time-stamp counter increments at a constant rate. That rate may be set by the maximum core-clock to bus-clock ratio of the processor or may be set by the maximum resolved frequency at which the processor is booted".

I just wanna know which of these cases is true.

Also, if the TSC increments by the maximum core-clock to bus-clock ratio, that means that if this ratio is e.g. 12 the TSC is incremented every 12 cycles?

If anyone knows something, it will be much helpful.

Thanks

Brendan · Post by **Brendan** » Fri Oct 09, 2009 2:50 am

Hi,

limp wrote:I am taking some measurements using the TSC on an Atom processor and I have a doubt about the rate of this counter. Intel manual says that "for Intel Atom processors the time-stamp counter increments at a constant rate. That rate may be set by the maximum core-clock to bus-clock ratio of the processor or may be set by the maximum resolved frequency at which the processor is booted".

I just wanna know which of these cases is true.

As long as the TSC counts at a constant rate, does it really matter how the constant rate was determined?

limp wrote:Also, if the TSC increments by the maximum core-clock to bus-clock ratio, that means that if this ratio is e.g. 12 the TSC is incremented every 12 cycles?

No...

If the TSC rate is determined by the "maximum core-clock to bus-clock ratio"; then the TSC would be incremented "bus_frequency * max_ratio" times per second; even if the CPU happens to be running at a slower core-clock to bus-clock ratio. For example, if the bus runs at 133.333 MHz and the maximum core-clock to bus-clock ratio is 12:1 then the TSC will be incremented 1600000 times per second (once per CPU cycle); and if the current core-clock to bus-clock ratio is 6:1 then the TSC will still be incremented 1600000 times per second (which works out to twice per CPU cycle because each CPU cycles takes twice as long).

Because the CPU never runs faster than it's maximum speed, the TSC is always incremented at least once per CPU cycle (and TSC may be incremented more than once per "CPU cycle" when the CPU isn't running at it's maximum speed).

Cheers,

Brendan

limp · Post by **limp** » Fri Oct 09, 2009 4:48 am

Thanks a lot Brendan, I understood how it works now.

As long as the TSC counts at a constant rate, does it really matter how the constant rate was determined?

If you want to take some timing measurements using the TSC, you should know what's the rate at which the TSC increments and not only that it's constant. I am wondering why the Intel manuals don't mention anything about how this rate is determined in different processors and they just say that it's even determined by that or that.

I guess the best way to find out is to measure it by using a different timer (like PIT or Local APIC).

Do you think that the use of the TSC is the best way of doing timing measurements or some other timers like HPET or Local APIC timer should be preferred?

Thanks

Matthew · Post by **Matthew** » Fri Oct 09, 2009 8:43 am

I measure the TSC at the same time I measure the LAPIC timer, using the PIT.

Brendan · Post by **Brendan** » Sat Oct 10, 2009 3:09 am

Hi,

limp wrote:If you want to take some timing measurements using the TSC, you should know what's the rate at which the TSC increments and not only that it's constant. I am wondering why the Intel manuals don't mention anything about how this rate is determined in different processors and they just say that it's even determined by that or that.

I guess the best way to find out is to measure it by using a different timer (like PIT or Local APIC).

The *only* sane way to determine the current speed of the TSC is to measure it with a different timer. Of course for some CPUs, the speed of the TSC may change immediately after you've measured it, or even while you're measuring it. For the latter case you'd need to repeatedly measure the TSC speed until several results in a row return the same (or very similar) TSC speeds (but you'd still only know what speed the TSC runs at when the measurements was done).

Basically, there's 3 different ways (that I know of) that a CPU can handle the TSC:

"TSC invariant" - TSC *always* runs at the same frequency, regardless of how fast the CPU itself is running and regardless of sleep states. This makes the TSC extremely good for measuring real time (e.g. keeping track of the current time of day), but also makes it a unsuitable for performance tuning (e.g. measuring how fast a certain piece of code is).
"TSC time" - TSC runs at the same frequency regardless of how fast the CPU itself is running; but stops or changes frequency when the CPU is in one or more sleep states (e.g. when the CPU isn't running). This makes the TSC useful for measuring real time (e.g. keeping track of the current time of day), but you if the CPU goes to sleep you can't rely on it and need to use some other timer while the CPU is in a sleep state. This is also unsuitable for performance tuning (e.g. measuring how fast a certain piece of code is).
"TSC cycles" - TSC frequency depends on how fast the CPU itself is running. This makes the TSC useless for measuring real time (although in some cases, with enough work it may still be possible to use things like the local APIC's thermal sensor IRQ, etc; and implement a "virtual TSC" that compensates for the TSC's speed changes) . This is the only option that is suitable for performance tuning (e.g. measuring how fast a certain piece of code is).

Also note that some CPUs don't support any power management or thermal throttling capabilities, and in this case the TSC can be used for both real time and performance tuning. This is mostly limited to rare CPUs intended for embedded systems though (all Intel, AMD and VIA CPUs that are new enough to support TSC do have some sort of power managment and/or thermal throttling).

During boot, an OS probably could/should try to detect what the TSC measures. This isn't easy. On very new CPUs there's a "TSCinvariant" flag (CPUID 0x80000007, bit 8 in EDX) which means that the TSC *always* runs at the same frequency. If this bit isn't set then you don't know if the CPU is too old to support this bit (and actually does qualify as "TSC invariant" even though the bit is clear). If the "TSCinvariant" flag in CPUID is clear or if CPUID 0x80000007 isn't supported, then the only way to find out what the TSC counts is to use the vendor/family/model from CPUID (e.g. have some sort of database lookup inside your CPU detection code). If the CPU is so old that it doesn't support CPUID, then you can assume it doesn't support TSC either.

Also, on SMP systems there's no guarantee that the TSC on one CPU is synchronized with the TSC on any other CPU. If you're not careful this can create some severe problems.

limp wrote:Do you think that the use of the TSC is the best way of doing timing measurements or some other timers like HPET or Local APIC timer should be preferred?

Using the TSC (where possible) is the best possible way of measuring real time - it's very precise and has very low overhead. This is why (IMHO) it's worth the hassle of finding out what the TSC counts, and implementing some work-arounds. However, if your OS is unable to determine what the TSC counts (or if the CPU doesn't support TSC) then it must be able to use other methods.

Cheers,

Brendan

robos · Post by **robos** » Sat Oct 10, 2009 3:17 pm

I found this paper interesting: http://svn.wildfiregames.com/public/ps/ ... tfalls.pdf

It describes the various timers that might be available and the good and bad things about them

limp · Post by **limp** » Wed Oct 14, 2009 5:12 am

Fist of all thanks very much Brendan for all this useful info (also a thanks to robos for the interesting paper).

I can understand why the invariant TSC is perfect for timing measurements but why it's not preferable for performance measuring? If your system has an invariant TSC, you know the rate of it and you measure the duration of a specific piece of code, you can see how fast (in time) this piece of code is.

All I am saying is that we can measure performance either as time or as CPU cycles. By the way, is there any way to find out how many cycles some instructions take? I looked at Intel manuals but nothing is mentioned there. Maybe in modern processors the amount of cycles that an instruction takes varies with respect to the technologies which are enabled (cache, HT), that's why it's not mentioned?

Anyway, I have set up the TSC for taking some periodic jitter measurements of two periodic tasks. The problem I have is that although the task that is called very often (every ms) doesn't seem to have any jitter, the task that is not called that often, the measurements have some huge continuing oscillations.

That is, in interval 1 the jitter is ~0.028 ms, in interval 2 ~6.97 ms (~250 times bigger!), in interval 3 0.028 ms, in interval 4 ~5.97 ms, etc.

Does anyone have any idea why this is happening? Any idea will be much appreciated.

Regards

limp

Brendan · Post by **Brendan** » Thu Oct 15, 2009 8:49 pm

Hi,

limp wrote:If your system has an invariant TSC, you know the rate of it and you measure the duration of a specific piece of code, you can see how fast (in time) this piece of code is.

Yes, but usually for performance tuning you want to know how fast the code is in cycles. For example, if a piece of code took 3 ms and you change it, and the new version takes 2 ms; then did the changes make the code faster? Maybe the first measurement was done while CPU is being throttled but the second measurement was done while CPU is running at full speed, and the changes made the code a lot slower...

limp wrote:By the way, is there any way to find out how many cycles some instructions take? I looked at Intel manuals but nothing is mentioned there. Maybe in modern processors the amount of cycles that an instruction takes varies with respect to the technologies which are enabled (cache, HT), that's why it's not mentioned?

There's "Latency" and "Throughput" figures in the "Optimization Reference Manual". Latency is defined as "The number of clocks cycles that are required for the execution core to complete execution of all the uops that form an instruction". Throughput is defined as "The number clocks cycles required to wait before the issue ports are free to accept the same instruction again". These are very simplistic figures though - they don't take into account instruction dependencies or the time taken to read data into the CPU (e.g. instruction fetch/decode limitations, cache fetch times, cache miss times, TLB lookup times, TLB miss times, etc). The only real way to determine how many cycles an instruction takes in a specific situation is to test it with RDTSC many times (and make sure that the CPU is running at 100% speed when you do, and also make sure that other logical CPUs in the same core are idle); and then replace the instruction with NOP and test it again, and then find the difference between the 2 measurements.

In almost every case, the exact amount of time it takes to execute one instruction is meaningless, and only the time it takes to execute a sequence of instructions makes any sense. For example, consider something like:

Code: Select all

    mov ebx,[foo]
    mul dword edx
   add eax,ebx

In this case, the "mov ebx,[foo]" might cause a cache miss that stalls the CPU for 200 cycles, and replacing the MUL instruction with a NOP might make no difference at all; and changing the last instruction to "add eax,ecx" might make a massive difference (even though it's exactly the same instruction).

Cheers,

Brendan

limp · Post by **limp** » Fri Oct 16, 2009 7:50 am

Hi again,

Brendan wrote:
limp wrote:If your system has an invariant TSC, you know the rate of it and you measure the duration of a specific piece of code, you can see how fast (in time) this piece of code is.
Yes, but usually for performance tuning you want to know how fast the code is in cycles. For example, if a piece of code took 3 ms and you change it, and the new version takes 2 ms; then did the changes make the code faster? Maybe the first measurement was done while CPU is being throttled but the second measurement was done while CPU is running at full speed, and the changes made the code a lot slower...

I see..But again, if let's say the first time I measured a piece of code it took 200 cycles and the second time took 300 cycles but in the first time the CPU was running at full speed and the second time it was being throttled. It's the same problem, isn't it?

Also, thanks for your explanation regarding the cycles per instruction question.

I am still getting very weird jitter measurements and trying to figure out why. Do you (or anyone else) know if there is a memory bus arbitration scheme in Intel multi-core processors (applied by an arbitrator or by the memory controller), how this works and how can I find more details about it? I have looked at the datasheet of the Northbridge but nothing is mentioned there. Some questions in which I could use an answer are:

i) What happens (in a dual-core system) if both cores want to access the memory at the same time? Who gets priority?
ii) If the second core is not used at all, the bus is given totally to core 1 or still a TDMA arbitration is applied on the bus?

Any help or info regarding these issues will be very useful.

Thanks in advance.

Regards.

Brendan · Post by **Brendan** » Fri Oct 16, 2009 6:43 pm

Hi,

limp wrote:
Brendan wrote:
limp wrote:If your system has an invariant TSC, you know the rate of it and you measure the duration of a specific piece of code, you can see how fast (in time) this piece of code is.
Yes, but usually for performance tuning you want to know how fast the code is in cycles. For example, if a piece of code took 3 ms and you change it, and the new version takes 2 ms; then did the changes make the code faster? Maybe the first measurement was done while CPU is being throttled but the second measurement was done while CPU is running at full speed, and the changes made the code a lot slower...
I see..But again, if let's say the first time I measured a piece of code it took 200 cycles and the second time took 300 cycles but in the first time the CPU was running at full speed and the second time it was being throttled. It's the same problem, isn't it?

It's the same problem - you've got no idea if 200 cycles is faster or slower than 300 cycles (unless you can adjust the results to compensate for CPU speed changes). Also don't forget that the CPU speed can change while you're doing the test (not just between tests) and that Intel's Nehalem CPUs have "TurboBoost" now (e.g. a CPU can be running anywhere between about 12.5% of it's normal speed up to about 120% of it's normal speed).

Mainly, for modern CPUs you'd want to use the TSC for measuring time, then setup performance monitoring counters to measure cycles. Unfortunately this isn't easy though - each CPU does performance monitoring differently.

limp wrote:I am still getting very weird jitter measurements and trying to figure out why. Do you (or anyone else) know if there is a memory bus arbitration scheme in Intel multi-core processors (applied by an arbitrator or by the memory controller), how this works and how can I find more details about it? I have looked at the datasheet of the Northbridge but nothing is mentioned there. Some questions in which I could use an answer are:

i) What happens (in a dual-core system) if both cores want to access the memory at the same time? Who gets priority?

There would have to be some sort of arbitration (2 or more "things" can't use the same bus at the same time). I'm not too sure how this is implemented for any specific CPU or bus though.

limp wrote:ii) If the second core is not used at all, the bus is given totally to core 1 or still a TDMA arbitration is applied on the bus?

In most cases there's layers. For example, CPU cores that talk to their own "per core" L1 caches (no arbitration needed), that talk to shared L2 caches (with arbitration), that talk to the "front-side-bus" (where all devices on the front-side bus use arbitration, not just CPUs).

However, buses are relatively fast (and the effects of arbitration should be spread relatively evenly) so I doubt that arbitration (at any level) is responsible for large variations; and IMHO it's more likely that your jitter is caused by something else: cache misses, TLB misses, IRQs, the combination of physical pages being used, bugs in your measurement code, etc.

Cheers,

Brendan

limp · Post by **limp** » Wed Oct 21, 2009 5:17 am

Hi there,

Brendan wrote:it's more likely that your jitter is caused by something else: cache misses, TLB misses, IRQs, the combination of physical pages being used, bugs in your measurement code, etc.

Well I tried to disable the cache but then my scheduler is not working (it initialises but it doesn't updating)

I am also getting some undefined interrupts (while at the moment I am using only one interrupt on the system for driving the scheduler). Does anyone have an idea from where these undefined interrupts are coming and why when I am disabling the cache, the scheduler is not working? Any logical steps that I could follow to sort this out?

Thanks in advance.

P.S Thanks Brendan for all your help so far!

Combuster · Post by **Combuster** » Wed Oct 21, 2009 6:19 am

Could you be a bit more specific? What IRQs do you receive, and preferrably an indication of how often?

As for the cache problem, the only things I can guess is that you are either writing the wrong value to CR0, or you are using INVD instead of WBINVD.

limp · Post by **limp** » Mon Oct 26, 2009 2:53 pm

Hi all,

Sorry for taking so long to reply...

Combuster wrote:Could you be a bit more specific? What IRQs do you receive, and preferrably an indication of how often?

I was receiving an IRQ 3 (serial port 1) every 1-2 seconds but now for no reason I am not getting it.

Combuster wrote:As for the cache problem, the only things I can guess is that you are either writing the wrong value to CR0, or you are using INVD instead of WBINVD.

I am sure that I am writing the right value to CR0 and yes I am using WBINVD.

The behaviour that I have is not the same every time.

When I am not disabling the cache, everything works fine.

When the cache-disable code is placed, the behaviour is not standard every time. Some times, the tasks run for a while (for varied time every time) and then the scheduler stops. Also, some times the tasks run for a while, then stop, after a while run a little more, then stop again, etc.

Thanks

Combuster · Post by **Combuster** » Tue Oct 27, 2009 1:55 am

Well, then the cause is most likely unrelated to this - are you zeroing memory everywhere? Can you reproduce the problem in an (different) emulator? (so you can debug it). You should also check for race conditions and the like.

Have you remapped the IRQs (to eliminate the odd chance that you get an unexpected #NP over something), you should also masks interrupts that you are not using for performance.

limp · Post by **limp** » Tue Oct 27, 2009 6:10 am

Combuster wrote:Well, then the cause is most likely unrelated to this - are you zeroing memory everywhere?

How can I zero the memory everywhere? My kernel boots from multiboot so I assume that multiboot is doing that for me.

Combuster wrote:Can you reproduce the problem in an (different) emulator? (so you can debug it). You should also check for race conditions and the like.

I am actually running it on real hardware but I will try run it to an emulator as well so that I can debug it easier.

Combuster wrote:Have you remapped the IRQs (to eliminate the odd chance that you get an unexpected #NP over something), you should also masks interrupts that you are not using for performance.

Yes, I've done this as well.

I've put accidentaly some wrong code in my unhandled interurpt handler and I wasn't seeing any unhandled interrupts, although I am pretty sure that they were occurring. Now that I fixed that, I get an unhandled interrupt and the in-service register of both master and slave PICs when this occurs is 213 (0xD5). This happens just before the Timer 0 is routed to IRQ0 in 8259 (I am using the HPET so I when I am starting the timer, I need to route it to PIC because it's not by default).

Thanks.

OSDev.org

TSC rate

TSC rate

Re: TSC rate

Re: TSC rate

Re: TSC rate

Re: TSC rate

Re: TSC rate

Re: TSC rate

Re: TSC rate

Re: TSC rate

Re: TSC rate

Re: TSC rate

Re: TSC rate

Re: TSC rate

Re: TSC rate

Re: TSC rate