OSDev.org

Posted: **Wed Dec 22, 2004 4:32 am**

Hi dudes

Someone posted a suggestion for telling CPU speed on the FAQ

I wrote test code (http://www.sylvain-ulg.be.tf/resources/speed.c) for it and ran usermode linux experiment which is not really concluant.

If anyone has other results, let me know...

When finally stabilizing, the 1550 XORs took 1575 cycles on an AMD x86-64 and 1601 cycles on a Pentium 3 ...

Posted: **Wed Dec 22, 2004 7:13 am**

Hi,

Pype.Clicker wrote: Someone posted a suggestion for telling CPU speed on the FAQ

I wrote test code (http://www.sylvain-ulg.be.tf/resources/speed.c) for it and ran usermode linux experiment which is not really concluant.

If anyone has other results, let me know...

When finally stabilizing, the 1550 XORs took 1575 cycles on an AMD x86-64 and 1601 cycles on a Pentium 3 ...

I'm having trouble determining the entire point of the exercise...

If your test code is designed to measure how many cycles the XOR instruction takes under certain conditions, then fair enough, however I fail to see how it can be useful for determining CPU speed.

If a series of XOR's is being suggested as a replacement for RDTSC on modern CPUs, then it will fail on Pentium IV's as the XOR instruction may take half a clock cycle (if done in execution port 0) or a full cycle (if done in execution port 1) - see here. In addition there's no guarantee that future CPUs won't execute XOR in 1/8th of a cycle, or 16 cycles, or anything else.

If a series of XOR's is being suggested as a RDTSC replacement for older CPUs (that don't support RDTSC), then tests run on modern CPUs is a waste of time (and worrying about pipelining wouldn't be necessary). Also you'd need to test (or at least account for) the differences in these older CPUs - for e.g. an 80386 will take 2 cycles to do a "xor reg32,reg32" instruction, while an 80486 will take 1 cycle - see here. In this case I'd suggest that the "LODSD" instruction would be a better choice, as it consumes 5 clock cycles on all 80486 and older IA32 Intel CPUs. Of course other manufacturers may still vary.

Considering that the CPU's frequency does not relate directly to the performance of the CPU and can't/shouldn't be used for timing loops (without some form of calibration that can easily be CPU-independant), I would question it's usefulness. The only valid use for measuring the CPU's frequency is for display purposes, where it's much simpler to omit it if RDTSC isn't supported. IMHO something that measures actual CPU performance would be of far more practical value for both the user and software (e.g. Linux's BogoMIPS), even if it is only a crude benchmark.

If my poor over-tired brain has missed something in it's attempt to find a purpose, let me know and I'll test the computers I have here...

Cheers,

Brendan

Posted: **Wed Dec 22, 2004 8:05 am**

what i'm trying to do here is cross-validation of the XOR method. Someone suggested to use pipeline-ugly XORs as a way to wait for N CPU cycles, and i'm trying to see if the amount of CPU cycles required for a succession of XORs is predictable across the hardware platforms i have here.

If all platforms say the 775 XORs took N cycles then someone could calibrate a CPU by checking how much real time M XORs take. If various platforms disagree, it means the technique cannot reliably make the CPU wait for M cycles.

Posted: **Wed Dec 22, 2004 10:00 am**

Hi,

Pype.Clicker wrote: If all platforms say the 775 XORs took N cycles then someone could calibrate a CPU by checking how much real time M XORs take. If various platforms disagree, it means the technique cannot reliably make the CPU wait for M cycles.

Ok. For a start it won't work on an 80386 (or older) CPU - I doubt this matters (I haven't even been able to find a working 80386 on eBay)..

I added code for more statistics to the source code (and attached it). Here's the results for my Pentium IV:

Code: Select all

(000194) took 5308 cycles (4700 min, 7068 max, 5521 avg)
(000195) took 5972 cycles (4700 min, 7068 max, 5523 avg)
(000196) took 5324 cycles (4700 min, 7068 max, 5522 avg)
(000197) took 5548 cycles (4700 min, 7068 max, 5522 avg)

This CPU is a "Williamette" - the first and slowest Pentium IV chip. The test code was compiled by Cygwin, and run under Windows 98.

The results look far too slow to me. I also get slow results from my own crude benchmark (1 cycle per second on average for integer instructions, as compared to 3 cycles per second for the same test on a Pentium III).

I'll try some other machines later (and might even do a bootable floppy version to rule out OS scheduler and IRQ's) - I should've been asleep hours ago.

Cheers,

Brendan

Posted: **Wed Dec 22, 2004 10:34 am**

Hi again,

Ok I didn't run it for long enough ::).....

I commented out the "sleep(1)" and got a minimum of 1792 cycles for the Pentium IV after some 15000 tests. I also tried it out on a dual Pentium III server running Gentoo and got a minimum of 1682 cycles after 460000 tests.

Quite frankly I'm surprised at the number of tests that are needed to get a clean/minimal result - can't figure out what these OS's are doing so often that 1550 instructions can't be executed regularly without something interrupting.

I'll write some bootable code to do the test in real mode (without IRQ's, etc) in the next day or so - should give much more stable results...

Cheers,

Brendan

Posted: **Thu Dec 23, 2004 3:30 am**

Hi,

Brendan wrote: I'll write some bootable code to do the test in real mode (without IRQ's, etc) in the next day or so - should give much more stable results...

Done!

My bootable code test 1000 XOR pairs to make sure it's all in the caches, then tests 1000 XOR pairs again and keeps this value. Then it does the same for 500 XOR pairs, and subtracts the second of these values from the original 1000 XOR test value. This is done to eliminate everything except the actual timing for the XOR instructions, e.g.:

(1000pairValue + rdtsc_overhead) - (500pairValue + rdtsc_overhead) = 500pairValue without any overhead.

Also interrupts are disabled during the actual test. Because of all this the results are always identical, without any variation.

The results are:
1062???AMD-K6 3D processor 166Mhz
1005???Pentium Pro 200 MHz
1005???Dual Pentium II 400 MHz
1005???Dual Pentium III 1 GHz
500???Pentium IV (Williamette) 1.6 GHz

I've attached a *.zip file containing all the source for the actual testing. It's implemented as a replacement for my OS's 16 bit setup code, so it needs one of my boot loaders to get it working (the boot loader source for floppies, DOS and GRUB is available at http://www.users.bigpond.com/sacabling/dl.html). A 1440 KB boot floppy image is also included in the *.zip file to save hassles (just "rawrite" it to a floppy).

Cheers,

Brendan

Posted: **Thu Dec 23, 2004 4:04 am**

The results are overhead-less, right ? I wonder how those "+5" for Pentium-based processors and "+62" may come from ... I suppose that's due to internal cache-fill (from L2 cache) or something ...

The fact it takes half the time for Williammette is really disturbing ... That means for me we cannot rely on a chain of XORs to count N CPU cycles, in the general way ...

Posted: **Thu Dec 23, 2004 8:24 am**

Hi,

Pype.Clicker wrote: The results are overhead-less, right ? I wonder how those "+5" for Pentium-based processors and "+62" may come from ... I suppose that's due to internal cache-fill (from L2 cache) or something ...

I don't know enough about CPU internals to make a reasonable guess

. It shouldn't be caused by cache fills as this should occur within the first test which is ignored, rather than the second test.

Pype.Clicker wrote: The fact it takes half the time for Williammette is really disturbing ... That means for me we cannot rely on a chain of XORs to count N CPU cycles, in the general way ...

In general it's never a good idea to rely on the time (or number of cycles) any 80x86 instruction might take.

It may still be possible to use the XOR method for CPUs ranging from 80486 to Pentium III...

Cheers,

Brendan

Posted: **Thu Dec 23, 2004 3:37 pm**

Brendan wrote:
Pype.Clicker wrote: The results are overhead-less, right ? I wonder how those "+5" for Pentium-based processors and "+62" may come from ... I suppose that's due to internal cache-fill (from L2 cache) or something ...
I don't know enough about CPU internals to make a reasonable guess . It shouldn't be caused by cache fills as this should occur within the first test which is ignored, rather than the second test.

My guess for the first is a pipeline flush occuring at the end of the sequence, since RDTSC is a synchronized instruction. The second one might be caused by RDTSC being a vector-path instruction on the K6-1, I'll test it sometime later this week on my K6-2 laptop to see if it does that too (should probably...). Remember the K6-1 was one of the first to have rdtsc from AMD, might be they considered it an "addon" to the design and decided it wouldn't be oft-used.

Pype.Clicker wrote: The fact it takes half the time for Williammette is really disturbing ... That means for me we cannot rely on a chain of XORs to count N CPU cycles, in the general way ...
In general it's never a good idea to rely on the time (or number of cycles) any 80x86 instruction might take.

It may still be possible to use the XOR method for CPUs ranging from 80486 to Pentium III...

Why do you want a static series of instructions that takes a static number of cycles? It's only usable for determining the lag of an instruction in optimal conditions, which is pretty pointless (as it's also written out in some optimization guides...). For superscalar processors you will get the quickest pipeline in executing that instruction in terms of lag, which in case of the willamette (and all other netbursts, such as xeons and prescotts!) is the double double-clocked ALU (actually, LU, the arithmetic part is only part-way complete in one of them) so you get half a cycle. On scalar processors you'll get the lag of the instruction itself and on non-scalar processors you'll get the number of cycles it takes.

In any case, they're clearly documented and all very pointless for determining CPU speed. If you have RDTSC, you can use that to determine how long a cycle takes. If you don't have it, you can assume that something like XOR takes pretty short (note, avoid dependencies among the xors! If you have dependencies you have a big chance that the processor can stall on them, even though it's faster itself.

You can also just decide to go for MIPS, take any arbitrary ALU instruction (such as XOR) and count how often it can do that in a second. Note, MIPS also stands for Meaningless Indication of Processor Speed, which pretty much illustrates my point.

What is there in knowing how fast the processor is, except for allowing more timer interrupts? And, if it's fast enough for you to care about that, it's got RDTSC.

Posted: **Thu Dec 23, 2004 10:46 pm**

Hi,

Candy wrote: Why do you want a static series of instructions that takes a static number of cycles? It's only usable for determining the lag of an instruction in optimal conditions, which is pretty pointless (as it's also written out in some optimization guides...). For superscalar processors you will get the quickest pipeline in executing that instruction in terms of lag, which in case of the willamette (and all other netbursts, such as xeons and prescotts!) is the double double-clocked ALU (actually, LU, the arithmetic part is only part-way complete in one of them) so you get half a cycle. On scalar processors you'll get the lag of the instruction itself and on non-scalar processors you'll get the number of cycles it takes.

I think the basic idea is to use a series of instructions that uses a known/fixed number of cycles instead of RDTSC, which would be especially useful on older CPUs that don't support RDTSC (or in user space if RDTSC access is disabled). For e.g. if XOR always took 1 cycle, then you could measure how many XOR's the CPU can do in a second to find cycles per second.

Candy wrote: If you have dependencies you have a big chance that the processor can stall on them, even though it's faster itself.

For the "XOR test code", dependancies are required to prevent pipelining (multiple XOR's executed in multiple pipelines), and memory accesses are avoided to prevent stalls.

Candy wrote: You can also just decide to go for MIPS, take any arbitrary ALU instruction (such as XOR) and count how often it can do that in a second. Note, MIPS also stands for Meaningless Indication of Processor Speed, which pretty much illustrates my point.

What is there in knowing how fast the processor is, except for allowing more timer interrupts? And, if it's fast enough for you to care about that, it's got RDTSC.

My OS uses a varied collection of integer instructions to work out a "VMHz rating", which is then used to determine the system timer tick frequency and the basic time slice length. The code to find the VMHz rating also measures the CPU's bus speed (if there's a local APIC) and the CPU frequency (if RDTSC is supported). Of these measurements the CPU frequency is the only one that's not used by the OS (bus speed is needed to calibrate the local APIC timer). I report the CPU frequency to the user because it was easy to do, but it's much less accurate/useful than VMHz, BogoMIPS or MIPS as it doesn't relate to performance at all. For e.g. a pentium III (coppermine) running at 1 Ghz is about twice as fast as a 1.6 GHz williamette when running integer code, depending on memory bandwidth, etc. The CPU frequency is completely misleading...

Cheers,

Brendan

Posted: **Fri Dec 24, 2004 7:50 am**

Like the others I'm not sure how valuable knowing the frequency actually is (Although many months back I suggested a fun scheme to get very accurate timing using rdtsc and a decent frequency approximation).

IMO what is far more valuable is knowing how long a context switch takes. One way to check this would be to just load up the PIT with a long value and run a simulated set of context switches (Eg Repetitively calling a non-EOI producing version of the scheduler with a task consisting of a software interrupt). That value can then be used to reduce the time spent context switching to something useful when compared to time spent servicing applications by calibrating time slices appropriately. After all, the idea is to reduce overhead while maintaining the illusion of parallel execution.

OSDev.org

A new approach suggested for telling CPU speed

A new approach suggested for telling CPU speed

Re:A new approach suggested for telling CPU speed

Re:A new approach suggested for telling CPU speed

Re:A new approach suggested for telling CPU speed

Re:A new approach suggested for telling CPU speed

Re:A new approach suggested for telling CPU speed

Re:A new approach suggested for telling CPU speed

Re:A new approach suggested for telling CPU speed

Re:A new approach suggested for telling CPU speed

Re:A new approach suggested for telling CPU speed

Re:A new approach suggested for telling CPU speed