Hi,
Crazed123 wrote:Yes, but how do I time them? With my stopwatch?!
For timing a set of instructions, typically you'd use the RDTSC instruction (ReaD Time Stamp Counter). RDTSC returns the number of cycles the CPU has done so far (the time stamp). By reading the time stamp at the start and end you can work out how many cycles the code took to run.
For e.g.:
Code: Select all
RDTSC
push eax
push edx
<code being timed>
RDTSC
pop ecx
pop ebx
sub eax,ebx
sbb edx,ecx ;edx:eax = number of cycles it took
To get very accurate results there's a pile of obvious and not-so-obvious things that need to be taken into account (the overhead that the measuring code itself adds, IRQs, CPU caches and instruction pipelines). Usually running the code 10 times in a row and finding the average is good enough for most things...
A quick note about Bochs (and QEMU too I think) - every instruction takes exactly 1 cycle! For example, as far as Bochs is concerned, something like "mov al,1" or "inc ebx" takes the same amount of time as "fdiv". Also there is no pipelining, so "mov eax,1; mov ebx, 1" will be counted as 2 cycles where a modern CPU will do both in one cycle (depending on the CPU of course). This means a highly tuned series of simple integer instructions will take the same number of cycles as a series of depandant floating point divisions - definately not accurate in any way.
Of course this isn't really
profiling as I understand it. For that you'd also need to consider how often each routine is used. For example, if one routine takes 4 seconds and it's run once a day and another routine takes 1 second but it's run once every minute, then you'd want to optimize the quicker routine before worrying about the slower routine.
For working out how often code is used there's 2 methods that I'm aware of. The first is to have a timer IRQ that takes a "sample" - if the timer interrupts routine A 100 times and interrupts routine B 20 times, then you can assume routine A is 5 times as important for performance. This can work for a lot of things, but disabling interrupts can distort the results.
The next method involves adding a counter at the beginning of every routine. For example:
Code: Select all
section .data
counterA: dd 0
counterB: dd 0
counterC: dd 0
section .text
routineA:
inc dword [counterA]
..do the work...
ret
routineB:
inc dword [counterB]
..do the work...
ret
routineC:
inc dword [counterC]
..do the work...
ret
This isn't effected by disabling interrupts, but can be tedious to implement.
In addition to all of this, there's the performance monitoring counters that are built into modern CPUs. These are all model specific (e.g. code designed for a Pentium 4 probably won't work on anything else), and they aren't easy to understand (I gave up). IIRC Intel makes a special tool for performance tuning windows applications that does use them. Anyway, if this sort of thing sounds interesting or useful to you, the details are in chapter 15 of Intel's system programming guide...
Cheers,
Brendan