Profiling

Question about which tools to use, bugs, the best way to implement a function, etc should go here. Don't forget to see if your question is answered in the wiki first! When in doubt post here.
Post Reply
Crazed123

Profiling

Post by Crazed123 »

How does one go about profiling a kernel, in Bochs or on real hardware? I'm hoping to find out where I can make this thing a bit faster.
NOTNULL

Re:Profiling

Post by NOTNULL »

You can very well use BOCHS to develop your kernel. But, once you have seen your code working in BOCHS, it is necessary to test it in the real hardware also.
AR

Re:Profiling

Post by AR »

You can try unit testing every function in the kernel whilst timing how long each execution takes, the functions that are both slow and on a central important path of execution (say in IRQ0, page faults, task switches, memory allocation, etc) would be likely candidates for optimization. I wouldn't worry too much about it until it's finished though since you may end up rewriting the functions in the kernel a few times to add new features which would undo the performance tuning.
Crazed123

Re:Profiling

Post by Crazed123 »

Yes, but how do I time them? With my stopwatch?!
Kemp

Re:Profiling

Post by Kemp »

Well, if you take a routine that takes approximately a millisecond to run and do it ten thousand or so times in a loop you can very easily time it with a stopwatch. Beyond that you can use your own timing code, just get the time or tick count when it starts and ends and take one away from the other.
User avatar
Brendan
Member
Member
Posts: 8561
Joined: Sat Jan 15, 2005 12:00 am
Location: At his keyboard!
Contact:

Re:Profiling

Post by Brendan »

Hi,
Crazed123 wrote:Yes, but how do I time them? With my stopwatch?!
For timing a set of instructions, typically you'd use the RDTSC instruction (ReaD Time Stamp Counter). RDTSC returns the number of cycles the CPU has done so far (the time stamp). By reading the time stamp at the start and end you can work out how many cycles the code took to run.

For e.g.:

Code: Select all

    RDTSC
    push eax
    push edx

    <code being timed>

    RDTSC
    pop ecx
    pop ebx
    sub eax,ebx
    sbb edx,ecx               ;edx:eax = number of cycles it took
To get very accurate results there's a pile of obvious and not-so-obvious things that need to be taken into account (the overhead that the measuring code itself adds, IRQs, CPU caches and instruction pipelines). Usually running the code 10 times in a row and finding the average is good enough for most things...

A quick note about Bochs (and QEMU too I think) - every instruction takes exactly 1 cycle! For example, as far as Bochs is concerned, something like "mov al,1" or "inc ebx" takes the same amount of time as "fdiv". Also there is no pipelining, so "mov eax,1; mov ebx, 1" will be counted as 2 cycles where a modern CPU will do both in one cycle (depending on the CPU of course). This means a highly tuned series of simple integer instructions will take the same number of cycles as a series of depandant floating point divisions - definately not accurate in any way.

Of course this isn't really profiling as I understand it. For that you'd also need to consider how often each routine is used. For example, if one routine takes 4 seconds and it's run once a day and another routine takes 1 second but it's run once every minute, then you'd want to optimize the quicker routine before worrying about the slower routine.

For working out how often code is used there's 2 methods that I'm aware of. The first is to have a timer IRQ that takes a "sample" - if the timer interrupts routine A 100 times and interrupts routine B 20 times, then you can assume routine A is 5 times as important for performance. This can work for a lot of things, but disabling interrupts can distort the results.

The next method involves adding a counter at the beginning of every routine. For example:

Code: Select all

   section .data
counterA:   dd 0
counterB:   dd 0
counterC:   dd 0
   section .text

routineA:
   inc dword [counterA]
   ..do the work...
   ret

routineB:
   inc dword [counterB]
   ..do the work...
   ret

routineC:
   inc dword [counterC]
   ..do the work...
   ret
This isn't effected by disabling interrupts, but can be tedious to implement.

In addition to all of this, there's the performance monitoring counters that are built into modern CPUs. These are all model specific (e.g. code designed for a Pentium 4 probably won't work on anything else), and they aren't easy to understand (I gave up). IIRC Intel makes a special tool for performance tuning windows applications that does use them. Anyway, if this sort of thing sounds interesting or useful to you, the details are in chapter 15 of Intel's system programming guide...


Cheers,

Brendan
For all things; perfection is, and will always remain, impossible to achieve in practice. However; by striving for perfection we create things that are as perfect as practically possible. Let the pursuit of perfection be our guide.
Post Reply