Profiling
Re:Profiling
You can very well use BOCHS to develop your kernel. But, once you have seen your code working in BOCHS, it is necessary to test it in the real hardware also.
Re:Profiling
You can try unit testing every function in the kernel whilst timing how long each execution takes, the functions that are both slow and on a central important path of execution (say in IRQ0, page faults, task switches, memory allocation, etc) would be likely candidates for optimization. I wouldn't worry too much about it until it's finished though since you may end up rewriting the functions in the kernel a few times to add new features which would undo the performance tuning.
Re:Profiling
Well, if you take a routine that takes approximately a millisecond to run and do it ten thousand or so times in a loop you can very easily time it with a stopwatch. Beyond that you can use your own timing code, just get the time or tick count when it starts and ends and take one away from the other.
Re:Profiling
Hi,
For e.g.:
To get very accurate results there's a pile of obvious and not-so-obvious things that need to be taken into account (the overhead that the measuring code itself adds, IRQs, CPU caches and instruction pipelines). Usually running the code 10 times in a row and finding the average is good enough for most things...
A quick note about Bochs (and QEMU too I think) - every instruction takes exactly 1 cycle! For example, as far as Bochs is concerned, something like "mov al,1" or "inc ebx" takes the same amount of time as "fdiv". Also there is no pipelining, so "mov eax,1; mov ebx, 1" will be counted as 2 cycles where a modern CPU will do both in one cycle (depending on the CPU of course). This means a highly tuned series of simple integer instructions will take the same number of cycles as a series of depandant floating point divisions - definately not accurate in any way.
Of course this isn't really profiling as I understand it. For that you'd also need to consider how often each routine is used. For example, if one routine takes 4 seconds and it's run once a day and another routine takes 1 second but it's run once every minute, then you'd want to optimize the quicker routine before worrying about the slower routine.
For working out how often code is used there's 2 methods that I'm aware of. The first is to have a timer IRQ that takes a "sample" - if the timer interrupts routine A 100 times and interrupts routine B 20 times, then you can assume routine A is 5 times as important for performance. This can work for a lot of things, but disabling interrupts can distort the results.
The next method involves adding a counter at the beginning of every routine. For example:
This isn't effected by disabling interrupts, but can be tedious to implement.
In addition to all of this, there's the performance monitoring counters that are built into modern CPUs. These are all model specific (e.g. code designed for a Pentium 4 probably won't work on anything else), and they aren't easy to understand (I gave up). IIRC Intel makes a special tool for performance tuning windows applications that does use them. Anyway, if this sort of thing sounds interesting or useful to you, the details are in chapter 15 of Intel's system programming guide...
Cheers,
Brendan
For timing a set of instructions, typically you'd use the RDTSC instruction (ReaD Time Stamp Counter). RDTSC returns the number of cycles the CPU has done so far (the time stamp). By reading the time stamp at the start and end you can work out how many cycles the code took to run.Crazed123 wrote:Yes, but how do I time them? With my stopwatch?!
For e.g.:
Code: Select all
RDTSC
push eax
push edx
<code being timed>
RDTSC
pop ecx
pop ebx
sub eax,ebx
sbb edx,ecx ;edx:eax = number of cycles it took
A quick note about Bochs (and QEMU too I think) - every instruction takes exactly 1 cycle! For example, as far as Bochs is concerned, something like "mov al,1" or "inc ebx" takes the same amount of time as "fdiv". Also there is no pipelining, so "mov eax,1; mov ebx, 1" will be counted as 2 cycles where a modern CPU will do both in one cycle (depending on the CPU of course). This means a highly tuned series of simple integer instructions will take the same number of cycles as a series of depandant floating point divisions - definately not accurate in any way.
Of course this isn't really profiling as I understand it. For that you'd also need to consider how often each routine is used. For example, if one routine takes 4 seconds and it's run once a day and another routine takes 1 second but it's run once every minute, then you'd want to optimize the quicker routine before worrying about the slower routine.
For working out how often code is used there's 2 methods that I'm aware of. The first is to have a timer IRQ that takes a "sample" - if the timer interrupts routine A 100 times and interrupts routine B 20 times, then you can assume routine A is 5 times as important for performance. This can work for a lot of things, but disabling interrupts can distort the results.
The next method involves adding a counter at the beginning of every routine. For example:
Code: Select all
section .data
counterA: dd 0
counterB: dd 0
counterC: dd 0
section .text
routineA:
inc dword [counterA]
..do the work...
ret
routineB:
inc dword [counterB]
..do the work...
ret
routineC:
inc dword [counterC]
..do the work...
ret
In addition to all of this, there's the performance monitoring counters that are built into modern CPUs. These are all model specific (e.g. code designed for a Pentium 4 probably won't work on anything else), and they aren't easy to understand (I gave up). IIRC Intel makes a special tool for performance tuning windows applications that does use them. Anyway, if this sort of thing sounds interesting or useful to you, the details are in chapter 15 of Intel's system programming guide...
Cheers,
Brendan
For all things; perfection is, and will always remain, impossible to achieve in practice. However; by striving for perfection we create things that are as perfect as practically possible. Let the pursuit of perfection be our guide.