It's been a very long time since I was last here or experimenting with OS code.. (about 6 years I think!), in any event out of interest I pulled out my old code.. it compiled.. put it on a usb.. it booted.. amazing
I wanted to really just try something out which was to profile some blocks of code and do a comparison between my OS/system startup and the same code in Windows.
Under Windows my setup is:
Core i7 quadcore (Sandy Bridge)
Windows 7 Ultimate x64
Test code assembled using JWASM
My OS (If you can really call it that):
Code assembled with FASM
Long Mode, basic identity mapped mem, no ints enabled etc.
My theory was that given there should be no interrupts, no context switches, less cache thrashing or TLB misses.. the code blocks should run at least as fast as under Windows, possibly (hopefully) a bit faster (not sure if anyone has tried these sort of benchmarks before?)
However.. this is not the case, my first test code takes 33 seconds to execute under Windows and 47 seconds under my long mode startup.
So I have a couple of suspicions for what might causes this listed in order of likelihood:
1) I am enabling the caches (first during boot with bios calls to int 15h) then on long mode switch with:
Code: Select all
;------------------------------------------------------------------------------
; Clear CD and NW flags in CR0 to enable CPU caches.
;------------------------------------------------------------------------------
xor rax,rax
mov rax,cr0
and eax,9fffffffh
mov cr0,rax
2) Under Windows I run in high performance power mode, perhaps on boot via ACPI the CPU is not running at maximum?
(If for example in Windows I switch to balanced power plan the same code takes 1 minute to run).
3) The memory address I'm using in the test under Windows is obtained via LEA (just a small array in the code) so it's physical address is probably somewhere higher up in memory than the test address of 0x90000 i'm using in the os code.
Is it possible that this area (being under 640k .. not that really applies anymore) might be non-cacheable or have different attributes?
4) A problem with attributes set on the paging structures
5) Windows does some secret magic to get more performance out of the machine (highly doubt this as my below figures indicate that Windows is running the test at about the optimal level)
For reference here is the test block.. which in theory should be well balance as the CPU and memory bandwidth max out at about the same point.
assuming 64bit bus, 8 byte reads, 3ghz cpu..it should achieve 24Gb/s (which is very close to the theoretical limit of my RAM)
given my timing of 47 seconds that equates to a read throughput of about 17Gb/s
Windows achieves exactly that.. 24Gb/s
Code: Select all
mov rcx,100000000
outerloop:
mov rdi,90000h
mov rdx,1000
@@:
mov rax,[rdi]
add rdi,8
dec rdx
jnz short @B
dec rcx
jnz short outerloop
Anyone have any thoughts or suggestions as to where to look?