OSDev.org

Posted: **Sun Mar 23, 2014 3:44 pm**

Hi,

Now this is another weird one. The same code run in qemu is faster than on real hardware.
And I mean a lot faster (like 200x faster). Is there any possible reason for that?

I've come across this when testing my kmalloc - the same code in qemu it does about 700k allocs / s,
and only 15k allocs/s on real hardware. The only time I've seen a speed reduction like this is when
the CPU's caches have been disabled in BIOS. The test platform is qemu on Win7 64bit running
on a core i5, versus just booting the kernel on the same hardware directly from a usb stick.

Any ideas?

Cheers,
Andrew

Posted: **Sun Mar 23, 2014 3:54 pm**

Do you mean the cache being disabled on the actual hardware or just in the emulator or both?

Posted: **Sun Mar 23, 2014 4:16 pm**

Hi,

theesemtheesem wrote:Now this is another weird one. The same code run in qemu is faster than on real hardware.
And I mean a lot faster (like 200x faster). Is there any possible reason for that?

I've come across this when testing my kmalloc - the same code in qemu it does about 700k allocs / s,
and only 15k allocs/s on real hardware. The only time I've seen a speed reduction like this is when
the CPU's caches have been disabled in BIOS. The test platform is qemu on Win7 64bit running
on a core i5, versus just booting the kernel on the same hardware directly from a usb stick.

Any ideas?

I think you'll need to be more specific about what the code actually does. For example, if you're allocating all RAM in 1 KiB pieces, then running it on Qemu emulating 8 MiB of RAM is likely to be a lot faster than running it on real hardware with 8 GiB of RAM. For another example, maybe Qemu is only emulating one CPU and your real hardware is using 8 CPUs and there's a lot of lock contention and/or cache line bouncing.

Also, how are you measuring time (host wall clock time, guest virtual time, RDTSC)?

Cheers,

Brendan

Posted: **Sun Mar 23, 2014 4:59 pm**

Hi,

Once again thanks for Your valued responses.
Turns out there was a typo in the assembly code that builds the PDPTEs

Code: Select all

   bts eax, 0 ;set to PRESENT
   btc eax, 3 ;clear WT ONLY
   btc eax, 4 ;clear CACHE DISABLE

obviously if these bits are zero in the first place they become set and thus no cache.
I mix this up all the time (btc so nicely expands to bit test and clear in my head but
is bit test and complement in reality). Intel should have named it btf (bit test and flip).
The correct code is:

Code: Select all

  bts eax, 0 ;set to PRESENT
  btr eax, 3 ;clear WT ONLY
  btr eax, 4 ;clear CACHE DISABLE

btr being bit test and reset obviously.

Interesting side note - both BOCHS and QEMU ignore theese bits, as well as any cache controll
in the page entries, or in CR3, or in CR0 in fact. You can disable cache in CR0 and qemu will work
as fast as with cache enabled. So whatever You set in there qemu and bochs will still work full speed.
Don't know about vmware and others, it would be interesting to find out.

To answer Brendan it was not time as such that was measured, but the number of the kernel timer ticks
that elapse during a big loop with lots of kmalloc/kfree in it.

Cheers,
Andrew

Posted: **Mon Mar 24, 2014 10:04 am**

I know you got the problem solved but out of curiosity what kind of heap implementation are you using?

Posted: **Mon Mar 24, 2014 3:17 pm**

Hi,

It's something i wrote myself, focused on minimizing memory fragmentation. I thought about trying to use dlmalloc as my kernel's kmalloc, but since I write in Pascal it seemed to be too much work. Besides, dlmalloc's approach with smallbins and trees seemed overcomplicated to me. Please note that this is just for internal use by the kernel, I didn't get to the stage where I actually give out memory to userspace programs yet.

So as I said my implementation focuses on minimizing heap fragmentation and speed - in that order. It might not be memory efficient, but memory is cheap nowadays so I don't see this as a huge problem. All blocks are 16 aligned so mmx/sse instructions can be used to shuffle data between / inside these blocks, free blocks are merged where possible and split if necessary, and the heap top is unchained where possible.

Cheers,
Andrew

OSDev.org

[SOLVED] QEMU faster then real hardware?

[SOLVED] QEMU faster then real hardware?

Re: QEMU faster then real hardware?

Re: QEMU faster then real hardware?

Re: QEMU faster then real hardware?

Re: QEMU faster then real hardware?

Re: QEMU faster then real hardware?