[SOLVED] QEMU faster then real hardware?

Question about which tools to use, bugs, the best way to implement a function, etc should go here. Don't forget to see if your question is answered in the wiki first! When in doubt post here.
Post Reply
theesemtheesem
Member
Member
Posts: 31
Joined: Thu Mar 20, 2014 2:22 pm
Location: London, UK

[SOLVED] QEMU faster then real hardware?

Post by theesemtheesem »

Hi,

Now this is another weird one. The same code run in qemu is faster than on real hardware.
And I mean a lot faster (like 200x faster). Is there any possible reason for that?

I've come across this when testing my kmalloc - the same code in qemu it does about 700k allocs / s,
and only 15k allocs/s on real hardware. The only time I've seen a speed reduction like this is when
the CPU's caches have been disabled in BIOS. The test platform is qemu on Win7 64bit running
on a core i5, versus just booting the kernel on the same hardware directly from a usb stick.

Any ideas?

Cheers,
Andrew
Last edited by theesemtheesem on Wed Apr 02, 2014 8:04 am, edited 1 time in total.
User avatar
ScropTheOSAdventurer
Member
Member
Posts: 86
Joined: Sun Aug 25, 2013 5:47 pm
Location: Nebraska, USA

Re: QEMU faster then real hardware?

Post by ScropTheOSAdventurer »

Do you mean the cache being disabled on the actual hardware or just in the emulator or both?
"Procrastination is the art of keeping up with yesterday."
User avatar
Brendan
Member
Member
Posts: 8561
Joined: Sat Jan 15, 2005 12:00 am
Location: At his keyboard!
Contact:

Re: QEMU faster then real hardware?

Post by Brendan »

Hi,
theesemtheesem wrote:Now this is another weird one. The same code run in qemu is faster than on real hardware.
And I mean a lot faster (like 200x faster). Is there any possible reason for that?

I've come across this when testing my kmalloc - the same code in qemu it does about 700k allocs / s,
and only 15k allocs/s on real hardware. The only time I've seen a speed reduction like this is when
the CPU's caches have been disabled in BIOS. The test platform is qemu on Win7 64bit running
on a core i5, versus just booting the kernel on the same hardware directly from a usb stick.

Any ideas?
I think you'll need to be more specific about what the code actually does. For example, if you're allocating all RAM in 1 KiB pieces, then running it on Qemu emulating 8 MiB of RAM is likely to be a lot faster than running it on real hardware with 8 GiB of RAM. For another example, maybe Qemu is only emulating one CPU and your real hardware is using 8 CPUs and there's a lot of lock contention and/or cache line bouncing.

Also, how are you measuring time (host wall clock time, guest virtual time, RDTSC)?


Cheers,

Brendan
For all things; perfection is, and will always remain, impossible to achieve in practice. However; by striving for perfection we create things that are as perfect as practically possible. Let the pursuit of perfection be our guide.
theesemtheesem
Member
Member
Posts: 31
Joined: Thu Mar 20, 2014 2:22 pm
Location: London, UK

Re: QEMU faster then real hardware?

Post by theesemtheesem »

Hi,

Once again thanks for Your valued responses.
Turns out there was a typo in the assembly code that builds the PDPTEs

Code: Select all

   bts eax, 0 ;set to PRESENT
   btc eax, 3 ;clear WT ONLY
   btc eax, 4 ;clear CACHE DISABLE
obviously if these bits are zero in the first place they become set and thus no cache.
I mix this up all the time (btc so nicely expands to bit test and clear in my head but
is bit test and complement in reality). Intel should have named it btf (bit test and flip).
The correct code is:

Code: Select all

  bts eax, 0 ;set to PRESENT
  btr eax, 3 ;clear WT ONLY
  btr eax, 4 ;clear CACHE DISABLE
btr being bit test and reset obviously.

Interesting side note - both BOCHS and QEMU ignore theese bits, as well as any cache controll
in the page entries, or in CR3, or in CR0 in fact. You can disable cache in CR0 and qemu will work
as fast as with cache enabled. So whatever You set in there qemu and bochs will still work full speed.
Don't know about vmware and others, it would be interesting to find out.

To answer Brendan it was not time as such that was measured, but the number of the kernel timer ticks
that elapse during a big loop with lots of kmalloc/kfree in it.

Cheers,
Andrew
User avatar
Pancakes
Member
Member
Posts: 75
Joined: Mon Mar 19, 2012 1:52 pm

Re: QEMU faster then real hardware?

Post by Pancakes »

I know you got the problem solved but out of curiosity what kind of heap implementation are you using?
theesemtheesem
Member
Member
Posts: 31
Joined: Thu Mar 20, 2014 2:22 pm
Location: London, UK

Re: QEMU faster then real hardware?

Post by theesemtheesem »

Hi,

It's something i wrote myself, focused on minimizing memory fragmentation. I thought about trying to use dlmalloc as my kernel's kmalloc, but since I write in Pascal it seemed to be too much work. Besides, dlmalloc's approach with smallbins and trees seemed overcomplicated to me. Please note that this is just for internal use by the kernel, I didn't get to the stage where I actually give out memory to userspace programs yet.

So as I said my implementation focuses on minimizing heap fragmentation and speed - in that order. It might not be memory efficient, but memory is cheap nowadays so I don't see this as a huge problem. All blocks are 16 aligned so mmx/sse instructions can be used to shuffle data between / inside these blocks, free blocks are merged where possible and split if necessary, and the heap top is unchained where possible.

Cheers,
Andrew
Post Reply