Hi,
Now this is another weird one. The same code run in qemu is faster than on real hardware.
And I mean a lot faster (like 200x faster). Is there any possible reason for that?
I've come across this when testing my kmalloc - the same code in qemu it does about 700k allocs / s,
and only 15k allocs/s on real hardware. The only time I've seen a speed reduction like this is when
the CPU's caches have been disabled in BIOS. The test platform is qemu on Win7 64bit running
on a core i5, versus just booting the kernel on the same hardware directly from a usb stick.
Any ideas?
Cheers,
Andrew
[SOLVED] QEMU faster then real hardware?
-
- Member
- Posts: 31
- Joined: Thu Mar 20, 2014 2:22 pm
- Location: London, UK
[SOLVED] QEMU faster then real hardware?
Last edited by theesemtheesem on Wed Apr 02, 2014 8:04 am, edited 1 time in total.
- ScropTheOSAdventurer
- Member
- Posts: 86
- Joined: Sun Aug 25, 2013 5:47 pm
- Location: Nebraska, USA
Re: QEMU faster then real hardware?
Do you mean the cache being disabled on the actual hardware or just in the emulator or both?
"Procrastination is the art of keeping up with yesterday."
Re: QEMU faster then real hardware?
Hi,
Also, how are you measuring time (host wall clock time, guest virtual time, RDTSC)?
Cheers,
Brendan
I think you'll need to be more specific about what the code actually does. For example, if you're allocating all RAM in 1 KiB pieces, then running it on Qemu emulating 8 MiB of RAM is likely to be a lot faster than running it on real hardware with 8 GiB of RAM. For another example, maybe Qemu is only emulating one CPU and your real hardware is using 8 CPUs and there's a lot of lock contention and/or cache line bouncing.theesemtheesem wrote:Now this is another weird one. The same code run in qemu is faster than on real hardware.
And I mean a lot faster (like 200x faster). Is there any possible reason for that?
I've come across this when testing my kmalloc - the same code in qemu it does about 700k allocs / s,
and only 15k allocs/s on real hardware. The only time I've seen a speed reduction like this is when
the CPU's caches have been disabled in BIOS. The test platform is qemu on Win7 64bit running
on a core i5, versus just booting the kernel on the same hardware directly from a usb stick.
Any ideas?
Also, how are you measuring time (host wall clock time, guest virtual time, RDTSC)?
Cheers,
Brendan
For all things; perfection is, and will always remain, impossible to achieve in practice. However; by striving for perfection we create things that are as perfect as practically possible. Let the pursuit of perfection be our guide.
-
- Member
- Posts: 31
- Joined: Thu Mar 20, 2014 2:22 pm
- Location: London, UK
Re: QEMU faster then real hardware?
Hi,
Once again thanks for Your valued responses.
Turns out there was a typo in the assembly code that builds the PDPTEs
obviously if these bits are zero in the first place they become set and thus no cache.
I mix this up all the time (btc so nicely expands to bit test and clear in my head but
is bit test and complement in reality). Intel should have named it btf (bit test and flip).
The correct code is:
btr being bit test and reset obviously.
Interesting side note - both BOCHS and QEMU ignore theese bits, as well as any cache controll
in the page entries, or in CR3, or in CR0 in fact. You can disable cache in CR0 and qemu will work
as fast as with cache enabled. So whatever You set in there qemu and bochs will still work full speed.
Don't know about vmware and others, it would be interesting to find out.
To answer Brendan it was not time as such that was measured, but the number of the kernel timer ticks
that elapse during a big loop with lots of kmalloc/kfree in it.
Cheers,
Andrew
Once again thanks for Your valued responses.
Turns out there was a typo in the assembly code that builds the PDPTEs
Code: Select all
bts eax, 0 ;set to PRESENT
btc eax, 3 ;clear WT ONLY
btc eax, 4 ;clear CACHE DISABLE
I mix this up all the time (btc so nicely expands to bit test and clear in my head but
is bit test and complement in reality). Intel should have named it btf (bit test and flip).
The correct code is:
Code: Select all
bts eax, 0 ;set to PRESENT
btr eax, 3 ;clear WT ONLY
btr eax, 4 ;clear CACHE DISABLE
Interesting side note - both BOCHS and QEMU ignore theese bits, as well as any cache controll
in the page entries, or in CR3, or in CR0 in fact. You can disable cache in CR0 and qemu will work
as fast as with cache enabled. So whatever You set in there qemu and bochs will still work full speed.
Don't know about vmware and others, it would be interesting to find out.
To answer Brendan it was not time as such that was measured, but the number of the kernel timer ticks
that elapse during a big loop with lots of kmalloc/kfree in it.
Cheers,
Andrew
Re: QEMU faster then real hardware?
I know you got the problem solved but out of curiosity what kind of heap implementation are you using?
-
- Member
- Posts: 31
- Joined: Thu Mar 20, 2014 2:22 pm
- Location: London, UK
Re: QEMU faster then real hardware?
Hi,
It's something i wrote myself, focused on minimizing memory fragmentation. I thought about trying to use dlmalloc as my kernel's kmalloc, but since I write in Pascal it seemed to be too much work. Besides, dlmalloc's approach with smallbins and trees seemed overcomplicated to me. Please note that this is just for internal use by the kernel, I didn't get to the stage where I actually give out memory to userspace programs yet.
So as I said my implementation focuses on minimizing heap fragmentation and speed - in that order. It might not be memory efficient, but memory is cheap nowadays so I don't see this as a huge problem. All blocks are 16 aligned so mmx/sse instructions can be used to shuffle data between / inside these blocks, free blocks are merged where possible and split if necessary, and the heap top is unchained where possible.
Cheers,
Andrew
It's something i wrote myself, focused on minimizing memory fragmentation. I thought about trying to use dlmalloc as my kernel's kmalloc, but since I write in Pascal it seemed to be too much work. Besides, dlmalloc's approach with smallbins and trees seemed overcomplicated to me. Please note that this is just for internal use by the kernel, I didn't get to the stage where I actually give out memory to userspace programs yet.
So as I said my implementation focuses on minimizing heap fragmentation and speed - in that order. It might not be memory efficient, but memory is cheap nowadays so I don't see this as a huge problem. All blocks are 16 aligned so mmx/sse instructions can be used to shuffle data between / inside these blocks, free blocks are merged where possible and split if necessary, and the heap top is unchained where possible.
Cheers,
Andrew