My OS's GUI runs slow in Bochs. That's expected. It runs fast in VirtualBox and OK in real hardware (2 test PCs.)
It runs quite slow in QEMU. I enable writeback caching (clear highest two bits of CR0, set bit 3 of CR3) on booting. Is there a way of speeding up a memcpy other than using SSE aligned instructions when possible? Even when the source and destination are not aligned, I use SSE unaligned instructions (movdqa/movdqu).
I've tried implementing AVX, and it majorly improves performance in Bochs, but decreases it in real HW. (I made a test of copying 8 MB, REP MOVSB did in in ~70 milliseconds, REP MOVSQ did it in 8-9 milliseconds if I remember correctly, AVX aligned memcpy did it in 7 milliseconds and SSE aligned memcpy did it in 5-6 milliseconds.) I've ran these tests almost 10 times with similar results. So I removed AVX detection and enabling code from my OS.
I've also tried setting VESA framebuffer MTRR to write-combining, but that doesn't do much in QEMU, and doesn't do anything in my laptop because all MTRRs are already populated by the BIOS, and somehow destroys the framebuffer on one of my test PCs.
Is there anything else I am not aware of? Any help will be greatly appreciated.
Thanks in advance.
Improving performance in QEMU
- BrightLight
- Member
- Posts: 901
- Joined: Sat Dec 27, 2014 9:11 am
- Location: Maadi, Cairo, Egypt
- Contact:
Improving performance in QEMU
You know your OS is advanced when you stop using the Intel programming guide as a reference.
- Kazinsal
- Member
- Posts: 559
- Joined: Wed Jul 13, 2011 7:38 pm
- Libera.chat IRC: Kazinsal
- Location: Vancouver
- Contact:
Re: Improving performance in QEMU
QEMU uses emulation and dynamic translation at best, whereas VirtualBox uses hardware virtualization extensions. QEMU can as well if you install KVM and pass -enable-kvm to QEMU. You need a CPU with VT-x support for this -- any recent Intel CPU will do.
- BrightLight
- Member
- Posts: 901
- Joined: Sat Dec 27, 2014 9:11 am
- Location: Maadi, Cairo, Egypt
- Contact:
Re: Improving performance in QEMU
Some other graphical OSes run fast in the same QEMU configuration. I meant are there any more ways to speed up my code, not QEMU itself.Kazinsal wrote:QEMU uses emulation and dynamic translation at best, whereas VirtualBox uses hardware virtualization extensions. QEMU can as well if you install KVM and pass -enable-kvm to QEMU. You need a CPU with VT-x support for this -- any recent Intel CPU will do.
You know your OS is advanced when you stop using the Intel programming guide as a reference.
Re: Improving performance in QEMU
Hi,
If the final blit from buffer to display memory is actually a problem; the next step is to minimise the amount of data you copy. This means keeping track of which areas changed and which areas didn't, and only copying areas that actually changedl which means modifying the code that draws things so that it updates whatever you use to keep track of modified/not modified. One of the simplest ways to do this is to have a "row modified" flag for each row of pixels (e.g. 768 flags if you're using 1028 * 768 video mode) and only copy rows of pixels that changed. More complex methods include "dirty rectangles".
Also note that (because video display memory is much slower to access than RAM) often you can improve performance by using another buffer in RAM (that contains the pixel data for the previous frame) and doing "if( olddata[src] != newdata[src] ) { olddata[src] = newdata[src]; videoDisplayMemory[dest] = newdata[src]; }" so that you only write (groups of) pixels that are different (and don't write pixels that remained the same colour despite being modified).
There is rarely any reason to be doing misaligned writes to video display memory. The trickiest case is 24-bpp modes, where it's fairly trivial to do groups of 4 pixels as 3 aligned dwords.
A single "blit data from buffer to display memory" function is probably a bad idea. Use a function pointer; and have multiple functions for blitting (one for the "bytes_between_lines == horizontal_resolution * bytes_per_pixel" special case, one for the rare misaligned cases, etc) and determine which blitting function to use (and set the "blitter function pointer") when you initialise your code.
Finally; things like "memcpy()" have a setup overhead (where it figures out what it's doing) followed by "per byte of data" overhead (which is often limited by bus bandwidth, especially for video display memory). For blitting (where you're typically copying many small pieces and not doing a single large copy) the setup overhead can matter more than the "per byte of data" overhead. For implementations that use SSE and AVX the setup overhead is significantly higher (they're typically designed for "low per byte of data overhead" instead of being useful for well designed code); which means that for a large number of small copies (that you get when you're doing blitting right) they're extremely bad.
Cheers,
Brendan
First determine why it's slow. If it spends 256 ms redrawing everything in a buffer in RAM and then another 2 ms copying everything from the buffer to display memory, it'd be silly to worrying about how fast you can copy data and ignore how fast you draw things.omarrx024 wrote:Is there anything else I am not aware of?
If the final blit from buffer to display memory is actually a problem; the next step is to minimise the amount of data you copy. This means keeping track of which areas changed and which areas didn't, and only copying areas that actually changedl which means modifying the code that draws things so that it updates whatever you use to keep track of modified/not modified. One of the simplest ways to do this is to have a "row modified" flag for each row of pixels (e.g. 768 flags if you're using 1028 * 768 video mode) and only copy rows of pixels that changed. More complex methods include "dirty rectangles".
Also note that (because video display memory is much slower to access than RAM) often you can improve performance by using another buffer in RAM (that contains the pixel data for the previous frame) and doing "if( olddata[src] != newdata[src] ) { olddata[src] = newdata[src]; videoDisplayMemory[dest] = newdata[src]; }" so that you only write (groups of) pixels that are different (and don't write pixels that remained the same colour despite being modified).
There is rarely any reason to be doing misaligned writes to video display memory. The trickiest case is 24-bpp modes, where it's fairly trivial to do groups of 4 pixels as 3 aligned dwords.
A single "blit data from buffer to display memory" function is probably a bad idea. Use a function pointer; and have multiple functions for blitting (one for the "bytes_between_lines == horizontal_resolution * bytes_per_pixel" special case, one for the rare misaligned cases, etc) and determine which blitting function to use (and set the "blitter function pointer") when you initialise your code.
Finally; things like "memcpy()" have a setup overhead (where it figures out what it's doing) followed by "per byte of data" overhead (which is often limited by bus bandwidth, especially for video display memory). For blitting (where you're typically copying many small pieces and not doing a single large copy) the setup overhead can matter more than the "per byte of data" overhead. For implementations that use SSE and AVX the setup overhead is significantly higher (they're typically designed for "low per byte of data overhead" instead of being useful for well designed code); which means that for a large number of small copies (that you get when you're doing blitting right) they're extremely bad.
Cheers,
Brendan
For all things; perfection is, and will always remain, impossible to achieve in practice. However; by striving for perfection we create things that are as perfect as practically possible. Let the pursuit of perfection be our guide.
Re: Improving performance in QEMU
You didn't even tell us what your code is doing slowly, so, how do we know what it is to be improved?omarrx024 wrote:I meant are there any more ways to speed up my code, not QEMU itself.
Are you drawing and redrawing too much too often? If you are, maybe you shouldn't.
- BrightLight
- Member
- Posts: 901
- Joined: Sat Dec 27, 2014 9:11 am
- Location: Maadi, Cairo, Egypt
- Contact:
Re: Improving performance in QEMU
I need a way to speed up memcpy. I know that I redraw the screen too much, but this will be fixed soon.
P.S: I know memcpy is what is making things slow, because every time I move a window, I redraw the background before redrawing the windows. The background is a BMP image decoded to be a plain 32-bit pixel buffer, which means that in 32-bit VBE modes (the default), it just memcpy's line by line. When I take away the background and replace it by a solid color background, performance is quite nice in QEMU and even acceptable in Bochs. That's why my point is how to speed up a memcpy, or are there other ways to redraw an image background?
P.S: I know memcpy is what is making things slow, because every time I move a window, I redraw the background before redrawing the windows. The background is a BMP image decoded to be a plain 32-bit pixel buffer, which means that in 32-bit VBE modes (the default), it just memcpy's line by line. When I take away the background and replace it by a solid color background, performance is quite nice in QEMU and even acceptable in Bochs. That's why my point is how to speed up a memcpy, or are there other ways to redraw an image background?
You know your OS is advanced when you stop using the Intel programming guide as a reference.