Double Buffering Slower Than DBA

Octacone · Post by **Octacone** » Tue Sep 13, 2016 11:57 am

DBA = Direct Buffer Access
Hello. Today I was finally able to enable double buffering. It took me quite a while to fix all the bugs I was having. Now I am really confused, my double buffering is slower than writing to the screen (linear frame buffer) memory. I blame my poor memory copy function. Do you think this thing is any good? Do any of you guys have a super ultra fast memory copy that can be used for double buffering.

Code: Select all

void MemoryCopy(const void* from, void* to, uint32_t size)
{
     	for(int i=0; i < size / 16; i++)
      	{
         		__asm__ __volatile__("movups (%0), %%xmm0\n" "movntdq %%xmm0, (%1)\n"::"r"(from), "r"(to) : "memory");
         		from += 16;
         		to += 16;
      	}
      	if(size & 7)
      	{
      		 size = size & 7;
     		int d0, d1, d2;
     		 __asm__ __volatile__(
      		"rep ; movsl\n\t"
      		"testb $2,%b4\n\t"
      		"je 1f\n\t"
      		"movsw\n"
      		"1:\ttestb $1,%b4\n\t"
      		"je 2f\n\t"
     		"movsb\n"
     		"2:"
     		: "=&c" (d0), "=&D" (d1), "=&S" (d2)
      		:"0" (size/4), "q" (size),"1" ((long) to),"2" ((long) from)
      		: "memory");
      	}
      	return (to);
}

Ycep · Post by **Ycep** » Tue Sep 13, 2016 12:19 pm

There is no such thing called as direct buffer access,as far I know.
Althrough you done complete nonsense down there in your code.
I think that Video Double Buffering could not be compared with something called DBA.
If you need faster memory access, use SSE.

BrightLight · Post by **BrightLight** » Tue Sep 13, 2016 12:30 pm

Hi,

I didn't fully read your memcpy() function, but here are some tips.
First, check if the source and destination are 16-byte aligned. If so, use the MOVDQA instruction for reading from the source, as well as MOVDQA for writing to the destination. If they are not aligned, use MOVDQU instruction. MOVUPS and MOVAPS perform some floating point checks, which waste a few CPU cycles, which are very valuable when it comes to graphics programming. MOVDQA and MOVDQU don't perform floating point checks; they are meant for packed integers and so no checks are performed.
Next, double buffering will always use more CPU time if all you're doing is a bunch of clear screens and draw rectangles. The point of double buffering has always been to prevent flickering. For example, when your window manager is complete, and you have a desktop background, desktop icons, and two windows open, the screen will flicker if you draw all of them directly to the hardware framebuffer. A user may see the background only before the desktop icons are drawn; they may see one window before the other is drawn; etc... To work around this, you draw everything to a back buffer, and then memcpy() it to the screen. Optimize your memcpy() function, and you'll see the benefits of double buffering later.
One last thing that's worth mentioning: there are other things you can also do to improve graphics performance. The simplest one (for now) is enabling caching if it's not already enabled (clearing bits 30 and 29 of CR0 register). MTRR and AVX are bonus things to boost performance even more when you get later into development.

Cheers.

Ch4ozz · Post by **Ch4ozz** » Tue Sep 13, 2016 12:32 pm

Of course double buffering is slower on virtual machines ...
Just think how it internally works. The VRAM of the vm is just normal RAM and some VMs dont even use the copy on write function.
You will see its true usefulness on real hardware only.

Your memcpy uses 1 of 8 xmm registers, therefore its much slower then using all registers at once.
You should also use the "prefetchnta" instruction for prefetching the memory.

SSE is able to copy 128 bytes in one loop using 16 instructions.

omarrx024 wrote:The point of double buffering has always been to prevent flickering.

Not quite right.
Its main point is that if you have alot of overlapping windows and you dont want to do clipping checks, you simply copy them all onto the double buffer and dont waste time calculating clipping.
Then you write the double buffer result ONCE into the slow VRAM instead of writing all windows overlapping directly into the VRAM.

BTW: The word DBA doesnt exist

Octacone · Post by **Octacone** » Tue Sep 13, 2016 12:40 pm

Lukand wrote:There is no such thing called as direct buffer access,as far I know.
Althrough you done complete nonsense down there in your code.
I think that Video Double Buffering could not be compared with something called DBA.
If you need faster memory access, use SSE.

Direct buffer access = writing directly to linear frame buffer
I already have SSE 2 enabled. Just not using it for some reason.

Octacone · Post by **Octacone** » Tue Sep 13, 2016 1:51 pm

omarrx024 wrote:Hi,

I didn't fully read your memcpy() function, but here are some tips.
First, check if the source and destination are 16-byte aligned. If so, use the MOVDQA instruction for reading from the source, as well as MOVDQA for writing to the destination. If they are not aligned, use MOVDQU instruction. MOVUPS and MOVAPS perform some floating point checks, which waste a few CPU cycles, which are very valuable when it comes to graphics programming. MOVDQA and MOVDQU don't perform floating point checks; they are meant for packed integers and so no checks are performed.
Next, double buffering will always use more CPU time if all you're doing is a bunch of clear screens and draw rectangles. The point of double buffering has always been to prevent flickering. For example, when your window manager is complete, and you have a desktop background, desktop icons, and two windows open, the screen will flicker if you draw all of them directly to the hardware framebuffer. A user may see the background only before the desktop icons are drawn; they may see one window before the other is drawn; etc... To work around this, you draw everything to a back buffer, and then memcpy() it to the screen. Optimize your memcpy() function, and you'll see the benefits of double buffering later.
One last thing that's worth mentioning: there are other things you can also do to improve graphics performance. The simplest one (for now) is enabling caching if it's not already enabled (clearing bits 30 and 29 of CR0 register). MTRR and AVX are bonus things to boost performance even more when you get later into development.

Cheers.

Thanks for answering. Do I need to enable MTRR and AVX in order to use those instructions? I still haven't enabled caching (didn't know it was that easy to do). I have SSE2 enabled, is it better than MTRR? The main reason I am doing this is because of flickering. I guess I'll have to upgrade my assembly knowledge becuase of that memcpy.

Octacone · Post by **Octacone** » Tue Sep 13, 2016 1:56 pm

Ch4ozz wrote: broken quote this is my reply
Yeah, I do not want to mess around with clipping. Double buffering is wayyy easier. VRAM is just an emulation, that is shocking. (speaking about vbox and qemu), didn't know that. Ty!

BTW: The word DBA doesnt exist

I made that up,

BrightLight · Post by **BrightLight** » Tue Sep 13, 2016 2:20 pm

octacone wrote:Thanks for answering. Do I need to enable MTRR and AVX in order to use those instructions? I still haven't enabled caching (didn't know it was that easy to do). I have SSE2 enabled, is it better than MTRR? The main reason I am doing this is because of flickering. I guess I'll have to upgrade my assembly knowledge becuase of that memcpy.

MOVDQA and MOVDQU are SSE2 instructions. You don't need AVX or MTRR to use them.
It seems you have some misunderstanding here; SSE is a SIMD (Single Instruction, Multiple Data) technology that operates on 128-bit registers. AVX (Advanced Vector Extensions) is the same idea but extends the SSE registers to 256 bits, thus better speed. Forget about AVX for now, most 32-bit hardware don't support it. MTRR (Memory Type Range Register) is not a SIMD technology; it's a way of providing specific caching types for specific memory ranges, but you don't need that just yet. Just work on your memcpy() function first using the tips I gave in my earlier post; they should guide you in the right direction.
The Intel manuals give good-enough information on MTRR if you're interested.
EDIT: Forgot to say this, assembly language is irrelevant here; with C wrapper functions all of this can be done in C.

Brendan · Post by **Brendan** » Tue Sep 13, 2016 7:13 pm

Hi,

For optimising "blit from double buffer to display memory", the single most important optimisation is avoiding copying data for no reason (e.g. when those pixels are the same colour as last time anyway). Everything else is typically "relatively irrelevant" in comparison.

If you are doing something to avoid copying data for no reason, then you'll find that almost all of the time you're copying very small pieces of memory, and all the checks and branches you'll have (to decide if source and/or destination is aligned, if size is large enough to use SSE, if there's left over bytes, etc) destroy performance.

Don't forget that if you use SSE/AVX the kernel has to load and save its state (either during task switches or before/after using it); and that on modern CPUs the CPU puts all the SSE/AVX stuff to sleep to save power when it's not being used and once that happens it takes thousands of cycles before SSE/AVX can wake up properly and return to normal speed. These things combined mean that occasional use of SSE/AVX (e.g. "once every 16.66 ms") costs more than just the overhead you see within "memcpy()" alone.

Finally; you shouldn't have one generic "blit from double buffer to display memory" function, but should have one for each different case (e.g. some for 4 bytes per pixel, some for "no padding between lines in display memory", etc). In the same way you should not have one generic "memcpy()" and should have different memory copying code designed specifically for (and built directly into) each of your blitting functions.

Cheers,

Brendan

Octacone · Post by **Octacone** » Wed Sep 14, 2016 10:57 am

Brendan wrote:Hi,

For optimising "blit from double buffer to display memory", the single most important optimisation is avoiding copying data for no reason (e.g. when those pixels are the same colour as last time anyway). Everything else is typically "relatively irrelevant" in comparison.

If you are doing something to avoid copying data for no reason, then you'll find that almost all of the time you're copying very small pieces of memory, and all the checks and branches you'll have (to decide if source and/or destination is aligned, if size is large enough to use SSE, if there's left over bytes, etc) destroy performance.

Don't forget that if you use SSE/AVX the kernel has to load and save its state (either during task switches or before/after using it); and that on modern CPUs the CPU puts all the SSE/AVX stuff to sleep to save power when it's not being used and once that happens it takes thousands of cycles before SSE/AVX can wake up properly and return to normal speed. These things combined mean that occasional use of SSE/AVX (e.g. "once every 16.66 ms") costs more than just the overhead you see within "memcpy()" alone.

Finally; you shouldn't have one generic "blit from double buffer to display memory" function, but should have one for each different case (e.g. some for 4 bytes per pixel, some for "no padding between lines in display memory", etc). In the same way you should not have one generic "memcpy()" and should have different memory copying code designed specifically for (and built directly into) each of your blitting functions.

Cheers,

Brendan

What do you mean by copying data for no reason? Pixels must contain colors, you can't ignore them. If you are talking about dirty regions, isn't that a bit slow? Because lets say I have 40 rectangles, that means that I need to calculate each one of them and see if it needs to be repainted every single time I move my mouse. Also what about desktop, it needs to be repainted in like 90% of all the draw calls, how to optimize that? SSE is not an option since it takes 16ms. So like I need to make 3 memory copy solutions, each one to serve a different size purpose. Yeah I think I get what you mean. Anyways, I hate optimizing... so many different approaches... Thanks!

Ch4ozz · Post by **Ch4ozz** » Wed Sep 14, 2016 12:35 pm

Using SSE I had about 200 fps when moving a window pretty fast on real hardware using VESA.
Not sure how checking every pixel color should be faster though.

Now that I locked fps to 60 theres like nothing on VBox or the real hardware which can slow down the rendering to drop below
Even when copying the whole double buffer into the VRAM each frame Im able to get ~57 fps
All those double checks for clipping etc just slowed down my rendering because copying one big rect is simply faster then coyping alot of small rects and doing all the calculations before.

Brendan · Post by **Brendan** » Wed Sep 14, 2016 9:05 pm

Hi,

octacone wrote:What do you mean by copying data for no reason? Pixels must contain colors, you can't ignore them.

Imagine (for a random/simple example) you have a screen full of black text on a white background, and you scroll the text. Every pixel is scrolled so every pixel needs to be updated, right? Wrong.

About 20% is probably white space at the ends of lines before and after; and about 10% of "rows of pixels" are gaps between lines before and after; so that's about 30% of the screen that was white before and is white again now that doesn't need to be updated.

Of the remaining 70%, for each pixel there's a 50% chance it's the same colour as before (e.g. maybe it was a black pixel that was part of the letter "A" and now it's a black pixel that's part of a letter "B", but it's still a black pixel and hasn't changed colour). That brings it down to 35% of pixels that need to be updated.

If you only update 35% of the pixels (and avoid updating the 65% of pixels where it's unnecessary), then you're only pushing 35% of the data across the relatively slow PCI bus (the likely bottleneck), so you can do the blit 2.85 times faster.

octacone wrote:If you are talking about dirty regions, isn't that a bit slow? Because lets say I have 40 rectangles, that means that I need to calculate each one of them and see if it needs to be repainted every single time I move my mouse.

What if you had a "1 bit per pixel" bitmap, cleared the bitmap, then and drew 40 rectangles in that bitmap using "OR"? You could do this extremely quickly in RAM (because it's not data going over the "extremely slow in comparison" PCI bus). Then you can do "if(bit in bitmap is set) { copy pixel to display memory}" and avoid a lot of "extremely slow in comparison" traffic across the PCI bus.

octacone wrote:Also what about desktop, it needs to be repainted in like 90% of all the draw calls, how to optimize that? SSE is not an option since it takes 16ms. So like I need to make 3 memory copy solutions, each one to serve a different size purpose. Yeah I think I get what you mean. Anyways, I hate optimizing... so many different approaches... Thanks!

One of the tricks I do is keep a copy of the screen's data in RAM; and then do something like:

Code: Select all

    for each dword in buffer {
        if( buffer[dword] != copy_in_RAM[dword] ) {
            copy_in_RAM[dword] = buffer[dword];
            display_memory[dword] = buffer[dword];
    }

Again, because RAM is much faster than the PCI bus, this ends up improving performance (because it avoids a lot of PCI bus traffic). Note that this doesn't need any special code (e.g. to keep track of dirty rectangles, etc) and can be completely contained in the "blit" function. Of course it's trivial to speed this up by also doing (e.g.) dirty rectangles, or having a "changed/unchanged flag" for each row of pixels, or...

Cheers,

Brendan

Schol-R-LEA · Post by **Schol-R-LEA** » Wed Sep 14, 2016 9:49 pm

octacone: I think that the relevant factor here is that you are thinking in terms of 'painting', when (as Brendan says) you should be thinking in terms of blitting: writing a mask of changes over the existing video memory. Instead of writing out the whole local (system) memory buffer, you should compose the changes in local memory separate from the local memory copy, then write only the changes pixels to the video memory (with a write-through to the local copy).

You have to remember that while local memory is fast and video memory is even faster (for some things), the bus between them is far slower. The bus is the bottleneck; the fewer bits you send through it, the better. A fast 'write the whole block of pixels to video memory' is still going to be slower than a slower 'write the specific pixels that have changed' will be, because you will be sending more data through the PCI bus.

As I understand it, most video cards today, with their gigabytes of video memory available, have far greater capacity than they need for holding even a very high-definition 16:9 video page; and while some of that memory is used by the GPU when it is in use (more on that in a moment), there is still plenty for multiple buffers on the video adapter, and video adapters which don't rely on sharing system memory usually (I think) have hardware banking of those buffers, meaning that you can have the active page locked and refreshing the screen (or even updating itself through some programmed operation) while you write to other pages from the bus, and then automatically blit the pages you were building with the active page to yet another page which you then flip to. At least, that's my understanding; I would love any disproofs, corrections, or clarifications on these points.

The point is, you not only don't need to write the whole rectangle you are updating, doing so is actually slower than cherry-picking the pixels you need to update and writing them one at a time to a second buffer that is in the video memory rather than the local memory. The local pixel map is for your program's use, to keep it in sync with the GPU's copy; it shouldn't be what you are sending to the video adapter, in most instances.

Another part of the problem is that right now, you aren't really using the GPU at all, because you don't have a way to talk to the proprietary parts. Most modern GPUs have a whole set of higher-level video operations that will let them do a lot of these things for you, provided you have or can write a driver to talk to the GPU to use those GPU-specific operations. Ideally, you should just be able to pass the GPU a set of changes to make without actually writing a pixel map in local memory at all, but that's not really something most of us as hobby OS devs are in a position to do (and there are often reasons to have a local pixel map anyway, at least of some specific sections of memory).

OSDev.org

Double Buffering Slower Than DBA

Double Buffering Slower Than DBA

Re: Double Buffering Slower Than DBA

Re: Double Buffering Slower Than DBA

Re: Double Buffering Slower Than DBA

Re: Double Buffering Slower Than DBA

Re: Double Buffering Slower Than DBA

Re: Double Buffering Slower Than DBA

Re: Double Buffering Slower Than DBA

Re: Double Buffering Slower Than DBA

Re: Double Buffering Slower Than DBA

Re: Double Buffering Slower Than DBA

Re: Double Buffering Slower Than DBA

Re: Double Buffering Slower Than DBA