SVGA driver optimizations

NickJohnson · Post by **NickJohnson** » Wed Apr 13, 2011 8:47 am

My relatively simple SVGA driver (for my microkernel system) has become a bottleneck for graphics operations, and I want to try to optimize it. Here's the situation. The SVGA driver has a shared memory region containing a 32-bit RGB(A) (where the A is ignored) buffer of all pixels, which the process writes to directly. The process sends a message telling the SVGA driver to flip a specified rectangle, which causes the driver to do so. Based on my measurements that the latency of flipping a rectangle is nearly proportional to the area of the rectangle, the message passing is not a bottleneck. So, I'm instead focusing on optimizing things within the driver.

Internally, the SVGA driver uses a generic putpixel function that converts a pixel value from the shared memory region to a pixel value appropriate for the given mode (only direct color is supported at the moment) and then writes it to the buffer (either using linear or paged addressing.) This function is called by the rectangle-flipping function in a loop. The relevant code is here: https://github.com/nickbjohnson4224/rho ... vga/svga.c, in the svga_putbyte, svga_putpixel, and svga_fliprect functions.

What sort of internal optimizations might benefit this setup?

thepowersgang · Post by **thepowersgang** » Wed Apr 13, 2011 9:08 am

Well, unrolling putpixel in fliprect will speed things up a bit (removing the need to recalculate the index each time)

Designing your loop to reduce bank switches could be an idea too (but from a look, it seems that would not be a problem)

JamesM · Post by **JamesM** » Wed Apr 13, 2011 10:55 am

NickJohnson wrote:My relatively simple SVGA driver (for my microkernel system) has become a bottleneck for graphics operations, and I want to try to optimize it. Here's the situation. The SVGA driver has a shared memory region containing a 32-bit RGB(A) (where the A is ignored) buffer of all pixels, which the process writes to directly. The process sends a message telling the SVGA driver to flip a specified rectangle, which causes the driver to do so. Based on my measurements that the latency of flipping a rectangle is nearly proportional to the area of the rectangle, the message passing is not a bottleneck. So, I'm instead focusing on optimizing things within the driver.

Internally, the SVGA driver uses a generic putpixel function that converts a pixel value from the shared memory region to a pixel value appropriate for the given mode (only direct color is supported at the moment) and then writes it to the buffer (either using linear or paged addressing.) This function is called by the rectangle-flipping function in a loop. The relevant code is here: https://github.com/nickbjohnson4224/rho ... vga/svga.c, in the svga_putbyte, svga_putpixel, and svga_fliprect functions.

What sort of internal optimizations might benefit this setup?

Have you looked at what assembly the compiler is emitting at -O3? That will give you an idea of where it can't optimise well.

NickJohnson · Post by **NickJohnson** » Wed Apr 13, 2011 11:18 am

I was going to do that next; I just wanted to know if there were any architectural/algorithm optimizations that could be made. For example, would it be beneficial to track which pixels have changed since the last flip, so that the minimum number of writes to video memory are made, or would that cause too much overhead?

Combuster · Post by **Combuster** » Wed Apr 13, 2011 12:04 pm

You have a potential copy too much - if the actual video mode is linear RGBA, you can instead try to share VRAM itself - you can even share a framebuffer (either VRAM or in shared memory) if it's not in R8G8B8A8 format if the application can deal with it: it saves you from requiring the intelligence of converting colours altogether: static graphics can be preconverted in the correct format at load time, after which you will need little more than a blitter per bpp rather than per format.

Having the user provide dirty rectangles instead of just flipping the entire screen can save you quite a bit in GUI scenarios but will be of little help with visually rich scenarios where most of the screen gets updated anyway and where the first suggestion can prove much more effective. Note that actually accessing each pixel for comparison will cost you an amount that's likely on par with just writing it anyway.

turdus · Post by **turdus** » Wed Apr 13, 2011 3:58 pm

+1 to berkus. Using a rectangle eliminates many expensive call to putpixel, and avoids recalculation of index too. This is what X11 does for example, and you can't say it's slow.

Modifying the MTRR would be also a good idea.
Alternatively you can use two triangles if you have an OpenGL driver.

Brendan · Post by **Brendan** » Wed Apr 13, 2011 6:15 pm

Hi,

Some general notes...

a) Allowing applications to access a pixel buffer directly fails as soon as the application is spread across multiple video drivers/monitors, and makes it virtually impossible for any video driver to (eventually) support any hardware acceleration effectively. It also makes it impossible to do things like remote desktop (where a "pseudo driver" pretends it's a video driver while sending a description of what to draw to a remote computer) efficiently - for example, to draw a single large white rectangle you're be looking at several MiB of network traffic rather than about 20 bytes. Finally, you can't make things easy for applications by abstracting away low levels details (including things like resolution independence and pixel format independence).

b) Anything that uses a "putpixel()" routine needs to be optimised until it doesn't. You should also consider using function pointers. For an example, you might have a function pointer that points to a "draw rectangle" function, and 5 different "draw rectangle" functions (one for each supported pixel format); where the code that changes video modes also changes the function pointers so they point to the right functions for the pixel format.

c) Dirty rectangles are good until you get too many dirty rectangles and have to spend all your CPU time trying to detect and handle overlapping rectangles. There's simpler methods that are O(1); like having a "needs to be updated" flag for each group of pixels; which are slower for a smaller number of changes (where performance doesn't matter so much) but faster for a larger number of changes (where performance matters a lot more). With a good video driver interface, it shouldn't matter how the video driver handles this internally (e.g. a video driver could support 5 different methods of minimising writes to display memory - dirty rectangles, "needs to be updated" flags, a "just blit everything" method, etc), and the applications shouldn't know or care which of these methods the video driver happens to be using at any given time.

Cheers,

Brendan

roboman · Post by **roboman** » Thu Apr 14, 2011 9:33 pm

sounds like you are talking about a bliter, bob, tile, sprite or even a few other things, depending on a few details. A reasonably good page on sprites: http://www.nondot.org/sabre/graphpro/index_sprite.html

OSDev.org

SVGA driver optimizations

SVGA driver optimizations

Re: SVGA driver optimizations

Re: SVGA driver optimizations

Re: SVGA driver optimizations

Re: SVGA driver optimizations

Re: SVGA driver optimizations

Re: SVGA driver optimizations

Re: SVGA driver optimizations