Optimizing VESA

Question about which tools to use, bugs, the best way to implement a function, etc should go here. Don't forget to see if your question is answered in the wiki first! When in doubt post here.
Post Reply
pcmattman
Member
Member
Posts: 2566
Joined: Sun Jan 14, 2007 9:15 pm
Libera.chat IRC: miselin
Location: Sydney, Australia (I come from a land down under!)
Contact:

Optimizing VESA

Post by pcmattman »

Hi,

I'm reaching the point where I'm really happy with my kernel and can almost release it (once I put some syscalls in). However, I'm still not happy with my graphics functions, which I'm sure can be optimized. These functions won't be in the next release, but will be ready for later and I'd like to have them working nicely.

How do I go about making drawing to the frame buffer faster (and I'm already using a double-buffer - maybe I should use hardware double-buffering?)

I notice Windows draws reasonably fast in Bochs and Qemu, but in Bochs my GUI takes a good minute to draw (Qemu is fast though). How do they do it?

Any ideas?
User avatar
astrocrep
Member
Member
Posts: 127
Joined: Sat Apr 21, 2007 7:21 pm

Post by astrocrep »

Bochs for some reason has very slow VGA Updates... you can try cranking up the manual override and see if that helps... I would use QEMU or VMWare Server (free) for more advanced testing. Boch is great when you cannot figure out why your crashing, but lacks when you get further down the road.

-Rich
User avatar
Brendan
Member
Member
Posts: 8561
Joined: Sat Jan 15, 2005 12:00 am
Location: At his keyboard!
Contact:

Re: Optimizing VESA

Post by Brendan »

Hi,
pcmattman wrote:I notice Windows draws reasonably fast in Bochs and Qemu, but in Bochs my GUI takes a good minute to draw (Qemu is fast though). How do they do it?
In Bochs, there is no "emulated hardware acceleration", so Windows has to be using more optimized code. Also in Bochs, there's no caching, so optimizing cache access (e.g. using write-combining, prefetching cache lines, not polluting the cache, etc) would make no difference.

I seem to be having trouble finding legally downloadable source code for Windows, so I'm not sure exactly how they've implemented their "non-accelerated" video code to make it fast :roll:. Likewise, I haven't seen your source code, so I'm not sure exactly how you've implemented your video code to make it "less fast".... ;)

General principles would apply though - use MMX and/or SSE to do more at once if you can, don't do anything that can be skipped (e.g. don't update parts of the screen that haven't changed), find better algorithms (e.g. don't work on individual pixels), minimize access to video display memory as much as possible (e.g. never read from video display memory), etc.
pcmattman wrote:How do I go about making drawing to the frame buffer faster (and I'm already using a double-buffer - maybe I should use hardware double-buffering?)
First determine where your performance problems are coming from. I'm guessing that most of the problems are in the operations that draw things in the double-buffer (e.g. "setPixel()", "fillRectange()", "drawCharacter()", "doHorizontalLine()", "doVerticalLine()", etc), and the code that copies the double-buffer to video display memory is mostly irrelevant.

@astrocrep: The video refresh setting will effect the performance of Window's code and pcmattman's code equally, so a difference in performance between Window's code and pcmattman's code can't be attributed to the video refresh setting in Bochs. If pcmattman's code takes a minute or so of processing to generate a screen then a higher refresh rate will make this worse because the host CPU will be spending more time per second copying data from the emulated video card's display memory to a window or something on the host computer, and less time per second running the emulated CPU (and pcmattman's video code).


Cheers,

Brendan
For all things; perfection is, and will always remain, impossible to achieve in practice. However; by striving for perfection we create things that are as perfect as practically possible. Let the pursuit of perfection be our guide.
pcmattman
Member
Member
Posts: 2566
Joined: Sun Jan 14, 2007 9:15 pm
Libera.chat IRC: miselin
Location: Sydney, Australia (I come from a land down under!)
Contact:

Re: Optimizing VESA

Post by pcmattman »

Brendan wrote:Likewise, I haven't seen your source code, so I'm not sure exactly how you've implemented your video code to make it "less fast".... ;)
You can take a look at my CVS if you like (and now SVN, I'm using both now).
Brendan wrote: General principles would apply though - use MMX and/or SSE to do more at once if you can
I never have used MMX or SSE, care to give me a link to a tutorial or some pointers? I'll Google it in the meantime.
Brendan wrote: don't do anything that can be skipped (e.g. don't update parts of the screen that haven't changed), find better algorithms (e.g. don't work on individual pixels), minimize access to video display memory as much as possible (e.g. never read from video display memory), etc.
Already doing this.
Brendan wrote: draw things in the double-buffer (e.g. "setPixel()", "fillRectange()", "drawCharacter()", "doHorizontalLine()", "doVerticalLine()", etc), and the code that copies the double-buffer to video display memory is mostly irrelevant
I can tell you now, the slow functions are not SetPixel, but DrawBitmap (not a BMP, but an already loaded array of pixels), DrawRectangle et al... but I have no idea how to optimize those functions.

Thanks for all your help.
User avatar
Brendan
Member
Member
Posts: 8561
Joined: Sat Jan 15, 2005 12:00 am
Location: At his keyboard!
Contact:

Re: Optimizing VESA

Post by Brendan »

Hi,

Aplogies to all for this "slightly large" post....
pcmattman wrote:
Brendan wrote: draw things in the double-buffer (e.g. "setPixel()", "fillRectange()", "drawCharacter()", "doHorizontalLine()", "doVerticalLine()", etc), and the code that copies the double-buffer to video display memory is mostly irrelevant
I can tell you now, the slow functions are not SetPixel, but DrawBitmap (not a BMP, but an already loaded array of pixels), DrawRectangle et al... but I have no idea how to optimize those functions.
Is this the right code?

If it is, I'm not surprised it takes a minute or so to draw a screen... ;)

First, there's a few places where you do something like this:

Code: Select all

            // do not draw if it's already there!
            if( *((uchar_t*) addr) == (uchar_t) color )
                return;

            // draw it
            *((uchar_t*) addr ) = (uchar_t) color;
This doesn't help performance. Reading from the video card is slower than writing to it, and the conditional branch will be mispredicted by the CPU fairly often (causing expensive pipeline flushes). Even when it doesn't cause a pipeline flush the CPU won't be able to do the write until the read completes. Your video code would be at least twice as fast if you write to the video card regardless of whether you need to or not in this case.

For example, "SetPixel()" should be:

Code: Select all

void VGA::SetPixel( uint_t x, uint_t y, uint_t color )
{
    // get the address
    unsigned int addr = m_dbuff + ( ( m_width * y ) + x );

    // sets a pixel
    switch( m_bpp )
    {
        case 8: /** 8-bit **/
            *((uchar_t*) addr ) = (uchar_t) color;
            break;
        case 16: /** 16-bit **/
            *((ushort_t*) addr ) = (ushort_t) color;
            break;
        case 24: /** 24-bit (32-bit with NO aplha) **/
// WRONG -> *((uint_t*) addr ) = color & 0x00FFFFFF;
            *((uchar_t*) addr ) = (uchar_t) color;
            *((uchar_t*) (addr + 1) ) = (uchar_t) (color >> 8);
            *((uchar_t*) (addr + 2) ) = (uchar_t) (color >> 16);
            break;
        case 32: /** 32-bit (1 channel for ALPHA) **/
// TWICE?   *((uint_t*) m_dbuff + ( ( m_width * y ) + x ) ) = color;
            *((uint_t*) addr ) = color;
            break;
    }
}

Next, comment out the "SetPixel()" function - it should never be used by any of the other functions (all of the other functions should access video display memory directly). For an example, consider your "Rectangle()" function:

Code: Select all

void VGA::Rectangle( uint_t x, uint_t y, uint_t w, uint_t h, uint_t color )
{
    for( int _y = y; _y < (y+h); _y++ )
    {
        for( int _x = x; _x < (x+w); _x++ )
        {
            SetPixel( _x, _y, color );
        }
    }
}
Now insert the (modified) "SetPixel()" into it to see how silly it looks:

Code: Select all

void VGA::Rectangle( uint_t x, uint_t y, uint_t w, uint_t h, uint_t color )
{
    for( int _y = y; _y < (y+h); _y++ )
    {
        for( int _x = x; _x < (x+w); _x++ )
        {
            // get the address
            unsigned int addr = m_dbuff + ( ( m_width * y ) + x );

            // sets a pixel
            switch( m_bpp )
            {
                case 8: /** 8-bit **/
                    *((uchar_t*) addr ) = (uchar_t) color;
                    break;
                case 16: /** 16-bit **/
                    *((ushort_t*) addr ) = (ushort_t) color;
                    break;
                case 24: /** 24-bit (32-bit with NO aplha) **/
                    *((uchar_t*) addr ) = (uchar_t) color;
                    *((uchar_t*) (addr + 1) ) = (uchar_t) (color >> 8);
                    *((uchar_t*) (addr + 2) ) = (uchar_t) (color >> 16);
                    break;
                case 32: /** 32-bit (1 channel for ALPHA) **/
                    *((uint_t*) addr ) = color;
                    break;
            }
        }
    }
}
Now optimize to remove the sillyness:

Code: Select all

void VGA::Rectangle( uint_t x, uint_t y, uint_t w, uint_t h, uint_t color )
{
    unsigned int addr;

    switch( m_bpp )
    {
        case 8: /** 8-bit **/
            for( int _y = y; _y < (y+h); _y++ )
            {
                addr = m_dbuff + ( ( m_width * y ) + x );
                for( int _x = x; _x < (x+w); _x++ )
                {
                    *((uchar_t*) addr ) = (uchar_t) color;
                    addr += 1;
                }
            }
            break;
        case 16: /** 16-bit **/
            for( int _y = y; _y < (y+h); _y++ )
            {
                addr = m_dbuff + ( ( m_width * y ) + x );
                for( int _x = x; _x < (x+w); _x++ )
                {
                    *((ushort_t*) addr ) = (ushort_t) color;
                    addr += 2;
                }
            }
            break;
        case 24: /** 24-bit (32-bit with NO aplha) **/
            for( int _y = y; _y < (y+h); _y++ )
            {
                addr = m_dbuff + ( ( m_width * y ) + x );
                for( int _x = x; _x < (x+w); _x++ )
                {
                    *((uchar_t*) addr ) = (uchar_t) color;
                    *((uchar_t*) (addr + 1) ) = (uchar_t) (color >> 8);
                    *((uchar_t*) (addr + 2) ) = (uchar_t) (color >> 16);
                    addr += 3;
                }
            }
            break;
        case 32: /** 32-bit (1 channel for ALPHA) **/
            for( int _y = y; _y < (y+h); _y++ )
            {
                addr = m_dbuff + ( ( m_width * y ) + x );
                for( int _x = x; _x < (x+w); _x++ )
                {
                    *((uint_t*) addr ) = color;
                    addr += 4;
                }
            }
            break;
    }
}

Notice here that the switch/case thing is only done once on the outside of the loops (instead of for every single pixel on the inside of both loops), and that "addr" is only calculated once per horizontal line (not recalculated for every single pixel).

Now we can optimize it more. We can simplify the calculation for "addr" so there's no multiplication in the loop. For 24-bit colour it's splitting the colour into bytes on the inside of the loop, which could be done on the outside of the loop.

The code becomes something like:

Code: Select all

void VGA::Rectangle( uint_t x, uint_t y, uint_t w, uint_t h, uint_t color )
{
    void *addr, _addr;
    uchar_t red, green, blue;

    addr = m_dbuff + ( m_width * y );
    switch( m_bpp )
    {
        case 8: /** 8-bit **/
            for( int _y = y; _y < (y+h); _y++ )
            {
                _addr = addr;
                for( int _x = x; _x < (x+w); _x++ )
                {
                    *((uchar_t*) _addr ) = (uchar_t) color;
                    _addr += 1;
                }
                addr += m_width;
            }
            break;
        case 16: /** 16-bit **/
            for( int _y = y; _y < (y+h); _y++ )
            {
                _addr = addr;
                for( int _x = x; _x < (x+w); _x++ )
                {
                    *((ushort_t*) _addr ) = (ushort_t) color;
                    _addr += 2;
                }
                addr += m_width;
            }
            break;
        case 24: /** 24-bit (32-bit with NO aplha) **/
            blue = (uchar_t) color;
            green = (uchar_t) (color >> 8);
            red = (uchar_t) (color >> 16);

            for( int _y = y; _y < (y+h); _y++ )
            {
                _addr = addr;
                for( int _x = x; _x < (x+w); _x++ )
                {
                    *((uchar_t*) _addr ) = blue
                    *((uchar_t*) (_addr + 1) ) = green;
                    *((uchar_t*) (_addr + 2) ) = red;
                    _addr += 3;
                }
                addr += m_width;
            }
            break;
        case 32: /** 32-bit (1 channel for ALPHA) **/
            for( int _y = y; _y < (y+h); _y++ )
            {
                _addr = addr;
                for( int _x = x; _x < (x+w); _x++ )
                {
                    *((uint_t*) _addr ) = color;
                    _addr += 4;
                }
                addr += m_width;
            }
            break;
    }
}

Next, we're using "_x" to control the inner loops and "_x" isn't being used, and we're also using "_addr". It makes more sense to use "_addr" to control the loop, and get rid of "_x" completely.

Code: Select all

void VGA::Rectangle( uint_t x, uint_t y, uint_t w, uint_t h, uint_t color )
{
    void *addr, _addr;
    uchar_t red, green, blue;

    addr = m_dbuff + ( m_width * y );
    switch( m_bpp )
    {
        case 8: /** 8-bit **/
            for( int _y = y; _y < (y+h); _y++ )
            {
                for( _addr = addr; _addr < addr + w; _addr += 1 )
                {
                    *((uchar_t*) _addr ) = (uchar_t) color;
                }
                addr += m_width;
            }
            break;
        case 16: /** 16-bit **/
            for( int _y = y; _y < (y+h); _y++ )
            {
                for( _addr = addr; _addr < addr + (w * 2); _addr += 2 )

                {
                    *((ushort_t*) _addr ) = (ushort_t) color;
                }
                addr += m_width;
            }
            break;
        case 24: /** 24-bit (32-bit with NO aplha) **/
            blue = (uchar_t) color;
            green = (uchar_t) (color >> 8);
            red = (uchar_t) (color >> 16);

            for( int _y = y; _y < (y+h); _y++ )
            {
                for( _addr = addr; _addr < addr + (w * 3); _addr += 3 )
                {
                    *((uchar_t*) _addr ) = blue
                    *((uchar_t*) (_addr + 1) ) = green;
                    *((uchar_t*) (_addr + 2) ) = red;
                }
                addr += m_width;
            }
            break;
        case 32: /** 32-bit (1 channel for ALPHA) **/
            for( int _y = y; _y < (y+h); _y++ )
            {
                for( _addr = addr; _addr < addr + (w * 4); _addr += 4 )
                {
                    *((uint_t*) _addr ) = color;
                }
                addr += m_width;
            }
            break;
    }
}
Now, for 8-bit it's doing 4 writes to display memory when it could do 4 pixels at a time as 32-bit writes. This is a bit tricky because you'd want to make sure 32-bit writes are aligned. Similarly, 16-bit could do 2 pixels at a time. 24-bit is a little trickier because a pixel is 3 bytes, so we need to do 12 bytes at a time (4 pixels as three 32-bit writes) to keep the 32-bit writes aligned.

Code: Select all

void VGA::Rectangle( uint_t x, uint_t y, uint_t w, uint_t h, uint_t color )
{
    void *addr, _addr;
    uchar_t red, green, blue;
    uint_t temp1, temp2, temp3;

    addr = m_dbuff + ( m_width * y );
    switch( m_bpp )
    {
        case 8: /** 8-bit **/
            temp1 = color & 0xFF;
            temp1 |= (temp1 << 8) | (temp1 << 16) | (temp1 << 24);

            for( int _y = y; _y < (y+h); _y++ )
            {
                _addr = addr;
                while( ( _addr < addr_x + w ) && ( (_addr & 3) != 0) ) {
                    *((uchar_t*) _addr ) = (uchar_t) color;
                    _addr++;
                }
                while( _addr + 3 < addr_x + w ) {
                    *((uint_t*) _addr ) = temp1;
                    _addr += 4;
                }
                while( _addr < addr_x + w ) {
                    *((uchar_t*) _addr ) = (uchar_t) color;
                    _addr++;
                }
                addr += m_width;
            }
            break;
        case 16: /** 16-bit **/
            temp1 = color & 0xFFFF;
            temp1 |= temp1 << 16;

            for( int _y = y; _y < (y+h); _y++ )
            {
                _addr = addr;
                while( ( _addr < addr + (w * 2) ) && ( (_addr & 3) != 0) ) {
                    *((ushort_t*) _addr ) = (ushort_t) color;
                    _addr += 2;
                }
                while( _addr + 3 < addr + w ) {
                    *((uint_t*) _addr ) = temp1;
                    _addr += 4;
                }

                while( _addr < addr + (w * 2) ) {
                    *((ushort_t*) _addr ) = (ushort_t) color;
                    _addr += 2;
                }
                addr += m_width;
            }
            break;
        case 24: /** 24-bit (32-bit with NO aplha) **/
            blue = (uchar_t) color;
            green = (uchar_t) (color >> 8);
            red = (uchar_t) (color >> 16);

            temp1 = blue | (green << 8) | (red << 16) | (blue << 24);
            temp2 = green | (red << 8) | (blue << 16) | (green << 24);
            temp3 = red | (blue << 8) | (green << 16) | (red << 24);

            for( int _y = y; _y < (y+h); _y++ )
            {
                _addr = addr;
                while( ( _addr < addr + (w * 3) ) && ( (_addr % 12) != 0) ) {
                    *((uchar_t*) _addr ) = blue
                    *((uchar_t*) (_addr + 1) ) = green;
                    *((uchar_t*) (_addr + 2) ) = red;
                    _addr += 3;
                }
                while( _addr + 11 < addr + w ) {
                    *((uint_t*) _addr ) = temp1;
                    *((uint_t*) (_addr + 4) ) = temp2;
                    *((uint_t*) (_addr + 8) ) = temp3;
                    _addr += 12;
                }
                while( _addr < addr + (w * 3) ) {
                    *((uchar_t*) _addr ) = blue
                    *((uchar_t*) (_addr + 1) ) = green;
                    *((uchar_t*) (_addr + 2) ) = red;
                    _addr += 3;
                }
                addr += m_width;
            }
            break;
        case 32: /** 32-bit (1 channel for ALPHA) **/
            for( int _y = y; _y < (y+h); _y++ )
            {
                for( _addr = addr; _addr < addr + (w * 4); _addr += 4 )
                {
                    *((uint_t*) _addr ) = color;
                }
                addr += m_width;
            }
            break;
    }
}
Lastly, if the width of the rectange is small it'll only need the first "while" loop, so we can do a special case to make that faster.

If the starting address of one horizontal line is aligned on a 4 byte boundary, then we could assume that the starting address for the next horizontal line will also be aligned. In this case, if the assumption is true, then we could do a special case to make it faster (by skipping the first "while" loop). We can make sure this assumption is true by only ever allowing buffers that are suitable sizes. For 8-bit pixels the horizontal resolution must be a multiple of 4, for 16-bit pixels the horizontal resolution must be a multiple of 2, and for 24-bit pixels the horizontal resolution must be a multiple of 4. That sounds acceptable to me... ;)

The same thinking applies to the ending addresses - if the ending address of one horizontal line is aligned on a 4 byte boundary, then we could assume that the ending address for the next horizontal line will also be aligned, and we could skip the last "while" loop to make that faster.

This gives 5 special cases - small width, left and right edges aligned, left only aligned, right only aligned, and left and right edges unaligned:

Code: Select all

void VGA::Rectangle( uint_t x, uint_t y, uint_t w, uint_t h, uint_t color )
{
    void *addr, _addr;
    void *addr1, addr2, addr3, addr4;
    uchar_t red, green, blue;
    uint_t temp1, temp2, temp3;

    addr = m_dbuff + ( m_width * y );
    switch( m_bpp )
    {
        case 8: /** 8-bit **/
            if(w < 7) {
                // Small width
                for( int _y = y; _y < (y+h); _y++ )
                {
                    for( _addr = addr; _addr < addr + w; _addr += 1 )
                    {
                        *((uchar_t*) _addr ) = (uchar_t) color;
                    }
                    addr += m_width;

                }
            } else {
                addr1 = addr;            // Address of first byte at start of line
                addr2 = (addr + 3) & ~3; // Address of first aligned dword at start of line
                addr3 = (addr + w) & 3;  // Address of first aligned dword after end of line
                addr4 = addr + w;        // Address of byte after end of line
                temp1 = color & 0xFF;
                temp1 |= (temp1 << 8) | (temp1 << 16) | (temp1 << 24);

                if(addr1 == addr2) {
                    if(addr3 == addr4) {
                        // Both edges aligned
                        for( int _y = y; _y < (y+h); _y++ )
                        {
                            for( _addr = addr2, _addr < addr3, _addr += 4) {
                                *((uint_t*) _addr ) = temp1;
                            }
                            addr2 += m_width;
                            addr3 += m_width;
                        }
                    } else {
                        // Left edge aligned, right edge not aligned
                        for( int _y = y; _y < (y+h); _y++ )
                        {
                            for( _addr = addr2, _addr < addr3, _addr += 4) {
                                *((uint_t*) _addr ) = temp1;
                            }
                            for( _addr = addr3, _addr < addr4, _addr++) {
                                *((uchar_t*) _addr ) = (uchar_t) color;
                            }
                            addr2 += m_width;
                            addr3 += m_width;
                            addr4 += m_width;
                        }
                    }
                } else {
                    if(addr3 == addr4) {
                        // Left edge not aligned, right edge aligned
                        for( int _y = y; _y < (y+h); _y++ )
                        {
                            for( _addr = addr1, _addr < addr2, _addr++) {
                                *((uchar_t*) _addr ) = (uchar_t) color;
                            }
                            for( _addr = addr2, _addr < addr3, _addr += 4) {
                                *((uint_t*) _addr ) = temp1;
                            }
                            addr1 += m_width;
                            addr2 += m_width;
                            addr3 += m_width;
                        }
                    } else {
                        // Both edges not aligned
                        for( int _y = y; _y < (y+h); _y++ )
                        {
                            for( _addr = addr1, _addr < addr2, _addr++) {
                                *((uchar_t*) _addr ) = (uchar_t) color;
                            }
                            for( _addr = addr2, _addr < addr3, _addr += 4) {
                                *((uint_t*) _addr ) = temp1;
                            }
                            for( _addr = addr3, _addr < addr4, _addr++) {
                                *((uchar_t*) _addr ) = (uchar_t) color;
                            }
                            addr1 += m_width;
                            addr2 += m_width;
                            addr3 += m_width;
                            addr4 += m_width;
                        }
                    }
                }
            }
            break;
        case 16: /** 16-bit **/
            if(w < 3) {
                // Small width
                for( int _y = y; _y < (y+h); _y++ )
                {
                    for( _addr = addr; _addr < addr + (w * 2); _addr += 2 )
                    {
                        *((ushort_t*) _addr ) = (ushort_t) color;
                    }
                    addr += m_width;
                }
            } else {
                addr1 = addr;                // Address of first byte at start of line
                addr2 = (addr + 3) & ~3;     // Address of first aligned dword at start of line
                addr3 = (addr + w * 2) & 3;  // Address of first aligned dword after end of line
                addr4 = addr + w * 2;        // Address of byte after end of line
                temp1 = color & 0xFFFF;
                temp1 |= temp1 << 16;

                if(addr1 == addr2) {
                    if(addr3 == addr4) {
                        // Both edges aligned
                        for( int _y = y; _y < (y+h); _y++ )
                        {
                            for( _addr = addr2, _addr < addr3, _addr += 4) {
                                *((uint_t*) _addr ) = temp1;
                            }
                            addr2 += m_width;
                            addr3 += m_width;
                        }
                    } else {
                        // Left edge aligned, right edge not aligned
                        for( int _y = y; _y < (y+h); _y++ )
                        {
                            for( _addr = addr2, _addr < addr3, _addr += 4) {
                                *((uint_t*) _addr ) = temp1;
                            }
                            for( _addr = addr3, _addr < addr4, _addr += 2) {
                                *((ushort_t*) _addr ) = (ushort_t) color;
                            }
                            addr2 += m_width;
                            addr3 += m_width;
                            addr4 += m_width;
                        }
                    }
                } else {
                    if(addr3 == addr4) {
                        // Left edge not aligned, right edge aligned
                        for( int _y = y; _y < (y+h); _y++ )
                        {
                            for( _addr = addr1, _addr < addr2, _addr += 2) {
                                *((ushort_t*) _addr ) = (ushort_t) color;
                            }
                            for( _addr = addr2, _addr < addr3, _addr += 4) {
                                *((uint_t*) _addr ) = temp1;
                            }
                            addr1 += m_width;
                            addr2 += m_width;
                            addr3 += m_width;
                        }
                    } else {
                        // Both edges not aligned
                        for( int _y = y; _y < (y+h); _y++ )
                        {
                            for( _addr = addr1, _addr < addr2, _addr += 2) {
                                *((ushort_t*) _addr ) = (ushort_t) color;
                            }
                            for( _addr = addr2, _addr < addr3, _addr += 4) {
                                *((uint_t*) _addr ) = temp1;
                            }
                            for( _addr = addr3, _addr < addr4, _addr += 2) {
                                *((ushort_t*) _addr ) = (ushort_t) color;
                            }
                            addr1 += m_width;
                            addr2 += m_width;
                            addr3 += m_width;
                            addr4 += m_width;
                        }
                    }
                }
            }
            break;
        case 24: /** 24-bit (32-bit with NO aplha) **/
            blue = (uchar_t) color;
            green = (uchar_t) (color >> 8);
            red = (uchar_t) (color >> 16);

            if(w < 11) {
                // Small width
                for( int _y = y; _y < (y+h); _y++ )
                {
                    for( _addr = addr; _addr < addr + (w * 3); _addr += 3 )
                    {
                        *((uchar_t*) _addr ) = blue
                        *((uchar_t*) (_addr + 1) ) = green;
                        *((uchar_t*) (_addr + 2) ) = red;
                    }
                    addr += m_width;
                }
            } else {
                addr1 = addr;                // Address of first byte at start of line
                addr2 = (addr + 3) & ~3;     // Address of first aligned dword at start of line
                addr3 = (addr + w * 3) & 3;  // Address of first aligned dword after end of line
                addr4 = addr + w * 3;        // Address of byte after end of line
                temp1 = blue | (green << 8) | (red << 16) | (blue << 24);
                temp2 = green | (red << 8) | (blue << 16) | (green << 24);
                temp3 = red | (blue << 8) | (green << 16) | (red << 24);

                if(addr1 == addr2) {
                    if(addr3 == addr4) {
                        // Both edges aligned
                        for( int _y = y; _y < (y+h); _y++ )
                        {
                            for( _addr = addr2, _addr < addr3, _addr += 12) {
                                *((uint_t*) _addr ) = temp1;
                                *((uint_t*) (_addr + 4) ) = temp2;
                                *((uint_t*) (_addr + 8) ) = temp3;
                            }
                            addr2 += m_width;
                            addr3 += m_width;
                        }
                    } else {
                        // Left edge aligned, right edge not aligned
                        for( int _y = y; _y < (y+h); _y++ )
                        {
                            for( _addr = addr2, _addr < addr3, _addr += 12) {
                                *((uint_t*) _addr ) = temp1;
                                *((uint_t*) (_addr + 4) ) = temp2;
                                *((uint_t*) (_addr + 8) ) = temp3;
                            }
                            for( _addr = addr3, _addr < addr4, _addr += 3) {
                                *((uchar_t*) _addr ) = blue
                                *((uchar_t*) (_addr + 1) ) = green;
                                *((uchar_t*) (_addr + 2) ) = red;
                            }
                            addr2 += m_width;
                            addr3 += m_width;
                            addr4 += m_width;
                        }
                    }
                } else {
                    if(addr3 == addr4) {
                        // Left edge not aligned, right edge aligned
                        for( int _y = y; _y < (y+h); _y++ )
                        {
                            for( _addr = addr1, _addr < addr2, _addr += 3) {
                                *((uchar_t*) _addr ) = blue
                                *((uchar_t*) (_addr + 1) ) = green;
                                *((uchar_t*) (_addr + 2) ) = red;
                            }
                            for( _addr = addr2, _addr < addr3, _addr += 12) {
                                *((uint_t*) _addr ) = temp1;
                                *((uint_t*) (_addr + 4) ) = temp2;
                                *((uint_t*) (_addr + 8) ) = temp3;
                            }
                            addr1 += m_width;
                            addr2 += m_width;
                            addr3 += m_width;
                        }
                    } else {
                        // Both edges not aligned
                        for( int _y = y; _y < (y+h); _y++ )
                        {
                            for( _addr = addr1, _addr < addr2, _addr += 3) {
                                *((uchar_t*) _addr ) = blue
                                *((uchar_t*) (_addr + 1) ) = green;
                                *((uchar_t*) (_addr + 2) ) = red;
                            }
                            for( _addr = addr2, _addr < addr3, _addr += 12) {
                                *((uint_t*) _addr ) = temp1;
                                *((uint_t*) (_addr + 4) ) = temp2;
                                *((uint_t*) (_addr + 8) ) = temp3;
                            }
                            for( _addr = addr3, _addr < addr4, _addr += 3) {
                                *((uchar_t*) _addr ) = blue
                                *((uchar_t*) (_addr + 1) ) = green;
                                *((uchar_t*) (_addr + 2) ) = red;
                            }
                            addr1 += m_width;
                            addr2 += m_width;
                            addr3 += m_width;
                            addr4 += m_width;
                        }
                    }
                }
            }
            break;
        case 32: /** 32-bit (1 channel for ALPHA) **/
            for( int _y = y; _y < (y+h); _y++ )
            {
                for( _addr = addr; _addr < addr + (w * 4); _addr += 4 )
                {
                    *((uint_t*) _addr ) = color;
                }
                addr += m_width;
            }
            break;
    }
}
AFAIK to get any more performance out of this code you'd need to use:
  • a) non-temporal prefetches, or prefetches, or preloading (select the first one that's supported by the CPU).
    b) SSE, or 64-bit code and MMX, or MMX (select the first one that's supported by the CPU, if any).
This means that for absolute maximum performance, you'd need to detect what the CPU supports and have 12 different "Rectangle()" functions.

It's still not finished though - you should protect yourself from bad input parameters. For e.g. at the start of the function add the following:

Code: Select all

    if(x >= the_horizontal_resolution_of_the_buffer) return;
    if(y >= the_vertical_resolution_of_the_buffer) return;
    if(x + w >= the_horizontal_resolution_of_the_buffer) {
        w = the_horizontal_resolution_of_the_buffer - x;
    }
    if(y + h >= the_vertical_resolution_of_the_buffer) {
        h = the_vertical_resolution_of_the_buffer - x;
    }
I'll leave the rest of the code for you to optimize. If you give all of your functions the same treatment as I did for "Rectangle()", then you'll probably be able to reduce the time it takes to redraw the screen on Bochs from 1 minute to a fraction of a second... :)

One last thing: none of the above is tested - I might be missing a few type casts, etc.


Cheers,

Brendan
For all things; perfection is, and will always remain, impossible to achieve in practice. However; by striving for perfection we create things that are as perfect as practically possible. Let the pursuit of perfection be our guide.
pcmattman
Member
Member
Posts: 2566
Joined: Sun Jan 14, 2007 9:15 pm
Libera.chat IRC: miselin
Location: Sydney, Australia (I come from a land down under!)
Contact:

Post by pcmattman »

Wow. It's going to take a while to digest all that goodness!

Thanks heaps Brendan, that code was the code I intended for you to look at.
pcmattman
Member
Member
Posts: 2566
Joined: Sun Jan 14, 2007 9:15 pm
Libera.chat IRC: miselin
Location: Sydney, Australia (I come from a land down under!)
Contact:

Post by pcmattman »

Thanks Brendan, my GUI is now 70% faster!

One question, is it possible to change the LFB address on the fly? I'm thinking of switching between the double buffer and the LFB every time the display needs blitted, or is there a better way to do this?
User avatar
Brendan
Member
Member
Posts: 8561
Joined: Sat Jan 15, 2005 12:00 am
Location: At his keyboard!
Contact:

Post by Brendan »

Hi,
pcmattman wrote:One question, is it possible to change the LFB address on the fly? I'm thinking of switching between the double buffer and the LFB every time the display needs blitted, or is there a better way to do this?
It is technically possible to change the physical address of the LFB on the fly, but it involves messing with PCI configuration space for the card (or device specific I/O ports for non-PCI video cards) and making sure that caching is also adjusted. Adjusting caching may include MTTRs, PCI to PCI bridges and hyper-transport links, and making sure all CPUs use the same caching to avoid cache coherency problems.

It's much easier to change the linear address of the LFB on the fly. In this case all you'd do is change page table entries point to the LFB instead of RAM (or the reverse).

There are other problems though. Typically a double buffer is used because it's faster to do graphics opertions on fast/cached RAM, especially if those graphics operations either read from the buffer, or set the same pixels more than once, or don't work on the most efficient size transfers for the video card's bus.

The other reason people use double buffers is that it stops shearing effects. For example, if you draw a white rectangle and then draw some black text in it; then the user might see half a white rectangle, then a full white rectangle with a little bit of text, then the completed thing. A double buffer prevents this because you only make the data visible (blit it to the video card) when you've finished drawing it, so users can't see partially drawn screens.

There's a similar problem involving video synchronization. If the video card is currently sending data to the monitor and you blit new data from RAM to the video card, then the video card might send the first half of one frame to the monitor before the blit, and the second half of the next frame to the monitor after the blit. In this case the user sees half the old frame and half the new frame for a little while. To avoid this most systems let you wait for vertical retrace (which is a little gap of time between when the video card finishes sending one frame to the monitor and when it starts sending the next frame). Basically you wait until the video card has sent an entire frame to the monitor, then blit the new frame to the video card before it starts sending the next frame to the monitor.

There are alternative approaches though.

For example, it's possible to use "page flipping" where the video card's display memory is split into 2 areas. At any time one of the areas is "active" (it's contents are being read by the video card and sent to the monitor) while the other area is "inactive" (it's contents aren't being read by the video card, and can be changed without causing shearing effects). Once you've finished drawing in the inactive area, (and possibly after waiting for vertical retrace) you tell the video card to switch buffers. For example, if bufferA is active you'd draw into bufferB, then tell the video card to make bufferA inactive and bufferB active, then start drawing the next frame in bufferA (then tell the video card to make bufferB inactive and bufferA active).

However, often it's still faster to do your graphics operations on fast/cached RAM. Sometimes it's worth the hassle of using "triple-buffering". This is the same as page-flipping, but with an additional buffer in RAM. You draw your data into the buffer in RAM, blit it to the currently inactive buffer in display memory, then (possibly after waiting for vertical retrace) tell the video card to switch buffers (make the inactive buffer active, and the active buffer inactive). This reduces slow video display memory access penalties. It can also mean that the video driver waits for vertical retrace before it switches active/inactive buffers but your code never needs to wait for vertical retrace before it starts drawing the next frame in RAM.

There's still some problems here - you don't know what the video driver (and video hardware) may or may not support. This is relatively easy to fix - if the video driver has a "blit_my_buffer_from_RAM()" function then it could use double buffering or page flipping internally (with or without waiting for vertical retrace), without other software caring what happens inside the video driver.

The only problem left is 2D/3D acceleration. If a very talented video device driver programmer decides to support 2D/3D acceleration in their video driver, they're screwed and you'll still be doing slow graphics operations in software. To get around this I'd have something like a video script in addition to the ability to upload raw pixel data, where the video script tells the video driver what to do to generate a frame (which may or may not include using previously uploaded raw pixel data). For example, an application might upload pixel data for some text, then send a script that tells the video driver to draw a large white rectangle and blit the pixel data on top of the white rectangle (with alpha blending).

In this case the video driver might do it all in software or hardware accelerate all or some of it, and the video driver can still use double buffering or page flipping internally (with or without waiting for vertical retrace), without other software caring what happens inside the video driver. You could also still have a "blit_my_buffer_from_RAM()" function for those legacy applications, and this could be implemented in a graphics library (where the library function uploads the raw pixel data, then sends a simple video script that just says "blit the uploaded pixel data without alpha blending"). In this case the legacy applications still won't ever get any benefits from 2D/3D acceleration, but that's probably what they deserve... ;)


Cheers,

Brendan
For all things; perfection is, and will always remain, impossible to achieve in practice. However; by striving for perfection we create things that are as perfect as practically possible. Let the pursuit of perfection be our guide.
Post Reply