Optimizing VESA
-
- Member
- Posts: 2566
- Joined: Sun Jan 14, 2007 9:15 pm
- Libera.chat IRC: miselin
- Location: Sydney, Australia (I come from a land down under!)
- Contact:
Optimizing VESA
Hi,
I'm reaching the point where I'm really happy with my kernel and can almost release it (once I put some syscalls in). However, I'm still not happy with my graphics functions, which I'm sure can be optimized. These functions won't be in the next release, but will be ready for later and I'd like to have them working nicely.
How do I go about making drawing to the frame buffer faster (and I'm already using a double-buffer - maybe I should use hardware double-buffering?)
I notice Windows draws reasonably fast in Bochs and Qemu, but in Bochs my GUI takes a good minute to draw (Qemu is fast though). How do they do it?
Any ideas?
I'm reaching the point where I'm really happy with my kernel and can almost release it (once I put some syscalls in). However, I'm still not happy with my graphics functions, which I'm sure can be optimized. These functions won't be in the next release, but will be ready for later and I'd like to have them working nicely.
How do I go about making drawing to the frame buffer faster (and I'm already using a double-buffer - maybe I should use hardware double-buffering?)
I notice Windows draws reasonably fast in Bochs and Qemu, but in Bochs my GUI takes a good minute to draw (Qemu is fast though). How do they do it?
Any ideas?
Re: Optimizing VESA
Hi,
I seem to be having trouble finding legally downloadable source code for Windows, so I'm not sure exactly how they've implemented their "non-accelerated" video code to make it fast . Likewise, I haven't seen your source code, so I'm not sure exactly how you've implemented your video code to make it "less fast"....
General principles would apply though - use MMX and/or SSE to do more at once if you can, don't do anything that can be skipped (e.g. don't update parts of the screen that haven't changed), find better algorithms (e.g. don't work on individual pixels), minimize access to video display memory as much as possible (e.g. never read from video display memory), etc.
@astrocrep: The video refresh setting will effect the performance of Window's code and pcmattman's code equally, so a difference in performance between Window's code and pcmattman's code can't be attributed to the video refresh setting in Bochs. If pcmattman's code takes a minute or so of processing to generate a screen then a higher refresh rate will make this worse because the host CPU will be spending more time per second copying data from the emulated video card's display memory to a window or something on the host computer, and less time per second running the emulated CPU (and pcmattman's video code).
Cheers,
Brendan
In Bochs, there is no "emulated hardware acceleration", so Windows has to be using more optimized code. Also in Bochs, there's no caching, so optimizing cache access (e.g. using write-combining, prefetching cache lines, not polluting the cache, etc) would make no difference.pcmattman wrote:I notice Windows draws reasonably fast in Bochs and Qemu, but in Bochs my GUI takes a good minute to draw (Qemu is fast though). How do they do it?
I seem to be having trouble finding legally downloadable source code for Windows, so I'm not sure exactly how they've implemented their "non-accelerated" video code to make it fast . Likewise, I haven't seen your source code, so I'm not sure exactly how you've implemented your video code to make it "less fast"....
General principles would apply though - use MMX and/or SSE to do more at once if you can, don't do anything that can be skipped (e.g. don't update parts of the screen that haven't changed), find better algorithms (e.g. don't work on individual pixels), minimize access to video display memory as much as possible (e.g. never read from video display memory), etc.
First determine where your performance problems are coming from. I'm guessing that most of the problems are in the operations that draw things in the double-buffer (e.g. "setPixel()", "fillRectange()", "drawCharacter()", "doHorizontalLine()", "doVerticalLine()", etc), and the code that copies the double-buffer to video display memory is mostly irrelevant.pcmattman wrote:How do I go about making drawing to the frame buffer faster (and I'm already using a double-buffer - maybe I should use hardware double-buffering?)
@astrocrep: The video refresh setting will effect the performance of Window's code and pcmattman's code equally, so a difference in performance between Window's code and pcmattman's code can't be attributed to the video refresh setting in Bochs. If pcmattman's code takes a minute or so of processing to generate a screen then a higher refresh rate will make this worse because the host CPU will be spending more time per second copying data from the emulated video card's display memory to a window or something on the host computer, and less time per second running the emulated CPU (and pcmattman's video code).
Cheers,
Brendan
For all things; perfection is, and will always remain, impossible to achieve in practice. However; by striving for perfection we create things that are as perfect as practically possible. Let the pursuit of perfection be our guide.
-
- Member
- Posts: 2566
- Joined: Sun Jan 14, 2007 9:15 pm
- Libera.chat IRC: miselin
- Location: Sydney, Australia (I come from a land down under!)
- Contact:
Re: Optimizing VESA
You can take a look at my CVS if you like (and now SVN, I'm using both now).Brendan wrote:Likewise, I haven't seen your source code, so I'm not sure exactly how you've implemented your video code to make it "less fast"....
I never have used MMX or SSE, care to give me a link to a tutorial or some pointers? I'll Google it in the meantime.Brendan wrote: General principles would apply though - use MMX and/or SSE to do more at once if you can
Already doing this.Brendan wrote: don't do anything that can be skipped (e.g. don't update parts of the screen that haven't changed), find better algorithms (e.g. don't work on individual pixels), minimize access to video display memory as much as possible (e.g. never read from video display memory), etc.
I can tell you now, the slow functions are not SetPixel, but DrawBitmap (not a BMP, but an already loaded array of pixels), DrawRectangle et al... but I have no idea how to optimize those functions.Brendan wrote: draw things in the double-buffer (e.g. "setPixel()", "fillRectange()", "drawCharacter()", "doHorizontalLine()", "doVerticalLine()", etc), and the code that copies the double-buffer to video display memory is mostly irrelevant
Thanks for all your help.
Re: Optimizing VESA
Hi,
Aplogies to all for this "slightly large" post....
If it is, I'm not surprised it takes a minute or so to draw a screen...
First, there's a few places where you do something like this:
This doesn't help performance. Reading from the video card is slower than writing to it, and the conditional branch will be mispredicted by the CPU fairly often (causing expensive pipeline flushes). Even when it doesn't cause a pipeline flush the CPU won't be able to do the write until the read completes. Your video code would be at least twice as fast if you write to the video card regardless of whether you need to or not in this case.
For example, "SetPixel()" should be:
Next, comment out the "SetPixel()" function - it should never be used by any of the other functions (all of the other functions should access video display memory directly). For an example, consider your "Rectangle()" function:
Now insert the (modified) "SetPixel()" into it to see how silly it looks:
Now optimize to remove the sillyness:
Notice here that the switch/case thing is only done once on the outside of the loops (instead of for every single pixel on the inside of both loops), and that "addr" is only calculated once per horizontal line (not recalculated for every single pixel).
Now we can optimize it more. We can simplify the calculation for "addr" so there's no multiplication in the loop. For 24-bit colour it's splitting the colour into bytes on the inside of the loop, which could be done on the outside of the loop.
The code becomes something like:
Next, we're using "_x" to control the inner loops and "_x" isn't being used, and we're also using "_addr". It makes more sense to use "_addr" to control the loop, and get rid of "_x" completely.
Now, for 8-bit it's doing 4 writes to display memory when it could do 4 pixels at a time as 32-bit writes. This is a bit tricky because you'd want to make sure 32-bit writes are aligned. Similarly, 16-bit could do 2 pixels at a time. 24-bit is a little trickier because a pixel is 3 bytes, so we need to do 12 bytes at a time (4 pixels as three 32-bit writes) to keep the 32-bit writes aligned.
Lastly, if the width of the rectange is small it'll only need the first "while" loop, so we can do a special case to make that faster.
If the starting address of one horizontal line is aligned on a 4 byte boundary, then we could assume that the starting address for the next horizontal line will also be aligned. In this case, if the assumption is true, then we could do a special case to make it faster (by skipping the first "while" loop). We can make sure this assumption is true by only ever allowing buffers that are suitable sizes. For 8-bit pixels the horizontal resolution must be a multiple of 4, for 16-bit pixels the horizontal resolution must be a multiple of 2, and for 24-bit pixels the horizontal resolution must be a multiple of 4. That sounds acceptable to me...
The same thinking applies to the ending addresses - if the ending address of one horizontal line is aligned on a 4 byte boundary, then we could assume that the ending address for the next horizontal line will also be aligned, and we could skip the last "while" loop to make that faster.
This gives 5 special cases - small width, left and right edges aligned, left only aligned, right only aligned, and left and right edges unaligned:
AFAIK to get any more performance out of this code you'd need to use:
It's still not finished though - you should protect yourself from bad input parameters. For e.g. at the start of the function add the following:
I'll leave the rest of the code for you to optimize. If you give all of your functions the same treatment as I did for "Rectangle()", then you'll probably be able to reduce the time it takes to redraw the screen on Bochs from 1 minute to a fraction of a second...
One last thing: none of the above is tested - I might be missing a few type casts, etc.
Cheers,
Brendan
Aplogies to all for this "slightly large" post....
Is this the right code?pcmattman wrote:I can tell you now, the slow functions are not SetPixel, but DrawBitmap (not a BMP, but an already loaded array of pixels), DrawRectangle et al... but I have no idea how to optimize those functions.Brendan wrote: draw things in the double-buffer (e.g. "setPixel()", "fillRectange()", "drawCharacter()", "doHorizontalLine()", "doVerticalLine()", etc), and the code that copies the double-buffer to video display memory is mostly irrelevant
If it is, I'm not surprised it takes a minute or so to draw a screen...
First, there's a few places where you do something like this:
Code: Select all
// do not draw if it's already there!
if( *((uchar_t*) addr) == (uchar_t) color )
return;
// draw it
*((uchar_t*) addr ) = (uchar_t) color;
For example, "SetPixel()" should be:
Code: Select all
void VGA::SetPixel( uint_t x, uint_t y, uint_t color )
{
// get the address
unsigned int addr = m_dbuff + ( ( m_width * y ) + x );
// sets a pixel
switch( m_bpp )
{
case 8: /** 8-bit **/
*((uchar_t*) addr ) = (uchar_t) color;
break;
case 16: /** 16-bit **/
*((ushort_t*) addr ) = (ushort_t) color;
break;
case 24: /** 24-bit (32-bit with NO aplha) **/
// WRONG -> *((uint_t*) addr ) = color & 0x00FFFFFF;
*((uchar_t*) addr ) = (uchar_t) color;
*((uchar_t*) (addr + 1) ) = (uchar_t) (color >> 8);
*((uchar_t*) (addr + 2) ) = (uchar_t) (color >> 16);
break;
case 32: /** 32-bit (1 channel for ALPHA) **/
// TWICE? *((uint_t*) m_dbuff + ( ( m_width * y ) + x ) ) = color;
*((uint_t*) addr ) = color;
break;
}
}
Next, comment out the "SetPixel()" function - it should never be used by any of the other functions (all of the other functions should access video display memory directly). For an example, consider your "Rectangle()" function:
Code: Select all
void VGA::Rectangle( uint_t x, uint_t y, uint_t w, uint_t h, uint_t color )
{
for( int _y = y; _y < (y+h); _y++ )
{
for( int _x = x; _x < (x+w); _x++ )
{
SetPixel( _x, _y, color );
}
}
}
Code: Select all
void VGA::Rectangle( uint_t x, uint_t y, uint_t w, uint_t h, uint_t color )
{
for( int _y = y; _y < (y+h); _y++ )
{
for( int _x = x; _x < (x+w); _x++ )
{
// get the address
unsigned int addr = m_dbuff + ( ( m_width * y ) + x );
// sets a pixel
switch( m_bpp )
{
case 8: /** 8-bit **/
*((uchar_t*) addr ) = (uchar_t) color;
break;
case 16: /** 16-bit **/
*((ushort_t*) addr ) = (ushort_t) color;
break;
case 24: /** 24-bit (32-bit with NO aplha) **/
*((uchar_t*) addr ) = (uchar_t) color;
*((uchar_t*) (addr + 1) ) = (uchar_t) (color >> 8);
*((uchar_t*) (addr + 2) ) = (uchar_t) (color >> 16);
break;
case 32: /** 32-bit (1 channel for ALPHA) **/
*((uint_t*) addr ) = color;
break;
}
}
}
}
Code: Select all
void VGA::Rectangle( uint_t x, uint_t y, uint_t w, uint_t h, uint_t color )
{
unsigned int addr;
switch( m_bpp )
{
case 8: /** 8-bit **/
for( int _y = y; _y < (y+h); _y++ )
{
addr = m_dbuff + ( ( m_width * y ) + x );
for( int _x = x; _x < (x+w); _x++ )
{
*((uchar_t*) addr ) = (uchar_t) color;
addr += 1;
}
}
break;
case 16: /** 16-bit **/
for( int _y = y; _y < (y+h); _y++ )
{
addr = m_dbuff + ( ( m_width * y ) + x );
for( int _x = x; _x < (x+w); _x++ )
{
*((ushort_t*) addr ) = (ushort_t) color;
addr += 2;
}
}
break;
case 24: /** 24-bit (32-bit with NO aplha) **/
for( int _y = y; _y < (y+h); _y++ )
{
addr = m_dbuff + ( ( m_width * y ) + x );
for( int _x = x; _x < (x+w); _x++ )
{
*((uchar_t*) addr ) = (uchar_t) color;
*((uchar_t*) (addr + 1) ) = (uchar_t) (color >> 8);
*((uchar_t*) (addr + 2) ) = (uchar_t) (color >> 16);
addr += 3;
}
}
break;
case 32: /** 32-bit (1 channel for ALPHA) **/
for( int _y = y; _y < (y+h); _y++ )
{
addr = m_dbuff + ( ( m_width * y ) + x );
for( int _x = x; _x < (x+w); _x++ )
{
*((uint_t*) addr ) = color;
addr += 4;
}
}
break;
}
}
Notice here that the switch/case thing is only done once on the outside of the loops (instead of for every single pixel on the inside of both loops), and that "addr" is only calculated once per horizontal line (not recalculated for every single pixel).
Now we can optimize it more. We can simplify the calculation for "addr" so there's no multiplication in the loop. For 24-bit colour it's splitting the colour into bytes on the inside of the loop, which could be done on the outside of the loop.
The code becomes something like:
Code: Select all
void VGA::Rectangle( uint_t x, uint_t y, uint_t w, uint_t h, uint_t color )
{
void *addr, _addr;
uchar_t red, green, blue;
addr = m_dbuff + ( m_width * y );
switch( m_bpp )
{
case 8: /** 8-bit **/
for( int _y = y; _y < (y+h); _y++ )
{
_addr = addr;
for( int _x = x; _x < (x+w); _x++ )
{
*((uchar_t*) _addr ) = (uchar_t) color;
_addr += 1;
}
addr += m_width;
}
break;
case 16: /** 16-bit **/
for( int _y = y; _y < (y+h); _y++ )
{
_addr = addr;
for( int _x = x; _x < (x+w); _x++ )
{
*((ushort_t*) _addr ) = (ushort_t) color;
_addr += 2;
}
addr += m_width;
}
break;
case 24: /** 24-bit (32-bit with NO aplha) **/
blue = (uchar_t) color;
green = (uchar_t) (color >> 8);
red = (uchar_t) (color >> 16);
for( int _y = y; _y < (y+h); _y++ )
{
_addr = addr;
for( int _x = x; _x < (x+w); _x++ )
{
*((uchar_t*) _addr ) = blue
*((uchar_t*) (_addr + 1) ) = green;
*((uchar_t*) (_addr + 2) ) = red;
_addr += 3;
}
addr += m_width;
}
break;
case 32: /** 32-bit (1 channel for ALPHA) **/
for( int _y = y; _y < (y+h); _y++ )
{
_addr = addr;
for( int _x = x; _x < (x+w); _x++ )
{
*((uint_t*) _addr ) = color;
_addr += 4;
}
addr += m_width;
}
break;
}
}
Next, we're using "_x" to control the inner loops and "_x" isn't being used, and we're also using "_addr". It makes more sense to use "_addr" to control the loop, and get rid of "_x" completely.
Code: Select all
void VGA::Rectangle( uint_t x, uint_t y, uint_t w, uint_t h, uint_t color )
{
void *addr, _addr;
uchar_t red, green, blue;
addr = m_dbuff + ( m_width * y );
switch( m_bpp )
{
case 8: /** 8-bit **/
for( int _y = y; _y < (y+h); _y++ )
{
for( _addr = addr; _addr < addr + w; _addr += 1 )
{
*((uchar_t*) _addr ) = (uchar_t) color;
}
addr += m_width;
}
break;
case 16: /** 16-bit **/
for( int _y = y; _y < (y+h); _y++ )
{
for( _addr = addr; _addr < addr + (w * 2); _addr += 2 )
{
*((ushort_t*) _addr ) = (ushort_t) color;
}
addr += m_width;
}
break;
case 24: /** 24-bit (32-bit with NO aplha) **/
blue = (uchar_t) color;
green = (uchar_t) (color >> 8);
red = (uchar_t) (color >> 16);
for( int _y = y; _y < (y+h); _y++ )
{
for( _addr = addr; _addr < addr + (w * 3); _addr += 3 )
{
*((uchar_t*) _addr ) = blue
*((uchar_t*) (_addr + 1) ) = green;
*((uchar_t*) (_addr + 2) ) = red;
}
addr += m_width;
}
break;
case 32: /** 32-bit (1 channel for ALPHA) **/
for( int _y = y; _y < (y+h); _y++ )
{
for( _addr = addr; _addr < addr + (w * 4); _addr += 4 )
{
*((uint_t*) _addr ) = color;
}
addr += m_width;
}
break;
}
}
Code: Select all
void VGA::Rectangle( uint_t x, uint_t y, uint_t w, uint_t h, uint_t color )
{
void *addr, _addr;
uchar_t red, green, blue;
uint_t temp1, temp2, temp3;
addr = m_dbuff + ( m_width * y );
switch( m_bpp )
{
case 8: /** 8-bit **/
temp1 = color & 0xFF;
temp1 |= (temp1 << 8) | (temp1 << 16) | (temp1 << 24);
for( int _y = y; _y < (y+h); _y++ )
{
_addr = addr;
while( ( _addr < addr_x + w ) && ( (_addr & 3) != 0) ) {
*((uchar_t*) _addr ) = (uchar_t) color;
_addr++;
}
while( _addr + 3 < addr_x + w ) {
*((uint_t*) _addr ) = temp1;
_addr += 4;
}
while( _addr < addr_x + w ) {
*((uchar_t*) _addr ) = (uchar_t) color;
_addr++;
}
addr += m_width;
}
break;
case 16: /** 16-bit **/
temp1 = color & 0xFFFF;
temp1 |= temp1 << 16;
for( int _y = y; _y < (y+h); _y++ )
{
_addr = addr;
while( ( _addr < addr + (w * 2) ) && ( (_addr & 3) != 0) ) {
*((ushort_t*) _addr ) = (ushort_t) color;
_addr += 2;
}
while( _addr + 3 < addr + w ) {
*((uint_t*) _addr ) = temp1;
_addr += 4;
}
while( _addr < addr + (w * 2) ) {
*((ushort_t*) _addr ) = (ushort_t) color;
_addr += 2;
}
addr += m_width;
}
break;
case 24: /** 24-bit (32-bit with NO aplha) **/
blue = (uchar_t) color;
green = (uchar_t) (color >> 8);
red = (uchar_t) (color >> 16);
temp1 = blue | (green << 8) | (red << 16) | (blue << 24);
temp2 = green | (red << 8) | (blue << 16) | (green << 24);
temp3 = red | (blue << 8) | (green << 16) | (red << 24);
for( int _y = y; _y < (y+h); _y++ )
{
_addr = addr;
while( ( _addr < addr + (w * 3) ) && ( (_addr % 12) != 0) ) {
*((uchar_t*) _addr ) = blue
*((uchar_t*) (_addr + 1) ) = green;
*((uchar_t*) (_addr + 2) ) = red;
_addr += 3;
}
while( _addr + 11 < addr + w ) {
*((uint_t*) _addr ) = temp1;
*((uint_t*) (_addr + 4) ) = temp2;
*((uint_t*) (_addr + 8) ) = temp3;
_addr += 12;
}
while( _addr < addr + (w * 3) ) {
*((uchar_t*) _addr ) = blue
*((uchar_t*) (_addr + 1) ) = green;
*((uchar_t*) (_addr + 2) ) = red;
_addr += 3;
}
addr += m_width;
}
break;
case 32: /** 32-bit (1 channel for ALPHA) **/
for( int _y = y; _y < (y+h); _y++ )
{
for( _addr = addr; _addr < addr + (w * 4); _addr += 4 )
{
*((uint_t*) _addr ) = color;
}
addr += m_width;
}
break;
}
}
If the starting address of one horizontal line is aligned on a 4 byte boundary, then we could assume that the starting address for the next horizontal line will also be aligned. In this case, if the assumption is true, then we could do a special case to make it faster (by skipping the first "while" loop). We can make sure this assumption is true by only ever allowing buffers that are suitable sizes. For 8-bit pixels the horizontal resolution must be a multiple of 4, for 16-bit pixels the horizontal resolution must be a multiple of 2, and for 24-bit pixels the horizontal resolution must be a multiple of 4. That sounds acceptable to me...
The same thinking applies to the ending addresses - if the ending address of one horizontal line is aligned on a 4 byte boundary, then we could assume that the ending address for the next horizontal line will also be aligned, and we could skip the last "while" loop to make that faster.
This gives 5 special cases - small width, left and right edges aligned, left only aligned, right only aligned, and left and right edges unaligned:
Code: Select all
void VGA::Rectangle( uint_t x, uint_t y, uint_t w, uint_t h, uint_t color )
{
void *addr, _addr;
void *addr1, addr2, addr3, addr4;
uchar_t red, green, blue;
uint_t temp1, temp2, temp3;
addr = m_dbuff + ( m_width * y );
switch( m_bpp )
{
case 8: /** 8-bit **/
if(w < 7) {
// Small width
for( int _y = y; _y < (y+h); _y++ )
{
for( _addr = addr; _addr < addr + w; _addr += 1 )
{
*((uchar_t*) _addr ) = (uchar_t) color;
}
addr += m_width;
}
} else {
addr1 = addr; // Address of first byte at start of line
addr2 = (addr + 3) & ~3; // Address of first aligned dword at start of line
addr3 = (addr + w) & 3; // Address of first aligned dword after end of line
addr4 = addr + w; // Address of byte after end of line
temp1 = color & 0xFF;
temp1 |= (temp1 << 8) | (temp1 << 16) | (temp1 << 24);
if(addr1 == addr2) {
if(addr3 == addr4) {
// Both edges aligned
for( int _y = y; _y < (y+h); _y++ )
{
for( _addr = addr2, _addr < addr3, _addr += 4) {
*((uint_t*) _addr ) = temp1;
}
addr2 += m_width;
addr3 += m_width;
}
} else {
// Left edge aligned, right edge not aligned
for( int _y = y; _y < (y+h); _y++ )
{
for( _addr = addr2, _addr < addr3, _addr += 4) {
*((uint_t*) _addr ) = temp1;
}
for( _addr = addr3, _addr < addr4, _addr++) {
*((uchar_t*) _addr ) = (uchar_t) color;
}
addr2 += m_width;
addr3 += m_width;
addr4 += m_width;
}
}
} else {
if(addr3 == addr4) {
// Left edge not aligned, right edge aligned
for( int _y = y; _y < (y+h); _y++ )
{
for( _addr = addr1, _addr < addr2, _addr++) {
*((uchar_t*) _addr ) = (uchar_t) color;
}
for( _addr = addr2, _addr < addr3, _addr += 4) {
*((uint_t*) _addr ) = temp1;
}
addr1 += m_width;
addr2 += m_width;
addr3 += m_width;
}
} else {
// Both edges not aligned
for( int _y = y; _y < (y+h); _y++ )
{
for( _addr = addr1, _addr < addr2, _addr++) {
*((uchar_t*) _addr ) = (uchar_t) color;
}
for( _addr = addr2, _addr < addr3, _addr += 4) {
*((uint_t*) _addr ) = temp1;
}
for( _addr = addr3, _addr < addr4, _addr++) {
*((uchar_t*) _addr ) = (uchar_t) color;
}
addr1 += m_width;
addr2 += m_width;
addr3 += m_width;
addr4 += m_width;
}
}
}
}
break;
case 16: /** 16-bit **/
if(w < 3) {
// Small width
for( int _y = y; _y < (y+h); _y++ )
{
for( _addr = addr; _addr < addr + (w * 2); _addr += 2 )
{
*((ushort_t*) _addr ) = (ushort_t) color;
}
addr += m_width;
}
} else {
addr1 = addr; // Address of first byte at start of line
addr2 = (addr + 3) & ~3; // Address of first aligned dword at start of line
addr3 = (addr + w * 2) & 3; // Address of first aligned dword after end of line
addr4 = addr + w * 2; // Address of byte after end of line
temp1 = color & 0xFFFF;
temp1 |= temp1 << 16;
if(addr1 == addr2) {
if(addr3 == addr4) {
// Both edges aligned
for( int _y = y; _y < (y+h); _y++ )
{
for( _addr = addr2, _addr < addr3, _addr += 4) {
*((uint_t*) _addr ) = temp1;
}
addr2 += m_width;
addr3 += m_width;
}
} else {
// Left edge aligned, right edge not aligned
for( int _y = y; _y < (y+h); _y++ )
{
for( _addr = addr2, _addr < addr3, _addr += 4) {
*((uint_t*) _addr ) = temp1;
}
for( _addr = addr3, _addr < addr4, _addr += 2) {
*((ushort_t*) _addr ) = (ushort_t) color;
}
addr2 += m_width;
addr3 += m_width;
addr4 += m_width;
}
}
} else {
if(addr3 == addr4) {
// Left edge not aligned, right edge aligned
for( int _y = y; _y < (y+h); _y++ )
{
for( _addr = addr1, _addr < addr2, _addr += 2) {
*((ushort_t*) _addr ) = (ushort_t) color;
}
for( _addr = addr2, _addr < addr3, _addr += 4) {
*((uint_t*) _addr ) = temp1;
}
addr1 += m_width;
addr2 += m_width;
addr3 += m_width;
}
} else {
// Both edges not aligned
for( int _y = y; _y < (y+h); _y++ )
{
for( _addr = addr1, _addr < addr2, _addr += 2) {
*((ushort_t*) _addr ) = (ushort_t) color;
}
for( _addr = addr2, _addr < addr3, _addr += 4) {
*((uint_t*) _addr ) = temp1;
}
for( _addr = addr3, _addr < addr4, _addr += 2) {
*((ushort_t*) _addr ) = (ushort_t) color;
}
addr1 += m_width;
addr2 += m_width;
addr3 += m_width;
addr4 += m_width;
}
}
}
}
break;
case 24: /** 24-bit (32-bit with NO aplha) **/
blue = (uchar_t) color;
green = (uchar_t) (color >> 8);
red = (uchar_t) (color >> 16);
if(w < 11) {
// Small width
for( int _y = y; _y < (y+h); _y++ )
{
for( _addr = addr; _addr < addr + (w * 3); _addr += 3 )
{
*((uchar_t*) _addr ) = blue
*((uchar_t*) (_addr + 1) ) = green;
*((uchar_t*) (_addr + 2) ) = red;
}
addr += m_width;
}
} else {
addr1 = addr; // Address of first byte at start of line
addr2 = (addr + 3) & ~3; // Address of first aligned dword at start of line
addr3 = (addr + w * 3) & 3; // Address of first aligned dword after end of line
addr4 = addr + w * 3; // Address of byte after end of line
temp1 = blue | (green << 8) | (red << 16) | (blue << 24);
temp2 = green | (red << 8) | (blue << 16) | (green << 24);
temp3 = red | (blue << 8) | (green << 16) | (red << 24);
if(addr1 == addr2) {
if(addr3 == addr4) {
// Both edges aligned
for( int _y = y; _y < (y+h); _y++ )
{
for( _addr = addr2, _addr < addr3, _addr += 12) {
*((uint_t*) _addr ) = temp1;
*((uint_t*) (_addr + 4) ) = temp2;
*((uint_t*) (_addr + 8) ) = temp3;
}
addr2 += m_width;
addr3 += m_width;
}
} else {
// Left edge aligned, right edge not aligned
for( int _y = y; _y < (y+h); _y++ )
{
for( _addr = addr2, _addr < addr3, _addr += 12) {
*((uint_t*) _addr ) = temp1;
*((uint_t*) (_addr + 4) ) = temp2;
*((uint_t*) (_addr + 8) ) = temp3;
}
for( _addr = addr3, _addr < addr4, _addr += 3) {
*((uchar_t*) _addr ) = blue
*((uchar_t*) (_addr + 1) ) = green;
*((uchar_t*) (_addr + 2) ) = red;
}
addr2 += m_width;
addr3 += m_width;
addr4 += m_width;
}
}
} else {
if(addr3 == addr4) {
// Left edge not aligned, right edge aligned
for( int _y = y; _y < (y+h); _y++ )
{
for( _addr = addr1, _addr < addr2, _addr += 3) {
*((uchar_t*) _addr ) = blue
*((uchar_t*) (_addr + 1) ) = green;
*((uchar_t*) (_addr + 2) ) = red;
}
for( _addr = addr2, _addr < addr3, _addr += 12) {
*((uint_t*) _addr ) = temp1;
*((uint_t*) (_addr + 4) ) = temp2;
*((uint_t*) (_addr + 8) ) = temp3;
}
addr1 += m_width;
addr2 += m_width;
addr3 += m_width;
}
} else {
// Both edges not aligned
for( int _y = y; _y < (y+h); _y++ )
{
for( _addr = addr1, _addr < addr2, _addr += 3) {
*((uchar_t*) _addr ) = blue
*((uchar_t*) (_addr + 1) ) = green;
*((uchar_t*) (_addr + 2) ) = red;
}
for( _addr = addr2, _addr < addr3, _addr += 12) {
*((uint_t*) _addr ) = temp1;
*((uint_t*) (_addr + 4) ) = temp2;
*((uint_t*) (_addr + 8) ) = temp3;
}
for( _addr = addr3, _addr < addr4, _addr += 3) {
*((uchar_t*) _addr ) = blue
*((uchar_t*) (_addr + 1) ) = green;
*((uchar_t*) (_addr + 2) ) = red;
}
addr1 += m_width;
addr2 += m_width;
addr3 += m_width;
addr4 += m_width;
}
}
}
}
break;
case 32: /** 32-bit (1 channel for ALPHA) **/
for( int _y = y; _y < (y+h); _y++ )
{
for( _addr = addr; _addr < addr + (w * 4); _addr += 4 )
{
*((uint_t*) _addr ) = color;
}
addr += m_width;
}
break;
}
}
- a) non-temporal prefetches, or prefetches, or preloading (select the first one that's supported by the CPU).
b) SSE, or 64-bit code and MMX, or MMX (select the first one that's supported by the CPU, if any).
It's still not finished though - you should protect yourself from bad input parameters. For e.g. at the start of the function add the following:
Code: Select all
if(x >= the_horizontal_resolution_of_the_buffer) return;
if(y >= the_vertical_resolution_of_the_buffer) return;
if(x + w >= the_horizontal_resolution_of_the_buffer) {
w = the_horizontal_resolution_of_the_buffer - x;
}
if(y + h >= the_vertical_resolution_of_the_buffer) {
h = the_vertical_resolution_of_the_buffer - x;
}
One last thing: none of the above is tested - I might be missing a few type casts, etc.
Cheers,
Brendan
For all things; perfection is, and will always remain, impossible to achieve in practice. However; by striving for perfection we create things that are as perfect as practically possible. Let the pursuit of perfection be our guide.
Hi,
It's much easier to change the linear address of the LFB on the fly. In this case all you'd do is change page table entries point to the LFB instead of RAM (or the reverse).
There are other problems though. Typically a double buffer is used because it's faster to do graphics opertions on fast/cached RAM, especially if those graphics operations either read from the buffer, or set the same pixels more than once, or don't work on the most efficient size transfers for the video card's bus.
The other reason people use double buffers is that it stops shearing effects. For example, if you draw a white rectangle and then draw some black text in it; then the user might see half a white rectangle, then a full white rectangle with a little bit of text, then the completed thing. A double buffer prevents this because you only make the data visible (blit it to the video card) when you've finished drawing it, so users can't see partially drawn screens.
There's a similar problem involving video synchronization. If the video card is currently sending data to the monitor and you blit new data from RAM to the video card, then the video card might send the first half of one frame to the monitor before the blit, and the second half of the next frame to the monitor after the blit. In this case the user sees half the old frame and half the new frame for a little while. To avoid this most systems let you wait for vertical retrace (which is a little gap of time between when the video card finishes sending one frame to the monitor and when it starts sending the next frame). Basically you wait until the video card has sent an entire frame to the monitor, then blit the new frame to the video card before it starts sending the next frame to the monitor.
There are alternative approaches though.
For example, it's possible to use "page flipping" where the video card's display memory is split into 2 areas. At any time one of the areas is "active" (it's contents are being read by the video card and sent to the monitor) while the other area is "inactive" (it's contents aren't being read by the video card, and can be changed without causing shearing effects). Once you've finished drawing in the inactive area, (and possibly after waiting for vertical retrace) you tell the video card to switch buffers. For example, if bufferA is active you'd draw into bufferB, then tell the video card to make bufferA inactive and bufferB active, then start drawing the next frame in bufferA (then tell the video card to make bufferB inactive and bufferA active).
However, often it's still faster to do your graphics operations on fast/cached RAM. Sometimes it's worth the hassle of using "triple-buffering". This is the same as page-flipping, but with an additional buffer in RAM. You draw your data into the buffer in RAM, blit it to the currently inactive buffer in display memory, then (possibly after waiting for vertical retrace) tell the video card to switch buffers (make the inactive buffer active, and the active buffer inactive). This reduces slow video display memory access penalties. It can also mean that the video driver waits for vertical retrace before it switches active/inactive buffers but your code never needs to wait for vertical retrace before it starts drawing the next frame in RAM.
There's still some problems here - you don't know what the video driver (and video hardware) may or may not support. This is relatively easy to fix - if the video driver has a "blit_my_buffer_from_RAM()" function then it could use double buffering or page flipping internally (with or without waiting for vertical retrace), without other software caring what happens inside the video driver.
The only problem left is 2D/3D acceleration. If a very talented video device driver programmer decides to support 2D/3D acceleration in their video driver, they're screwed and you'll still be doing slow graphics operations in software. To get around this I'd have something like a video script in addition to the ability to upload raw pixel data, where the video script tells the video driver what to do to generate a frame (which may or may not include using previously uploaded raw pixel data). For example, an application might upload pixel data for some text, then send a script that tells the video driver to draw a large white rectangle and blit the pixel data on top of the white rectangle (with alpha blending).
In this case the video driver might do it all in software or hardware accelerate all or some of it, and the video driver can still use double buffering or page flipping internally (with or without waiting for vertical retrace), without other software caring what happens inside the video driver. You could also still have a "blit_my_buffer_from_RAM()" function for those legacy applications, and this could be implemented in a graphics library (where the library function uploads the raw pixel data, then sends a simple video script that just says "blit the uploaded pixel data without alpha blending"). In this case the legacy applications still won't ever get any benefits from 2D/3D acceleration, but that's probably what they deserve...
Cheers,
Brendan
It is technically possible to change the physical address of the LFB on the fly, but it involves messing with PCI configuration space for the card (or device specific I/O ports for non-PCI video cards) and making sure that caching is also adjusted. Adjusting caching may include MTTRs, PCI to PCI bridges and hyper-transport links, and making sure all CPUs use the same caching to avoid cache coherency problems.pcmattman wrote:One question, is it possible to change the LFB address on the fly? I'm thinking of switching between the double buffer and the LFB every time the display needs blitted, or is there a better way to do this?
It's much easier to change the linear address of the LFB on the fly. In this case all you'd do is change page table entries point to the LFB instead of RAM (or the reverse).
There are other problems though. Typically a double buffer is used because it's faster to do graphics opertions on fast/cached RAM, especially if those graphics operations either read from the buffer, or set the same pixels more than once, or don't work on the most efficient size transfers for the video card's bus.
The other reason people use double buffers is that it stops shearing effects. For example, if you draw a white rectangle and then draw some black text in it; then the user might see half a white rectangle, then a full white rectangle with a little bit of text, then the completed thing. A double buffer prevents this because you only make the data visible (blit it to the video card) when you've finished drawing it, so users can't see partially drawn screens.
There's a similar problem involving video synchronization. If the video card is currently sending data to the monitor and you blit new data from RAM to the video card, then the video card might send the first half of one frame to the monitor before the blit, and the second half of the next frame to the monitor after the blit. In this case the user sees half the old frame and half the new frame for a little while. To avoid this most systems let you wait for vertical retrace (which is a little gap of time between when the video card finishes sending one frame to the monitor and when it starts sending the next frame). Basically you wait until the video card has sent an entire frame to the monitor, then blit the new frame to the video card before it starts sending the next frame to the monitor.
There are alternative approaches though.
For example, it's possible to use "page flipping" where the video card's display memory is split into 2 areas. At any time one of the areas is "active" (it's contents are being read by the video card and sent to the monitor) while the other area is "inactive" (it's contents aren't being read by the video card, and can be changed without causing shearing effects). Once you've finished drawing in the inactive area, (and possibly after waiting for vertical retrace) you tell the video card to switch buffers. For example, if bufferA is active you'd draw into bufferB, then tell the video card to make bufferA inactive and bufferB active, then start drawing the next frame in bufferA (then tell the video card to make bufferB inactive and bufferA active).
However, often it's still faster to do your graphics operations on fast/cached RAM. Sometimes it's worth the hassle of using "triple-buffering". This is the same as page-flipping, but with an additional buffer in RAM. You draw your data into the buffer in RAM, blit it to the currently inactive buffer in display memory, then (possibly after waiting for vertical retrace) tell the video card to switch buffers (make the inactive buffer active, and the active buffer inactive). This reduces slow video display memory access penalties. It can also mean that the video driver waits for vertical retrace before it switches active/inactive buffers but your code never needs to wait for vertical retrace before it starts drawing the next frame in RAM.
There's still some problems here - you don't know what the video driver (and video hardware) may or may not support. This is relatively easy to fix - if the video driver has a "blit_my_buffer_from_RAM()" function then it could use double buffering or page flipping internally (with or without waiting for vertical retrace), without other software caring what happens inside the video driver.
The only problem left is 2D/3D acceleration. If a very talented video device driver programmer decides to support 2D/3D acceleration in their video driver, they're screwed and you'll still be doing slow graphics operations in software. To get around this I'd have something like a video script in addition to the ability to upload raw pixel data, where the video script tells the video driver what to do to generate a frame (which may or may not include using previously uploaded raw pixel data). For example, an application might upload pixel data for some text, then send a script that tells the video driver to draw a large white rectangle and blit the pixel data on top of the white rectangle (with alpha blending).
In this case the video driver might do it all in software or hardware accelerate all or some of it, and the video driver can still use double buffering or page flipping internally (with or without waiting for vertical retrace), without other software caring what happens inside the video driver. You could also still have a "blit_my_buffer_from_RAM()" function for those legacy applications, and this could be implemented in a graphics library (where the library function uploads the raw pixel data, then sends a simple video script that just says "blit the uploaded pixel data without alpha blending"). In this case the legacy applications still won't ever get any benefits from 2D/3D acceleration, but that's probably what they deserve...
Cheers,
Brendan
For all things; perfection is, and will always remain, impossible to achieve in practice. However; by striving for perfection we create things that are as perfect as practically possible. Let the pursuit of perfection be our guide.