Alpha Blending - A painful process

PearOs · Post by **PearOs** » Thu Jan 30, 2014 11:57 pm

Hey guys, so I have been working on Alpha blending. Well it works, and my assembly routine without MMX or SSE is 35 instructions long total. Not too bad I guess. However its not the fastest thing in the world, and I knew this from the git go.

So here's my equation: (Source Color * Alpha + Dest Color * (255 - Alpha)) >> 8 where Source Color is a Color component of a Pixel, ex R or G.

Would it make a difference to use a lookup table instead? To kind of remove some of the multiplications or would it be better to try to use SEE to do it since it has large registers?

Thanks guys, Matt

gerryg400 · Post by **gerryg400** » Fri Jan 31, 2014 1:11 am

Don't forget the most obvious optimisation

Code: Select all

        if (alpha == 0) {
            /* Do nothing */
        }
        else if (alpha == 0xff) {
            d = s;
        }
        else {
            /* Your blending code */
        }

I think that lookup tables would require some sort of multiply.

Combuster · Post by **Combuster** » Fri Jan 31, 2014 1:15 am

PearOs wrote:Source Color * Alpha

In your case, is this ever going to change?

try to use SEE

SSE; And why not let the compiler handle that for you?

Brendan · Post by **Brendan** » Fri Jan 31, 2014 1:31 am

Hi,

PearOs wrote:Hey guys, so I have been working on Alpha blending. Well it works, and my assembly routine without MMX or SSE is 35 instructions long total. Not too bad I guess. However its not the fastest thing in the world, and I knew this from the git go.

So here's my equation: (Source Color * Alpha + Dest Color * (255 - Alpha)) >> 8 where Source Color is a Color component of a Pixel, ex R or G.

Would it make a difference to use a lookup table instead? To kind of remove some of the multiplications or would it be better to try to use SEE to do it since it has large registers?

First decide if the equation is right. For example, if the source colour is 255, dest colour is 255 and the alpha is 128; then:

Code: Select all

( 255*128 + 255*(255-128) ) >> 8
= ( 255*128 + 255*127 ) >> 8
= ( 32640 + 32385 ) >> 8
= 65025 >> 8
= 254

The correct answer would be 255, but this equation gave us 254. What happened is that your shift is dividing by 256 but it should be dividing by 255 instead. The correct equation would be: ( src * alpha + dest * (255 - alpha) ) / 255

Division is slow and you want to avoid it. Multiplications aren't too fast either. Something like "result = lookupTable[src][dest][alpha]" would cost 16 MiB (and that's too much).

Instead, you could consider something like this:

Code: Select all

    result = ( src * (alpha << 8 + alpha) + dest * ( (alpha ^ 0xFF) << 8 + (alpha ^ 0xFF) ) ) >> 16;

That looks ugly; but with a lookup table you can do this:

Code: Select all

    result = ( lookupTable[src][alpha] + lookupTable[dest][alpha ^ 0xFF] ) >> 8;

That gives you a 128 KiB lookup table (small enough to fit in cache). It also means 6 lookups per pixel.

So now there's 3 options:

use the "wrong" equation with SSE (6 multiplications and 3 shifts per pixel)
use the "right" equation with SSE multiplication and shift (7 multiplications and 3 shifts per pixel)
use the "right" equation with lookup table (6 lookups and 3 shifts per pixel)

Depending on what you're doing, none of these options might be the best way. For example, if alpha is the same for all pixels (and there's enough pixels), then you could generate a table to suit the alpha, then do this:

Code: Select all

    result = (lookupTable[src.red][dest.red] << 16) | (lookupTable[src.green][dest.green] << 8) | lookupTable[src.blue][dest.blue];

That gives you a 64 KiB lookup table and only 3 lookups per pixel.

Cheers,

Brendan

Owen · Post by **Owen** » Fri Jan 31, 2014 3:49 am

Brendan, I'm surprised. I thought you of all people would have noticed the glaring flaw.

Lets say we have a black background, white foreground, alpha 0.5 (Lets use 0-1 for our range; it makes things simpler)

I think you'll agree that we should end up with a grey, half intensity image?

The naive blending function will give you an image at value 0.5, which has a brightness of 0.72*. The correct answer is a value of 0.21*, brightness 0.5

I'd cover gamma, but don't have the time to do so right now. My general advice would be to do all of the maths with floats and unpack from sRGB/gamma-RGB at the start into linear RGB, and repack at the end.

* (Approx, for 2.2 gamma. Doing sRGB gamma requires more time than I have right now)

Brendan · Post by **Brendan** » Fri Jan 31, 2014 5:24 am

Hi,

Owen wrote:I'd cover gamma, but don't have the time to do so right now. My general advice would be to do all of the maths with floats and unpack from sRGB/gamma-RGB at the start into linear RGB, and repack at the end.

The last time I did anything with pretty graphics, I did all processing using "no gamma, CIE XYZ" and converted to sRGB (with gamma) after everything was drawn. It tends to simplify things a lot.

Of course if you're doing that you will need something better than 8:8:8 (e.g. floats) because "perceived intensity" isn't linear (resulting in a noticeable lack of precision if you use 8-bit integers during calculations); but then you need something better than 8:8:8 once you start looking at things like HDR anyway.

Cheers,

Brendan

PearOs · Post by **PearOs** » Fri Jan 31, 2014 10:55 am

Wow thanks for the reply's guys. I will try to answer everyone's stuff.

Ok so on the dividing by 256, I know about this and there is a slight offset I think by like 1 on each color, however that equation seems to work without any noticeable issues.

A table lookup, I have a way to actually remove almost all multiplication, and have a table size of 65536 bytes.
Here's how:

So we have ((Color * Alpha) + (Color2 * InvAlpha)) >> 8
Well if you did ((Color * Alpha)>>8) + ((Color2 * InvAlpha) >>8) is the same thing. As far as my math skills go.
So if you generated a table like:

byte[,] array = new byte[256, 256] and then access is by x = Alpha, and y = Color where color is either (R, G, B)
and setup the Array where you do

For (int color = 0; color < 256; color++) {
For (int alpha = 0; alpha < 256; alpha++) {
array[color, alpha] = ((color * alpha) >> 8);
}
}

This way when you loop all you have to do is supply the colors for the lookup table, and what I would do is just use a Memory amount of 65536
and use MemoryLocation = (Color << 8) + Alpha to find the index location.

Result, only one shift, and a few additions per color.

Not bad, in theory. :)

- Matt

JAAman · Post by **JAAman** » Fri Jan 31, 2014 11:34 am

So we have ((Color * Alpha) + (Color2 * InvAlpha)) >> 8
Well if you did ((Color * Alpha)>>8) + ((Color2 * InvAlpha) >>8) is the same thing. As far as my math skills go.

except for data loss...

you are loosing data when you shift, therefore when you shift before the addition, you are loosing part which might have caused a carry into the part that was kept:

Code: Select all

 10011110
+00001111
--------------
10101101 >> 4
--------------
00001010

where the new way would be:

Code: Select all

 10011110 >>4
+00001111 >>4
-------------------SHIFTED
 00001001
+00000000
----------------
00001001

so you get different results

PearOs · Post by **PearOs** » Fri Jan 31, 2014 12:14 pm

JAAman wrote:
So we have ((Color * Alpha) + (Color2 * InvAlpha)) >> 8
Well if you did ((Color * Alpha)>>8) + ((Color2 * InvAlpha) >>8) is the same thing. As far as my math skills go.
except for data loss...

you are loosing data when you shift, therefore when you shift before the addition, you are loosing part which might have caused a carry into the part that was kept:
Code: Select all
 10011110
+00001111
--------------
10101101 >> 4
--------------
00001010
where the new way would be:
Code: Select all
 10011110 >>4
+00001111 >>4
-------------------SHIFTED
 00001001
+00000000
----------------
00001001
so you get different results

Unless I divide by 255 instead of shifting when generating the table, that way technically there is no loss, and then I could just do one muliplication by 255 per color. Would the data loss be enough to cause a problem?

Or you could just use ushorts for the data table instead of bytes, that way each entry is (color * alpha) and then when you extract them and add and then do >> 8 without losing any extra data, just would be 131072 bytes which isn't too bad.

-Matt

PearOs · Post by **PearOs** » Sat Feb 01, 2014 3:35 am

Ok guys, so my results for Alpha Blending without something like MMX, well I literally reached the max speed you could achieve doing color per color, I used a byte lookup table and even on an i7 I was getting like 15 fps for a 600x400 picture that was alpha blended. However.. I attempted to use MMX, and I found this code:

Code: Select all

dest = alpha * (source - dest) / 255 + dest;

pxor mm3, mm3 

mov eax, dword ptr [dst] 
movd mm0, dword ptr [eax] 
movd mm1, dword ptr [color] //src 
movd mm2, dword ptr [alpha] 

punpcklbw mm0, mm3 //unpack dst to words 
punpcklbw mm1, mm3 //unpack color 
punpcklbw mm2, mm2 //unpack alpha 
punpcklbw mm2, mm2 
punpcklbw mm2, mm3 

psubusb mm1, mm0 //(color - dest) 
pmullw mm1, mm2 //alpha * (color - dest) 
psrlw    mm1, 8 //alpha * (color - dest) / 256 
paddusw mm1, mm0 //alpha * (color - dest) / 256 + dest 

packuswb mm1, mm3 
movd dword ptr [eax], mm1

However his equation is flawed the one that's simply and works well is dst = ((source * alpha) + (dst * invalpha)) >> 8

So I tried to convert mine into MMX...

Code: Select all

pxor mm3, mm3
movd MM0, dword[edi] ;Dest Color
movd MM1, dword[eax] ;Source Color
movd MM2, dword[Blit_32_TestBlend_Data] ;Alpha
movd MM4, dword[Blit_32_TestBlend_Data+4] ;InvAlpha
punpcklbw mm0, mm3
punpcklbw mm1, mm3 
punpcklbw mm2, mm2
punpcklbw mm2, mm2
punpcklbw mm2, mm3
punpcklbw mm4, mm4
punpcklbw mm4, mm4
punpcklbw mm4, mm3
 

pmullw mm1, mm2 ;Color1 * Alpha
pmullw mm0, mm4 ;Color2 * InvAlpha
paddusw mm1, mm0
psrlw mm0, 8 ;Result >>8
packuswb mm1, mm3 
movd dword [eax], mm1

However something is off and I don't understand MMX enough to be able to tell what exactly is going on. Anyone know MMX enough to know what's wrong?

Thanks guys, I attached a picture of the result.. It's ugly

bluemoon · Post by **bluemoon** » Sat Feb 01, 2014 3:48 am

I have a feeling that this is more suit for general programming board.

Anyway, check the manual for the meaning of each instruction, it should not be too difficult to decode what's going on.

However, I would suggest to forget MMX now and go directly for SSE and process multiple pixels at once (or convert to floats); machines that do not support SSE or even SSE2 is already many years old and I would consider them obsolete enough to just use the fallback x86 instructions, or otherwise write the MMX routines later.

Even better, if you can't write decent MMX/SSE code, turn on the SSE code generation flag on compiler and let them do the black magic.

PearOs · Post by **PearOs** » Sat Feb 01, 2014 3:51 am

bluemoon wrote:I have a feeling that this is more suit for general programming board.

Anyway, check the manual for the meaning of each instruction, it should not be too difficult to decode what's going on.

However, I would suggest to forget MMX now and go directly for SSE and process multiple pixels at once (or convert to floats); machines that do not support SSE or even SSE2 is already many years old and I would consider them obsolete enough to just use the fallback x86 instructions, or otherwise write the MMX routines later.

Even better, if you can't write decent MMX/SSE code, turn on the SSE code generation flag on compiler and let them do the black magic.

Well the compiler option sounds so good..... however I'm rolling my own compiler

so that doesn't work. I have read up on the MMX docs and I have a feeling its either the multiplying, adding, or when I unpack the registers. I would like to learn this though and not just go directly to SSE however I appreciate your advice. Fallback x86 instructions are good, I mean I can Alpha blend really fast right now, but its not fast enough to do over and over and over again. It would be better for like a one time thing. I would like real time alpha blending with MMX, and it's possible, I just need need to figure out how to use MMX. That's why I came here with it.

Thanks, Matt

bluemoon · Post by **bluemoon** » Sat Feb 01, 2014 4:51 am

I mean, MMX does not out-perform very much to x86 version on modern computer for such simple blender, the real bottleneck is on memory throughput, where you could further boost the performance by processing more pixels at once; and that's the idea of SIMD.

Combuster · Post by **Combuster** » Sat Feb 01, 2014 5:31 am

the real bottleneck is on memory throughput

Even so, the FPS is low considering I can do software raytracing at higher resolutions and framerates.

But bluemoon has a point - it may not be the alphablending that is slow. Do you get significantly higher framerates if you do solid screens with a rotating colour (like clear_screen(colour++); ), and the same if you fill the screen the black in one go followed by an XOR with a rotating colour (which would yield the exact same visual result but also demonstrates you the real memory bottleneck if it exists)

PearOs · Post by **PearOs** » Sat Feb 01, 2014 2:42 pm

Combuster wrote:
the real bottleneck is on memory throughput
Even so, the FPS is low considering I can do software raytracing at higher resolutions and framerates.

But bluemoon has a point - it may not be the alphablending that is slow. Do you get significantly higher framerates if you do solid screens with a rotating colour (like clear_screen(colour++); ), and the same if you fill the screen the black in one go followed by an XOR with a rotating colour (which would yield the exact same visual result but also demonstrates you the real memory bottleneck if it exists)

Oh I did not think of that. Yeah I am reaching the memory bottle neck then. Alright so can someone show me how to write this equation: ((Color * Alpha) + (Color2 * (InvAlpha)) >> 8 In SSE, or walk me through it? Would it be good to do two pixels at a time in SSE, or three or four?

Thanks a ton for the help guys, Matt

P.S. Thanks Combuster, and everyone else

OSDev.org

Alpha Blending - A painful process

Alpha Blending - A painful process

Re: Alpha Blending - A painful process

Re: Alpha Blending - A painful process

Re: Alpha Blending - A painful process

Re: Alpha Blending - A painful process

Re: Alpha Blending - A painful process

Re: Alpha Blending - A painful process

Re: Alpha Blending - A painful process

Re: Alpha Blending - A painful process

Re: Alpha Blending - A painful process

Re: Alpha Blending - A painful process

Re: Alpha Blending - A painful process

Re: Alpha Blending - A painful process

Re: Alpha Blending - A painful process

Re: Alpha Blending - A painful process