Alpha Blending - A painful process
Alpha Blending - A painful process
Hey guys, so I have been working on Alpha blending. Well it works, and my assembly routine without MMX or SSE is 35 instructions long total. Not too bad I guess. However its not the fastest thing in the world, and I knew this from the git go.
So here's my equation: (Source Color * Alpha + Dest Color * (255 - Alpha)) >> 8 where Source Color is a Color component of a Pixel, ex R or G.
Would it make a difference to use a lookup table instead? To kind of remove some of the multiplications or would it be better to try to use SEE to do it since it has large registers?
Thanks guys, Matt
So here's my equation: (Source Color * Alpha + Dest Color * (255 - Alpha)) >> 8 where Source Color is a Color component of a Pixel, ex R or G.
Would it make a difference to use a lookup table instead? To kind of remove some of the multiplications or would it be better to try to use SEE to do it since it has large registers?
Thanks guys, Matt
Re: Alpha Blending - A painful process
Don't forget the most obvious optimisation
I think that lookup tables would require some sort of multiply.
Code: Select all
if (alpha == 0) {
/* Do nothing */
}
else if (alpha == 0xff) {
d = s;
}
else {
/* Your blending code */
}
If a trainstation is where trains stop, what is a workstation ?
- Combuster
- Member
- Posts: 9301
- Joined: Wed Oct 18, 2006 3:45 am
- Libera.chat IRC: [com]buster
- Location: On the balcony, where I can actually keep 1½m distance
- Contact:
Re: Alpha Blending - A painful process
In your case, is this ever going to change?PearOs wrote:Source Color * Alpha
SSE; And why not let the compiler handle that for you?try to use SEE
Re: Alpha Blending - A painful process
Hi,
The correct answer would be 255, but this equation gave us 254. What happened is that your shift is dividing by 256 but it should be dividing by 255 instead. The correct equation would be: ( src * alpha + dest * (255 - alpha) ) / 255
Division is slow and you want to avoid it. Multiplications aren't too fast either. Something like "result = lookupTable[src][dest][alpha]" would cost 16 MiB (and that's too much).
Instead, you could consider something like this:
That looks ugly; but with a lookup table you can do this:
That gives you a 128 KiB lookup table (small enough to fit in cache). It also means 6 lookups per pixel.
So now there's 3 options:
That gives you a 64 KiB lookup table and only 3 lookups per pixel.
Cheers,
Brendan
First decide if the equation is right. For example, if the source colour is 255, dest colour is 255 and the alpha is 128; then:PearOs wrote:Hey guys, so I have been working on Alpha blending. Well it works, and my assembly routine without MMX or SSE is 35 instructions long total. Not too bad I guess. However its not the fastest thing in the world, and I knew this from the git go.
So here's my equation: (Source Color * Alpha + Dest Color * (255 - Alpha)) >> 8 where Source Color is a Color component of a Pixel, ex R or G.
Would it make a difference to use a lookup table instead? To kind of remove some of the multiplications or would it be better to try to use SEE to do it since it has large registers?
Code: Select all
( 255*128 + 255*(255-128) ) >> 8
= ( 255*128 + 255*127 ) >> 8
= ( 32640 + 32385 ) >> 8
= 65025 >> 8
= 254
Division is slow and you want to avoid it. Multiplications aren't too fast either. Something like "result = lookupTable[src][dest][alpha]" would cost 16 MiB (and that's too much).
Instead, you could consider something like this:
Code: Select all
result = ( src * (alpha << 8 + alpha) + dest * ( (alpha ^ 0xFF) << 8 + (alpha ^ 0xFF) ) ) >> 16;
Code: Select all
result = ( lookupTable[src][alpha] + lookupTable[dest][alpha ^ 0xFF] ) >> 8;
So now there's 3 options:
- use the "wrong" equation with SSE (6 multiplications and 3 shifts per pixel)
- use the "right" equation with SSE multiplication and shift (7 multiplications and 3 shifts per pixel)
- use the "right" equation with lookup table (6 lookups and 3 shifts per pixel)
Code: Select all
result = (lookupTable[src.red][dest.red] << 16) | (lookupTable[src.green][dest.green] << 8) | lookupTable[src.blue][dest.blue];
Cheers,
Brendan
For all things; perfection is, and will always remain, impossible to achieve in practice. However; by striving for perfection we create things that are as perfect as practically possible. Let the pursuit of perfection be our guide.
- Owen
- Member
- Posts: 1700
- Joined: Fri Jun 13, 2008 3:21 pm
- Location: Cambridge, United Kingdom
- Contact:
Re: Alpha Blending - A painful process
Brendan, I'm surprised. I thought you of all people would have noticed the glaring flaw.
Lets say we have a black background, white foreground, alpha 0.5 (Lets use 0-1 for our range; it makes things simpler)
I think you'll agree that we should end up with a grey, half intensity image?
The naive blending function will give you an image at value 0.5, which has a brightness of 0.72*. The correct answer is a value of 0.21*, brightness 0.5
I'd cover gamma, but don't have the time to do so right now. My general advice would be to do all of the maths with floats and unpack from sRGB/gamma-RGB at the start into linear RGB, and repack at the end.
* (Approx, for 2.2 gamma. Doing sRGB gamma requires more time than I have right now)
Lets say we have a black background, white foreground, alpha 0.5 (Lets use 0-1 for our range; it makes things simpler)
I think you'll agree that we should end up with a grey, half intensity image?
The naive blending function will give you an image at value 0.5, which has a brightness of 0.72*. The correct answer is a value of 0.21*, brightness 0.5
I'd cover gamma, but don't have the time to do so right now. My general advice would be to do all of the maths with floats and unpack from sRGB/gamma-RGB at the start into linear RGB, and repack at the end.
* (Approx, for 2.2 gamma. Doing sRGB gamma requires more time than I have right now)
Re: Alpha Blending - A painful process
Hi,
Of course if you're doing that you will need something better than 8:8:8 (e.g. floats) because "perceived intensity" isn't linear (resulting in a noticeable lack of precision if you use 8-bit integers during calculations); but then you need something better than 8:8:8 once you start looking at things like HDR anyway.
Cheers,
Brendan
The last time I did anything with pretty graphics, I did all processing using "no gamma, CIE XYZ" and converted to sRGB (with gamma) after everything was drawn. It tends to simplify things a lot.Owen wrote:I'd cover gamma, but don't have the time to do so right now. My general advice would be to do all of the maths with floats and unpack from sRGB/gamma-RGB at the start into linear RGB, and repack at the end.
Of course if you're doing that you will need something better than 8:8:8 (e.g. floats) because "perceived intensity" isn't linear (resulting in a noticeable lack of precision if you use 8-bit integers during calculations); but then you need something better than 8:8:8 once you start looking at things like HDR anyway.
Cheers,
Brendan
For all things; perfection is, and will always remain, impossible to achieve in practice. However; by striving for perfection we create things that are as perfect as practically possible. Let the pursuit of perfection be our guide.
Re: Alpha Blending - A painful process
Wow thanks for the reply's guys. I will try to answer everyone's stuff.
Ok so on the dividing by 256, I know about this and there is a slight offset I think by like 1 on each color, however that equation seems to work without any noticeable issues.
A table lookup, I have a way to actually remove almost all multiplication, and have a table size of 65536 bytes.
Here's how:
So we have ((Color * Alpha) + (Color2 * InvAlpha)) >> 8
Well if you did ((Color * Alpha)>>8) + ((Color2 * InvAlpha) >>8) is the same thing. As far as my math skills go.
So if you generated a table like:
byte[,] array = new byte[256, 256] and then access is by x = Alpha, and y = Color where color is either (R, G, B)
and setup the Array where you do
For (int color = 0; color < 256; color++) {
For (int alpha = 0; alpha < 256; alpha++) {
array[color, alpha] = ((color * alpha) >> 8);
}
}
This way when you loop all you have to do is supply the colors for the lookup table, and what I would do is just use a Memory amount of 65536
and use MemoryLocation = (Color << 8) + Alpha to find the index location.
Result, only one shift, and a few additions per color.
Not bad, in theory. :)
- Matt
Ok so on the dividing by 256, I know about this and there is a slight offset I think by like 1 on each color, however that equation seems to work without any noticeable issues.
A table lookup, I have a way to actually remove almost all multiplication, and have a table size of 65536 bytes.
Here's how:
So we have ((Color * Alpha) + (Color2 * InvAlpha)) >> 8
Well if you did ((Color * Alpha)>>8) + ((Color2 * InvAlpha) >>8) is the same thing. As far as my math skills go.
So if you generated a table like:
byte[,] array = new byte[256, 256] and then access is by x = Alpha, and y = Color where color is either (R, G, B)
and setup the Array where you do
For (int color = 0; color < 256; color++) {
For (int alpha = 0; alpha < 256; alpha++) {
array[color, alpha] = ((color * alpha) >> 8);
}
}
This way when you loop all you have to do is supply the colors for the lookup table, and what I would do is just use a Memory amount of 65536
and use MemoryLocation = (Color << 8) + Alpha to find the index location.
Result, only one shift, and a few additions per color.
Not bad, in theory. :)
- Matt
Re: Alpha Blending - A painful process
except for data loss...So we have ((Color * Alpha) + (Color2 * InvAlpha)) >> 8
Well if you did ((Color * Alpha)>>8) + ((Color2 * InvAlpha) >>8) is the same thing. As far as my math skills go.
you are loosing data when you shift, therefore when you shift before the addition, you are loosing part which might have caused a carry into the part that was kept:
Code: Select all
10011110
+00001111
--------------
10101101 >> 4
--------------
00001010
Code: Select all
10011110 >>4
+00001111 >>4
-------------------SHIFTED
00001001
+00000000
----------------
00001001
Re: Alpha Blending - A painful process
Unless I divide by 255 instead of shifting when generating the table, that way technically there is no loss, and then I could just do one muliplication by 255 per color. Would the data loss be enough to cause a problem?JAAman wrote:except for data loss...So we have ((Color * Alpha) + (Color2 * InvAlpha)) >> 8
Well if you did ((Color * Alpha)>>8) + ((Color2 * InvAlpha) >>8) is the same thing. As far as my math skills go.
you are loosing data when you shift, therefore when you shift before the addition, you are loosing part which might have caused a carry into the part that was kept:
where the new way would be:Code: Select all
10011110 +00001111 -------------- 10101101 >> 4 -------------- 00001010
so you get different resultsCode: Select all
10011110 >>4 +00001111 >>4 -------------------SHIFTED 00001001 +00000000 ---------------- 00001001
Or you could just use ushorts for the data table instead of bytes, that way each entry is (color * alpha) and then when you extract them and add and then do >> 8 without losing any extra data, just would be 131072 bytes which isn't too bad.
-Matt
Re: Alpha Blending - A painful process
Ok guys, so my results for Alpha Blending without something like MMX, well I literally reached the max speed you could achieve doing color per color, I used a byte lookup table and even on an i7 I was getting like 15 fps for a 600x400 picture that was alpha blended. However.. I attempted to use MMX, and I found this code:
However his equation is flawed the one that's simply and works well is dst = ((source * alpha) + (dst * invalpha)) >> 8
So I tried to convert mine into MMX...
However something is off and I don't understand MMX enough to be able to tell what exactly is going on. Anyone know MMX enough to know what's wrong?
Thanks guys, I attached a picture of the result.. It's ugly
Code: Select all
dest = alpha * (source - dest) / 255 + dest;
pxor mm3, mm3
mov eax, dword ptr [dst]
movd mm0, dword ptr [eax]
movd mm1, dword ptr [color] //src
movd mm2, dword ptr [alpha]
punpcklbw mm0, mm3 //unpack dst to words
punpcklbw mm1, mm3 //unpack color
punpcklbw mm2, mm2 //unpack alpha
punpcklbw mm2, mm2
punpcklbw mm2, mm3
psubusb mm1, mm0 //(color - dest)
pmullw mm1, mm2 //alpha * (color - dest)
psrlw mm1, 8 //alpha * (color - dest) / 256
paddusw mm1, mm0 //alpha * (color - dest) / 256 + dest
packuswb mm1, mm3
movd dword ptr [eax], mm1
So I tried to convert mine into MMX...
Code: Select all
pxor mm3, mm3
movd MM0, dword[edi] ;Dest Color
movd MM1, dword[eax] ;Source Color
movd MM2, dword[Blit_32_TestBlend_Data] ;Alpha
movd MM4, dword[Blit_32_TestBlend_Data+4] ;InvAlpha
punpcklbw mm0, mm3
punpcklbw mm1, mm3
punpcklbw mm2, mm2
punpcklbw mm2, mm2
punpcklbw mm2, mm3
punpcklbw mm4, mm4
punpcklbw mm4, mm4
punpcklbw mm4, mm3
pmullw mm1, mm2 ;Color1 * Alpha
pmullw mm0, mm4 ;Color2 * InvAlpha
paddusw mm1, mm0
psrlw mm0, 8 ;Result >>8
packuswb mm1, mm3
movd dword [eax], mm1
Thanks guys, I attached a picture of the result.. It's ugly
Re: Alpha Blending - A painful process
I have a feeling that this is more suit for general programming board.
Anyway, check the manual for the meaning of each instruction, it should not be too difficult to decode what's going on.
However, I would suggest to forget MMX now and go directly for SSE and process multiple pixels at once (or convert to floats); machines that do not support SSE or even SSE2 is already many years old and I would consider them obsolete enough to just use the fallback x86 instructions, or otherwise write the MMX routines later.
Even better, if you can't write decent MMX/SSE code, turn on the SSE code generation flag on compiler and let them do the black magic.
Anyway, check the manual for the meaning of each instruction, it should not be too difficult to decode what's going on.
However, I would suggest to forget MMX now and go directly for SSE and process multiple pixels at once (or convert to floats); machines that do not support SSE or even SSE2 is already many years old and I would consider them obsolete enough to just use the fallback x86 instructions, or otherwise write the MMX routines later.
Even better, if you can't write decent MMX/SSE code, turn on the SSE code generation flag on compiler and let them do the black magic.
Re: Alpha Blending - A painful process
Well the compiler option sounds so good..... however I'm rolling my own compiler so that doesn't work. I have read up on the MMX docs and I have a feeling its either the multiplying, adding, or when I unpack the registers. I would like to learn this though and not just go directly to SSE however I appreciate your advice. Fallback x86 instructions are good, I mean I can Alpha blend really fast right now, but its not fast enough to do over and over and over again. It would be better for like a one time thing. I would like real time alpha blending with MMX, and it's possible, I just need need to figure out how to use MMX. That's why I came here with it.bluemoon wrote:I have a feeling that this is more suit for general programming board.
Anyway, check the manual for the meaning of each instruction, it should not be too difficult to decode what's going on.
However, I would suggest to forget MMX now and go directly for SSE and process multiple pixels at once (or convert to floats); machines that do not support SSE or even SSE2 is already many years old and I would consider them obsolete enough to just use the fallback x86 instructions, or otherwise write the MMX routines later.
Even better, if you can't write decent MMX/SSE code, turn on the SSE code generation flag on compiler and let them do the black magic.
Thanks, Matt
Re: Alpha Blending - A painful process
I mean, MMX does not out-perform very much to x86 version on modern computer for such simple blender, the real bottleneck is on memory throughput, where you could further boost the performance by processing more pixels at once; and that's the idea of SIMD.
- Combuster
- Member
- Posts: 9301
- Joined: Wed Oct 18, 2006 3:45 am
- Libera.chat IRC: [com]buster
- Location: On the balcony, where I can actually keep 1½m distance
- Contact:
Re: Alpha Blending - A painful process
Even so, the FPS is low considering I can do software raytracing at higher resolutions and framerates.the real bottleneck is on memory throughput
But bluemoon has a point - it may not be the alphablending that is slow. Do you get significantly higher framerates if you do solid screens with a rotating colour (like clear_screen(colour++); ), and the same if you fill the screen the black in one go followed by an XOR with a rotating colour (which would yield the exact same visual result but also demonstrates you the real memory bottleneck if it exists)
Re: Alpha Blending - A painful process
Oh I did not think of that. Yeah I am reaching the memory bottle neck then. Alright so can someone show me how to write this equation: ((Color * Alpha) + (Color2 * (InvAlpha)) >> 8 In SSE, or walk me through it? Would it be good to do two pixels at a time in SSE, or three or four?Combuster wrote:Even so, the FPS is low considering I can do software raytracing at higher resolutions and framerates.the real bottleneck is on memory throughput
But bluemoon has a point - it may not be the alphablending that is slow. Do you get significantly higher framerates if you do solid screens with a rotating colour (like clear_screen(colour++); ), and the same if you fill the screen the black in one go followed by an XOR with a rotating colour (which would yield the exact same visual result but also demonstrates you the real memory bottleneck if it exists)
Thanks a ton for the help guys, Matt
P.S. Thanks Combuster, and everyone else