For the ASM coders, a little game: "Beat the compiler!"

JamesM · Post by **JamesM** » Thu May 28, 2009 3:31 am

Hi,

I'm familiar with but not in any way an expert in x86 protected mode assembly - I have a routine, written in C++, that needs to be fast. It's fairly simple, monolithic (no jumps) and I ran it through G++ 4.3.2 with -O3 (pretty much full optimisation). I have the original source code and the generated assembly. I'm not asking anyone to spend any of their precious time actually making a better version of this - all I'm asking is, with hand-crafted assembly, do you think that this target code can be improved?

I ask because there seems an *awful* lot of redundant moves in there - granted it needs plenty because of imul only dealing with %edx but still...

Anyways, without further ado, the source (I should also mention that this is simplified to try and grab more out of G++'s optimiser - there is a clamp missing but that would cause jumps):

Code: Select all

uint32_t FreetypeXterm::interpolateColour(uint32_t f, uint32_t b, uint8_t a)
{
    Display::PixelFormat pf = m_Mode.pf;

    // Extract the red field.
    uint8_t range = 1 << pf.mRed;
    uint8_t fr = (f >> pf.pRed) & (range-1);
    uint8_t br = (b >> pf.pRed) & (range-1);

    uint8_t r = (fr * a + br * (256-a)) / 256;

    // Green
    range = 1 << pf.mGreen;
    uint8_t fg = (f >> pf.pGreen) & (range-1);
    uint8_t bg = (b >> pf.pGreen) & (range-1);

    uint8_t g = (fg * a + bg * (256-a)) / 256;

    // Blue
    range = 1 << pf.mBlue;
    uint8_t fb = (f >> pf.pBlue) & (range-1);
    uint8_t bb = (b >> pf.pBlue) & (range-1);

    uint8_t bl = (fb * a + bb * (256-a)) / 256;

    return 0 |
         (static_cast<uint32_t>(r) << pf.pRed) |
         (static_cast<uint32_t>(g) << pf.pGreen) |
         (static_cast<uint32_t>(bl) << pf.pBlue);
}

And the generated target code:

Code: Select all

00000000 <_ZN13FreetypeXterm17interpolateColourEmmh>:
   0:   55                      push   %ebp
   1:   89 e5                   mov    %esp,%ebp
   3:   57                      push   %edi
   4:   56                      push   %esi
   5:   53                      push   %ebx
   6:   83 ec 24                sub    $0x24,%esp
   9:   8d 7d e4                lea    -0x1c(%ebp),%edi
   c:   8b 75 08                mov    0x8(%ebp),%esi
   f:   83 c6 18                add    $0x18,%esi
  12:   b9 04 00 00 00          mov    $0x4,%ecx
  17:   f3 a5                   rep movsl %ds:(%esi),%es:(%edi)
  19:   31 c0                   xor    %eax,%eax
  1b:   8a 45 e5                mov    -0x1b(%ebp),%al
  1e:   89 45 d4                mov    %eax,-0x2c(%ebp)
  21:   31 c9                   xor    %ecx,%ecx
  23:   8a 4d e4                mov    -0x1c(%ebp),%cl
  26:   bb 01 00 00 00          mov    $0x1,%ebx
  2b:   89 d8                   mov    %ebx,%eax
  2d:   d3 e0                   shl    %cl,%eax
  2f:   8d 48 ff                lea    -0x1(%eax),%ecx
  32:   88 4d db                mov    %cl,-0x25(%ebp)
  35:   31 c0                   xor    %eax,%eax
  37:   8a 45 14                mov    0x14(%ebp),%al
  3a:   89 45 dc                mov    %eax,-0x24(%ebp)
  3d:   bf 00 01 00 00          mov    $0x100,%edi
  42:   29 c7                   sub    %eax,%edi
  44:   8a 55 e7                mov    -0x19(%ebp),%dl
  47:   81 e2 ff 00 00 00       and    $0xff,%edx
  4d:   89 d6                   mov    %edx,%esi
  4f:   31 c9                   xor    %ecx,%ecx
  51:   8a 4d e6                mov    -0x1a(%ebp),%cl
  54:   89 d8                   mov    %ebx,%eax
  56:   d3 e0                   shl    %cl,%eax
  58:   8d 40 ff                lea    -0x1(%eax),%eax
  5b:   88 45 d3                mov    %al,-0x2d(%ebp)
  5e:   31 d2                   xor    %edx,%edx
  60:   8a 55 e9                mov    -0x17(%ebp),%dl
  63:   89 55 e0                mov    %edx,-0x20(%ebp)
  66:   31 c0                   xor    %eax,%eax
  68:   8a 45 e8                mov    -0x18(%ebp),%al
  6b:   88 c1                   mov    %al,%cl
  6d:   d3 e3                   shl    %cl,%ebx
  6f:   4b                      dec    %ebx
  70:   8b 45 10                mov    0x10(%ebp),%eax
  73:   89 f1                   mov    %esi,%ecx
  75:   d3 e8                   shr    %cl,%eax
  77:   22 45 d3                and    -0x2d(%ebp),%al
  7a:   25 ff 00 00 00          and    $0xff,%eax
  7f:   0f af c7                imul   %edi,%eax
  82:   8b 55 0c                mov    0xc(%ebp),%edx
  85:   d3 ea                   shr    %cl,%edx
  87:   20 55 d3                and    %dl,-0x2d(%ebp)
  8a:   31 c9                   xor    %ecx,%ecx
  8c:   8a 4d d3                mov    -0x2d(%ebp),%cl
  8f:   0f af 4d dc             imul   -0x24(%ebp),%ecx
  93:   01 c8                   add    %ecx,%eax
  95:   0f b6 c4                movzbl %ah,%eax
  98:   89 f1                   mov    %esi,%ecx
  9a:   d3 e0                   shl    %cl,%eax
  9c:   8b 55 10                mov    0x10(%ebp),%edx
  9f:   8a 4d d4                mov    -0x2c(%ebp),%cl
  a2:   d3 ea                   shr    %cl,%edx
  a4:   22 55 db                and    -0x25(%ebp),%dl
  a7:   81 e2 ff 00 00 00       and    $0xff,%edx
  ad:   0f af d7                imul   %edi,%edx
  b0:   8b 75 0c                mov    0xc(%ebp),%esi
  b3:   d3 ee                   shr    %cl,%esi
  b5:   89 f1                   mov    %esi,%ecx
  b7:   20 4d db                and    %cl,-0x25(%ebp)
  ba:   31 c9                   xor    %ecx,%ecx
  bc:   8a 4d db                mov    -0x25(%ebp),%cl
  bf:   0f af 4d dc             imul   -0x24(%ebp),%ecx
  c3:   01 ca                   add    %ecx,%edx
  c5:   0f b6 d6                movzbl %dh,%edx
  c8:   8a 4d d4                mov    -0x2c(%ebp),%cl
  cb:   d3 e2                   shl    %cl,%edx
  cd:   09 d0                   or     %edx,%eax
  cf:   8a 4d e0                mov    -0x20(%ebp),%cl
  d2:   d3 6d 10                shrl   %cl,0x10(%ebp)
  d5:   8a 55 10                mov    0x10(%ebp),%dl
  d8:   21 da                   and    %ebx,%edx
  da:   81 e2 ff 00 00 00       and    $0xff,%edx
  e0:   0f af d7                imul   %edi,%edx
  e3:   d3 6d 0c                shrl   %cl,0xc(%ebp)
  e6:   23 5d 0c                and    0xc(%ebp),%ebx
  e9:   81 e3 ff 00 00 00       and    $0xff,%ebx
  ef:   0f af 5d dc             imul   -0x24(%ebp),%ebx
  f3:   01 da                   add    %ebx,%edx
  f5:   0f b6 d6                movzbl %dh,%edx
  f8:   d3 e2                   shl    %cl,%edx
  fa:   09 d0                   or     %edx,%eax
  fc:   83 c4 24                add    $0x24,%esp
  ff:   5b                      pop    %ebx
 100:   5e                      pop    %esi
 101:   5f                      pop    %edi
 102:   5d                      pop    %ebp
 103:   c3                      ret
 104:   8d b6 00 00 00 00       lea    0x0(%esi),%esi
 10a:   8d bf 00 00 00 00       lea    0x0(%edi),%edi

Cheers!

James

Combuster · Post by **Combuster** » Thu May 28, 2009 3:55 am

- Any reason for avoiding vector instructions?
- If not, you might efficiently multiply red and blue in one packed dword due to range of alpha (putting a limit of 16 bits)
- Alternatively, precompute the color masks, don't shift the components but only mask them.
- there's no advantage being taken of -fomit-frame-pointer.
- You can inline pixel formats, or use lookup tables rather than shifts for a component's bitmasks.

So yes, there's room for improvement. Many of them based on the things GCC can not know.

Not to mention there are likely a few whole-program-optimisations.

JamesM · Post by **JamesM** » Thu May 28, 2009 4:16 am

Hi,

Combuster wrote:- Any reason for avoiding vector instructions?

Nope, if they're available on all x86's, not at all.

- If not, you might efficiently multiply red and blue in one packed dword due to range of alpha (putting a limit of 16 bits)

I can't impose that limit, however since 24-bit and 32-bit modes are so damn simple, it seems reasonable to special-case the 16-bit calculations.

- Alternatively, precompute the color masks, don't shift the components but only mask them.

An excellent idea.

EDIT: In retrospect, this could cause a possible overflow with RGBA formats - multiplying the R component (highest-order 8 bits) by 256, then dividing by 256 again afterwards causes data loss.

- there's no advantage being taken of -fomit-frame-pointer.

It would save about 4 instructions total, and wouldn't be worth it. -fomit-frame-pointer is enabled in our release build, but for debug builds we keep it in for ease of backtracing.

- You can inline pixel formats, or use lookup tables rather than shifts for a component's bitmasks.

Again, a good suggestion. I'll look into it.

So yes, there's room for improvement. Many of them based on the things GCC can not know.

Not to mention there are likely a few whole-program-optimisations.

Is that a subtle way of saying that my code sucks?

Cheers,

James

Combuster · Post by **Combuster** » Thu May 28, 2009 4:51 am

JamesM wrote:
Combuster wrote:- Any reason for avoiding vector instructions?
Nope, if they're available on all x86's, not at all.

They're not available on all x86, that's my problem. Do you mind substitute calls to point to a mmx-enabled version?

- If not, you might efficiently multiply red and blue in one packed dword due to range of alpha (putting a limit of 16 bits)
I can't impose that limit, however since 24-bit and 32-bit modes are so damn simple, it seems reasonable to special-case the 16-bit calculations.

I meant, non-mmx vector multiply:
(00000000rrrrrrrr00000000bbbbbbbb * a + 00000000rrrrrrrr00000000bbbbbbbb * (256-a)) & ff00ff00

- Alternatively, precompute the color masks, don't shift the components but only mask them.
An excellent idea.

EDIT: In retrospect, this could cause a possible overflow with RGBA formats - multiplying the R component (highest-order 8 bits) by 256, then dividing by 256 again afterwards causes data loss.

for x86, alpha is bits 24-31. And if you don't want to assume that, the mul generates unpacks to a 64-bit number, div takes a 64-bit number.

- there's no advantage being taken of -fomit-frame-pointer.
It would save about 4 instructions total, and wouldn't be worth it. -fomit-frame-pointer is enabled in our release build, but for debug builds we keep it in for ease of backtracing.

And you missed the fact that EBP is then free as a general purpose register, reducing register strain and keeping another variable out of memory.

Not to mention there are likely a few whole-program-optimisations.

Is that a subtle way of saying that my code sucks?

No, that you should use LLVM

(And to be honest, I was under the impression I was looking at freetype or somesuch)

In reality, the hint was that you shouldn't limit your focus on one function but view it in context.

whowhatwhere · Post by **whowhatwhere** » Thu May 28, 2009 5:22 am

Good SSE Reference
Better SSE Reference

bewing · Post by **bewing** » Thu May 28, 2009 7:19 am

Can the # of bits in each color bitfield be different? Or must they be identical?
As Combuster said, in pf you should precompute range-1 (for each color, if necessary -- or once if the # of bits must be identical). I disagree with him about masking in place, because that causes divisions that can be avoided easily. How about something like this ASM (intel syntax) pseudocode?

Code: Select all

	push eax		; storage for return value
	pusha
	mov esi, "f"		; get values off stack
	mov edi, "b"
	mov ebx, "a"
	mov bh, 256
	sub bh, bl		; bh = 256 - a
	mov edx, "pf.range-1 mask"
	; calculate blue FIRST, because it needs no shift
	mov eax, esi
	and eax, edx		; clear upper eax, use range mask
	mul bl
	mov ecx, eax		; copy ax into cx (useful val is in ch)
	mov eax, edi
	and al, dl
	mul bh
	add ecx, eax		; combine the fg and bg blue values
	movzx ebp, ch		; divide by 256, store in ebp
	mov cl, "pf.green shift"
	push ecx		; save the green shift for a moment
	shr esi, cl
	shr edi, cl
	mov eax, esi
	and al, dl
	mul bl
	mov ecx, eax		; copy ax into cx (useful val is in ch)
	mov eax, edi
	and al, dl
	mul bh
	add ecx, eax		; combine the fg and bg blue values
	movzx eax, ch		; divide by 256, put in al temporarily
	pop ecx			; recover the green shift
	shl eax, cl
	or ebp, eax		; merge green into blue
	; if the shifts are the same size, then the red shift = the green shift!
	push ecx		; save the red shift for a moment
	; note: if the shifts are the same size, then use mov ecx, [esp] and not "pop" above, and get rid of this push
	shr esi, cl
	shr edi, cl
	mov eax, esi
	and al, dl
	mul bl
	mov ecx, eax		; copy ax into dx (useful val is in ch)
	mov eax, edi
	and al, dl
	mul bh
	add ecx, eax		; combine the fg and bg blue values
	movzx eax, ch		; divide by 256, put in al temporarily
	pop ecx			; recover the red shift
	shl ecx, 1		; times 2
	shl eax, cl
	or ebp, eax		; merge red into green + blue
	mov [esp + 28], ebp	; put the return value back on the stack
	popa
	pop eax		; return value
	ret

No divisions, very few memory operations -- should be fast. If you do a speed test of the two against each other, I'd be very interested in hearing the results!

Combuster · Post by **Combuster** » Thu May 28, 2009 8:47 am

@bewing: div = slow; shr = fast (and does the same)

32 bits BGRA or RGBA, 24 bits RGB or BGR: (this one can be made really short with mmx) (v2)

Code: Select all

push esi
push edi
push ebp
mov eax, "f"
mov esi, "b"
movzx ecx, byte "a"
mov edx, eax
mov edi, esi
and eax, 0x00ff00ff
and edx, 0x0000ff00
imul eax, ecx
and esi, 0x00ff00ff
mov ebp, 256
imul edx, ecx
sub ebp, ecx
and edi, 0x0000ff00
imul esi, ebp
imul edi, ebp
add eax, esi
and eax, 0xff00ff00
add edx, edi
shr eax, 8
bswap edx
pop ebp
pop edi
pop esi
mov ah, dh
ret

16 bits (565), can be made 15 bits using other constants for the ANDs.

Code: Select all

push esi
push edi
push ebp
mov si, "f"
mov di, "b"
bswap esi  
bswap edi
movzx ecx, byte "a"
mov si, "f"
mov di, "b"
mov eax, esi
mov edx, edi
and esi, 0x00f8001f 
and eax, 0x00007e0
mov ebp, 256
imul esi, ecx
sub ebp, ecx
and edi, 0x00f8001f
imul eax, ecx
and edx, 0x000007e0
imul edi, ebp
imul edx, ebp
add esi, edi
add edx, eax
shr esi, 8
and edx, 0x0007e000
mov eax, esi
shr edx, 8
bswap esi
or eax, edx ' collect components
and esi, 0x0000f800
pop ebp
or eax, esi
pop edi
pop esi
ret

(neither has been tested, don't have an assembler here atm)

bewing · Post by **bewing** » Thu May 28, 2009 12:52 pm

Heh. OK, OK.

But I think I should get some points for doubling the efficiency of GCC's best shot, in ten minutes, using JamesM's original algorithm -- when I was sleepy.

And why are you and gcc using signed multiply on unsigned quantities?

Combuster · Post by **Combuster** » Thu May 28, 2009 4:11 pm

why is there only SHL and no SAL, but there is SHR and SAR

Observe the unsigned multiplication, followed by truncation:

Code: Select all

        00001000 8     8
        11111111 255  -1  X
     ----------------------
     11111111000 2040 
     ----------------------
        11111000 248  -8

multiplication is, like addition, sign-insensitive

bewing · Post by **bewing** » Thu May 28, 2009 4:30 pm

Combuster wrote:multiplication is, like addition, sign-insensitive

While admittedly I have not run the numbers for all possible cases, the very fact that there exist two entirely different sets of opcodes for the two operations (MUL and IMUL) makes me doubt that statement is true in all cases. I don't think they did it for the slightly different EFLAGS results, or for their health.

Combuster · Post by **Combuster** » Thu May 28, 2009 4:41 pm

they exist for a different reason, sign extension. As part of some multiplications, the storage size increases. At that point, it does matter:

Code: Select all

        11111000 -8 or 248
0000000011111000 248
1111111111111000 -8

for imul reg,reg the storage size is constant, so there is no sign extension, and hence no difference between signed and unsigned multiplication.

JamesM · Post by **JamesM** » Fri May 29, 2009 3:31 am

Cheers guys, much appreciated. I'm going to have the above algorithm as the generic one and then optimised (possibly in assembler, ruthlessly taken from this thread) routines for common formats.

Cheers again!

Brendan · Post by **Brendan** » Fri May 29, 2009 5:33 am

Hi,

Did anyone notice the bug?

For 15-bpp, if the foreground is white and the background is white, and the alpha is 50%, then the answer is....

Light grey...

Huh?

The problem is precision loss. The code does "31 / 2 + 31 / 2 = 15 + 15 = 30" instead of doing "31 / 2 + 31 / 2 = 15.5 + 15.5 = 31".

You could play with fixed point (e.g. "foreground * 65536 / alpha + background * 65536 / alpha = result * 65536") to reduce the precision loss, but ...

How about:

Code: Select all

typedef struct {
	float blue;
	float green;
	float red;
	float alpha;
} PIXEL_DATA;

void FreetypeXterm::interpolateColour(PIXEL_DATA *foreground, PIXEL_DATA *background, float alpha);

It looks slower, until you realize that you don't need to separate the red/green/blue from a single packed integer anymore (which is what caused most of the hassle in the original code); and that using this format for pixel data lets you load a pixel's data into a (packed single precision) SSE register and do the majority of it with about five SSE instructions (e.g. "MULPS", "SUBPS", "MULPS", "ADDPS")...

Basically what I'm suggesting is to convert colours (once) from whatever format they're in into a standard floating point pixel format; then do *all* calculations, etc using this floating point pixel format; then when everything is done convert it back into the desired pixel format as the last step. The other advantages here is that you can operate on mixed pixel formats without any problem (e.g. 15-bpp foreground and 24-bpp background used to create a 16-bpp result), and it can easily be extended to high-end pixel formats and future pixel formats (some high-end video cards support 64-bpp and some support floating point pixel formats).

Cheers,

Brendan

Love4Boobies · Post by **Love4Boobies** » Fri May 29, 2009 6:07 am

Modern x86 CPUs treat DIV as shifts when they can so there's no performance loss when you use that

OSDev.org

For the ASM coders, a little game: "Beat the compiler!"

For the ASM coders, a little game: "Beat the compiler!"

Re: For the ASM coders, a little game: "Beat the compiler!"

Re: For the ASM coders, a little game: "Beat the compiler!"

Re: For the ASM coders, a little game: "Beat the compiler!"

Re: For the ASM coders, a little game: "Beat the compiler!"

Re: For the ASM coders, a little game: "Beat the compiler!"

Re: For the ASM coders, a little game: "Beat the compiler!"

Re: For the ASM coders, a little game: "Beat the compiler!"

Re: For the ASM coders, a little game: "Beat the compiler!"

Re: For the ASM coders, a little game: "Beat the compiler!"

Re: For the ASM coders, a little game: "Beat the compiler!"

Re: For the ASM coders, a little game: "Beat the compiler!"

Re: For the ASM coders, a little game: "Beat the compiler!"

Re: For the ASM coders, a little game: "Beat the compiler!"