Open Watcom 16-Bit Optimization

Programming, for all ages and all languages.
Post Reply
Rhob
Posts: 22
Joined: Fri Apr 13, 2007 10:08 am
Location: Florida

Open Watcom 16-Bit Optimization

Post by Rhob »

Hey everyone,

I've been playing around with Open Watcom's 16-bit C++ compiler lately. I wanted to see what kind of assembly output it generates. From what I've seen so far, it seems like there's quite a bit of extra assembly code that comes out of the compiler. Below is a sample class and assembly output to illustrate.

Class header:

Code: Select all

#include "stddef.h"

class Foo
{
    private:
        int x, y;
        
    public:
        cdecl Foo(int a = 0, int b = 0) : x(a), y(b) {}
        cdecl ~Foo() {}
        int cdecl bar();
        int cdecl test(int, int);
};
Class implementation:

Code: Select all

#include "test.hpp"

int Foo::bar()
{
    return 0;
}

int Foo::test(int a, int b)
{
    return a + b;
}

int main()
{
    Foo f(1, 2);
    foo.bar();
    foo.test(1, 1);
}
Assembly output:

Code: Select all

Module: C:\WATCOM\projects\test.cpp
GROUP: 'DGROUP' CONST,CONST2,_DATA,_BSS

File contains no line numbers.
Segment: _TEXT BYTE USE16 0000008D bytes
0000                          int near Foo::bar():
0000    56                        push        si 
0001    57                        push        di 
0002    55                        push        bp 
0003    89 E5                     mov         bp,sp 
0005    81 EC 02 00               sub         sp,0x0002 
0009    C7 46 FE 00 00            mov         word ptr -0x2[bp],0x0000 
000E    8B 46 FE                  mov         ax,word ptr -0x2[bp] 
0011    89 EC                     mov         sp,bp 
0013    5D                        pop         bp 
0014    5F                        pop         di 
0015    5E                        pop         si 
0016    C3                        ret         

Routine Size: 23 bytes,    Routine Base: _TEXT + 0000

0017                          int near Foo::test( int, int ):
0017    56                        push        si 
0018    57                        push        di 
0019    55                        push        bp 
001A    89 E5                     mov         bp,sp 
001C    81 EC 02 00               sub         sp,0x0002 
0020    8B 46 0A                  mov         ax,word ptr 0xa[bp] 
0023    03 46 0C                  add         ax,word ptr 0xc[bp] 
0026    89 46 FE                  mov         word ptr -0x2[bp],ax 
0029    8B 46 FE                  mov         ax,word ptr -0x2[bp] 
002C    89 EC                     mov         sp,bp 
002E    5D                        pop         bp 
002F    5F                        pop         di 
0030    5E                        pop         si 
0031    C3                        ret         

Routine Size: 27 bytes,    Routine Base: _TEXT + 0017

0032                          main_:
0032    53                        push        bx 
0033    51                        push        cx 
0034    52                        push        dx 
0035    56                        push        si 
0036    57                        push        di 
0037    55                        push        bp 
0038    89 E5                     mov         bp,sp 
003A    81 EC 0E 00               sub         sp,0x000e 
003E    8D 46 F2                  lea         ax,-0xe[bp] 
0041    89 46 F8                  mov         word ptr -0x8[bp],ax 
0044    C7 46 FA 01 00            mov         word ptr -0x6[bp],0x0001 
0049    C7 46 FC 02 00            mov         word ptr -0x4[bp],0x0002 
004E    8B 46 FA                  mov         ax,word ptr -0x6[bp] 
0051    89 46 F2                  mov         word ptr -0xe[bp],ax 
0054    8B 46 FC                  mov         ax,word ptr -0x4[bp] 
0057    89 46 F4                  mov         word ptr -0xc[bp],ax 
005A    8D 46 F2                  lea         ax,-0xe[bp] 
005D    89 46 FE                  mov         word ptr -0x2[bp],ax 
0060    8D 46 F2                  lea         ax,-0xe[bp] 
0063    50                        push        ax 
0064    E8 00 00                  call        int near Foo::bar() 
0067    83 C4 02                  add         sp,0x0002 
006A    B8 01 00                  mov         ax,0x0001 
006D    50                        push        ax 
006E    B8 01 00                  mov         ax,0x0001 
0071    50                        push        ax 
0072    8D 46 F2                  lea         ax,-0xe[bp] 
0075    50                        push        ax 
0076    E8 00 00                  call        int near Foo::test( int, int ) 
0079    83 C4 06                  add         sp,0x0006 
007C    C7 46 F6 00 00            mov         word ptr -0xa[bp],0x0000 
0081    8B 46 F6                  mov         ax,word ptr -0xa[bp] 
0084    89 EC                     mov         sp,bp 
0086    5D                        pop         bp 
0087    5F                        pop         di 
0088    5E                        pop         si 
0089    5A                        pop         dx 
008A    59                        pop         cx 
008B    5B                        pop         bx 
008C    C3                        ret         

Routine Size: 91 bytes,    Routine Base: _TEXT + 0032

No disassembly errors

Segment: CONST BYTE USE16 00000000 bytes

Segment: CONST2 PARA USE16 00000000 bytes

Segment: _DATA BYTE USE16 00000000 bytes

Segment: _BSS BYTE USE16 00000000 bytes

BSS Size: 0 bytes
My real concern is creating a Foo object on the stack in the fake main() method. It reserves space on the stack for 14 bytes (sub sp, 0x000E) but the size of a Foo object is only 4 bytes - 2 bytes each for the two int variables. Wouldn't it be more efficient for the assembly code to just reserve 4 bytes on the stack and put the two constructor arguments there? Or am I missing something?

Please let me know if there's anything I need to clarify.
User avatar
Owen
Member
Member
Posts: 1700
Joined: Fri Jun 13, 2008 3:21 pm
Location: Cambridge, United Kingdom
Contact:

Re: Open Watcom 16-Bit Optimization

Post by Owen »

A constructor is called as if defined

Code: Select all

void my_constructor(Object* this, void* arg1, void* arg2, ...);
No calling convention allows a pointed to object (*this) and the argument stack space to overlap. That would be insane.

That still doesn't explain the reservation size, however (but I haven't had an in depth look; gotta run soon)
Rhob
Posts: 22
Joined: Fri Apr 13, 2007 10:08 am
Location: Florida

Re: Open Watcom 16-Bit Optimization

Post by Rhob »

If you'll notice, in the assembly output, there's no actual call to the Foo class constructor in the fake main() method. So what's the point of pushing constructor arguments on the stack, or storing them in registers?
User avatar
Combuster
Member
Member
Posts: 9301
Joined: Wed Oct 18, 2006 3:45 am
Libera.chat IRC: [com]buster
Location: On the balcony, where I can actually keep 1½m distance
Contact:

Re: Open Watcom 16-Bit Optimization

Post by Combuster »

You put code in the header file. Watcom is apparently smart enough to inline the result :wink:
"Certainly avoid yourself. He is a newbie and might not realize it. You'll hate his code deeply a few years down the road." - Sortie
[ My OS ] [ VDisk/SFS ]
Rhob
Posts: 22
Joined: Fri Apr 13, 2007 10:08 am
Location: Florida

Re: Open Watcom 16-Bit Optimization

Post by Rhob »

Fair enough. :P But as far as I can tell, it only did a partial inlining. A full inlining would render the pushing of constructor arguments on the stack completely superfluous, would it not?
Rhob
Posts: 22
Joined: Fri Apr 13, 2007 10:08 am
Location: Florida

Re: Open Watcom 16-Bit Optimization

Post by Rhob »

I'd like to summarize my stance on this issue.

It seems clear to me that, as Combuster wrote, the Open Watcom C++ compiler was smart enough to inline the constructor code where it's called by the fake main() function - especially since the constructor is empty aside from an initializer list. However, for some reason it still went through the motions of pushing the constructor arguments (two ints) onto the stack. An even better result, IMO, would be the following:

Code: Select all

main_:
    ; Preserve register values
    push    bx
    push    cx
    push    dx
    push    si
    push    di
    push    bp
    mov     bp, sp
    
    ; Inline constructor code
    sub     sp, 0x0004
    lea     ax, -0x4[bp]                ; 'this' pointer
    mov     word ptr -0x4[bp], 0x0001
    mov     word ptr -0x2[bp], 0x0002
    
    ; Call Foo::bar() using cdecl
    push    ax                          ; 'this' pointer
    call    int near Foo::bar()
    add     sp, 0x0002
    
    ; Now call Foo::test(int, int) using cdecl
    push    0x0001
    push    0x0001
    lea     ax, -0x4[bp]                ; 'this' pointer
    push    ax
    call    int near Foo::test(int, int)
    add     sp, 0x0006
    
    ; Restore register values and return 0
    mov     ax, 0x0000
    mov     sp, bp
    pop     bp
    pop     di
    pop     si
    pop     dx
    pop     cx
    pop     bx
    ret
gerryg400
Member
Member
Posts: 1801
Joined: Thu Mar 25, 2010 11:26 pm
Location: Melbourne, Australia

Re: Open Watcom 16-Bit Optimization

Post by gerryg400 »

What are you debugging/optimisation settings ? Do you have debugging disabled and optimisation enabled ?
If a trainstation is where trains stop, what is a workstation ?
Rhob
Posts: 22
Joined: Fri Apr 13, 2007 10:08 am
Location: Florida

Re: Open Watcom 16-Bit Optimization

Post by Rhob »

I have both debugging and optimization disabled, actually. I also have stack-frame generation disabled, if that makes any difference.
gerryg400
Member
Member
Posts: 1801
Joined: Thu Mar 25, 2010 11:26 pm
Location: Melbourne, Australia

Re: Open Watcom 16-Bit Optimization

Post by gerryg400 »

Rhob wrote:I have both debugging and optimization disabled, actually. I also have stack-frame generation disabled, if that makes any difference.
If you have disabled optimisation then I'm not surprised that you are able to produce better code than the compiler. Try it with optimisation ON. You'll probably find that your main function is almost empty because it doesn't really do anything.
If a trainstation is where trains stop, what is a workstation ?
User avatar
Combuster
Member
Member
Posts: 9301
Joined: Wed Oct 18, 2006 3:45 am
Libera.chat IRC: [com]buster
Location: On the balcony, where I can actually keep 1½m distance
Contact:

Re: Open Watcom 16-Bit Optimization

Post by Combuster »

I have both debugging and optimization disabled
Which explains why it only inlines the constructor verbatim and nothing more: because it has to do that - the implementation isn't defined elsewhere.

Also, I noticed your "optimized" code doesn't run on an 8086 in case you are looking for a fair comparison :wink:.
"Certainly avoid yourself. He is a newbie and might not realize it. You'll hate his code deeply a few years down the road." - Sortie
[ My OS ] [ VDisk/SFS ]
Rhob
Posts: 22
Joined: Fri Apr 13, 2007 10:08 am
Location: Florida

Re: Open Watcom 16-Bit Optimization

Post by Rhob »

Combuster wrote:Which explains why it only inlines the constructor verbatim and nothing more: because it has to do that - the implementation isn't defined elsewhere.
I'm sorry but I don't understand why it throws in that extra code anyway.
Combuster wrote:Also, I noticed your "optimized" code doesn't run on an 8086 in case you are looking for a fair comparison :wink:.
... Do you think I'm being arrogant or something? Otherwise, I can't explain your tone here.

Maybe you'd like to tell me why it apparently doesn't run on an 8086?
User avatar
Combuster
Member
Member
Posts: 9301
Joined: Wed Oct 18, 2006 3:45 am
Libera.chat IRC: [com]buster
Location: On the balcony, where I can actually keep 1½m distance
Contact:

Re: Open Watcom 16-Bit Optimization

Post by Combuster »

I'm sorry but I don't understand why it throws in that extra code anyway.
A compiler that is not optimizing will 1:1 translate code to its assembly counterpart without any regard for operations following it. If you declare a variable, it is allocated space on the stack, If you set a variable that location is actually written. The resulting assembly will look rather bloated as it completely lacks the resulting load-store forwarding.
Maybe you'd like to tell me why it apparently doesn't run on an 8086
because you are using instructions that did not exist on the 8086/8088: in this case, push immediate. The compiler generates universally correct code by generating mov ax, imm; push ax; which makes the compiler's code better than yours as the result works everywhere. You however claim it is worse because it uses double the number of CPU cycles and uses more bytes of code: that comparison is not entirely fair and you were apparently unaware of that fact.
"Certainly avoid yourself. He is a newbie and might not realize it. You'll hate his code deeply a few years down the road." - Sortie
[ My OS ] [ VDisk/SFS ]
Rhob
Posts: 22
Joined: Fri Apr 13, 2007 10:08 am
Location: Florida

Re: Open Watcom 16-Bit Optimization

Post by Rhob »

Combuster wrote:A compiler that is not optimizing will 1:1 translate code to its assembly counterpart without any regard for operations following it. If you declare a variable, it is allocated space on the stack, If you set a variable that location is actually written. The resulting assembly will look rather bloated as it completely lacks the resulting load-store forwarding.
In the following line of code:

Code: Select all

Foo f(1, 2);
only the Foo instance is explicitly declared. I see no reason for the compiler to treat the int arguments to the constructor as implicit variables, even when not told to optimize. However, I guess I was mistaken about how typical (C/C++) compilers operate - they're "dumber" than I expected them to be when it comes to things like this. Then again, even without being told to optimize, the Open Watcom C++ compiler is still "smart" enough to inline the initialization code instead of going through the motions of actually calling the constructor method. So is it really doing a 1:1 code translation?
Combuster wrote:because you are using instructions that did not exist on the 8086/8088: in this case, push immediate. The compiler generates universally correct code by generating mov ax, imm; push ax; which makes the compiler's code better than yours as the result works everywhere. You however claim it is worse because it uses double the number of CPU cycles and uses more bytes of code: that comparison is not entirely fair and you were apparently unaware of that fact.
I assume you're talking about these two lines of my alternative code:

Code: Select all

push 0x0001
push 0x0001
That wasn't my intended focus for this thread (it was the object creation and initialization), and I indeed wasn't aware that push-immediate instructions didn't exist for the 8086/8088. So I gladly stand corrected on that point, thanks. :)
User avatar
Solar
Member
Member
Posts: 7615
Joined: Thu Nov 16, 2006 12:01 pm
Location: Germany
Contact:

Re: Open Watcom 16-Bit Optimization

Post by Solar »

The -O0 path of code generation is usually the least well-tested one, and the one the designers did spend the least amount of work on, to the point of its use being discouraged for anything except debugging.
Every good solution is obvious once you've found it.
User avatar
Combuster
Member
Member
Posts: 9301
Joined: Wed Oct 18, 2006 3:45 am
Libera.chat IRC: [com]buster
Location: On the balcony, where I can actually keep 1½m distance
Contact:

Re: Open Watcom 16-Bit Optimization

Post by Combuster »

Rhob wrote:In the following line of code:

Code: Select all

Foo f(1, 2);
only the Foo instance is explicitly declared. I see no reason for the compiler to treat the int arguments to the constructor as implicit variables, even when not told to optimize. However, I guess I was mistaken about how typical (C/C++) compilers operate - they're "dumber" than I expected them to be when it comes to things like this. Then again, even without being told to optimize, the Open Watcom C++ compiler is still "smart" enough to inline the initialization code instead of going through the motions of actually calling the constructor method. So is it really doing a 1:1 code translation?
This compilation exercise has the following:
Declare a variable f
initialize f, and for that supply the arguments 1 and 2 to the constructor.

The declaration requires the size of the class be known, and space for it to be allocated. The size will be added to the stack frame in question and in a later stage the offset relative to bp is used where needed.
Then the initialisation is a constructor function with the format (Foo *, int, int). Since this is a nested function, we have to use the same stackframe. The number of arguments is more than any previous function so we tell the prologue to include at least three slots for call stack storage. Then put the arguments on the stack in reverse order: push the value of b, push the value of a, push the address f
Then the constructor is called. The code is included in the class definition, so create a scope block and recursively include the code.
Add the local variables of the constructor to the required stack space, of which there are none.
Initialize the variables: copy a passed on the stack to a_ which is n bytes relative to the pointer passed elsewhere on the stack. Repeat for b.
Function scope ends. Since the function was nested, we do not need to include a separate prologue and epilogue.


Looks like KISS compiler behaviour to me.
"Certainly avoid yourself. He is a newbie and might not realize it. You'll hate his code deeply a few years down the road." - Sortie
[ My OS ] [ VDisk/SFS ]
Post Reply