Page 2 of 2

Posted: Fri Jul 13, 2007 6:15 am
by XCHG
hailstorm,

That method won't do either when you have local variables. A local variable must always be accessed with EBP that is pushed as the last value while creating the stack frame. For example:

Code: Select all

; ——————————————————————————————————————————————————
  __Dummy:
    ; void __Dummy (void); StdCall;
    PUSH    EAX
    PUSH    EDX
    PUSH    EBP
    MOV     EBP , ESP
    
    ; [EBP - 0x04] = LocalVariable1
    ; [EBP - 0x08] = LocalVariable2
    
    POP     EBP
    POP     EDX
    POP     EAX
    RET
; ——————————————————————————————————————————————————
But if you push EBP first, then you will have to calculate the offset of local variables with more complexity.

Posted: Fri Jul 13, 2007 9:30 am
by Combuster
EBP is *not* required. Setting up stack frames cost you a few clocks every time you call that function, not to mention one of the eight general purpose registers. The standard prologue and epilogue are just a trick to make programming and debugging easier using the stack. You can directly reference relative to ESP instead of using EBP, and if you're more into assembly you can use register calling conventions which are even easier to write. (but more difficult to maintain)

Some other notes:
StdCall dictates that the stack be cleaned by the callee. That is, the called function must clean up the stack before returning.

Code: Select all

push arg
call stdcallfunction
; when we get here, arg has been popped
Also, when you want to use local variables, you must create space on the stack for them:

Code: Select all

MOV EBP, ESP
SUB ESP, arg_count * 4
MOV ..., [EBP - 4 * arg_no - 4]
or without stackframe referencing

Code: Select all

SUB ESP, arg_count * 4
MOV ..., [ESP + 4 * (arg_count - arg_no)]

Posted: Sun Jul 15, 2007 6:33 am
by XCHG
I really don't like accessing local variables or parameters using the ESP because of these reasons:


1) Accessing local variables and parameters through ESP is literally useless when coding recursive functions.

2) Each reference to ESP is encoded into opcodes 1 byte longer than the equivalent using EBP. So code gets bigger if you use ESP instead of EBP (by one byte per each reference).

I have not seen any good HLL compilers using ESP to access these values and they are, IMHO, not just *tricks* for debugging. They make the programming easier and the code more understandable and easier to maintain.

Posted: Sun Jul 15, 2007 6:13 pm
by Brendan
Hi,
XCHG wrote:I really don't like accessing local variables or parameters using the ESP because of these reasons:

1) Accessing local variables and parameters through ESP is literally useless when coding recursive functions.
I can't see why it'd make any difference whether the function is recursive or not.
XCHG wrote:2) Each reference to ESP is encoded into opcodes 1 byte longer than the equivalent using EBP. So code gets bigger if you use ESP instead of EBP (by one byte per each reference).
It takes 4 extra bytes to setup EBP (push ebp; mov ebp, esp; pop ebp). Also, one local variable can be put into EBP instead of on the stack, which saves more bytes of code (and can reduce register dependancies).
XCHG wrote:I have not seen any good HLL compilers using ESP to access these values and they are, IMHO, not just *tricks* for debugging. They make the programming easier and the code more understandable and easier to maintain.
Did you try "GCC -fomit-frame-pointer"? According to Redhat's GCC options web page it's enabled on all platforms that benefit from it when any optimization is being done (including, -O1, -Os, etc).

Unfortunately, on my system (Gentoo) this isn't the case, and I need to specify "-fomit-frame-pointer" (even with "-O3").

For an example, I tried this:

Code: Select all

int foo(int a, int b) {
   int c = 1;

   c += a;

   if(a != 0) c = foo(b - 1, c);

   return (a & b | c);
}
Compiled with "gcc -fomit-frame-pointer -S foo.c" it produced:

Code: Select all

.globl foo
	.type	foo, @function
foo:
	subl	$24, %esp
	movl	$1, 20(%esp)
	movl	28(%esp), %eax
	addl	%eax, 20(%esp)
	cmpl	$0, 28(%esp)
	je	.L2
	movl	32(%esp), %eax
	decl	%eax
	movl	20(%esp), %edx
	movl	%edx, 4(%esp)
	movl	%eax, (%esp)
	call	foo
	movl	%eax, 20(%esp)
.L2:
	movl	32(%esp), %eax
	andl	28(%esp), %eax
	orl	20(%esp), %eax
	addl	$24, %esp
	ret

Cheers,

Brendan

Posted: Sun Jul 15, 2007 10:37 pm
by XCHG
1) Let me go through basics again :roll: A recursive function is a function that CALLs itself. Each time the function CALLs a label within itself, the CALL instruction will PUSH the DWORD EIP of the next instruction right after the CALL into the stack, causing the ESP to be decremented 4 times. Now when the function has finished recursing, you are left with an ESP that you don't even know how many times has been decremented. Now how are you going to access local parameters or even worse, local variables with this ESP? An example:

Code: Select all

; ——————————————————————————————————————————————————
  __RecursiveFunction:
    ; void __RecursiveFunction (DWORD Count, Boolean* MissionAccomplished); StdCall;
    PUSH    EAX
    PUSH    ECX
    
    ; [ESP]        = ECX
    ; [ESP + 0x04] = EAX
    ; [ESP + 0x08] = EIP
    ; [ESP + 0x0C] = Count
    ; [ESP + 0x10] = *MissionAccomplished
    
    MOV     ECX , DWORD PTR [ESP + 0x0C]
    .DummyLoop:
      ; Do some calculations here
      DEC     ECX
      JZ      .End
      CALL    .DummyLoop
    .End:
      ; What is the offset of [MissionAccomplished] from ESP now?
    POP     ECX
    POP     EAX
    RET     0x08
; ——————————————————————————————————————————————————
Could you tell me what the offset of the [MissionAccomplished] parameter will be after the function is done recursing?




2) You mean you would sacrifice a whole lot of benefits that are given to you by using EBP, just because setting up the stack pointer would cost you a few couple of clock cycles? You will lose your local variables; lose your parameters, your opcodes are each 1 byte longer than when you use EBP and you don't see all this? Just because you want to save a few clock cycles by preventing creating the stack frame? :roll:

Posted: Mon Jul 16, 2007 3:10 am
by Brendan
Hi,
XCHG wrote:1) Let me go through basics again :roll: A recursive function is a function that CALLs itself. Each time the function CALLs a label within itself, the CALL instruction will PUSH the DWORD EIP of the next instruction right after the CALL into the stack, causing the ESP to be decremented 4 times. Now when the function has finished recursing, you are left with an ESP that you don't even know how many times has been decremented. Now how are you going to access local parameters or even worse, local variables with this ESP? An example:

Code: Select all

; ——————————————————————————————————————————————————
  __RecursiveFunction:
    ; void __RecursiveFunction (DWORD Count, Boolean* MissionAccomplished); StdCall;
    PUSH    EAX
    PUSH    ECX
    
    ; [ESP]        = ECX
    ; [ESP + 0x04] = EAX
    ; [ESP + 0x08] = EIP
    ; [ESP + 0x0C] = Count
    ; [ESP + 0x10] = *MissionAccomplished
    
    MOV     ECX , DWORD PTR [ESP + 0x0C]
    .DummyLoop:
      ; Do some calculations here
      DEC     ECX
      JZ      .End
      CALL    .DummyLoop
    .End:
      ; What is the offset of [MissionAccomplished] from ESP now?
    POP     ECX
    POP     EAX
    RET     0x08
; ——————————————————————————————————————————————————
Could you tell me what the offset of the [MissionAccomplished] parameter will be after the function is done recursing?
That is NOT a recursive function - it's an abomination, as ".DummyLoop" isn't an entry point for any function, and for any non-zero count you'd pop more off the stack than was pushed on it and trash the return EIP.

To do it correctly you'd want something like:

Code: Select all

; ——————————————————————————————————————————————————
  __RecursiveFunction:
    ; void __RecursiveFunction (DWORD Count, Boolean* MissionAccomplished); StdCall;
    PUSH    EAX
    PUSH    ECX
    
    ; [ESP]        = ECX
    ; [ESP + 0x04] = EAX
    ; [ESP + 0x08] = EIP
    ; [ESP + 0x0C] = Count
    ; [ESP + 0x10] = *MissionAccomplished
    
    MOV     ECX , DWORD PTR [ESP + 0x0C]
    .DummyLoop:
      ; Do some calculations here
      DEC     ECX
      JZ      .End
      PUSH    [ESP + 0x0C]      ;Push Count
      PUSH    [ESP + 0x10 + 4]  ;Push *MissionAccomplished
      CALL    __RecursiveFunction
    .End:
      ; What is the offset of [MissionAccomplished] from ESP now?
    POP     ECX
    POP     EAX
    RET     0x08
; ——————————————————————————————————————————————————

For the correct version, the offset of "*MissionAccomplished" is "[esp + 0x10]" both before the recursive call and after it.


Of course the original code would've also been correct if you did something like this:

Code: Select all

; ——————————————————————————————————————————————————
  __RecursiveFunction:
    ; void __RecursiveFunction (DWORD Count, Boolean* MissionAccomplished); StdCall;
    PUSH    EAX
    PUSH    ECX
    
    ; [ESP]        = ECX
    ; [ESP + 0x04] = EAX
    ; [ESP + 0x08] = EIP
    ; [ESP + 0x0C] = Count
    ; [ESP + 0x10] = *MissionAccomplished
    
    MOV     ECX , DWORD PTR [ESP + 0x0C]
    .DummyLoop:
      ; Do some calculations here
      DEC     ECX
      JZ      .End
      JMP     .DummyLoop
    .End:
      ; What is the offset of [MissionAccomplished] from ESP now?
    POP     ECX
    POP     EAX
    RET     0x08
; ——————————————————————————————————————————————————
..but then it's not recursive at all.
XCHG wrote:2) You mean you would sacrifice a whole lot of benefits that are given to you by using EBP, just because setting up the stack pointer would cost you a few couple of clock cycles? You will lose your local variables; lose your parameters, your opcodes are each 1 byte longer than when you use EBP and you don't see all this? Just because you want to save a few clock cycles by preventing creating the stack frame? :roll:
I mean I'd sacrifice all the disadvantages given to me by using a frame pointer, given that I can't find a single advantage of wasting an entire register, especially on 32-bit 80x86 where there really aren't enough registers to waste.

You don't lose local variables or parameters by not wasting EBP. For a quick example:

Code: Select all

__FOO:           ;int FOO(int a, int b) 
    add esp,8
%define firstLocalVariable  [esp]
%define secondLocalVariable [esp+4]
%define firstParameter      [esp+12]
%define secondParameter     [esp+16]

    sub esp,8
    ret
For code size, saving 1 byte for some opcodes is a small (negligable) gain, but it can cost you more than you gain for small functions. For your own " __MemSetB" code there'd be no difference in code size (the cost of the "push ebp; mov ebp,esp; pop ebp" is equal to the 4 bytes you save) but it'd run faster with 3 less micro-ops in the trace cache.


Cheers,

Brendan

Posted: Mon Jul 16, 2007 6:15 am
by XCHG
Brendan,


I think you will agree that procedures and functions are the same from the Assembly point of view. They are just labels and they have RET. So the simplest procedure (I will state procedure instead of procedure/function from this point forth) is like this:

Code: Select all

__MyProcedure;
  ; void __MyProcedure (void); StdCall;
  RET


Therefore a procedure can be defined as an EIP and a RET. Now let me give you another example of a procedure. This one will convert an unsigned DWORD value to its equivalent null-terminated string:

Code: Select all

; --------------------------------------------------
  __DWORDToStr:
    ; void __DWORDToStr  (DWORD InDWORD, char* Buffer)
    PUSH    EAX
    PUSH    EBX
    PUSH    ECX
    PUSH    EDX
    PUSH    EDI
    PUSH    EBP
    MOV     EBP , ESP
    PUSH    OFFSET .EP
    MOV     EAX , DWORD PTR [EBP + 0x1C] ; InDWORD
    MOV     EDI , DWORD PTR [EBP + 0x20] ; Buffer
    MOV     ECX , 0xCCCCCCCD             ; Divisor
    .Loop:
      MOV     EBX , EAX
      MUL     ECX
      SHR     EDX , 0x00000003
      MOV     EAX , EDX
      ADD     EDX , EDX
      LEA     EDX , [EDX + 0x04*EDX]
      SUB     EBX , EDX
      OR      EBX , 0x00000030
      PUSH    EBX
      TEST    EAX , EAX
      JE      .Extract
      CALL    .Loop
    .Extract:
      POP     EAX
      MOV     BYTE PTR [EDI] , AL
      INC     EDI
      RET
    .EP:
      MOV     BYTE PTR [EDI] , 0x00
      POP     EBP
      POP     EDI
      POP     EDX
      POP     ECX
      POP     EBX
      POP     EAX
    RET     0x08
; --------------------------------------------------

Now if you look at the [.Loop] label, you will see that it has RET at the end of the [.Extract] label therefore, the [.Loop] up to the end of the [.Extract] label is a separate procedure. [__DWORDToStr] is another procedure. So far, we have nested procedures.

Now look at the CALL instruction before the [.Extract] label. It will push the EIP of the [.Extract] label into the stack every time it transfers the control to the [.Loop] label so [/b]IT IS RECURSIVE[/b]. To state the matter differently, The [.Loop] is a NESTED RECURSIVE[b/] procedure. So that's that! ESP is usable in such procedures neither for accessing local variables nor parameters.



About using ESP instead of creating EBP, maybe it is how I feel but I think you don't code in Assembly as much as you do in high level languages because if you did, you would have known that maintainability is one of THE most important parts of coding in Assembly. I have coded procedures that are literally hundreds of lines of code and if I hadn't udes EBP as a steady reference to my local parameters and variables, I could have simply wasted a lot of time re-calculating all the offsets relative to ESP. There are times when you PUSH and POP a lot of values onto and off of the stack and if you had chosen to access the parameters and/or local variables using ESP instead of EBP, you would have gone insane. Now could it make you happy if you still had your EBP as a free GPR while all of the offsets and re-calculating them could take you hours? Sorry I just don't buy that.

About the speed of execution of the codes that use ESP instead of EBP, you are NOT at all considering the decoding time of instructions, huh? You think it wouldn't matter if you had referenced ESP 100 times instead of EBP and you had made the CPU decode 100 more bytes. That is CRIME against the CPU. [calls 911 :roll:]

Posted: Mon Jul 16, 2007 8:56 am
by Combuster
XCHG wrote:Now look at the CALL instruction before the [.Extract] label. It will push the EIP of the [.Extract] label into the stack every time it transfers the control to the [.Loop] label so [/b]IT IS RECURSIVE[/b]. To state the matter differently, The [.Loop] is a NESTED RECURSIVE[b/] procedure. So that's that! ESP is usable in such procedures neither for accessing local variables nor parameters.

Oh? Actually you ARE using ESP for stack referencing in the recursive part without even USING EBP. All operations are implicit stack operations (push, pop, call, ret) and operate on ESP. Also, I can omit the EBP setup and use ESP instead, it would take less bytes in total and fill less cpu instruction slots.

About using ESP instead of creating EBP, maybe it is how I feel but I think you don't code in Assembly as much as you do in high level languages because if you did, you would have known that maintainability is one of THE most important parts of coding in Assembly. I have coded procedures that are literally hundreds of lines of code and if I hadn't udes EBP as a steady reference to my local parameters and variables, I could have simply wasted a lot of time re-calculating all the offsets relative to ESP. There are times when you PUSH and POP a lot of values onto and off of the stack and if you had chosen to access the parameters and/or local variables using ESP instead of EBP, you would have gone insane. Now could it make you happy if you still had your EBP as a free GPR while all of the offsets and re-calculating them could take you hours? Sorry I just don't buy that.

And when you reuse code, you change the arguments? Normally, the reused code will be identical, and you do not have to spend time changing the references. Even if you do, the changes will be the same for either method: you just add or subtract a number. When you really need to recalculate ESP relatives, you wouldn't be reusing code, you'd be modifying or even rewriting it.
Also, the main purpose of freeing EBP setup is speed. Performance optimisations takes time, something you are apparently not willing to spend. Consequently bothering with speed is something I don't buy.

Also, it has been said that you should rewrite your code once your assumptions change. If you wonder, that comes from an Quake programmer and expert optimiser. I'll find you the quote if you don't believe me - its somewhere in Graphics Programming Black Book.

About the speed of execution of the codes that use ESP instead of EBP, you are NOT at all considering the decoding time of instructions, huh? You think it wouldn't matter if you had referenced ESP 100 times instead of EBP and you had made the CPU decode 100 more bytes. That is CRIME against the CPU. [calls 911 :roll:]
That's based on many assumptions.
#1: the decoder must be the bottleneck. That's doubtful, and even void for the P4's trace cache.
#2: the decoder must be quicker with decoding EBP+xxx than ESP+xxx. If the decoder has an fixed instructions/clock limit, this problem does not exist. For several other decoders this depends on the surrounding instructions and/or actual position in memory.
#3: freeing EBP for other purposes does not result in fewer references to the stack that reduce the latencies involved in #1 and #2. The main reason for freeing EBP is that you have auxiliary storage for variables that would otherwise be in memory. Referencing memory is on average slower and takes more bytes than using a register. Chances are very likely that you save more clocks and bytes here than you waste clocks and bytes decoding the ESP-referencing instructions.

Posted: Mon Jul 16, 2007 9:52 am
by Brendan
Hi,
XCHG wrote:Now if you look at the [.Loop] label, you will see that it has RET at the end of the [.Extract] label therefore, the [.Loop] up to the end of the [.Extract] label is a separate procedure. [__DWORDToStr] is another procedure. So far, we have nested procedures.
Your ".loop" could be considered a seperate procedure that passes parameters in registers, but it doesn't use any HLL calling convention and doesn't reference ESP or EBP at all. It's an example of what can be done once you throw away the stack frame.

Modern CPUs keep track of return values so that the target of a RET can be predicted and doesn't cause stalls - the code you've posted should be optimized to so that the "PUSH OFFSET .EP" doesn't stuff this up:

Code: Select all

; --------------------------------------------------
  __DWORDToStr:
    ; void __DWORDToStr  (DWORD InDWORD, char* Buffer)
    PUSH    EAX
    PUSH    EBX
    PUSH    ECX
    PUSH    EDX
    PUSH    EDI
    MOV     EAX , DWORD PTR [ESP + 0x18] ; InDWORD
    MOV     EDI , DWORD PTR [ESP + 0x1C] ; Buffer
    MOV     ECX , 0xCCCCCCCD             ; Divisor
    CALL    .Loop
    MOV     BYTE PTR [EDI] , 0x00
    POP     EDI
    POP     EDX
    POP     ECX
    POP     EBX
    POP     EAX
    RET     0x08

    .Loop:
      MOV     EBX , EAX
      MUL     ECX
      SHR     EDX , 0x00000003
      MOV     EAX , EDX
      ADD     EDX , EDX
      LEA     EDX , [EDX + 0x04*EDX]
      SUB     EBX , EDX
      OR      EBX , 0x00000030
      PUSH    EBX
      TEST    EAX , EAX
      JE      .Extract
      CALL    .Loop
    .Extract:
      POP     EAX
      MOV     BYTE PTR [EDI] , AL
      INC     EDI
      RET
; --------------------------------------------------
Of course the idea of using up to 80 bytes of stack space, having a call/ret pair in the inner loop, and doing up to 10 byte writes instead of fewer larger writes just doesn't seem right to me. It might be fun to see if this is faster:

Code: Select all

; --------------------------------------------------
  __DWORDToStr:
    ; void __DWORDToStr  (DWORD InDWORD, char* Buffer)
    PUSH    EAX
    PUSH    EBX
    PUSH    ECX
    PUSH    EDX
    PUSH    ESI
    PUSH    EDI
    PUSH    EBP
    MOV     EAX , DWORD PTR [ESP + 0x20] ; InDWORD
    MOV     EDI , 0xCCCCCCCD             ; Divisor
    XOR     EBP , EBP
    XOR     ECX , ECX
    XOR     ESI , ESI

 .L1:
      MOV     EBX , EAX
      MUL     EDI
      SHR     EDX , 0x00000003
      MOV     EAX , EDX
      ADD     EDX , EDX
      LEA     EDX , [EDX + 0x04*EDX]
      SUB     EBX , EDX
      OR      EBX , 0x00000030
      SHLD    ESI, EBP, 8
      SHLD    EBP, ECX, 8
      SHL     ECX, 8
      TEST    EAX , EAX
      MOV     CL, BL
      JNE     .L1

    MOV     EDI , DWORD PTR [ESP + 0x24] ; Buffer
    MOV     [EDI] , ECX
    MOV     [EDI+4] , EBP
    MOV     [EDI+8] , SI
    MOV     [EDI+10] , AL

    POP     EBP
    POP     EDI
    POP     ESI
    POP     EDX
    POP     ECX
    POP     EBX
    POP     EAX
    RET     0x08

; --------------------------------------------------
Hmm - nearly ran out of registers for that (must've been a coincidence).... :D

XCHG wrote:About using ESP instead of creating EBP, maybe it is how I feel but I think you don't code in Assembly as much as you do in high level languages because if you did, you would have known that maintainability is one of THE most important parts of coding in Assembly. I have coded procedures that are literally hundreds of lines of code and if I hadn't udes EBP as a steady reference to my local parameters and variables, I could have simply wasted a lot of time re-calculating all the offsets relative to ESP. There are times when you PUSH and POP a lot of values onto and off of the stack and if you had chosen to access the parameters and/or local variables using ESP instead of EBP, you would have gone insane. Now could it make you happy if you still had your EBP as a free GPR while all of the offsets and re-calculating them could take you hours? Sorry I just don't buy that.
For the record, I've been programming assembly for over 20 years (although I've only been doing 80x86 for about 15 of those years). I've spent more time programming in Commodore 64 BASIC than I have in any other high level language.

If you're using local variables, then you shouldn't be pushing and popping stuff everywhere to begin with (setup the stack on entry, then write your code so it spills values to local variables on the previously setup stack, then undo the stack and return). If you write a procedure that's hundreds of lines long, then it's not maintainable (regardless of what you do or don't do with the stack frame), and you should've split it into smaller parts before you got that far.
XCHG wrote:About the speed of execution of the codes that use ESP instead of EBP, you are NOT at all considering the decoding time of instructions, huh? You think it wouldn't matter if you had referenced ESP 100 times instead of EBP and you had made the CPU decode 100 more bytes. That is CRIME against the CPU. [calls 911 :roll:]
For small procedures using EBP for the frame pointer will cost you more bytes of code. For larger procedures a few extra bytes of code is nothing compared to having an extra register to help avoid spilling values to the stack, help avoid register dependancies, and help with better instruction scheduling.


Cheers,

Brendan