Assembly mem_copy ( )

Lprogster · Post by **Lprogster** » Thu May 17, 2007 12:15 pm

I've been trying to improve the performance of my mem_copy ( ) function.

...See later posts...

Any hints, ideas or solutions (especially) would be great.

Thanks,
Lster

Korona · Post by **Korona** » Thu May 17, 2007 1:39 pm

Code: Select all

mov edx, [esp]
mov eax, 8
div ecx

This piece of code loads the return address into edx and then divides edx:eax trough ecx.
Try:

Code: Select all

mov eax, [esp + 4]
xor edx, edx
mov ecx, 8
div ecx
mov eax, ecx

Much faster would be:

Code: Select all

mov ecx, [esp + 4]
shr ecx, 8

Code: Select all

mov ebx, [esp + 4]
mov edx, [esp + 8]

ebx and edx are loaded with the wrong values, the first argument is esp + 4, the second is esp + 8 and the third is esp + 12.

It would be faster to load more than one qword in one iteration (pipelining):
movq mm0, [edx]
movq mm1, [edx + 8]
....

Lprogster · Post by **Lprogster** » Thu May 17, 2007 1:55 pm

Thank you very much for your reply. I still get errors, though (page fault...). Here is my updated code:

...

Thanks again,
Lster

B.E · Post by **B.E** » Thu May 17, 2007 3:00 pm

I would try somthing like this (I can't remember if movs* decreases ecx or rep does it.), I don't know if this is faster, but it's supported by all >286 processors

Code: Select all

mem_copy_a:
   
   mov ecx, [esp + 4]
   mov eax, ecx
   and eax,3

   shr ecx, 3
   
   mov edi, [esp + 8]
   mov esi, [esp + 12]
   
   shr eax,1
   jnc TwoByteWrite
   movsb
TwoByteWrite:
   shr eax,1
   jnc Dword_Write
   movsw
Dword_Write:
   rep movsd
   ret

Also you page fault may be due to the fact that your writing to much data. (i.e what happens if the code want's to copy 9,17,25 or 33 bytes of data.

B.E · Post by **B.E** » Thu May 17, 2007 3:04 pm

Also it may be better to use a 128 bit move with. and use offsets rather than adding to edx, ebx

Code: Select all

  movdqa xmm1,[esi+ebx];
  movdqa [edi+ebx],xmm1;

XCHG · Post by **XCHG** » Fri May 18, 2007 12:31 am

Maybe I am missing on something but how in the world can you assume that the programmer is trying to move >8 bytes? What if we are moving 2 bytes only? You are moving 8 bytes with the MOVQ MMX instruction.

Another point to notice is that when you issue the DEC instruction with any of the General Purpose Registers, the Zero Flag in the EFLAGS register will be set to 1. So this code that you have written:

Code: Select all

      dec ecx
      cmp ecx, 0
      jne .again

Is not optimized. You can remove the CMP instruction like this:

Code: Select all

  DEC     ECX
  JNZ     .DoSomethingIfECXIsNotZeroYet

Another thing that you might already know is that you "must" ask the programmer how many bytes he would like to copy. Then you will be able to pull off a lot of tricks for copying from the source to the destination.

For example, first find the quotient of the "count" by 8 and see how many QWORDs (Using MMX instruction) you can move to the destination. After that, find the quotient of the "count / 8" divided by 4 and find out how many DWORDs you can move to the destination, then use that for WORD and finally BYTE.

For example, if the user asks you to move 19 bytes from the source to the destination, then 19 bytes of transfer can be done like this:

Two QWORDs (= 2 * 8 = 16 bytes)
Now subtract 16 (the number of bytes we have already moved) from 19 to get 3. 3 will not yield a positive quotient when divided by 4 so divide it by 2 to get 1 as the quotient. The remainder will be 1. So you have to move 1 WORD and 1 Bytes.

As a result, to transfer 19 bytes,

Transfer 2 QWORDs (QWORD = 8 bytes, 2*8= 16 bytes)
Transfer 1 WORD (2 bytes)
Transfer 1 Byte

I have written on __MMXZeroMemory function that I think will give you the idea. However, I have not used DWORD/WORD movements in this code. I have written comments for the code but haven't had time to write documentation. Post back here if you have any questions.

Code: Select all

  __MMXZeroMemory:
    ; void __MMXZeroMemory (void* Destination, DWORD Length)
    PUSH    EAX                         ; Push the accumulator onto the stack
    PUSH    ECX                         ; Push the count register onto the stack
    PUSH    EDX                         ; Push the data register onto the stack
    PUSH    EDI                         ; Push the destination index onto the stack
    PUSH    EBP                         ; Push the base pointer onto the stack
    MOV     EBP , ESP                   ; Move the stack pointer to the base pointer
    MOV     ECX , DWORD PTR [EBP + 0x1C]; *ECX = The [Length] parameter
    TEST    ECX , ECX                   ; See if the requested length is zero
    JZ      .EP                         ; Jump to the end of the procedure if yes
    MOV     EDI , DWORD PTR [EBP + 0x18]; *EDI = The [Destination] parameter
    MOV     EAX , ECX                   ; *EAX = The [Length] parameter
    XOR     EDX , EDX                   ; Clear the buffer for Byte-moves
    SHR     ECX , 0x00000003            ; *ECX = Number of QWORDs that we have to move to the Destination
    AND     EAX , 0x00000007            ; *EAX = Number of parity bytes that we have to move
    TEST    ECX , ECX                   ; See if the number of QWORDs to move is zero
    JZ      .MoveBytes                  ; Start moving bytes (Up to 7 bytes) if yes
    EMMS                                ; Empty MMX Technology State
    PXOR    MM0 , MM0                   ; The 8-Byte MMX byte is zero
    .MoveQWORDs:                        ; Start zeroing memory 8 bytes at a time
      MOVQ    QWORD PTR [EDI] , MM0     ; Zero the current QWORD
      ADD     EDI , 0x00000008          ; Move to the next QWORD in the destination
      DEC     ECX                       ; Decrement the number of QWORDs to move
      JNZ     .MoveQWORDs               ; Keep moving QWORDs to the destination
    TEST    EAX , EAX                   ; See if the number of parity bytes to move is zero
    JZ      .EP                         ; Jump to the end of the procedure if yes
    .MoveBytes:                         ; We are left to move up to 7 bytes now
      MOV     BYTE PTR [EDI] , DL       ; Write one byte to the current destination's location 
      INC     EDI                       ; Move to the next byte of the destination
      DEC     EAX                       ; Decrement the number of bytes to move
      JNZ     .MoveBytes                ; Keep moving parity bytes while EAX>0
    .EP:                                ; End of the procedure routine
      POP     EBP                       ; Restore the base pointer
      POP     EDI                       ; Restore the destination index
      POP     EDX                       ; Restore the data register
      POP     ECX                       ; Restore the count register
      POP     EAX                       ; Restore the accumulator
    RET     0x08                        ; Return to the calling procedure
                                        ; And sweep 2 parameters off the stack
                                        ; And sweep 1 parameter off the stack

Alright, I coded one [__MemCpy] right now and you are free to use it too. But there are certain things you have to bear in mind about the code and these can be used as general programming skills:

When you want to divide a number by n where n is a power of 2 (For example, 1, 2, 4, 8, 16, 32, etc), just find x while 2^x = n
For example, to divide a number by 16, find x for 2^x = 16
This can be evaluated as

Log.2(16) = x
Log.2(2^4) = x
4*Log.2(2) = x
4=x or x=4

Therefore, to be able to divide a number by 16, you can shift it to the right 4 times. This allows us to find the quotient of a number divided by 16. We are going to need this value to be able to find out how many QWORDs/DWORDs/etc we have to move to a buffer. For example, if the user wants to move 34 bytes to a buffer, we shift 34 to the right 2 times to be able to find out how many DWORDs we have to move to the buffer. When you shift a number to the right n times, that number will be divided by 2^n. So shifting a number to the right 2 times will divide the number by 2^2 = 4 (DWORD).

Now that we know the quotient of our division, we have to find the remainder also. Finding the remainder of a division of a number by a power of 2 is really easy, also. If you want to find this remainder, just AND the number of N-1 when N is the quotient. For example, to find the remainder of the division (14/8), just subtract 1 from 8 (8-1 = 7) then AND 14 with 7 to get 6. So 6 is the remainder of the division of (14/8). 1*8 = 8, 8+6 = 14.

To wrap it up, let's say we need the remainder and the quotient of the division (NUMBER / 32).

Quotient = NUMBER SHR x
2^x = 32 => x = 5
Quotient = NUMBER SHR 5

Remainder = NUMBER AND (32-1) = NUMBER AND 31

That's that. Now to be able to understand the code below, you need to know a few optimization tricks.

1) If the required length specified by the programmer is zero, then exit the function immediately.
2) If the source and the destination are equal, then quit the function immediately.
3) Use the MOVSD instruction to move ECX number of DWORDs at a time from DS:ESI to DS:EDI.
4) Use the MOVSB instruction to move ECX number of BYTES at a time from DS:ESI to DS:EDI.
5) Do not assume that the processor your kernel is running in is equipped with MMX/SSE instructions.
6) Do not trash your general purpose registers; PUSH and POP them to save their current state.
7) Clear the direction flag when using string instructions (although not necessarily) with the CLD instruction.
8) I use the StdCall calling convention in my kernel. You should push the parameters from right to left in this calling convention.

I didn't have enough time to write documentation for this code either. If you need further help/explanation, write back here.

Code: Select all

; â€”â€”â€”â€”â€”â€”â€”â€”â€”â€”â€”â€”â€”â€”â€”â€”â€”â€”â€”â€”â€”â€”â€”â€”â€”â€”â€”â€”â€”â€”â€”â€”â€”â€”â€”â€”â€”â€”â€”â€”â€”â€”â€”â€”â€”â€”â€”â€”â€”â€”
  __MemCpy:
    ; void __MemCpy (void* Destination, const void* Source, DWORD NumberofBytesToCopy); StdCall;
    PUSHFD                              ; Push the EFLAGS register onto the stack
    PUSH    EAX                         ; Push the accumulator onto the stack
    PUSH    ECX                         ; Push the count register onto the stack
    PUSH    ESI                         ; Push the source index onto the stack
    PUSH    EDI                         ; Push the destination index onto the satck
    PUSH    EBP                         ; Push the base pointer onto the stack
    MOV     EBP , ESP                   ; Move the stack pointer to the base pointer
    CLD                                 ; Clear the direction flag, Move forward with string instructions
    MOV     ECX , DWORD PTR [EBP + 0x24]; *ECX = The [NumberofBYtesToCopy] parameter
    TEST    ECX , ECX                   ; See if the number of bytes we have to copy is zero
    JZ      .EP                         ; Jump to the end of the procedure if yes
    MOV     ESI , DWORD PTR [EBP + 0x20]; *ESI = The [Source] parameter
    MOV     EDI , DWORD PTR [EBP + 0x1C]; *EDI = The [Destination] parameter
    CMP     ESI , EDI                   ; Are the source and the destination the same memory location?
    JE      .EP                         ; If yes, then jump to the end of the procedure
    MOV     EAX , ECX                   ; EAX holds the [NumberofBytesToCopy] parameter too
    AND     EAX , 0x00000003            ; *EAX = Remainder of ([NumberofBytesToCopy] / 4)
    SHR     ECX , 0x00000002            ; *ECX = Quotient of ([NumberofBytesToCopy] / 4)
    REP     MOVSD                       ; Store as many as ECX DWORDs from source to destination
    MOV     ECX , EAX                   ; Put the count in the count register
    REP     MOVSB                       ; Now start moving bytes from Source to Destination
    .EP:                                ; End of the procedure
      POP     EBP                       ; Restore the base pointer
      POP     EDI                       ; Restore the destination index
      POP     ESI                       ; Restore the source index
      POP     ECX                       ; Restore the count register
      POP     EAX                       ; Restore the base index
      POPFD                             ; Restore the EFLAGS register
    RET     0x0C                        ; Return to the calling procedure
                                        ; And sweep 3 parameters off the stack

Hope that helps!

Brendan · Post by **Brendan** » Fri May 18, 2007 2:17 am

Hi,

Lprogster wrote:I've been trying to improve the performance of my mem_copy ( ) function. So, this is my attempt to use MMX to copy faster (in Nasm BTW):

Don't.

A high performance generic "mem_copy()" needs to take into account various values for count and alignment issues, and also needs to worry about setup overhead.

For example, for a small transfer (e.g. 16 bytes) where source and destination are aligned to an 8 byte boundary, do you need to check if MMX is supported, then use FXSAVE, EMMS and FXLOAD (in case the caller is using the FPU), and could it cause an device not available exception? In this case "rep movsb" would be faster because there's much less setup overhead.

For a high level language this would be optimised by the compiler itself, so that if the compiler knows the source and destination are aligned it inlines code that doesn't bother checking alignment, and if the compiler knows the count is a multiple of 4 or 8 it inlines code that doesn't bother transferring bytes, words, dwords, etc.

If you really want high performance, then you need to use the same tricks. Basically you need either an optimised macro for each specific case or custom code for each specific case. A generic "mem_copy()" routine is guaranteed to give worse performance, simply because it's generic.

@XCHG: the INC and DEC instructions only update some of the flags, and therefore these instructions can't complete until after previous instruction/s have modified the flags (it creates a false dependancy). Because of this it's better to use ADD and SUB which replace all flags and has no false dependancy. For example:

Code: Select all

    cmp foo, bar   ;An instruction that modifies flags
     add foo,bar     ;Can complete without knowing flags left by the last instruction (NO STALL)
     adc foo,bar     ;Can't complete without knowing flags left by the last instruction (STALL)
     dec foo           ;Can't complete without knowing flags left by the last instruction (STALL)
     inc foo           ;Can't complete without knowing flags left by the last instruction (STALL)
     sub foo,1        ;Can complete without knowing flags left by the last instruction (NO STALL)

Cheers,

Brendan

crackers · Post by **crackers** » Fri May 18, 2007 2:57 am

Brendan wrote: @XCHG: the INC and DEC instructions only update some of the flags, and therefore these instructions can't complete until after previous instruction/s have modified the flags (it creates a false dependancy). Because of this it's better to use ADD and SUB which replace all flags and has no false dependancy.

I'm not sure if I understand. Does it mean that it's better to use add ax, 1 than inc ax ? If yes than by "stall" you mean what is happening in out-of-order buffer? If I've just written something stupid, please don't laugh at me.

XCHG · Post by **XCHG** » Fri May 18, 2007 4:13 am

Brendan,

What you are referring to as a stall is Partial-Register-Access-Stall. This "stall" does not apply in all circumstances and is well addressed in Agner Fog's optimization manuals. Instructions such as INC/DES/SUB/ADD all pair in both the u and the v pipes. In order to be able to say whether they stall or not you have to trace the code to its origin and see what instruction pairs with what. For example,

Code: Select all

  LEA     ESI , Source
  LEA     EDI , Destination
  MOV     ECX , 100
  @Transfer:
    MOV     EAX , DWORD PTR [ESI]
    MOV     DWORD PTR [EDI] , EAX
    ADD     ESI , 4
    ADD     EDI , 4
    DEC     ECX
    JNZ     @Transfer

According to your post, the above code should take longer to execute compared to when the DEC isntruction is replaced by SUB 1. The above code runs at the speed of 313 clock cycles under the below circumstances:

Code Segment aligned on a DWORD boundary.
Stack Segment aligned on a DWORD boundary.
Both Source and Destination variables are arrays of 100 DWORDs, aligned on a DWORD boundary.
CPU: PIII 800MHZ
RAM: 512MB SD

Now if I change the DEC to a SUB, the code will run at 336 clock cycles on the same CPU with the same conditions as stated above. The clock cycles stated above include the creation and the destruction of the stack frame.

Brendan · Post by **Brendan** » Fri May 18, 2007 4:54 am

Hi,

XCHG wrote:What you are referring to as a stall is Partial-Register-Access-Stall. This "stall" does not apply in all circumstances and is well addressed in Agner Fog's optimization manuals. Instructions such as INC/DES/SUB/ADD all pair in both the u and the v pipes. In order to be able to say whether they stall or not you have to trace the code to its origin and see what instruction pairs with what. For example,

IIRC the U and V pipes only apply to Pentium CPUs.

On Pentium III backward branches are predicted as taken, and the CPU will speculatively start executing the next iteration of the loop before it knows if the branch is taken or not. In this case the choice of sub or dec should only really effect the speed that the CPU exits the loop (where the branch would be mispredicted until the flags are known), nothing else.

XCHG wrote:Now if I change the DEC to a SUB, the code will run at 336 clock cycles on the same CPU with the same conditions as stated above. The clock cycles stated above include the creation and the destruction of the stack frame.

How many times did you try it (are the figures you've stated an average of many tests), were the caches in the same state (e.g. all code in the cache beforehand, so we know the add wasn't slower because it caused an extra cache line fetch), and did you do a serialising instruction before the first RDTSC?

More importantly, did you try it on a Pentium 4 or any other CPUs?

From Intel's Optimization Manual, "Use of the inc and dec Instructions" (page 2-68 in my copy):

The inc and dec instructions modify only a subset of the bits in the flag register. This creates a dependance on all previous writes of the flag register. This is especially problematic when these instructions are on the critical path because they are used to change an address for a load on which many other instructions depend.

Assembly/Complier Coding Rule 42. (medium impact, high generality) inc and dec instructions should be replaced with an add or sub instruction, because add and sub overwrite all flags, whereas inc and dec do not, therefore creating false dependancies on earlier instructions that set the flags.

Cheers,

Brendan

Candy · Post by **Candy** » Sat May 19, 2007 7:52 am

Brendan wrote: IIRC the U and V pipes only apply to Pentium CPUs.

Yep. U/V pipelining is total nonsense when applied to other cpu's. I have seen an internal implementation of memcpy for VS2005 which was hand-optimized for the Pentium I though, so you're not the first to do that wrong. In general, the things that worked on a P1 will continue to work, but more could work.

XCHG wrote:Now if I change the DEC to a SUB, the code will run at 336 clock cycles on the same CPU with the same conditions as stated above. The clock cycles stated above include the creation and the destruction of the stack frame.
More importantly, did you try it on a Pentium 4 or any other CPUs?

From Intel's Optimization Manual, "Use of the inc and dec Instructions" (page 2-68 in my copy):

Yep. That applies only for the P4 NetBurst architecture and probably the newer Core architectures as well.

OSDev.org

Assembly mem_copy ( )

Assembly mem_copy ( )

Re: Assembly mem_copy ( )