Assembly mem_copy ( )
Assembly mem_copy ( )
I've been trying to improve the performance of my mem_copy ( ) function.
...See later posts...
Any hints, ideas or solutions (especially) would be great.
Thanks,
Lster
...See later posts...
Any hints, ideas or solutions (especially) would be great.
Thanks,
Lster
Last edited by Lprogster on Fri May 18, 2007 3:47 am, edited 1 time in total.
Code: Select all
mov edx, [esp]
mov eax, 8
div ecx
Try:
Code: Select all
mov eax, [esp + 4]
xor edx, edx
mov ecx, 8
div ecx
mov eax, ecx
Code: Select all
mov ecx, [esp + 4]
shr ecx, 8
Code: Select all
mov ebx, [esp + 4]
mov edx, [esp + 8]
It would be faster to load more than one qword in one iteration (pipelining):
movq mm0, [edx]
movq mm1, [edx + 8]
....
I would try somthing like this (I can't remember if movs* decreases ecx or rep does it.), I don't know if this is faster, but it's supported by all >286 processors
Also you page fault may be due to the fact that your writing to much data. (i.e what happens if the code want's to copy 9,17,25 or 33 bytes of data.
Code: Select all
mem_copy_a:
mov ecx, [esp + 4]
mov eax, ecx
and eax,3
shr ecx, 3
mov edi, [esp + 8]
mov esi, [esp + 12]
shr eax,1
jnc TwoByteWrite
movsb
TwoByteWrite:
shr eax,1
jnc Dword_Write
movsw
Dword_Write:
rep movsd
ret
Microsoft: "let everyone run after us. We'll just INNOV~1"
Also it may be better to use a 128 bit move with. and use offsets rather than adding to edx, ebx
Code: Select all
movdqa xmm1,[esi+ebx];
movdqa [edi+ebx],xmm1;
Microsoft: "let everyone run after us. We'll just INNOV~1"
Maybe I am missing on something but how in the world can you assume that the programmer is trying to move >8 bytes? What if we are moving 2 bytes only? You are moving 8 bytes with the MOVQ MMX instruction.
Another point to notice is that when you issue the DEC instruction with any of the General Purpose Registers, the Zero Flag in the EFLAGS register will be set to 1. So this code that you have written:
Is not optimized. You can remove the CMP instruction like this:
Another thing that you might already know is that you "must" ask the programmer how many bytes he would like to copy. Then you will be able to pull off a lot of tricks for copying from the source to the destination.
For example, first find the quotient of the "count" by 8 and see how many QWORDs (Using MMX instruction) you can move to the destination. After that, find the quotient of the "count / 8" divided by 4 and find out how many DWORDs you can move to the destination, then use that for WORD and finally BYTE.
For example, if the user asks you to move 19 bytes from the source to the destination, then 19 bytes of transfer can be done like this:
Two QWORDs (= 2 * 8 = 16 bytes)
Now subtract 16 (the number of bytes we have already moved) from 19 to get 3. 3 will not yield a positive quotient when divided by 4 so divide it by 2 to get 1 as the quotient. The remainder will be 1. So you have to move 1 WORD and 1 Bytes.
As a result, to transfer 19 bytes,
Transfer 2 QWORDs (QWORD = 8 bytes, 2*8= 16 bytes)
Transfer 1 WORD (2 bytes)
Transfer 1 Byte
I have written on __MMXZeroMemory function that I think will give you the idea. However, I have not used DWORD/WORD movements in this code. I have written comments for the code but haven't had time to write documentation. Post back here if you have any questions.
Alright, I coded one [__MemCpy] right now and you are free to use it too. But there are certain things you have to bear in mind about the code and these can be used as general programming skills:
When you want to divide a number by n where n is a power of 2 (For example, 1, 2, 4, 8, 16, 32, etc), just find x while 2^x = n
For example, to divide a number by 16, find x for 2^x = 16
This can be evaluated as
Log.2(16) = x
Log.2(2^4) = x
4*Log.2(2) = x
4=x or x=4
Therefore, to be able to divide a number by 16, you can shift it to the right 4 times. This allows us to find the quotient of a number divided by 16. We are going to need this value to be able to find out how many QWORDs/DWORDs/etc we have to move to a buffer. For example, if the user wants to move 34 bytes to a buffer, we shift 34 to the right 2 times to be able to find out how many DWORDs we have to move to the buffer. When you shift a number to the right n times, that number will be divided by 2^n. So shifting a number to the right 2 times will divide the number by 2^2 = 4 (DWORD).
Now that we know the quotient of our division, we have to find the remainder also. Finding the remainder of a division of a number by a power of 2 is really easy, also. If you want to find this remainder, just AND the number of N-1 when N is the quotient. For example, to find the remainder of the division (14/8), just subtract 1 from 8 (8-1 = 7) then AND 14 with 7 to get 6. So 6 is the remainder of the division of (14/8). 1*8 = 8, 8+6 = 14.
To wrap it up, let's say we need the remainder and the quotient of the division (NUMBER / 32).
Quotient = NUMBER SHR x
2^x = 32 => x = 5
Quotient = NUMBER SHR 5
Remainder = NUMBER AND (32-1) = NUMBER AND 31
That's that. Now to be able to understand the code below, you need to know a few optimization tricks.
1) If the required length specified by the programmer is zero, then exit the function immediately.
2) If the source and the destination are equal, then quit the function immediately.
3) Use the MOVSD instruction to move ECX number of DWORDs at a time from DS:ESI to DS:EDI.
4) Use the MOVSB instruction to move ECX number of BYTES at a time from DS:ESI to DS:EDI.
5) Do not assume that the processor your kernel is running in is equipped with MMX/SSE instructions.
6) Do not trash your general purpose registers; PUSH and POP them to save their current state.
7) Clear the direction flag when using string instructions (although not necessarily) with the CLD instruction.
8) I use the StdCall calling convention in my kernel. You should push the parameters from right to left in this calling convention.
I didn't have enough time to write documentation for this code either. If you need further help/explanation, write back here.
Hope that helps!
Another point to notice is that when you issue the DEC instruction with any of the General Purpose Registers, the Zero Flag in the EFLAGS register will be set to 1. So this code that you have written:
Code: Select all
dec ecx
cmp ecx, 0
jne .again
Code: Select all
DEC ECX
JNZ .DoSomethingIfECXIsNotZeroYet
Another thing that you might already know is that you "must" ask the programmer how many bytes he would like to copy. Then you will be able to pull off a lot of tricks for copying from the source to the destination.
For example, first find the quotient of the "count" by 8 and see how many QWORDs (Using MMX instruction) you can move to the destination. After that, find the quotient of the "count / 8" divided by 4 and find out how many DWORDs you can move to the destination, then use that for WORD and finally BYTE.
For example, if the user asks you to move 19 bytes from the source to the destination, then 19 bytes of transfer can be done like this:
Two QWORDs (= 2 * 8 = 16 bytes)
Now subtract 16 (the number of bytes we have already moved) from 19 to get 3. 3 will not yield a positive quotient when divided by 4 so divide it by 2 to get 1 as the quotient. The remainder will be 1. So you have to move 1 WORD and 1 Bytes.
As a result, to transfer 19 bytes,
Transfer 2 QWORDs (QWORD = 8 bytes, 2*8= 16 bytes)
Transfer 1 WORD (2 bytes)
Transfer 1 Byte
I have written on __MMXZeroMemory function that I think will give you the idea. However, I have not used DWORD/WORD movements in this code. I have written comments for the code but haven't had time to write documentation. Post back here if you have any questions.
Code: Select all
__MMXZeroMemory:
; void __MMXZeroMemory (void* Destination, DWORD Length)
PUSH EAX ; Push the accumulator onto the stack
PUSH ECX ; Push the count register onto the stack
PUSH EDX ; Push the data register onto the stack
PUSH EDI ; Push the destination index onto the stack
PUSH EBP ; Push the base pointer onto the stack
MOV EBP , ESP ; Move the stack pointer to the base pointer
MOV ECX , DWORD PTR [EBP + 0x1C]; *ECX = The [Length] parameter
TEST ECX , ECX ; See if the requested length is zero
JZ .EP ; Jump to the end of the procedure if yes
MOV EDI , DWORD PTR [EBP + 0x18]; *EDI = The [Destination] parameter
MOV EAX , ECX ; *EAX = The [Length] parameter
XOR EDX , EDX ; Clear the buffer for Byte-moves
SHR ECX , 0x00000003 ; *ECX = Number of QWORDs that we have to move to the Destination
AND EAX , 0x00000007 ; *EAX = Number of parity bytes that we have to move
TEST ECX , ECX ; See if the number of QWORDs to move is zero
JZ .MoveBytes ; Start moving bytes (Up to 7 bytes) if yes
EMMS ; Empty MMX Technology State
PXOR MM0 , MM0 ; The 8-Byte MMX byte is zero
.MoveQWORDs: ; Start zeroing memory 8 bytes at a time
MOVQ QWORD PTR [EDI] , MM0 ; Zero the current QWORD
ADD EDI , 0x00000008 ; Move to the next QWORD in the destination
DEC ECX ; Decrement the number of QWORDs to move
JNZ .MoveQWORDs ; Keep moving QWORDs to the destination
TEST EAX , EAX ; See if the number of parity bytes to move is zero
JZ .EP ; Jump to the end of the procedure if yes
.MoveBytes: ; We are left to move up to 7 bytes now
MOV BYTE PTR [EDI] , DL ; Write one byte to the current destination's location
INC EDI ; Move to the next byte of the destination
DEC EAX ; Decrement the number of bytes to move
JNZ .MoveBytes ; Keep moving parity bytes while EAX>0
.EP: ; End of the procedure routine
POP EBP ; Restore the base pointer
POP EDI ; Restore the destination index
POP EDX ; Restore the data register
POP ECX ; Restore the count register
POP EAX ; Restore the accumulator
RET 0x08 ; Return to the calling procedure
; And sweep 2 parameters off the stack
; And sweep 1 parameter off the stack
Alright, I coded one [__MemCpy] right now and you are free to use it too. But there are certain things you have to bear in mind about the code and these can be used as general programming skills:
When you want to divide a number by n where n is a power of 2 (For example, 1, 2, 4, 8, 16, 32, etc), just find x while 2^x = n
For example, to divide a number by 16, find x for 2^x = 16
This can be evaluated as
Log.2(16) = x
Log.2(2^4) = x
4*Log.2(2) = x
4=x or x=4
Therefore, to be able to divide a number by 16, you can shift it to the right 4 times. This allows us to find the quotient of a number divided by 16. We are going to need this value to be able to find out how many QWORDs/DWORDs/etc we have to move to a buffer. For example, if the user wants to move 34 bytes to a buffer, we shift 34 to the right 2 times to be able to find out how many DWORDs we have to move to the buffer. When you shift a number to the right n times, that number will be divided by 2^n. So shifting a number to the right 2 times will divide the number by 2^2 = 4 (DWORD).
Now that we know the quotient of our division, we have to find the remainder also. Finding the remainder of a division of a number by a power of 2 is really easy, also. If you want to find this remainder, just AND the number of N-1 when N is the quotient. For example, to find the remainder of the division (14/8), just subtract 1 from 8 (8-1 = 7) then AND 14 with 7 to get 6. So 6 is the remainder of the division of (14/8). 1*8 = 8, 8+6 = 14.
To wrap it up, let's say we need the remainder and the quotient of the division (NUMBER / 32).
Quotient = NUMBER SHR x
2^x = 32 => x = 5
Quotient = NUMBER SHR 5
Remainder = NUMBER AND (32-1) = NUMBER AND 31
That's that. Now to be able to understand the code below, you need to know a few optimization tricks.
1) If the required length specified by the programmer is zero, then exit the function immediately.
2) If the source and the destination are equal, then quit the function immediately.
3) Use the MOVSD instruction to move ECX number of DWORDs at a time from DS:ESI to DS:EDI.
4) Use the MOVSB instruction to move ECX number of BYTES at a time from DS:ESI to DS:EDI.
5) Do not assume that the processor your kernel is running in is equipped with MMX/SSE instructions.
6) Do not trash your general purpose registers; PUSH and POP them to save their current state.
7) Clear the direction flag when using string instructions (although not necessarily) with the CLD instruction.
8) I use the StdCall calling convention in my kernel. You should push the parameters from right to left in this calling convention.
I didn't have enough time to write documentation for this code either. If you need further help/explanation, write back here.
Code: Select all
; ——————————————————————————————————————————————————
__MemCpy:
; void __MemCpy (void* Destination, const void* Source, DWORD NumberofBytesToCopy); StdCall;
PUSHFD ; Push the EFLAGS register onto the stack
PUSH EAX ; Push the accumulator onto the stack
PUSH ECX ; Push the count register onto the stack
PUSH ESI ; Push the source index onto the stack
PUSH EDI ; Push the destination index onto the satck
PUSH EBP ; Push the base pointer onto the stack
MOV EBP , ESP ; Move the stack pointer to the base pointer
CLD ; Clear the direction flag, Move forward with string instructions
MOV ECX , DWORD PTR [EBP + 0x24]; *ECX = The [NumberofBYtesToCopy] parameter
TEST ECX , ECX ; See if the number of bytes we have to copy is zero
JZ .EP ; Jump to the end of the procedure if yes
MOV ESI , DWORD PTR [EBP + 0x20]; *ESI = The [Source] parameter
MOV EDI , DWORD PTR [EBP + 0x1C]; *EDI = The [Destination] parameter
CMP ESI , EDI ; Are the source and the destination the same memory location?
JE .EP ; If yes, then jump to the end of the procedure
MOV EAX , ECX ; EAX holds the [NumberofBytesToCopy] parameter too
AND EAX , 0x00000003 ; *EAX = Remainder of ([NumberofBytesToCopy] / 4)
SHR ECX , 0x00000002 ; *ECX = Quotient of ([NumberofBytesToCopy] / 4)
REP MOVSD ; Store as many as ECX DWORDs from source to destination
MOV ECX , EAX ; Put the count in the count register
REP MOVSB ; Now start moving bytes from Source to Destination
.EP: ; End of the procedure
POP EBP ; Restore the base pointer
POP EDI ; Restore the destination index
POP ESI ; Restore the source index
POP ECX ; Restore the count register
POP EAX ; Restore the base index
POPFD ; Restore the EFLAGS register
RET 0x0C ; Return to the calling procedure
; And sweep 3 parameters off the stack
On the field with sword and shield amidst the din of dying of men's wails. War is waged and the battle will rage until only the righteous prevails.
Re: Assembly mem_copy ( )
Hi,
A high performance generic "mem_copy()" needs to take into account various values for count and alignment issues, and also needs to worry about setup overhead.
For example, for a small transfer (e.g. 16 bytes) where source and destination are aligned to an 8 byte boundary, do you need to check if MMX is supported, then use FXSAVE, EMMS and FXLOAD (in case the caller is using the FPU), and could it cause an device not available exception? In this case "rep movsb" would be faster because there's much less setup overhead.
For a high level language this would be optimised by the compiler itself, so that if the compiler knows the source and destination are aligned it inlines code that doesn't bother checking alignment, and if the compiler knows the count is a multiple of 4 or 8 it inlines code that doesn't bother transferring bytes, words, dwords, etc.
If you really want high performance, then you need to use the same tricks. Basically you need either an optimised macro for each specific case or custom code for each specific case. A generic "mem_copy()" routine is guaranteed to give worse performance, simply because it's generic.
@XCHG: the INC and DEC instructions only update some of the flags, and therefore these instructions can't complete until after previous instruction/s have modified the flags (it creates a false dependancy). Because of this it's better to use ADD and SUB which replace all flags and has no false dependancy. For example:
Cheers,
Brendan
Don't.Lprogster wrote:I've been trying to improve the performance of my mem_copy ( ) function. So, this is my attempt to use MMX to copy faster (in Nasm BTW):
A high performance generic "mem_copy()" needs to take into account various values for count and alignment issues, and also needs to worry about setup overhead.
For example, for a small transfer (e.g. 16 bytes) where source and destination are aligned to an 8 byte boundary, do you need to check if MMX is supported, then use FXSAVE, EMMS and FXLOAD (in case the caller is using the FPU), and could it cause an device not available exception? In this case "rep movsb" would be faster because there's much less setup overhead.
For a high level language this would be optimised by the compiler itself, so that if the compiler knows the source and destination are aligned it inlines code that doesn't bother checking alignment, and if the compiler knows the count is a multiple of 4 or 8 it inlines code that doesn't bother transferring bytes, words, dwords, etc.
If you really want high performance, then you need to use the same tricks. Basically you need either an optimised macro for each specific case or custom code for each specific case. A generic "mem_copy()" routine is guaranteed to give worse performance, simply because it's generic.
@XCHG: the INC and DEC instructions only update some of the flags, and therefore these instructions can't complete until after previous instruction/s have modified the flags (it creates a false dependancy). Because of this it's better to use ADD and SUB which replace all flags and has no false dependancy. For example:
Code: Select all
cmp foo, bar ;An instruction that modifies flags
add foo,bar ;Can complete without knowing flags left by the last instruction (NO STALL)
adc foo,bar ;Can't complete without knowing flags left by the last instruction (STALL)
dec foo ;Can't complete without knowing flags left by the last instruction (STALL)
inc foo ;Can't complete without knowing flags left by the last instruction (STALL)
sub foo,1 ;Can complete without knowing flags left by the last instruction (NO STALL)
Cheers,
Brendan
For all things; perfection is, and will always remain, impossible to achieve in practice. However; by striving for perfection we create things that are as perfect as practically possible. Let the pursuit of perfection be our guide.
I'm not sure if I understand. Does it mean that it's better to use add ax, 1 than inc ax ? If yes than by "stall" you mean what is happening in out-of-order buffer? If I've just written something stupid, please don't laugh at me.Brendan wrote: @XCHG: the INC and DEC instructions only update some of the flags, and therefore these instructions can't complete until after previous instruction/s have modified the flags (it creates a false dependancy). Because of this it's better to use ADD and SUB which replace all flags and has no false dependancy.
Brendan,
What you are referring to as a stall is Partial-Register-Access-Stall. This "stall" does not apply in all circumstances and is well addressed in Agner Fog's optimization manuals. Instructions such as INC/DES/SUB/ADD all pair in both the u and the v pipes. In order to be able to say whether they stall or not you have to trace the code to its origin and see what instruction pairs with what. For example,
According to your post, the above code should take longer to execute compared to when the DEC isntruction is replaced by SUB 1. The above code runs at the speed of 313 clock cycles under the below circumstances:
Code Segment aligned on a DWORD boundary.
Stack Segment aligned on a DWORD boundary.
Both Source and Destination variables are arrays of 100 DWORDs, aligned on a DWORD boundary.
CPU: PIII 800MHZ
RAM: 512MB SD
Now if I change the DEC to a SUB, the code will run at 336 clock cycles on the same CPU with the same conditions as stated above. The clock cycles stated above include the creation and the destruction of the stack frame.
What you are referring to as a stall is Partial-Register-Access-Stall. This "stall" does not apply in all circumstances and is well addressed in Agner Fog's optimization manuals. Instructions such as INC/DES/SUB/ADD all pair in both the u and the v pipes. In order to be able to say whether they stall or not you have to trace the code to its origin and see what instruction pairs with what. For example,
Code: Select all
LEA ESI , Source
LEA EDI , Destination
MOV ECX , 100
@Transfer:
MOV EAX , DWORD PTR [ESI]
MOV DWORD PTR [EDI] , EAX
ADD ESI , 4
ADD EDI , 4
DEC ECX
JNZ @Transfer
According to your post, the above code should take longer to execute compared to when the DEC isntruction is replaced by SUB 1. The above code runs at the speed of 313 clock cycles under the below circumstances:
Code Segment aligned on a DWORD boundary.
Stack Segment aligned on a DWORD boundary.
Both Source and Destination variables are arrays of 100 DWORDs, aligned on a DWORD boundary.
CPU: PIII 800MHZ
RAM: 512MB SD
Now if I change the DEC to a SUB, the code will run at 336 clock cycles on the same CPU with the same conditions as stated above. The clock cycles stated above include the creation and the destruction of the stack frame.
On the field with sword and shield amidst the din of dying of men's wails. War is waged and the battle will rage until only the righteous prevails.
Hi,
On Pentium III backward branches are predicted as taken, and the CPU will speculatively start executing the next iteration of the loop before it knows if the branch is taken or not. In this case the choice of sub or dec should only really effect the speed that the CPU exits the loop (where the branch would be mispredicted until the flags are known), nothing else.
More importantly, did you try it on a Pentium 4 or any other CPUs?
From Intel's Optimization Manual, "Use of the inc and dec Instructions" (page 2-68 in my copy):
Cheers,
Brendan
IIRC the U and V pipes only apply to Pentium CPUs.XCHG wrote:What you are referring to as a stall is Partial-Register-Access-Stall. This "stall" does not apply in all circumstances and is well addressed in Agner Fog's optimization manuals. Instructions such as INC/DES/SUB/ADD all pair in both the u and the v pipes. In order to be able to say whether they stall or not you have to trace the code to its origin and see what instruction pairs with what. For example,
On Pentium III backward branches are predicted as taken, and the CPU will speculatively start executing the next iteration of the loop before it knows if the branch is taken or not. In this case the choice of sub or dec should only really effect the speed that the CPU exits the loop (where the branch would be mispredicted until the flags are known), nothing else.
How many times did you try it (are the figures you've stated an average of many tests), were the caches in the same state (e.g. all code in the cache beforehand, so we know the add wasn't slower because it caused an extra cache line fetch), and did you do a serialising instruction before the first RDTSC?XCHG wrote:Now if I change the DEC to a SUB, the code will run at 336 clock cycles on the same CPU with the same conditions as stated above. The clock cycles stated above include the creation and the destruction of the stack frame.
More importantly, did you try it on a Pentium 4 or any other CPUs?
From Intel's Optimization Manual, "Use of the inc and dec Instructions" (page 2-68 in my copy):
The inc and dec instructions modify only a subset of the bits in the flag register. This creates a dependance on all previous writes of the flag register. This is especially problematic when these instructions are on the critical path because they are used to change an address for a load on which many other instructions depend.
Assembly/Complier Coding Rule 42. (medium impact, high generality) inc and dec instructions should be replaced with an add or sub instruction, because add and sub overwrite all flags, whereas inc and dec do not, therefore creating false dependancies on earlier instructions that set the flags.
Cheers,
Brendan
For all things; perfection is, and will always remain, impossible to achieve in practice. However; by striving for perfection we create things that are as perfect as practically possible. Let the pursuit of perfection be our guide.
Yep. U/V pipelining is total nonsense when applied to other cpu's. I have seen an internal implementation of memcpy for VS2005 which was hand-optimized for the Pentium I though, so you're not the first to do that wrong. In general, the things that worked on a P1 will continue to work, but more could work.Brendan wrote: IIRC the U and V pipes only apply to Pentium CPUs.
Yep. That applies only for the P4 NetBurst architecture and probably the newer Core architectures as well.More importantly, did you try it on a Pentium 4 or any other CPUs?XCHG wrote:Now if I change the DEC to a SUB, the code will run at 336 clock cycles on the same CPU with the same conditions as stated above. The clock cycles stated above include the creation and the destruction of the stack frame.
From Intel's Optimization Manual, "Use of the inc and dec Instructions" (page 2-68 in my copy):