Brendan wrote:
The "rep movsd" instruction is (sort of) the assembly language equivalent of "memcpy()"; except it works on dwords not bytes.
Indeed, the
'move string to string' instructions (or the equivalent AVX and SSE instructions) are sometimes used in implementing memcpy().
The basic instruction ('movs', which moves a single 16-bit word) is in the form
and copies a word value pointed to by SI (or ESI or RSI, depending on the CPU mode) to the location pointed to by the value in DI (EDI, RDI). There are 8, 16, 32, and 64 bit variants (movsb/movs/movsd/movsq), and it can also be written with explicit arguments to indicate the size instead (though this is rarely done):
The
'repeat string operation' ('rep'|'repz'|'repnz' etc.) prefixes are a special class of sub-opcode in string operations that indicates that a string operation should be repeated until some condition is met. For example, REP checks for a non-zero value in [E|R]CX and performs the string operation, then decrements the value of [E|R]CX and starts again, in effect performing a definite loop.
Since the larger word size you use the fewer memory cycles this will take, you can often speed up a memcpy() function using the largest string move operation that the processor mode permits. However, if you want to implement mempcy() with this, you do need to watch for two things: first, as Brendan says,
movsd operates on doublewords (there is a similar
movsq that moves quadwords which was added to 64-bit systems), so a memcpy() function using one need to check to see if the data is a) in a multiple of the the data size, and b) is aligned on the data size in memory. If it isn't aligned, then it needs to copy the leading part until it is aligned before performing the 'rep movs'; then, if there are trailing bytes which don't make up a full word (doubleword, quadword) set the counter register to the number of data elements are full words, perform the repeated string instruction, then use 'movsb' to copy the trailing bytes.
Second, the performance of the repeated string ops is highly dependent on the processor implementation, meaning that different CPU models may do better or worse compared to the conventional for() loop memcpy(). While this is pretty much a thing of the past, as the newer CPUs all should have the repeats optimized for faster operations in the memory pipelines, it is still something to be aware of. The same applies to the SSE and AVX instructions, so this is one place where - if you are up to it - you might want to have multiple versions of a given function (memcpy() in this case) and have the OS installer select the right one for the CPU model at installation time. It's not something most of us here have to concern ourselves with, but it is definitely a factor if you think you can move into the big leagues.
Mind you, I have always wondered why the standard memory subsystem doesn't have a hardware memory-to-memory DMA transfer; it would seem like an obvious way to bypass transfers to and from the CPU, though it would probably require some sort of interlock mechanism to prevent the CPU from trying to access memory in the process of being copied. It would be similar to but simpler than the
Bit BLIT hardware in most graphics cards, so it isn't as if it would be difficult.
Write-combining is (AFAIK) similar but not quite the same as what I mean, but I don't know enough about it to really say how applicable it would be.