Brendan wrote:For modern CPUs, "rep movsb/w/d/q" is optimised to work on a cache lines, but only in specific cases - e.g. if the destination address is aligned on a certain boundary (and maybe the source address too), if the count is large enough, if the source and destination areas don't overlap too much, etc. This means that a "rep movsb/w/d/q" has some initial overhead before it starts (while the CPU is determining if it can work on cache lines instead) which makes it slow when the count is tiny.
Depending on the CPU model, the conditions under which REP MOVS moves an entire cache line per iteration (except maybe for the last one) are:
- The counter must be high (as I often am ).
- Both the source and destination must be properly aligned.
- The direction must be forward.
- The distance between the source and destination must be at least the size of a cache line.
- The memory type for both the source and destination must be either write-back or write-combining.
REP MOVS takes a single cycle to perform both a read and a write. However, on P2 and P3, there is an extra cycle penalty if ESI + 1/2/4/8 - EDI is divisible by 32 (i.e., if bits 2--4 are the same for the source and destination addresses) due to a cache bank conflict. The simplest way to avoid this is to align the source and destination on a QWORD boundary.
Brendan wrote:- for tiny memory amounts (maybe less than 100 bytes) with a fixed count you want to unroll and generate a series of "mov" instructions
Yes, but there are two issues here:
- Unrolling loops on Sandy Bridge CPUs is not recommended because of the micro-op cache.
- It's not just MOV we're talking about; it's moves using the largest available registers (except for YMM as there are currently no x86(-64) CPUs which can perform 256-bit writes but expect that to change).
REP MOVS typically has some initial overhead for determining which method to use, which is why it's not recommended for small blocks. But it doesn't stop here; this method is typically the fastest even for larger block sizes (but not always, see below). Its downsides are that the block size and alignment must be known beforehand (unless self-modifying code is used but the only justification for that would be using the exact same routine
many times). More on alignment later.
Brendan wrote:- for tiny memory amounts (maybe less than 50 bytes) with a variable count you want to do a manual copy and avoid "rep movsb/w/d/q"
I'm not sure I understand what you mean by "manually copying". I presume you're describing the same technique as above, except using a loop in order to save space. If that's the case, I would recommend combining the two methods into a partially unrolled loop, where for Sandy Bridge the number of MOV instructions inside the loop is much higher. By the way, using MOVS without the REP prefix, just like with all string instructions, is a performance waste.
Brendan wrote:- for medium memory amounts (maybe between 100 bytes and 10 KiB) you want to use "rep movsb/w/d/q"
Using REP MOVSB or MOVSW should never be used if performance is the goal, so they're not really worth mentioning. Anyway, REP MOVS is
always the technique you want to use on Nehalem and Sandy Bridge for blocks ranging from anywhere from not very small to extremely big.
Brendan wrote:- for larger memory amounts (maybe between 20 KiB and 200 KiB) you start getting into trouble with cache pollution, and want to use CLFLUSH. In this case "rep movsb/w/d/q" is a bad idea because you end up doing tiny memory amounts between cache line flushes. You'd want to break it up into multiple "cache lines sized" chunks and generate a series of "mov" instructions to do each small/fixed size chunk. It's also probably a good idea to use MMX or SSE (if present) in this case; and you probably want to consider using prefetch instructions too.
I think you're overcomplicating it. Non-temporal writes can be used to bypass the cache. On many newer CPUs this is not recommended for blocks smaller than half the L2 cache, so that's probably when you should start worrying.
Brendan wrote:- for huge memory amounts (maybe above 100 KiB) you want to find the programmer who "needs" this and crush their hands with a hammer, so that they can't type (or program) ever again . In this case you should focus on avoiding the need to copy huge amounts of memory rather than optimising the memcpy(). For a very simple example, imagine a text editor where the text file is stored in memory as one big string, and every time a character is inserted or deleted you need to copy a (potentially huge) amount of memory - simple solution is to store the text file in memory as a linked list of lines so that you only ever need to move from the cursor to the end of the line (maybe 200 bytes worst case, rather than 500 KiB worst case).
That's an overgeneralization. There are scenarios in which copying huge blocks is imperative---although, none that I can think of for OS development.
Brendan wrote:Misaligned source and/or destination addresses are a major problem - it's tricky to get right, and it will kill the CPU's "rep movsb/w/d/q" optimisations. For cases with misalignment, you want to read an aligned value, shift it left or right, then write an aligned value. Of course this is also a good candidate for the "smash the programmer's fingers with a hammer" approach
.
Things that come in handy are SSE2 (PSRLDQ/PSLLDQ/POR), and supplementary SSE3 (PALIGNR---not recommended on AMD's Bobcat as it doesn't perform well). If you have a sufficiently new CPU, add AMD's SSE5, and AVX to the list. You need 16 branches (one per shift count) unless you have XOP (VPPERM).
Evidently, it is preferrable to simply align your blocks so you don't have this problem to begin with. However, misalignment isn't as big a problem as it used to because newer CPUs have better cache interfaces.
Brendan wrote:Note: Far too many people benchmark their "memcpy()" implementations on large/huge memory sizes. This is stupid as people shouldn't be copying large/huge amounts of memory anyway.
As I've said earlier, I don't think this is true for every scenario. However, I think you are fully right about most situations.
Brendan wrote:Anyway, for an assembly language programmer it can be done, but you'd need a group of macros (one for "small" memory amounts, one for medium memory amounts, etc) where you use the preprocessor to detect if the count is a constant (and potentially if source and dest are constant and aligned), and not one generic (optimised for no specific case) "memcpy()" routine. You'd also need to test on a wide variety of different CPUs (and maybe have different versions for different CPUs), and if you want the best possible results in all cases then you can probably spend several months getting it right.
Optimizing compilers typically take this burden off the programmer's shoulders by using the appropriate thing, at least in theory (vendors don't always care about implementing a bajillion edge cases). E.g., clang will even optimize something like
Code: Select all
// Using a pointer is not required, but I just can't stop being a good programmer :)
for (ptr = array; ptr != &a[sizeof array / sizeof array[0]]; *ptr++ = 0);
into
given non-floating-point types, which is far more complicated to implement.