Hi,
simeonz wrote:Brendan wrote:However, bad code (that fails to avoid copying large amounts of data) is going to be slow regardless of how "memcpy()" is implemented because it's almost impossible to avoid the "RAM bandwidth" bottleneck for almost all cases (excluding extremely rare cases where, in theory, you could use "copy on write" paging tricks); and for almost all good code the amount of data copied by each individual "memcpy()" call is typically less than 1 KiB (and often the mean size is around 16 bytes). This implies that a good "memcpy()" should be optimised for tiny/small copies.
To be honest, I am a little concerned about the memory bottleneck argument. Because, it is very important argument. In theory, you could have memory transfers interspersed with data processing. And the overall speed will depend on the rate of retiring in the memory transfer portion as well as the processing portion.
You're on the right track - one of the best ways to get rid of a memory copy is to merge it with whatever code touched the source previously, or merge it with whatever code will touch the destination next. This either avoids additional reads, or avoids additional writes; and the majority of 80x86 CPUs are "out-of-order enough" to end up doing processing and memory transfers in parallel.
simeonz wrote:And again in theory, the processing gives enough time for the cache to lazily flush the data to the next level and ultimately to memory. But the Intel cache to my knowledge is write-back only and will flush data on eviction, unless explicitly instructed to do so immediately.
Since 80386 Intel's caches have been controlled by global flags in CR0 and "per page" flags in page table entries that can be used to either disable caches or disable write-back/enable write-through. Then they added MTRRs (in Pentium) to allow caching to be controlled on a "per physical range" basis, then added support for write combining (in Pentium 4 I think). Of course other 80x86 CPU manufacturers (e.g. AMD, VIA) copied everything and support all of it; but Cyrix had something like MTRRs before Intel did (with a different name and different/incompatible access mechanism), and AMD added some extra stuff to MTRRs (mostly shifting some of the memory controller configuration from chipset into MTTRs).
The result of all of this is that for all 8x086 made in the last 10 years or more; there are multiple ways to configure an area as either "write-back" (default for RAM), "write-through", "write combining" or "uncached" (default for memory mapped devices).
There's also a few variations of prefetch instructions, non-temporal moves (which use "write combining" behaviour for individual writes) and CLFLUSH (which explicitly flushes a cache line).
For large memory copies; you mostly want to prefetch from the source (note: hardware prefetchers don't work across page boundaries) and use CLFLUSH to explicitly evict cache lines from both the source and the destination after you're finished with them (to avoid leaving the cache full of data that won't be used again "soon" and causing everything that is more likely to be used again to be evicted from cache). For huge memory copies; you might also want to trick the CPU into prefetching TLB entries (e.g. using a dummy load, like "cmp eax,[somewhere]", where nothing depends on instruction); but recent CPUs do have "TLB hardware prefetchers" making this unnecessary, and it shouldn't be done on CPUs that aren't "out-of-order" (e.g. first generation Atom).
simeonz wrote:And the eviction will be triggered in the transfer portion of the code, not the processing portion. Some AMD cpu's apparently use write-through cache strategy according to this
SO answer and rely on the write combining buffers to avoid redundant transfers.
That's only talking about (e.g.) the implementation of the L1 cache, not the overall behaviour (including L2 and L3). For those AMD CPUs, if a page uses "write-back caching" then the L1 cache will "write through" to L2 (but will not "write through" to L3 or RAM).
However this is mixing different concepts (the overall behaviour, and implementation details of one of several caches in the middle).
For better terminology; Intel CPUs typically use "exclusive caches" (where a cache line can not be in L1 and L2 at the same time) and AMD CPUs typically used "inclusive caches" (where a cache line can be in both L1 and L2 at the same time). For "inclusive caches", to maintain cache coherency, if you modify the data in a cache line you must also update or evict any/all copies in other caches.
simeonz wrote:If so, it means that they can exploit CPU to memory parallelism better and the speed of retiring instructions in the transfer portion becomes relevant for them. Another argument could be made that multi-channel dd3 ram offers a lot of performance, but considering that you could retire multiple instuctions per cpu cycle and those are movs and branches, and they are already kept decoded in the instruction cache, I don't think this is a valid argument.
An argument can be made that there's 4 or more CPUs (plus maybe an integrated GPU, plus bus mastering PCI devices for SATA and network and whatever else) all pounding the RAM at the same time; and that each individual CPU can only expect a fraction of the total memory bandwidth.
simeonz wrote:Brendan wrote:It also means that for some special cases "memcpy()" shouldn't be used at all. This includes code that is using SSE/AVX anyway (where the data is guaranteed to be aligned nicely, etc); and code that copies an entire buffer of pixels to the video card's frame buffer, where you should probably have a function pointer to one of many implementations (where the function that's selected when the video mode is set depends on things like if there's padding between rows of pixels in the video card's frame buffer, if each row of pixels is nicely aligned, etc) and where (especially for discrete video cards) you should try to avoid copying data that didn't change (dirty rectangles, "line changed flags", ...); and things like "fork()" where you should use paging tricks and copy nothing.
Even setting aside the cases in which the software has to specialize, the runtime would still need to have many variants to be suitable for most situations. Too many to be used comfortably in practice - to cater for alignment, hot and cold memory, VM pseudo-initialization, etc.
For "extreme specialisation", yes; but there's a lot of area between "no specialisation" and "extreme specialisation". For example, what if you provided "memcpy()" for when you expect the amount of memory being copied will be relatively small, plus a "memcpy_large()" for when you expect the amount of memory being copied will be relatively large? You'd be able to make a significant difference with very little specialisation.
simeonz wrote:Source level libraries can actually do better, since the interface can be simplified with generics (preferably) or macros, and by relying on decent LTO. But source level libraries have no hardware information at their disposal. Not with AOT compilation.
That depends (e.g. consider something like Gentoo Linux, where almost everything is AOT compiled from source when installed). In this case it's not the hardware information that libraries lack, it's usage information (will the caller be copying aligned data, and how much, and how often).
simeonz wrote:There are however justified uses of memcpy in my opinion.
Yes, for smaller memory copies (and not huge memory copies).
simeonz wrote:For example, applications that do not use zero-copy I/O receive a duplicate of a memory range from the system cache into user-space buffer.
If the process doesn't care about performance and doesn't use things like memory mapped files, etc; then they deserve whatever (poor) performance they get; and I'd argue that deliberately making performance as bad as possible is a good way to encourage them to write better software (and/or encourage users to replace the bad software with better software) and improve performance in the long run.
simeonz wrote:Brendan wrote:Note that most of the above applies to "memset()" too. For "memset()" the special cases (where "memset()" shouldn't be used) are things like filling an executable's ".bss" area with zeros (where you should use "allocate on write" paging tricks), filling an individual page with zeros (e.g. in the page fault handler when an "allocate on write" page is written to), etc. Also; for filling (larger) areas with zeros it's worth mentioning AMD's new "CLZERO" instruction, which is designed to fill an entire cache line (64 bytes) with zeros in a way that avoids the cost of fetching the cache line's previous contents (e.g. for filling a 4096-byte page with zeros in the page fault handler, I'd strongly consider using a linear sequence of 64 CLZERO instructions with no loop at all, and I'd consider having multiple different implementations of the page fault handler where during boot the kernel sets the IDT entry to point to the best page fault handler depending on various things, including if CLZERO is supported or not and including if cache line size is 64 bytes or something else).
Reading that paragraph, I am realizing now that Intel has no easy mechanism to discard cache lines or to write to cache without performing read-for-write. Or is there?
For discarding individual cache lines there's CLFLUSH.
Writing to part of a cache line (e.g. writing 8 bytes in a 64 byte cache line) means that the unmodified bytes must be read. The only cases where read-for-write can be avoided is when the stored data is not cached ("uncached", "write combining", non-temporal stores); or when the entire cache line is being modified (CLZERO, "rep movsb/d/q" and AVX512).
simeonz wrote:I had the delusion that write combining store buffers might skip read-for-writes if the code is writing out entire cache lines (in cached regions), but now I am thinking that this will not happen. The CPU cannot know if the write sequence will not end before filling the cache line, thus causing the read to be performed anyway, potentially stalling the pipeline. So, without explicit instructions this is not generally possible.
Why?
You do a bunch of little writes to a "write combining" area, and they get stored into the CPU's write combining buffer (ignoring cache) where they're combined if possible. Later (when CPU needs to evict/flush part of the write combining buffer, hopefully after you've finished writing to that area) CPU tries to store the "combined writes" using a small number of larger stores and knows exactly how large the store is (and knows if it does/doesn't fill an entire cache line).
Cheers,
Brendan