fast memcopy

xmm15 · Post by **xmm15** » Thu Oct 29, 2015 1:23 pm

Hi,

For a long time, I've been using my own version of AVXmemcpy (and now AVX2memcpy with 256bit transfers). The "homemade" memcpy seem to be popular on the forum also. Copying 128 or 256bit is faster than 64bit transfers, assuming your data is aligned.

But has anyone ever considered the cost of a context switch vs the benefits of the fast memcpy?

For example, in my OS, the AVX context is lazy-saved only when a thread uses the AVX instuctions. So if no one is using AVX, context switch are fast. But if 2 threads would start using AVXmemcpy, then they both would trigger a DeviceNotAvailable exception and the whole AVX context would need to be saved/restored. In the case of AVX2 that's a total of 480bytes of transfer X2 plus the whole overhead of the exception. So it seems to me like if those 2 threads were copying anything under 1K, it would actually be slower. Of course, if they did that 1000 times during their time-slice, it would be ok.

Anyone has an opinion on this? Surely, I'm not the first one who thought about that.

SpyderTL · Post by **SpyderTL** » Thu Oct 29, 2015 3:14 pm

As with most "optimizations", you can't target every possible scenario at the same time. You have to choose the best approach for the situation that you are targeting, and you have to measure the results to prove that your assumptions that you made are, indeed, accurate.

If you want your code to be faster in a highly multi-threaded environment, then you should probably use fewer "shared" resources, like registers. If you want your code to run faster in a single-threaded environment, then you should use any resources that you can find that will make your code run faster.

The best advice that I can give is to implement all of these approaches as separate functions, and switch between them as needed. I would also expose all of them to the application layer, if at all possible, since the application will be in the best position to know which "optimization" works best for its usage pattern.

Or, better yet, let the user decide which method to use, preferably per-application... With a default setting for new applications...

Brendan · Post by **Brendan** » Thu Oct 29, 2015 6:00 pm

Hi,

xmm15 wrote:Anyone has an opinion on this?

A good generic "memcpy()" is impossible. You need a "memcpy_small()" designed for tiny copies (where the start-up overhead is more significant than the actual copy so you just do a "one byte at a time" loop); a "memcpy_medium()" designed for medium sized copies that just uses (e.g.) "rep movsb"; plus a "memcpy_large()" just in case.

The best possible code for "memcpy_large()" is something like this:

Code: Select all

void *memcpy_large(void *dest, const void *src, size_t n) {
    if(n > 65536) {
        fprintf(stderr, "ERROR: Some noob failed to avoid a huge memory copy!\n");
        exit(EXIT_FAILURE);
    }
    return memcpy_medium(dest, src, n);
}

Basically; if you think SSE or AVX is going to help then you're solving the wrong problem.

Cheers,

Brendan

kzinti · Post by **kzinti** » Thu Oct 29, 2015 6:02 pm

tsdnz · Post by **tsdnz** » Thu Oct 29, 2015 10:16 pm

LOL, gotta love this response.

Brendan wrote:Hi,

xmm15 wrote:Anyone has an opinion on this?
A good generic "memcpy()" is impossible. You need a "memcpy_small()" designed for tiny copies (where the start-up overhead is more significant than the actual copy so you just do a "one byte at a time" loop); a "memcpy_medium()" designed for medium sized copies that just uses (e.g.) "rep movsb"; plus a "memcpy_large()" just in case.

The best possible code for "memcpy_large()" is something like this:
Code: Select all
void *memcpy_large(void *dest, const void *src, size_t n) {
    if(n > 65536) {
        fprintf(stderr, "ERROR: Some noob failed to avoid a huge memory copy!\n");
        exit(EXIT_FAILURE);
    }
    return memcpy_medium(dest, src, n);
}
Basically; if you think SSE or AVX is going to help then you're solving the wrong problem.

Cheers,

Brendan

Schol-R-LEA · Post by **Schol-R-LEA** » Sat Oct 31, 2015 1:20 am

Brendan wrote: The best possible code for "memcpy_large()" is something like this:

Code: Select all

void *memcpy_large(void *dest, const void *src, size_t n) {
    if(n > 65536) {
        fprintf(stderr, "ERROR: Some noob failed to avoid a huge memory copy!\n");
        exit(EXIT_FAILURE);
    }
    return memcpy_medium(dest, src, n);
}

Unfortunately, with the gargantuan sizes of things like image or audio data today, this may prove problematic. Still, I would say that anything larger than a page (whatever the page size you are using, 4K for most cases) should be handed off to the memory mangler for lazy re-mapping - I'm think along the lines of mapping the two areas to a single set of read-only pages, then trapping attempted writes in such a manner that it forces copying of the altered page alone. Still, better to avoid the issue whenever possible.

OSDev.org

fast memcopy

fast memcopy

Re: fast memcopy

Re: fast memcopy

Re: fast memcopy

Re: fast memcopy

Re: fast memcopy