fast memcopy

Programming, for all ages and all languages.
Post Reply
xmm15
Member
Member
Posts: 27
Joined: Mon Dec 16, 2013 6:50 pm

fast memcopy

Post by xmm15 »

Hi,

For a long time, I've been using my own version of AVXmemcpy (and now AVX2memcpy with 256bit transfers). The "homemade" memcpy seem to be popular on the forum also. Copying 128 or 256bit is faster than 64bit transfers, assuming your data is aligned.

But has anyone ever considered the cost of a context switch vs the benefits of the fast memcpy?

For example, in my OS, the AVX context is lazy-saved only when a thread uses the AVX instuctions. So if no one is using AVX, context switch are fast. But if 2 threads would start using AVXmemcpy, then they both would trigger a DeviceNotAvailable exception and the whole AVX context would need to be saved/restored. In the case of AVX2 that's a total of 480bytes of transfer X2 plus the whole overhead of the exception. So it seems to me like if those 2 threads were copying anything under 1K, it would actually be slower. Of course, if they did that 1000 times during their time-slice, it would be ok.

Anyone has an opinion on this? Surely, I'm not the first one who thought about that.
User avatar
SpyderTL
Member
Member
Posts: 1074
Joined: Sun Sep 19, 2010 10:05 pm

Re: fast memcopy

Post by SpyderTL »

As with most "optimizations", you can't target every possible scenario at the same time. You have to choose the best approach for the situation that you are targeting, and you have to measure the results to prove that your assumptions that you made are, indeed, accurate.

If you want your code to be faster in a highly multi-threaded environment, then you should probably use fewer "shared" resources, like registers. If you want your code to run faster in a single-threaded environment, then you should use any resources that you can find that will make your code run faster.

The best advice that I can give is to implement all of these approaches as separate functions, and switch between them as needed. I would also expose all of them to the application layer, if at all possible, since the application will be in the best position to know which "optimization" works best for its usage pattern.

Or, better yet, let the user decide which method to use, preferably per-application... With a default setting for new applications...
Project: OZone
Source: GitHub
Current Task: LIB/OBJ file support
"The more they overthink the plumbing, the easier it is to stop up the drain." - Montgomery Scott
User avatar
Brendan
Member
Member
Posts: 8561
Joined: Sat Jan 15, 2005 12:00 am
Location: At his keyboard!
Contact:

Re: fast memcopy

Post by Brendan »

Hi,
xmm15 wrote:Anyone has an opinion on this?
A good generic "memcpy()" is impossible. You need a "memcpy_small()" designed for tiny copies (where the start-up overhead is more significant than the actual copy so you just do a "one byte at a time" loop); a "memcpy_medium()" designed for medium sized copies that just uses (e.g.) "rep movsb"; plus a "memcpy_large()" just in case.

The best possible code for "memcpy_large()" is something like this:

Code: Select all

void *memcpy_large(void *dest, const void *src, size_t n) {
    if(n > 65536) {
        fprintf(stderr, "ERROR: Some noob failed to avoid a huge memory copy!\n");
        exit(EXIT_FAILURE);
    }
    return memcpy_medium(dest, src, n);
}
Basically; if you think SSE or AVX is going to help then you're solving the wrong problem.


Cheers,

Brendan
For all things; perfection is, and will always remain, impossible to achieve in practice. However; by striving for perfection we create things that are as perfect as practically possible. Let the pursuit of perfection be our guide.
kzinti
Member
Member
Posts: 898
Joined: Mon Feb 02, 2015 7:11 pm

Re: fast memcopy

Post by kzinti »

=)
tsdnz
Member
Member
Posts: 333
Joined: Sun Jun 16, 2013 4:09 am

Re: fast memcopy

Post by tsdnz »

LOL, gotta love this response.
Brendan wrote:Hi,
xmm15 wrote:Anyone has an opinion on this?
A good generic "memcpy()" is impossible. You need a "memcpy_small()" designed for tiny copies (where the start-up overhead is more significant than the actual copy so you just do a "one byte at a time" loop); a "memcpy_medium()" designed for medium sized copies that just uses (e.g.) "rep movsb"; plus a "memcpy_large()" just in case.

The best possible code for "memcpy_large()" is something like this:

Code: Select all

void *memcpy_large(void *dest, const void *src, size_t n) {
    if(n > 65536) {
        fprintf(stderr, "ERROR: Some noob failed to avoid a huge memory copy!\n");
        exit(EXIT_FAILURE);
    }
    return memcpy_medium(dest, src, n);
}
Basically; if you think SSE or AVX is going to help then you're solving the wrong problem.


Cheers,

Brendan
User avatar
Schol-R-LEA
Member
Member
Posts: 1925
Joined: Fri Oct 27, 2006 9:42 am
Location: Athens, GA, USA

Re: fast memcopy

Post by Schol-R-LEA »

Brendan wrote: The best possible code for "memcpy_large()" is something like this:

Code: Select all

void *memcpy_large(void *dest, const void *src, size_t n) {
    if(n > 65536) {
        fprintf(stderr, "ERROR: Some noob failed to avoid a huge memory copy!\n");
        exit(EXIT_FAILURE);
    }
    return memcpy_medium(dest, src, n);
}
Unfortunately, with the gargantuan sizes of things like image or audio data today, this may prove problematic. Still, I would say that anything larger than a page (whatever the page size you are using, 4K for most cases) should be handed off to the memory mangler for lazy re-mapping - I'm think along the lines of mapping the two areas to a single set of read-only pages, then trapping attempted writes in such a manner that it forces copying of the altered page alone. Still, better to avoid the issue whenever possible.
Rev. First Speaker Schol-R-LEA;2 LCF ELF JAM POEE KoR KCO PPWMTF
Ordo OS Project
Lisp programmers tend to seem very odd to outsiders, just like anyone else who has had a religious experience they can't quite explain to others.
Post Reply