Hi,
For a long time, I've been using my own version of AVXmemcpy (and now AVX2memcpy with 256bit transfers). The "homemade" memcpy seem to be popular on the forum also. Copying 128 or 256bit is faster than 64bit transfers, assuming your data is aligned.
But has anyone ever considered the cost of a context switch vs the benefits of the fast memcpy?
For example, in my OS, the AVX context is lazy-saved only when a thread uses the AVX instuctions. So if no one is using AVX, context switch are fast. But if 2 threads would start using AVXmemcpy, then they both would trigger a DeviceNotAvailable exception and the whole AVX context would need to be saved/restored. In the case of AVX2 that's a total of 480bytes of transfer X2 plus the whole overhead of the exception. So it seems to me like if those 2 threads were copying anything under 1K, it would actually be slower. Of course, if they did that 1000 times during their time-slice, it would be ok.
Anyone has an opinion on this? Surely, I'm not the first one who thought about that.
fast memcopy
Re: fast memcopy
As with most "optimizations", you can't target every possible scenario at the same time. You have to choose the best approach for the situation that you are targeting, and you have to measure the results to prove that your assumptions that you made are, indeed, accurate.
If you want your code to be faster in a highly multi-threaded environment, then you should probably use fewer "shared" resources, like registers. If you want your code to run faster in a single-threaded environment, then you should use any resources that you can find that will make your code run faster.
The best advice that I can give is to implement all of these approaches as separate functions, and switch between them as needed. I would also expose all of them to the application layer, if at all possible, since the application will be in the best position to know which "optimization" works best for its usage pattern.
Or, better yet, let the user decide which method to use, preferably per-application... With a default setting for new applications...
If you want your code to be faster in a highly multi-threaded environment, then you should probably use fewer "shared" resources, like registers. If you want your code to run faster in a single-threaded environment, then you should use any resources that you can find that will make your code run faster.
The best advice that I can give is to implement all of these approaches as separate functions, and switch between them as needed. I would also expose all of them to the application layer, if at all possible, since the application will be in the best position to know which "optimization" works best for its usage pattern.
Or, better yet, let the user decide which method to use, preferably per-application... With a default setting for new applications...
Project: OZone
Source: GitHub
Current Task: LIB/OBJ file support
"The more they overthink the plumbing, the easier it is to stop up the drain." - Montgomery Scott
Source: GitHub
Current Task: LIB/OBJ file support
"The more they overthink the plumbing, the easier it is to stop up the drain." - Montgomery Scott
Re: fast memcopy
Hi,
The best possible code for "memcpy_large()" is something like this:
Basically; if you think SSE or AVX is going to help then you're solving the wrong problem.
Cheers,
Brendan
A good generic "memcpy()" is impossible. You need a "memcpy_small()" designed for tiny copies (where the start-up overhead is more significant than the actual copy so you just do a "one byte at a time" loop); a "memcpy_medium()" designed for medium sized copies that just uses (e.g.) "rep movsb"; plus a "memcpy_large()" just in case.xmm15 wrote:Anyone has an opinion on this?
The best possible code for "memcpy_large()" is something like this:
Code: Select all
void *memcpy_large(void *dest, const void *src, size_t n) {
if(n > 65536) {
fprintf(stderr, "ERROR: Some noob failed to avoid a huge memory copy!\n");
exit(EXIT_FAILURE);
}
return memcpy_medium(dest, src, n);
}
Cheers,
Brendan
For all things; perfection is, and will always remain, impossible to achieve in practice. However; by striving for perfection we create things that are as perfect as practically possible. Let the pursuit of perfection be our guide.
Re: fast memcopy
LOL, gotta love this response.
Brendan wrote:Hi,
A good generic "memcpy()" is impossible. You need a "memcpy_small()" designed for tiny copies (where the start-up overhead is more significant than the actual copy so you just do a "one byte at a time" loop); a "memcpy_medium()" designed for medium sized copies that just uses (e.g.) "rep movsb"; plus a "memcpy_large()" just in case.xmm15 wrote:Anyone has an opinion on this?
The best possible code for "memcpy_large()" is something like this:
Basically; if you think SSE or AVX is going to help then you're solving the wrong problem.Code: Select all
void *memcpy_large(void *dest, const void *src, size_t n) { if(n > 65536) { fprintf(stderr, "ERROR: Some noob failed to avoid a huge memory copy!\n"); exit(EXIT_FAILURE); } return memcpy_medium(dest, src, n); }
Cheers,
Brendan
- Schol-R-LEA
- Member
- Posts: 1925
- Joined: Fri Oct 27, 2006 9:42 am
- Location: Athens, GA, USA
Re: fast memcopy
Unfortunately, with the gargantuan sizes of things like image or audio data today, this may prove problematic. Still, I would say that anything larger than a page (whatever the page size you are using, 4K for most cases) should be handed off to the memory mangler for lazy re-mapping - I'm think along the lines of mapping the two areas to a single set of read-only pages, then trapping attempted writes in such a manner that it forces copying of the altered page alone. Still, better to avoid the issue whenever possible.Brendan wrote: The best possible code for "memcpy_large()" is something like this:
Code: Select all
void *memcpy_large(void *dest, const void *src, size_t n) { if(n > 65536) { fprintf(stderr, "ERROR: Some noob failed to avoid a huge memory copy!\n"); exit(EXIT_FAILURE); } return memcpy_medium(dest, src, n); }
Rev. First Speaker Schol-R-LEA;2 LCF ELF JAM POEE KoR KCO PPWMTF
Ordo OS Project
Lisp programmers tend to seem very odd to outsiders, just like anyone else who has had a religious experience they can't quite explain to others.
Ordo OS Project
Lisp programmers tend to seem very odd to outsiders, just like anyone else who has had a religious experience they can't quite explain to others.