OSDev.org

Posted: **Tue Jun 05, 2012 12:51 pm**

Hi, I wrote quite a standard memcpy implementation:

void *memcpy(void *__restrict dst, const void *__restrict src, size_t count)
{
	char *__restrict s = (char *) src;
	char *__restrict d = (char *) dst;
	
	while (count-- > 0)
		*s++ = *d++;
		
	return dst;
}

And I'm using a makefile to compile it several times with different flags (use MMX, use SSE, use SSE2, use AVX, ...) to see the difference in performance between different extensions. I expect gcc to use the processor extension I specified in each command line. Unfortunately I do not own a Core i7 Sandy Bridge with AVX extension, so I couldn't test the last piece.

Can you please test this and report? If you have AVX, thats great too

Posted: **Tue Jun 05, 2012 1:25 pm**

Just checking - make sure you use -ftree-vectorize to enable GCC's autovectorizer. Else you won't get good SSE code.

Posted: **Tue Jun 05, 2012 1:53 pm**

Most memcpy implementations I know copy bytes from src to dest, i.e. *d++ = *s++

You could add "const" to s to catch this

Posted: **Tue Jun 05, 2012 2:11 pm**

The implementation you wrote is in C so you trust complier to build fastest code for you.
In the other hand the same compiled binary not necessary will be best for all processors.
The performance will vary, don't expect AVX to give you any speedup here until Haswell, both Sandy Bridge and Bulldozer AVX implementations don't have full 256-bit data path to data cache so using AVX instructions above SSE not going to help.

But there is one implementation which is guaranteed to be best performance at least on all Intel's processor generations since Core2Duo.
And it is ... rep movsb. Smart compiler will emit the rep movsb instruction for general purpose memcpy.
The internal CPU implementation in each processor could be different, on Sandy Bridge it will use 128-bit read and writes for example. On Ivy Bridge Intel also introduced "fast string" extension which will make memcpy even faster.

Stanislav

Posted: **Tue Jun 05, 2012 3:18 pm**

jbemmel wrote:Most memcpy implementations I know copy bytes from src to dest, i.e. *d++ = *s++

You could add "const" to s to catch this

Damn! Well, I wrote that very fast, didn't pay much attention, thanks for that bug...

So I added const to s (it generally allows the compiler to make some new assumption and will enable some more complex optimization IIRC).
I added -ftree-vectorize as suggested by JamesM, but timing doesn't change and code neither (just noticed that the code emitted for SSE2, SSE3, SSSE3, SSE4, SSE4.1, SSE4.2 is identical)

stlw wrote:The performance will vary, don't expect AVX to give you any speedup here until Haswell, both Sandy Bridge and Bulldozer AVX implementations don't have full 256-bit data path to data cache so using AVX instructions above SSE not going to help.

I hope to have misunderstood: do you mean that AVX is build over SSE in current implementations??

stlw wrote:But there is one implementation which is guaranteed to be best performance at least on all Intel's processor generations since Core2Duo.
And it is ... rep movsb.

REP MOVSB should be faster than SSE2?? Doubt mode on...

However here's my fixed archive, sorry for the bad code

Posted: **Wed Jun 06, 2012 1:46 am**

stlw wrote:The performance will vary, don't expect AVX to give you any speedup here until Haswell, both Sandy Bridge and Bulldozer AVX implementations don't have full 256-bit data path to data cache so using AVX instructions above SSE not going to help.
I hope to have misunderstood: do you mean that AVX is build over SSE in current implementations??

Yes, data cache have no full 256-bit datapath in Sandy Bridge or Bulldozer. if you need to execute 256-bit load the hardware will split it into two 128-bit loads.
Happy read optimization guide manuals.

stlw wrote:But there is one implementation which is guaranteed to be best performance at least on all Intel's processor generations since Core2Duo.
And it is ... rep movsb.
REP MOVSB should be faster than SSE2?? Doubt mode on...

You will have to measure different memcopy sizes from small sizes up to several kilobytes.
Of course it the rep movsb won't be faster always, especially for smaller memory blocks to copy, but it general it will be faster. Try and measure.
Also because internal implementation of rep movsb in the processor is based on 128-bit memory accesses already. Actually on even more optimal than it, x86 have no way to write combine stores for example without doing them non-temporal.

Stanislav

Posted: **Wed Jun 06, 2012 2:05 am**

AlfaOmega08 wrote:Hi, I wrote quite a standard memcpy implementation...

Such a "naive" implementation is nice to have as a fallback. (No finger-pointing there, my PDCLib doesn't have better code either yet.)

But if you want to optimize things, and are willing to rely on GCC, the best thing you could do is to realize that GCC provides "__builtin_memcpy", and handles "memcpy" as a builtin too unless overruled...

Posted: **Wed Jun 06, 2012 4:19 am**

A (provably) optimal assembly implementation of memcpy takes about 500 LoC. I am too lazy to benchmark it right now but someone (froggey from IRC) benchmarked my implementation of memset over one year ago (clicky). As you can see, there is potential to beat the pants off of current compilers. At any rate, on the Sandy Bridge, it's fastest to use REP MOVSQ (it's likely that Intel will continue making this construct the fastest in future CPUs). And the biggest mistake is to unroll loops (due to the micro-op cache).

Posted: **Wed Jun 06, 2012 6:34 am**

berkus wrote:If you routinely copy over 8Kb of data you should be castrated anyway.
(IIRC CPU stops the copy at every 8Kb to check for interrupts).

There is no need to stop copy at any point to check for interrupts.
Hardware knows to enable interrupt window when needed and it doesn't take any execution cycles.

Stanislav

Posted: **Wed Jun 06, 2012 8:15 am**

I wouldn't rely on __builtin_memcpy(). It might just as well yield a call to memcpy().

EDIT: Yep, Cygwin GCC 4.5.3 does this.

Code: Select all

#include <string.h>

void *test(void *__restrict__ d, const void *__restrict__ s, size_t n)
{
    return __builtin_memcpy(d, s, n);
}

Code: Select all

gcc -S -O3 test.c

Code: Select all

	.file	"test.c"
	.text
	.p2align 4,,15
.globl _test
	.def	_test;	.scl	2;	.type	32;	.endef
_test:
	pushl	%ebp
	movl	%esp, %ebp
	subl	$8, %esp
	leave
	jmp	_memcpy
	.def	_memcpy;	.scl	2;	.type	32;	.endef

Posted: **Wed Jun 06, 2012 8:33 am**

Yes... because memcpy itself is also a builtin, unless you overrule it with --fno-builtin or a definition of your own.

Posted: **Thu Jun 07, 2012 1:45 am**

Solar,

My point is that one can't use __builtin_memcpy() in stead of memcpy() beacuse the first may simply call the latter.

Using the GNU Compiler Collection wrote:Many of these functions are only optimized in certain cases; if they are not optimized in a particular case, a call to the library function will be emitted.

EDIT: Perhaps my wording wasn't clear enough. I mean one can't use __builtin_mumcpy() without a "real" defined memcpy() somewhere. Also this, for example, will not work:

Code: Select all

void *memcpy(void *d, const void *s, size_t n)
{
    return __builtin_memcpy(d, s, n);
}

How convenient it would be.

OSDev.org

Test memcpy performance, please...

Test memcpy performance, please...

Re: Test memcpy performance, please...

Re: Test memcpy performance, please...

Re: Test memcpy performance, please...

Re: Test memcpy performance, please...

Re: Test memcpy performance, please...

Re: Test memcpy performance, please...

Re: Test memcpy performance, please...

Re: Test memcpy performance, please...

Re: Test memcpy performance, please...

Re: Test memcpy performance, please...

Re: Test memcpy performance, please...