Test memcpy performance, please...

This forums is for OS project announcements including project openings, new releases, update notices, test requests, and job openings (both paying and volunteer).
Post Reply
User avatar
AlfaOmega08
Member
Member
Posts: 226
Joined: Wed Nov 07, 2007 12:15 pm
Location: Italy

Test memcpy performance, please...

Post by AlfaOmega08 »

Hi, I wrote quite a standard memcpy implementation:

Code: Select all

void *memcpy(void *__restrict dst, const void *__restrict src, size_t count)
{
	char *__restrict s = (char *) src;
	char *__restrict d = (char *) dst;
	
	while (count-- > 0)
		*s++ = *d++;
		
	return dst;
}
And I'm using a makefile to compile it several times with different flags (use MMX, use SSE, use SSE2, use AVX, ...) to see the difference in performance between different extensions. I expect gcc to use the processor extension I specified in each command line. Unfortunately I do not own a Core i7 Sandy Bridge with AVX extension, so I couldn't test the last piece.

Can you please test this and report? If you have AVX, thats great too :)
Attachments
memcpy-perf.tar.bz2
(6.83 KiB) Downloaded 158 times
Please, correct my English...
Motherboard: ASUS Rampage II Extreme
CPU: Core i7 950 @ 3.06 GHz OC at 3.6 GHz
RAM: 4 GB 1600 MHz DDR3
Video: nVidia GeForce 210 GTS... it sucks...
User avatar
JamesM
Member
Member
Posts: 2935
Joined: Tue Jul 10, 2007 5:27 am
Location: York, United Kingdom
Contact:

Re: Test memcpy performance, please...

Post by JamesM »

Just checking - make sure you use -ftree-vectorize to enable GCC's autovectorizer. Else you won't get good SSE code.
jbemmel
Member
Member
Posts: 53
Joined: Fri May 11, 2012 11:54 am

Re: Test memcpy performance, please...

Post by jbemmel »

Most memcpy implementations I know copy bytes from src to dest, i.e. *d++ = *s++

You could add "const" to s to catch this
stlw
Member
Member
Posts: 357
Joined: Fri Apr 04, 2008 6:43 am
Contact:

Re: Test memcpy performance, please...

Post by stlw »

The implementation you wrote is in C so you trust complier to build fastest code for you.
In the other hand the same compiled binary not necessary will be best for all processors.
The performance will vary, don't expect AVX to give you any speedup here until Haswell, both Sandy Bridge and Bulldozer AVX implementations don't have full 256-bit data path to data cache so using AVX instructions above SSE not going to help.

But there is one implementation which is guaranteed to be best performance at least on all Intel's processor generations since Core2Duo.
And it is ... rep movsb. Smart compiler will emit the rep movsb instruction for general purpose memcpy.
The internal CPU implementation in each processor could be different, on Sandy Bridge it will use 128-bit read and writes for example. On Ivy Bridge Intel also introduced "fast string" extension which will make memcpy even faster.

Stanislav
User avatar
AlfaOmega08
Member
Member
Posts: 226
Joined: Wed Nov 07, 2007 12:15 pm
Location: Italy

Re: Test memcpy performance, please...

Post by AlfaOmega08 »

jbemmel wrote:Most memcpy implementations I know copy bytes from src to dest, i.e. *d++ = *s++

You could add "const" to s to catch this
Damn! Well, I wrote that very fast, didn't pay much attention, thanks for that bug... #-o #-o #-o

So I added const to s (it generally allows the compiler to make some new assumption and will enable some more complex optimization IIRC).
I added -ftree-vectorize as suggested by JamesM, but timing doesn't change and code neither (just noticed that the code emitted for SSE2, SSE3, SSSE3, SSE4, SSE4.1, SSE4.2 is identical)
stlw wrote:The performance will vary, don't expect AVX to give you any speedup here until Haswell, both Sandy Bridge and Bulldozer AVX implementations don't have full 256-bit data path to data cache so using AVX instructions above SSE not going to help.
I hope to have misunderstood: do you mean that AVX is build over SSE in current implementations??
stlw wrote:But there is one implementation which is guaranteed to be best performance at least on all Intel's processor generations since Core2Duo.
And it is ... rep movsb.
REP MOVSB should be faster than SSE2?? Doubt mode on...

However here's my fixed archive, sorry for the bad code :roll:
Attachments
memcpy-perf.tar.bz2
(6.86 KiB) Downloaded 183 times
Please, correct my English...
Motherboard: ASUS Rampage II Extreme
CPU: Core i7 950 @ 3.06 GHz OC at 3.6 GHz
RAM: 4 GB 1600 MHz DDR3
Video: nVidia GeForce 210 GTS... it sucks...
stlw
Member
Member
Posts: 357
Joined: Fri Apr 04, 2008 6:43 am
Contact:

Re: Test memcpy performance, please...

Post by stlw »

stlw wrote:The performance will vary, don't expect AVX to give you any speedup here until Haswell, both Sandy Bridge and Bulldozer AVX implementations don't have full 256-bit data path to data cache so using AVX instructions above SSE not going to help.
I hope to have misunderstood: do you mean that AVX is build over SSE in current implementations??
Yes, data cache have no full 256-bit datapath in Sandy Bridge or Bulldozer. if you need to execute 256-bit load the hardware will split it into two 128-bit loads.
Happy read optimization guide manuals.
stlw wrote:But there is one implementation which is guaranteed to be best performance at least on all Intel's processor generations since Core2Duo.
And it is ... rep movsb.
REP MOVSB should be faster than SSE2?? Doubt mode on...
You will have to measure different memcopy sizes from small sizes up to several kilobytes.
Of course it the rep movsb won't be faster always, especially for smaller memory blocks to copy, but it general it will be faster. Try and measure.
Also because internal implementation of rep movsb in the processor is based on 128-bit memory accesses already. Actually on even more optimal than it, x86 have no way to write combine stores for example without doing them non-temporal.

Stanislav
User avatar
Solar
Member
Member
Posts: 7615
Joined: Thu Nov 16, 2006 12:01 pm
Location: Germany
Contact:

Re: Test memcpy performance, please...

Post by Solar »

AlfaOmega08 wrote:Hi, I wrote quite a standard memcpy implementation...
Such a "naive" implementation is nice to have as a fallback. (No finger-pointing there, my PDCLib doesn't have better code either yet.)

But if you want to optimize things, and are willing to rely on GCC, the best thing you could do is to realize that GCC provides "__builtin_memcpy", and handles "memcpy" as a builtin too unless overruled...
Every good solution is obvious once you've found it.
User avatar
Love4Boobies
Member
Member
Posts: 2111
Joined: Fri Mar 07, 2008 5:36 pm
Location: Bucharest, Romania

Re: Test memcpy performance, please...

Post by Love4Boobies »

A (provably) optimal assembly implementation of memcpy takes about 500 LoC. I am too lazy to benchmark it right now but someone (froggey from IRC) benchmarked my implementation of memset over one year ago (clicky). As you can see, there is potential to beat the pants off of current compilers. At any rate, on the Sandy Bridge, it's fastest to use REP MOVSQ (it's likely that Intel will continue making this construct the fastest in future CPUs). And the biggest mistake is to unroll loops (due to the micro-op cache).
"Computers in the future may weigh no more than 1.5 tons.", Popular Mechanics (1949)
[ Project UDI ]
stlw
Member
Member
Posts: 357
Joined: Fri Apr 04, 2008 6:43 am
Contact:

Re: Test memcpy performance, please...

Post by stlw »

berkus wrote:If you routinely copy over 8Kb of data you should be castrated anyway.
(IIRC CPU stops the copy at every 8Kb to check for interrupts).
There is no need to stop copy at any point to check for interrupts.
Hardware knows to enable interrupt window when needed and it doesn't take any execution cycles.

Stanislav
User avatar
qw
Member
Member
Posts: 792
Joined: Mon Jan 26, 2009 2:48 am

Re: Test memcpy performance, please...

Post by qw »

I wouldn't rely on __builtin_memcpy(). It might just as well yield a call to memcpy().

EDIT: Yep, Cygwin GCC 4.5.3 does this.

Code: Select all

#include <string.h>

void *test(void *__restrict__ d, const void *__restrict__ s, size_t n)
{
    return __builtin_memcpy(d, s, n);
}

Code: Select all

gcc -S -O3 test.c

Code: Select all

	.file	"test.c"
	.text
	.p2align 4,,15
.globl _test
	.def	_test;	.scl	2;	.type	32;	.endef
_test:
	pushl	%ebp
	movl	%esp, %ebp
	subl	$8, %esp
	leave
	jmp	_memcpy
	.def	_memcpy;	.scl	2;	.type	32;	.endef
User avatar
Solar
Member
Member
Posts: 7615
Joined: Thu Nov 16, 2006 12:01 pm
Location: Germany
Contact:

Re: Test memcpy performance, please...

Post by Solar »

Yes... because memcpy itself is also a builtin, unless you overrule it with --fno-builtin or a definition of your own.
Every good solution is obvious once you've found it.
User avatar
qw
Member
Member
Posts: 792
Joined: Mon Jan 26, 2009 2:48 am

Re: Test memcpy performance, please...

Post by qw »

Solar,

My point is that one can't use __builtin_memcpy() in stead of memcpy() beacuse the first may simply call the latter.
Using the GNU Compiler Collection wrote:Many of these functions are only optimized in certain cases; if they are not optimized in a particular case, a call to the library function will be emitted.
EDIT: Perhaps my wording wasn't clear enough. I mean one can't use __builtin_mumcpy() without a "real" defined memcpy() somewhere. Also this, for example, will not work:

Code: Select all

void *memcpy(void *d, const void *s, size_t n)
{
    return __builtin_memcpy(d, s, n);
}
How convenient it would be.
Post Reply