jbemmel wrote:Most memcpy implementations I know copy bytes from src to dest, i.e. *d++ = *s++
You could add "const" to s to catch this
Damn! Well, I wrote that very fast, didn't pay much attention, thanks for that bug...
So I added const to s (it generally allows the compiler to make some new assumption and will enable some more complex optimization IIRC).
I added -ftree-vectorize as suggested by JamesM, but timing doesn't change and code neither (just noticed that the code emitted for SSE2, SSE3, SSSE3, SSE4, SSE4.1, SSE4.2 is identical)
stlw wrote:The performance will vary, don't expect AVX to give you any speedup here until Haswell, both Sandy Bridge and Bulldozer AVX implementations don't have full 256-bit data path to data cache so using AVX instructions above SSE not going to help.
I hope to have misunderstood: do you mean that AVX is build over SSE in current implementations??
stlw wrote:But there is one implementation which is guaranteed to be best performance at least on all Intel's processor generations since Core2Duo.
And it is ... rep movsb.
REP MOVSB should be faster than SSE2?? Doubt mode on...
However here's my fixed archive, sorry for the bad code
Please, correct my English...
Motherboard: ASUS Rampage II Extreme
CPU: Core i7 950 @ 3.06 GHz OC at 3.6 GHz
RAM: 4 GB 1600 MHz DDR3
Video: nVidia GeForce 210 GTS... it sucks...