Optimized memory functions?

imate900 · Post by **imate900** » Sun Mar 08, 2009 9:49 am

I implemented my memset like this in Assembler (nasm):

memset:
        ; [eax-4] has the buffer
        ; [eax+8] has the size
        ; and [eax+4] has the integer/char
        mov     ecx, [eax+8]
        mov     edi, [eax-4]
        mov     [es:edi], edi
        mov     al, [eax+4]
rep     stosb
        mov     eax, [es:edi]
        ret

01000101 · Post by **01000101** » Sun Mar 08, 2009 10:33 am

imate900 wrote:I implemented my memset like this in Assembler (nasm):

Code: Select all

memset:
        ; [eax-4] has the buffer
        ; [eax+8] has the size
        ; and [eax+4] has the integer/char
        mov     ecx, [eax+8]
        mov     edi, [eax-4]
        mov     [es:edi], edi
        mov     al, [eax+4]
rep     stosb
        mov     eax, [es:edi]
        ret

and?
unless I'm looking at it wrong, it seems like a very basic plain ol' memset with no optimizing instrucions (besides the REP prefix).

As the the previous posted, I may get around to re-testing those numbers with different methods (serializing, speed-independant timers, etc...).

imate900 · Post by **imate900** » Sun Mar 08, 2009 3:15 pm

01000101 wrote: and?
unless I'm looking at it wrong, it seems like a very basic plain ol' memset with no optimizing instrucions (besides the REP prefix).

As the the previous posted, I may get around to re-testing those numbers with different methods (serializing, speed-independant timers, etc...).

I optimized for size not speed.

Combuster · Post by **Combuster** » Sun Mar 08, 2009 3:31 pm

calling convention's off. And if you're really optimizing for size, try this (8 bytes and nasty):

Code: Select all

pop edx
pop edi
pop eax
pop ecx
rep stosb
push edx
ret

imate900 · Post by **imate900** » Sun Mar 08, 2009 5:00 pm

Combuster wrote:calling convention's off. And if you're really optimizing for size, try this (8 bytes and nasty):
Code: Select all
pop edx
pop edi
pop eax
pop ecx
rep stosb
push edx
ret

By the way, my code is more optimized to be called from C and not Assembler.

Craze Frog · Post by **Craze Frog** » Sun Mar 08, 2009 5:03 pm

Combuster wrote:calling convention's off. And if you're really optimizing for size, try this (8 bytes and nasty):
Code: Select all
pop edx
pop edi
pop eax
pop ecx
rep stosb
push edx
ret

That most definetely doesn't fit any calling convention. In both stdcall and cdecl, edi is callee-save, but you wipe it.

jal · Post by **jal** » Mon Mar 09, 2009 4:22 am

imate900 wrote:By the way, my code is more optimized to be called from C and not Assembler.

You are fighting a lost battle here, really. Stop embarassing yourself.

JAL

01000101 · Post by **01000101** » Mon Mar 09, 2009 6:39 am

This thread was meant to discuss speed optimizations (as you would have noticed had you actually read any of the topic posts). While that is indeed a small routine, the "big" routines aren't much larger.

Craze Frog · Post by **Craze Frog** » Mon Mar 09, 2009 12:15 pm

jal · Post by **jal** » Mon Mar 09, 2009 1:57 pm

Craze Frog wrote:No.

And that's a good thing, otherwise it would be extremely easy for a thread to steal CPU time :)

Code: Select all

mov ecx, 0xffffffff   ; we'll get that pesky scheduler!
rep movsd

JAL

Combuster · Post by **Combuster** » Mon Mar 09, 2009 3:19 pm

I assume you meant ECX there...

jal · Post by **jal** » Tue Mar 10, 2009 3:07 am

Combuster wrote:I assume you meant ECX there... :wink:

Yeah, typed too fast :) At first I have mosb as well, but at least corrected that one :))

JAL

01000101 · Post by **01000101** » Thu Mar 19, 2009 12:00 am

Hey, I decided to attempt and make an SSE-optimized (and maybe MMX later) library of core functions such as memcpy, memcmp, strlen, strcmp, mommove, etc...

I'm trying to weight between using non-temporal moves that are serializing or non-non-temporal (

) moves that can utilize out-of-order execution. I'm using PREFETCHNTA for the instructions before moving large blocks and a cache-line flush (CLFLUSH) at the end of each major iteration, I hope that somewhat balances the scale.

I'd like to post a few functions here, and hopefully as a by-proxy team effor, scrutinize the code into an efficient function.

I am aware of existing libraries out there that do similar things, but I'd like to make one especially for small OS's and optimize the code for easy porting (to different OS's, not different architectures). Well, lets get started, here's the big hitter, the memcpy! (I'm aware it's not 100% POSIX compliant, but I never said that it would be

).

This function gets called by a stub (memcpy) when the stub detects the presence of SSE 4.1 capabilities. There is currently 3 memcpy's that this stub can call, memcpy_std (standard, non-sse), memcpy_sse2 (use SSE2 if 4.1 isn't allowed), and memcpy_sse4_1. Here's SSE 4.1 (movntdqa).

Code: Select all


/* =====================================================================
 * === ALIGNED SSE4.1 MEMORY CLEARING OPERATIONS (MOVNTDQA == SSE4.1) ==
 * ===================================================================== */

static const void * memcpy_sse4_1_aligned(const void * const dst, const void * const src, const size_t m_count)
{
	size_t i;

	i = 0;

	/* is "src" aligned on a SSE_XMM_SIZE boundary */
	if(!((size_t)src & (SSE_XMM_SIZE-1)))
	{ }
	else
	{ 
		/* lets make sure we don't copy 'too' many bytes (i < m_count) */
		while((((size_t)src + i) & (SSE_XMM_SIZE-1)) && i < m_count)
		{
			asm("movsb;"::"S"((size_t)src + i), "D"((size_t)dst + i));
			i++;
		}
	}

	/* check to see if "dst" is aligned on a SSE_XMM_SIZE boundary */
	if(!((size_t)dst & (SSE_XMM_SIZE-1)))
	{
		/* each iteration consumes a 128-byte chunk of memory */

			
#ifdef __x86_64__
		for(; i + 256 < m_count; i += 256)
#else
		for(; i + 128 < m_count; i += 128)
#endif
		{
			/* fill all the XMM 128-bit SSE registers! */
			asm (" mfence;					 "
				 " prefetchnta  0(%0);		 "
				 " prefetchnta 32(%0);		 "
				 " prefetchnta 64(%0);		 "
				 " prefetchnta 96(%0);		 "
				 " prefetchnta  0(%1);		 "
				 " prefetchnta 32(%1);		 "
				 " prefetchnta 64(%1);		 "
				 " prefetchnta 96(%1);		 "
			     " movntdqa 0(%0) , %%xmm0;  "
				 " movntdqa 16(%0), %%xmm1;  " 
				 " movntdqa 32(%0), %%xmm2;  " 
				 " movntdqa 48(%0), %%xmm3;  " 
				 " movntdqa 64(%0), %%xmm4;  " 
				 " movntdqa 80(%0), %%xmm5;  " 
				 " movntdqa 96(%0), %%xmm6;  " 
				 " movntdqa 112(%0), %%xmm7; " 
#ifdef __x86_64__
				 " movntdqa 128(%0), %%xmm8;  "
				 " movntdqa 144(%0), %%xmm9;  " 
				 " movntdqa 160(%0), %%xmm10; " 
				 " movntdqa 176(%0), %%xmm11; " 
				 " movntdqa 192(%0), %%xmm12; " 
				 " movntdqa 208(%0), %%xmm13; " 
				 " movntdqa 224(%0), %%xmm14; " 
				 " movntdqa 240(%0), %%xmm15; " 
#endif
				 " movntdq %%xmm0, 0(%1);    " 
				 " movntdq %%xmm1, 16(%1);	 " 
				 " movntdq %%xmm2, 32(%1);	 " 
				 " movntdq %%xmm3, 48(%1);	 " 
				 " movntdq %%xmm4, 64(%1);   " 
				 " movntdq %%xmm5, 80(%1);	 " 
				 " movntdq %%xmm6, 96(%1);	 " 
				 " movntdq %%xmm7, 112(%1);	 " 
#ifdef __x86_64__
				 " movntdq %%xmm8, 128(%1);  " 
				 " movntdq %%xmm9, 144(%1);	 " 
				 " movntdq %%xmm10, 160(%1); " 
				 " movntdq %%xmm11, 176(%1); " 
				 " movntdq %%xmm12, 192(%1); " 
				 " movntdq %%xmm13, 208(%1); " 
				 " movntdq %%xmm14, 224(%1); " 
				 " movntdq %%xmm15, 240(%1); " 	
				 " clflush  0(%0);			 "
				 " clflush 32(%0);			 "
				 " clflush 64(%0);			 "
				 " clflush 96(%0);			 "
				 " clflush  0(%1);			 "
				 " clflush 32(%1);			 "
				 " clflush 64(%1);			 "
				 " clflush 96(%1);			 "
#endif 
				 ::"r"((size_t)src + i), "r"((size_t)dst + i)); 
		}
	}
	else
	{
#ifdef __x86_64__
		for(; i + 256 < m_count; i += 256)
#else
		for(; i + 128 < m_count; i += 128)
#endif
		{ 
			asm (" mfence; "
				 " prefetchnta  0(%0);		 "
				 " prefetchnta 32(%0);		 "
				 " prefetchnta 64(%0);		 "
				 " prefetchnta 96(%0);		 "
				 " prefetchnta  0(%1);		 "
				 " prefetchnta 32(%1);		 "
				 " prefetchnta 64(%1);		 "
				 " prefetchnta 96(%1);		 "
				 " movntdqa 0(%0) , %%xmm0;  "
				 " movntdqa 16(%0), %%xmm1;  " 
				 " movntdqa 32(%0), %%xmm2;  " 
				 " movntdqa 48(%0), %%xmm3;  " 
				 " movntdqa 64(%0), %%xmm4;  " 
				 " movntdqa 80(%0), %%xmm5;  " 
				 " movntdqa 96(%0), %%xmm6;  " 
				 " movntdqa 112(%0), %%xmm7; " 
#ifdef __x86_64__
				 " movntdqa 128(%0), %%xmm8;  "
				 " movntdqa 144(%0), %%xmm9;  " 
				 " movntdqa 160(%0), %%xmm10; " 
				 " movntdqa 176(%0), %%xmm11; " 
				 " movntdqa 192(%0), %%xmm12; " 
				 " movntdqa 208(%0), %%xmm13; " 
				 " movntdqa 224(%0), %%xmm14; " 
				 " movntdqa 240(%0), %%xmm15; " 
#endif
				 " movdqu %%xmm0, 0(%1);     " 
				 " movdqu %%xmm1, 16(%1);	 " 
				 " movdqu %%xmm2, 32(%1);	 " 
				 " movdqu %%xmm3, 48(%1);	 " 
				 " movdqu %%xmm4, 64(%1);    " 
				 " movdqu %%xmm5, 80(%1);	 " 
				 " movdqu %%xmm6, 96(%1);	 " 
				 " movdqu %%xmm7, 112(%1);	 " 
#ifdef __x86_64__
				 " movdqu %%xmm8, 128(%1);  " 
				 " movdqu %%xmm9, 144(%1);	" 
				 " movdqu %%xmm10, 160(%1); " 
				 " movdqu %%xmm11, 176(%1); " 
				 " movdqu %%xmm12, 192(%1); " 
				 " movdqu %%xmm13, 208(%1); " 
				 " movdqu %%xmm14, 224(%1); " 
				 " movdqu %%xmm15, 240(%1); " 
				 " clflush  0(%0);			 "
				 " clflush 32(%0);			 "
				 " clflush 64(%0);			 "
				 " clflush 96(%0);			 "
				 " clflush  0(%1);			 "
				 " clflush 32(%1);			 "
				 " clflush 64(%1);			 "
				 " clflush 96(%1);			 "
#endif
				 ::"r"((size_t)src + i), "r"((size_t)dst + i)); 
		}
	}

	i += m_count - i;
	asm(" rep movsb; " :: "S"((size_t)src + i), "D"((size_t)dst + i), "c"(i):"memory");

	return (void *)(((size_t)dst) + i);
}

In the above, I assume PREFETCHNTA will load the bare-minimum of 32-bytes and prefetch the entire interations-worth of data for both the source and destination areas.

I have a few questions myself to get started. First off, how can I tell how much PREFETCHNTA actually prefetches? Does it prefetch the entire cache-line that the imm8 is on, or just part of it (from where the imm8 starts)? Also, is there any latency involved in PREFETCHNTA, or is it neglegable? I know it's better to repeat using the same XMM register, but is it not more efficient to do all the loads into 8/16 registers at once instead of load-store-load-store?

Brendan · Post by **Brendan** » Thu Mar 19, 2009 1:25 am

Hi,

Some general notes...

For "write-back" caching, the CPU operates on cache lines (not individual bytes, words, dword, qwords, etc). For example, if you write a byte then the CPU will fetch an entire cache line, set the byte in the cache line, then (eventually) write the entire cache line to RAM. Prefetching prefetches a cache line (not just a byte).

There's no point prefetching a cache line immediately before reading from the cache line. The idea is to prefetch the cache line so that it's already in the cache *before* the CPU needs it. Because it takes time for the cache line to be fetched from RAM, for prefetching to work properly you need to prefetch the cache line at least 150 cycles before you read from the cache line. However, the exact "prefetch distance" depends on the memory controller, RAM speeds, instruction timings, etc.

If you don't prefetch at all, then if your reads are sequential the CPU's hardware prefetcher will do the prefetching anyway.

Prefetching does not work if there's a TLB miss. If the linear address isn't in the TLB already then the prefetch is ignored. To get around this you can "pre-load" instead (e.g. read from RAM into a register well before the contents of that register are needed, to force a TLB fetch for the page and force the data to be read into cache at the same time). In this case you'd want to "pre-load" at least 300 cycles before the contents of the register are needed (e.g. twice as much as the "prefetch distance", where the exact timing depends on the memory controller, RAM speeds, instruction timings, etc). The hardware prefetcher won't cross a page boundary, so "pre-loading" at page boundaries might have a lot more effect on performance than prefetching does.

For moving from RAM/cache into a register, there's no point bothering with a non-temporal hint because the non-temporal hint only effects writes to RAM/cache.

For moving from a register into RAM, the non-temporal hint tells the CPU to use "write-combining" protocol (instead of the normal "write-back" protocol), and therefore the data is never stored in the cache and doesn't need to be CLFLUSHed to avoid cache pollution.

In general, regardless of what you do you'll be limited by memory bandwidth. Optimizing the memory copying code isn't likely to reduce the time it takes to copy memory.

For moving (not copying) large amounts of data, if the starting source and destination addresses are aligned to a page boundary, then you can move page table entries instead of moving the data itself. For copying data (not moving it), if the starting source and destination addresses are aligned to a page boundary, then you can copy page table entries and use "copy on write" instead of copying the data itself. This is the only way to exceed the memory bandwidth limitation. For example, by copying/moving 4 KiB of "page table entry data" you can make it look like 4 MiB was copied/moved; and if you can copy/move 1 GiB per second with a normal copy/move operation then you can probably copy/move 1 TiB per second with this method.

Also, it's probably better to use function pointers than repeatedly testing if certain features are present. For example, have an "init" function that detects which features are present and sets function pointers; then call whatever function is pointed to by the function pointer instead of checking CPU features and branching.

Finally, measure your code's performance. For example, use RDTSC to see if your code is better/worse than "rep movsd", then keep changing your code and use RDTSC to see if your changes helped or not. For a generic library you need to do this on many different computers, because improving your code for one CPU might make it worse for another.

Cheers,

Brendan

xyzzy · Post by **xyzzy** » Thu Mar 19, 2009 1:38 am

I have a memcpy benchmarking program here, IIRC someone posted it on #osdev a while ago and I've modified it a bit. I plugged in your memcpy, and it doesn't perform brilliantly on the few tests that do work - it seems to be buggy and corrupting malloc()'s data structures

Code: Select all

*** glibc detected *** ./memcpy-speed: free(): invalid pointer: 0x0000000002295680 ***
======= Backtrace: =========
/lib/libc.so.6[0x7fa364bd28b8]
/lib/libc.so.6(cfree+0x76)[0x7fa364bd43f6]
./memcpy-speed[0x4015c8]
./memcpy-speed[0x4015f7]
/lib/libc.so.6(__libc_start_main+0xe6)[0x7fa364b7e546]
./memcpy-speed[0x4005e9]
======= Memory map: ========
00400000-00402000 r-xp 00000000 08:03 188170                             /home/alex/src/exclaim/testing/memcpy/memcpy-speed
00601000-00602000 rw-p 00001000 08:03 188170                             /home/alex/src/exclaim/testing/memcpy/memcpy-speed
02295000-022b6000 rw-p 02295000 00:00 0                                  [heap]
7fa360000000-7fa360021000 rw-p 7fa360000000 00:00 0 
7fa360021000-7fa364000000 ---p 7fa360021000 00:00 0 
7fa364949000-7fa36495f000 r-xp 00000000 08:01 7061                       /usr/lib/libgcc_s.so.1
7fa36495f000-7fa364b5f000 ---p 00016000 08:01 7061                       /usr/lib/libgcc_s.so.1
7fa364b5f000-7fa364b60000 rw-p 00016000 08:01 7061                       /usr/lib/libgcc_s.so.1
7fa364b60000-7fa364caa000 r-xp 00000000 08:01 2743                       /lib/libc-2.9.so
7fa364caa000-7fa364eaa000 ---p 0014a000 08:01 2743                       /lib/libc-2.9.so
7fa364eaa000-7fa364eae000 r--p 0014a000 08:01 2743                       /lib/libc-2.9.so
7fa364eae000-7fa364eaf000 rw-p 0014e000 08:01 2743                       /lib/libc-2.9.so
7fa364eaf000-7fa364eb4000 rw-p 7fa364eaf000 00:00 0 
7fa364eb4000-7fa364ed1000 r-xp 00000000 08:01 2744                       /lib/ld-2.9.so
7fa3650b0000-7fa3650b2000 rw-p 7fa3650b0000 00:00 0 
7fa3650ce000-7fa3650d0000 rw-p 7fa3650ce000 00:00 0 
7fa3650d0000-7fa3650d1000 r--p 0001c000 08:01 2744                       /lib/ld-2.9.so
7fa3650d1000-7fa3650d2000 rw-p 0001d000 08:01 2744                       /lib/ld-2.9.so
7fff6d0bd000-7fff6d0d2000 rw-p 7ffffffea000 00:00 0                      [stack]
7fff6d1ff000-7fff6d200000 r-xp 7fff6d1ff000 00:00 0                      [vdso]
ffffffffff600000-ffffffffff601000 r-xp 00000000 00:00 0                  [vsyscall]
Aborted

Anyway, I find that program to be helpful when tweaking my memcpy function to see how it compares to others. You might find it useful too

OSDev.org

Optimized memory functions?

Re: Optimized memory functions?

Re: Optimized memory functions?

Re: Optimized memory functions?

Re: Optimized memory functions?

Re: Optimized memory functions?

Re: Optimized memory functions?

Re: Optimized memory functions?

Re: Optimized memory functions?

Re: Optimized memory functions?

Re: Optimized memory functions?

Re: Optimized memory functions?

Re: Optimized memory functions?

Re: Optimized memory functions?

Re: Optimized memory functions?

Re: Optimized memory functions?