Optimized memory functions?

Discussions on more advanced topics such as monolithic vs micro-kernels, transactional memory models, and paging vs segmentation should go here. Use this forum to expand and improve the wiki!
User avatar
imate900
Member
Member
Posts: 80
Joined: Sat Feb 28, 2009 11:43 am

Re: Optimized memory functions?

Post by imate900 »

I implemented my memset like this in Assembler (nasm):

Code: Select all

memset:
        ; [eax-4] has the buffer
        ; [eax+8] has the size
        ; and [eax+4] has the integer/char
        mov     ecx, [eax+8]
        mov     edi, [eax-4]
        mov     [es:edi], edi
        mov     al, [eax+4]
rep     stosb
        mov     eax, [es:edi]
        ret
Current work on a OS: SauOS (project homepage: http://code.google.com/p/sauos/)
Image
User avatar
01000101
Member
Member
Posts: 1599
Joined: Fri Jun 22, 2007 12:47 pm
Contact:

Re: Optimized memory functions?

Post by 01000101 »

imate900 wrote:I implemented my memset like this in Assembler (nasm):

Code: Select all

memset:
        ; [eax-4] has the buffer
        ; [eax+8] has the size
        ; and [eax+4] has the integer/char
        mov     ecx, [eax+8]
        mov     edi, [eax-4]
        mov     [es:edi], edi
        mov     al, [eax+4]
rep     stosb
        mov     eax, [es:edi]
        ret
and?
unless I'm looking at it wrong, it seems like a very basic plain ol' memset with no optimizing instrucions (besides the REP prefix).

As the the previous posted, I may get around to re-testing those numbers with different methods (serializing, speed-independant timers, etc...).
User avatar
imate900
Member
Member
Posts: 80
Joined: Sat Feb 28, 2009 11:43 am

Re: Optimized memory functions?

Post by imate900 »

01000101 wrote: and?
unless I'm looking at it wrong, it seems like a very basic plain ol' memset with no optimizing instrucions (besides the REP prefix).

As the the previous posted, I may get around to re-testing those numbers with different methods (serializing, speed-independant timers, etc...).
I optimized for size not speed.
Current work on a OS: SauOS (project homepage: http://code.google.com/p/sauos/)
Image
User avatar
Combuster
Member
Member
Posts: 9301
Joined: Wed Oct 18, 2006 3:45 am
Libera.chat IRC: [com]buster
Location: On the balcony, where I can actually keep 1½m distance
Contact:

Re: Optimized memory functions?

Post by Combuster »

calling convention's off. And if you're really optimizing for size, try this (8 bytes and nasty):

Code: Select all

pop edx
pop edi
pop eax
pop ecx
rep stosb
push edx
ret
"Certainly avoid yourself. He is a newbie and might not realize it. You'll hate his code deeply a few years down the road." - Sortie
[ My OS ] [ VDisk/SFS ]
User avatar
imate900
Member
Member
Posts: 80
Joined: Sat Feb 28, 2009 11:43 am

Re: Optimized memory functions?

Post by imate900 »

Combuster wrote:calling convention's off. And if you're really optimizing for size, try this (8 bytes and nasty):

Code: Select all

pop edx
pop edi
pop eax
pop ecx
rep stosb
push edx
ret
By the way, my code is more optimized to be called from C and not Assembler.
Current work on a OS: SauOS (project homepage: http://code.google.com/p/sauos/)
Image
Craze Frog
Member
Member
Posts: 368
Joined: Sun Sep 23, 2007 4:52 am

Re: Optimized memory functions?

Post by Craze Frog »

Combuster wrote:calling convention's off. And if you're really optimizing for size, try this (8 bytes and nasty):

Code: Select all

pop edx
pop edi
pop eax
pop ecx
rep stosb
push edx
ret
That most definetely doesn't fit any calling convention. In both stdcall and cdecl, edi is callee-save, but you wipe it.
jal
Member
Member
Posts: 1385
Joined: Wed Oct 31, 2007 9:09 am

Re: Optimized memory functions?

Post by jal »

imate900 wrote:By the way, my code is more optimized to be called from C and not Assembler.
You are fighting a lost battle here, really. Stop embarassing yourself.


JAL
User avatar
01000101
Member
Member
Posts: 1599
Joined: Fri Jun 22, 2007 12:47 pm
Contact:

Re: Optimized memory functions?

Post by 01000101 »

This thread was meant to discuss speed optimizations (as you would have noticed had you actually read any of the topic posts). While that is indeed a small routine, the "big" routines aren't much larger.
Craze Frog
Member
Member
Posts: 368
Joined: Sun Sep 23, 2007 4:52 am

Re: Optimized memory functions?

Post by Craze Frog »

No.
jal
Member
Member
Posts: 1385
Joined: Wed Oct 31, 2007 9:09 am

Re: Optimized memory functions?

Post by jal »

Craze Frog wrote:No.
And that's a good thing, otherwise it would be extremely easy for a thread to steal CPU time :)

Code: Select all

mov ecx, 0xffffffff   ; we'll get that pesky scheduler!
rep movsd

JAL
Last edited by jal on Tue Mar 10, 2009 3:07 am, edited 1 time in total.
User avatar
Combuster
Member
Member
Posts: 9301
Joined: Wed Oct 18, 2006 3:45 am
Libera.chat IRC: [com]buster
Location: On the balcony, where I can actually keep 1½m distance
Contact:

Re: Optimized memory functions?

Post by Combuster »

I assume you meant ECX there... :wink:
"Certainly avoid yourself. He is a newbie and might not realize it. You'll hate his code deeply a few years down the road." - Sortie
[ My OS ] [ VDisk/SFS ]
jal
Member
Member
Posts: 1385
Joined: Wed Oct 31, 2007 9:09 am

Re: Optimized memory functions?

Post by jal »

Combuster wrote:I assume you meant ECX there... :wink:
Yeah, typed too fast :) At first I have mosb as well, but at least corrected that one :))


JAL
User avatar
01000101
Member
Member
Posts: 1599
Joined: Fri Jun 22, 2007 12:47 pm
Contact:

Re: Optimized memory functions?

Post by 01000101 »

Hey, I decided to attempt and make an SSE-optimized (and maybe MMX later) library of core functions such as memcpy, memcmp, strlen, strcmp, mommove, etc...

I'm trying to weight between using non-temporal moves that are serializing or non-non-temporal ( :D ) moves that can utilize out-of-order execution. I'm using PREFETCHNTA for the instructions before moving large blocks and a cache-line flush (CLFLUSH) at the end of each major iteration, I hope that somewhat balances the scale.

I'd like to post a few functions here, and hopefully as a by-proxy team effor, scrutinize the code into an efficient function.

I am aware of existing libraries out there that do similar things, but I'd like to make one especially for small OS's and optimize the code for easy porting (to different OS's, not different architectures). Well, lets get started, here's the big hitter, the memcpy! (I'm aware it's not 100% POSIX compliant, but I never said that it would be 8) ).

This function gets called by a stub (memcpy) when the stub detects the presence of SSE 4.1 capabilities. There is currently 3 memcpy's that this stub can call, memcpy_std (standard, non-sse), memcpy_sse2 (use SSE2 if 4.1 isn't allowed), and memcpy_sse4_1. Here's SSE 4.1 (movntdqa).

Code: Select all


/* =====================================================================
 * === ALIGNED SSE4.1 MEMORY CLEARING OPERATIONS (MOVNTDQA == SSE4.1) ==
 * ===================================================================== */

static const void * memcpy_sse4_1_aligned(const void * const dst, const void * const src, const size_t m_count)
{
	size_t i;

	i = 0;

	/* is "src" aligned on a SSE_XMM_SIZE boundary */
	if(!((size_t)src & (SSE_XMM_SIZE-1)))
	{ }
	else
	{ 
		/* lets make sure we don't copy 'too' many bytes (i < m_count) */
		while((((size_t)src + i) & (SSE_XMM_SIZE-1)) && i < m_count)
		{
			asm("movsb;"::"S"((size_t)src + i), "D"((size_t)dst + i));
			i++;
		}
	}

	/* check to see if "dst" is aligned on a SSE_XMM_SIZE boundary */
	if(!((size_t)dst & (SSE_XMM_SIZE-1)))
	{
		/* each iteration consumes a 128-byte chunk of memory */

			
#ifdef __x86_64__
		for(; i + 256 < m_count; i += 256)
#else
		for(; i + 128 < m_count; i += 128)
#endif
		{
			/* fill all the XMM 128-bit SSE registers! */
			asm (" mfence;					 "
				 " prefetchnta  0(%0);		 "
				 " prefetchnta 32(%0);		 "
				 " prefetchnta 64(%0);		 "
				 " prefetchnta 96(%0);		 "
				 " prefetchnta  0(%1);		 "
				 " prefetchnta 32(%1);		 "
				 " prefetchnta 64(%1);		 "
				 " prefetchnta 96(%1);		 "
			     " movntdqa 0(%0) , %%xmm0;  "
				 " movntdqa 16(%0), %%xmm1;  " 
				 " movntdqa 32(%0), %%xmm2;  " 
				 " movntdqa 48(%0), %%xmm3;  " 
				 " movntdqa 64(%0), %%xmm4;  " 
				 " movntdqa 80(%0), %%xmm5;  " 
				 " movntdqa 96(%0), %%xmm6;  " 
				 " movntdqa 112(%0), %%xmm7; " 
#ifdef __x86_64__
				 " movntdqa 128(%0), %%xmm8;  "
				 " movntdqa 144(%0), %%xmm9;  " 
				 " movntdqa 160(%0), %%xmm10; " 
				 " movntdqa 176(%0), %%xmm11; " 
				 " movntdqa 192(%0), %%xmm12; " 
				 " movntdqa 208(%0), %%xmm13; " 
				 " movntdqa 224(%0), %%xmm14; " 
				 " movntdqa 240(%0), %%xmm15; " 
#endif
				 " movntdq %%xmm0, 0(%1);    " 
				 " movntdq %%xmm1, 16(%1);	 " 
				 " movntdq %%xmm2, 32(%1);	 " 
				 " movntdq %%xmm3, 48(%1);	 " 
				 " movntdq %%xmm4, 64(%1);   " 
				 " movntdq %%xmm5, 80(%1);	 " 
				 " movntdq %%xmm6, 96(%1);	 " 
				 " movntdq %%xmm7, 112(%1);	 " 
#ifdef __x86_64__
				 " movntdq %%xmm8, 128(%1);  " 
				 " movntdq %%xmm9, 144(%1);	 " 
				 " movntdq %%xmm10, 160(%1); " 
				 " movntdq %%xmm11, 176(%1); " 
				 " movntdq %%xmm12, 192(%1); " 
				 " movntdq %%xmm13, 208(%1); " 
				 " movntdq %%xmm14, 224(%1); " 
				 " movntdq %%xmm15, 240(%1); " 	
				 " clflush  0(%0);			 "
				 " clflush 32(%0);			 "
				 " clflush 64(%0);			 "
				 " clflush 96(%0);			 "
				 " clflush  0(%1);			 "
				 " clflush 32(%1);			 "
				 " clflush 64(%1);			 "
				 " clflush 96(%1);			 "
#endif 
				 ::"r"((size_t)src + i), "r"((size_t)dst + i)); 
		}
	}
	else
	{
#ifdef __x86_64__
		for(; i + 256 < m_count; i += 256)
#else
		for(; i + 128 < m_count; i += 128)
#endif
		{ 
			asm (" mfence; "
				 " prefetchnta  0(%0);		 "
				 " prefetchnta 32(%0);		 "
				 " prefetchnta 64(%0);		 "
				 " prefetchnta 96(%0);		 "
				 " prefetchnta  0(%1);		 "
				 " prefetchnta 32(%1);		 "
				 " prefetchnta 64(%1);		 "
				 " prefetchnta 96(%1);		 "
				 " movntdqa 0(%0) , %%xmm0;  "
				 " movntdqa 16(%0), %%xmm1;  " 
				 " movntdqa 32(%0), %%xmm2;  " 
				 " movntdqa 48(%0), %%xmm3;  " 
				 " movntdqa 64(%0), %%xmm4;  " 
				 " movntdqa 80(%0), %%xmm5;  " 
				 " movntdqa 96(%0), %%xmm6;  " 
				 " movntdqa 112(%0), %%xmm7; " 
#ifdef __x86_64__
				 " movntdqa 128(%0), %%xmm8;  "
				 " movntdqa 144(%0), %%xmm9;  " 
				 " movntdqa 160(%0), %%xmm10; " 
				 " movntdqa 176(%0), %%xmm11; " 
				 " movntdqa 192(%0), %%xmm12; " 
				 " movntdqa 208(%0), %%xmm13; " 
				 " movntdqa 224(%0), %%xmm14; " 
				 " movntdqa 240(%0), %%xmm15; " 
#endif
				 " movdqu %%xmm0, 0(%1);     " 
				 " movdqu %%xmm1, 16(%1);	 " 
				 " movdqu %%xmm2, 32(%1);	 " 
				 " movdqu %%xmm3, 48(%1);	 " 
				 " movdqu %%xmm4, 64(%1);    " 
				 " movdqu %%xmm5, 80(%1);	 " 
				 " movdqu %%xmm6, 96(%1);	 " 
				 " movdqu %%xmm7, 112(%1);	 " 
#ifdef __x86_64__
				 " movdqu %%xmm8, 128(%1);  " 
				 " movdqu %%xmm9, 144(%1);	" 
				 " movdqu %%xmm10, 160(%1); " 
				 " movdqu %%xmm11, 176(%1); " 
				 " movdqu %%xmm12, 192(%1); " 
				 " movdqu %%xmm13, 208(%1); " 
				 " movdqu %%xmm14, 224(%1); " 
				 " movdqu %%xmm15, 240(%1); " 
				 " clflush  0(%0);			 "
				 " clflush 32(%0);			 "
				 " clflush 64(%0);			 "
				 " clflush 96(%0);			 "
				 " clflush  0(%1);			 "
				 " clflush 32(%1);			 "
				 " clflush 64(%1);			 "
				 " clflush 96(%1);			 "
#endif
				 ::"r"((size_t)src + i), "r"((size_t)dst + i)); 
		}
	}

	i += m_count - i;
	asm(" rep movsb; " :: "S"((size_t)src + i), "D"((size_t)dst + i), "c"(i):"memory");

	return (void *)(((size_t)dst) + i);
}

In the above, I assume PREFETCHNTA will load the bare-minimum of 32-bytes and prefetch the entire interations-worth of data for both the source and destination areas.

I have a few questions myself to get started. First off, how can I tell how much PREFETCHNTA actually prefetches? Does it prefetch the entire cache-line that the imm8 is on, or just part of it (from where the imm8 starts)? Also, is there any latency involved in PREFETCHNTA, or is it neglegable? I know it's better to repeat using the same XMM register, but is it not more efficient to do all the loads into 8/16 registers at once instead of load-store-load-store?
User avatar
Brendan
Member
Member
Posts: 8561
Joined: Sat Jan 15, 2005 12:00 am
Location: At his keyboard!
Contact:

Re: Optimized memory functions?

Post by Brendan »

Hi,

Some general notes...

For "write-back" caching, the CPU operates on cache lines (not individual bytes, words, dword, qwords, etc). For example, if you write a byte then the CPU will fetch an entire cache line, set the byte in the cache line, then (eventually) write the entire cache line to RAM. Prefetching prefetches a cache line (not just a byte).

There's no point prefetching a cache line immediately before reading from the cache line. The idea is to prefetch the cache line so that it's already in the cache *before* the CPU needs it. Because it takes time for the cache line to be fetched from RAM, for prefetching to work properly you need to prefetch the cache line at least 150 cycles before you read from the cache line. However, the exact "prefetch distance" depends on the memory controller, RAM speeds, instruction timings, etc.

If you don't prefetch at all, then if your reads are sequential the CPU's hardware prefetcher will do the prefetching anyway.

Prefetching does not work if there's a TLB miss. If the linear address isn't in the TLB already then the prefetch is ignored. To get around this you can "pre-load" instead (e.g. read from RAM into a register well before the contents of that register are needed, to force a TLB fetch for the page and force the data to be read into cache at the same time). In this case you'd want to "pre-load" at least 300 cycles before the contents of the register are needed (e.g. twice as much as the "prefetch distance", where the exact timing depends on the memory controller, RAM speeds, instruction timings, etc). The hardware prefetcher won't cross a page boundary, so "pre-loading" at page boundaries might have a lot more effect on performance than prefetching does.

For moving from RAM/cache into a register, there's no point bothering with a non-temporal hint because the non-temporal hint only effects writes to RAM/cache.

For moving from a register into RAM, the non-temporal hint tells the CPU to use "write-combining" protocol (instead of the normal "write-back" protocol), and therefore the data is never stored in the cache and doesn't need to be CLFLUSHed to avoid cache pollution.

In general, regardless of what you do you'll be limited by memory bandwidth. Optimizing the memory copying code isn't likely to reduce the time it takes to copy memory.

For moving (not copying) large amounts of data, if the starting source and destination addresses are aligned to a page boundary, then you can move page table entries instead of moving the data itself. For copying data (not moving it), if the starting source and destination addresses are aligned to a page boundary, then you can copy page table entries and use "copy on write" instead of copying the data itself. This is the only way to exceed the memory bandwidth limitation. For example, by copying/moving 4 KiB of "page table entry data" you can make it look like 4 MiB was copied/moved; and if you can copy/move 1 GiB per second with a normal copy/move operation then you can probably copy/move 1 TiB per second with this method.

Also, it's probably better to use function pointers than repeatedly testing if certain features are present. For example, have an "init" function that detects which features are present and sets function pointers; then call whatever function is pointed to by the function pointer instead of checking CPU features and branching.

Finally, measure your code's performance. For example, use RDTSC to see if your code is better/worse than "rep movsd", then keep changing your code and use RDTSC to see if your changes helped or not. For a generic library you need to do this on many different computers, because improving your code for one CPU might make it worse for another.


Cheers,

Brendan
For all things; perfection is, and will always remain, impossible to achieve in practice. However; by striving for perfection we create things that are as perfect as practically possible. Let the pursuit of perfection be our guide.
xyzzy
Member
Member
Posts: 391
Joined: Wed Jul 25, 2007 8:45 am
Libera.chat IRC: aejsmith
Location: London, UK
Contact:

Re: Optimized memory functions?

Post by xyzzy »

I have a memcpy benchmarking program here, IIRC someone posted it on #osdev a while ago and I've modified it a bit. I plugged in your memcpy, and it doesn't perform brilliantly on the few tests that do work - it seems to be buggy and corrupting malloc()'s data structures 8)

Code: Select all

*** glibc detected *** ./memcpy-speed: free(): invalid pointer: 0x0000000002295680 ***
======= Backtrace: =========
/lib/libc.so.6[0x7fa364bd28b8]
/lib/libc.so.6(cfree+0x76)[0x7fa364bd43f6]
./memcpy-speed[0x4015c8]
./memcpy-speed[0x4015f7]
/lib/libc.so.6(__libc_start_main+0xe6)[0x7fa364b7e546]
./memcpy-speed[0x4005e9]
======= Memory map: ========
00400000-00402000 r-xp 00000000 08:03 188170                             /home/alex/src/exclaim/testing/memcpy/memcpy-speed
00601000-00602000 rw-p 00001000 08:03 188170                             /home/alex/src/exclaim/testing/memcpy/memcpy-speed
02295000-022b6000 rw-p 02295000 00:00 0                                  [heap]
7fa360000000-7fa360021000 rw-p 7fa360000000 00:00 0 
7fa360021000-7fa364000000 ---p 7fa360021000 00:00 0 
7fa364949000-7fa36495f000 r-xp 00000000 08:01 7061                       /usr/lib/libgcc_s.so.1
7fa36495f000-7fa364b5f000 ---p 00016000 08:01 7061                       /usr/lib/libgcc_s.so.1
7fa364b5f000-7fa364b60000 rw-p 00016000 08:01 7061                       /usr/lib/libgcc_s.so.1
7fa364b60000-7fa364caa000 r-xp 00000000 08:01 2743                       /lib/libc-2.9.so
7fa364caa000-7fa364eaa000 ---p 0014a000 08:01 2743                       /lib/libc-2.9.so
7fa364eaa000-7fa364eae000 r--p 0014a000 08:01 2743                       /lib/libc-2.9.so
7fa364eae000-7fa364eaf000 rw-p 0014e000 08:01 2743                       /lib/libc-2.9.so
7fa364eaf000-7fa364eb4000 rw-p 7fa364eaf000 00:00 0 
7fa364eb4000-7fa364ed1000 r-xp 00000000 08:01 2744                       /lib/ld-2.9.so
7fa3650b0000-7fa3650b2000 rw-p 7fa3650b0000 00:00 0 
7fa3650ce000-7fa3650d0000 rw-p 7fa3650ce000 00:00 0 
7fa3650d0000-7fa3650d1000 r--p 0001c000 08:01 2744                       /lib/ld-2.9.so
7fa3650d1000-7fa3650d2000 rw-p 0001d000 08:01 2744                       /lib/ld-2.9.so
7fff6d0bd000-7fff6d0d2000 rw-p 7ffffffea000 00:00 0                      [stack]
7fff6d1ff000-7fff6d200000 r-xp 7fff6d1ff000 00:00 0                      [vdso]
ffffffffff600000-ffffffffff601000 r-xp 00000000 00:00 0                  [vsyscall]
Aborted
Anyway, I find that program to be helpful when tweaking my memcpy function to see how it compares to others. You might find it useful too :)
Post Reply