Optimized memory functions?

Discussions on more advanced topics such as monolithic vs micro-kernels, transactional memory models, and paging vs segmentation should go here. Use this forum to expand and improve the wiki!
User avatar
01000101
Member
Member
Posts: 1599
Joined: Fri Jun 22, 2007 12:47 pm
Contact:

Optimized memory functions?

Post by 01000101 »

Hey,
I've been brainstorming up some ways to optimize core memory functions such as 'memset' and 'memcpy'.
I wanted to make it so that when I do something like 'memset(&cArray, 0x00, 128);', that there wouldn't have to be a loop of 128 char copy commands. Most memset functions I have seen make it so that it loops while the 'count' variable decrements and just does a char copy over and over. I figured that, since I'm programming a 64-bit OS, that I might as well take advantage of it, same goes for 32 bit OSs as well.

here's a rough prototype that I'm testing right now, I wanted to know of any feedback regarding these memset and memcpy functions.

Code: Select all

void memcpy(void *dest, void *src, uint64 count)
{
  if(!count){return;} // nothing to copy?
  while(count >= 8){ *(uint64*)dest = *(uint64*)src; memcpy_transfers_64++; dest += 8; src += 8; count -= 8; }
  while(count >= 4){ *(uint32*)dest = *(uint32*)src; memcpy_transfers_32++; dest += 4; src += 4; count -= 4; }
  while(count >= 2){ *(uint16*)dest = *(uint16*)src; memcpy_transfers_16++; dest += 2; src += 2; count -= 2; }
  while(count >= 1){ *(uint8*)dest = *(uint8*)src; memcpy_transfers_8++; dest += 1; src += 1; count -= 1; }
  return;
}

void memset(void *dest, uint8 sval, uint64 count)
{
  uint64 val = (sval & 0xFF); // create a 64-bit version of 'sval'
  val |= ((val << 8) & 0xFF00);
  val |= ((val << 16) & 0xFFFF0000);
  val |= ((val << 32) & 0xFFFFFFFF00000000);
    
  if(!count){return;} // nothing to copy?
  while(count >= 8){ *(uint64*)dest = (uint64)val; memset_transfers_64++; dest += 8; count -= 8; }
  while(count >= 4){ *(uint32*)dest = (uint32)val; memset_transfers_32++; dest += 4; count -= 4; }
  while(count >= 2){ *(uint16*)dest = (uint16)val; memset_transfers_16++; dest += 2; count -= 2; }
  while(count >= 1){ *(uint8*)dest = (uint8)val; memset_transfers_8++; dest += 1; count -= 1; }
  return; 
}


say you want to do a memset of 128 bytes, this would perform 16 64-bit transfers instead of 128 8-bit transfers. Also, say if you wanted to memset 18 bytes, it would perform 2 64-bit and 1 16-bit operations, instead of 18 operations.

What do you think of this type of optimization? What would be the pitfalls?

[edit]btw, the memxxx_transfers_xx variable is just a log variable, and if I wasn't as lazy, would have taken out.[/edit]
User avatar
Colonel Kernel
Member
Member
Posts: 1437
Joined: Tue Oct 17, 2006 6:06 pm
Location: Vancouver, BC, Canada
Contact:

Re: Optimized memory functions?

Post by Colonel Kernel »

You need to handle the case where the starting address is not aligned properly.
Top three reasons why my OS project died:
  1. Too much overtime at work
  2. Got married
  3. My brain got stuck in an infinite loop while trying to design the memory manager
Don't let this happen to you!
CodeCat
Member
Member
Posts: 158
Joined: Tue Sep 23, 2008 1:45 pm
Location: Eindhoven, Netherlands

Re: Optimized memory functions?

Post by CodeCat »

You want to optimise the code in the loop as much as possible, so you should reduce the amount of variables to change inside the loop. For memset in fact all you need is a single int64 pointer, and then increment that pointer every iteration. Before the loop, create a pointer to the point past the last 64 bit value to be written: int64* end = (int64*) dest + (count / 8); Then the while loop just has to check whether the loop pointer equals the end pointer each time, which comes down to just a 'cmp' and a 'je' instruction.

Also is using inline assembly an option? If so, you might want to consider using the movsd/movsq instructions to automate the moving, and to further reduce the amount of code inside the loop.
User avatar
bewing
Member
Member
Posts: 1401
Joined: Wed Feb 07, 2007 1:45 pm
Location: Eugene, OR, US

Re: Optimized memory functions?

Post by bewing »

I have seen memsets that do what you are doing -- as Colonel Kernel says: you need to do a few byte operations to align the pointer on a qword boundary to begin with -- then do the 64 bit ops -- then maybe a few more byte ops to finish up. Don't bother with the dword and word ops, I'd say.

Also, precalculate the number of 64b loops, at least, and count them down to 0. It is always more efficient to test on 0.

And there are no pitfalls.
User avatar
Brendan
Member
Member
Posts: 8561
Joined: Sat Jan 15, 2005 12:00 am
Location: At his keyboard!
Contact:

Re: Optimized memory functions?

Post by Brendan »

Hi,

Some quick notes:
  • For copying or filling small areas of memory, the code to handle the first few bytes (alignment) and last few bytes will cause branch mis-predictions that will cost more than they're worth. For areas less than 128 bytes it's probably better to avoid these branches by moving bytes (e.g. REP MOVSB/STOSB).
  • For large areas, cache management is important - prefetch cache lines (or pre-load them if prefetch isn't supported) before you need them and flush cache lines as soon as you're finished with them to avoid polluting the cache (e.g. CLFLUSH instruction). For copying data, this includes the source and the destination. For example, if you write one dword in a cache line then the CPU may need to read the rest of the cache line's data before it can do the write (even if the rest of the cache line's data will be overwritten by later writes).
  • For very large areas, consider setting the destination area to "write combining" so that the CPU won't need to read any cache lines that are being modified. Unfortunately, setting pages to write combining involves invalidating the TLB entry for each page (INVLPG). Without using the PAT attributes in page tables you'd need to modify MTRRs and flush all caches (on all CPUs at the same time) which can be *expensive*. Don't forget to change back to "write-back" when you're done.
  • The CPU's string instructions (e.g. REP MOVSD/Q) are optimized so that under certain conditions they work on cache lines instead of individual dwords or qwords. The conditions depend on the CPU - for some source and destination must be aligned (but not for others). This also has some setup costs, and the CPU won't do this if the area is too small (to avoid the setup costs).
  • Using SSE can simplify cache management (non-temporal stores). SSE can also complicate things (alignment issues). For e.g. if source and destination aren't aligned you can get good performance by doing aligned load followed by shifting followed by aligned store.
  • A generic routine can't be optimized for any specific case. With several different special case routines a programmer can choose the best routine for the situation. For example, you could have a routine specifically designed for small memory areas that just uses "REP MOVSB" (and is always inlined - e.g. written as a macro) to minimize startup overheads.
  • Don't assume your code is better - always measure the performance on a variety of different computers. Something that's faster on one CPU may be slower on another CPU, and something that's faster one one CPU may be slower on a different computer with the exact same CPU because of different RAM speeds.
  • Avoiding the need to copy or fill memory is always going to be faster.
I'd also recommend finding out what the compiler does already. Most compilers have additional support for optimized memcpy() and memset() functions, and don't treat them like normal library functions. It's possible that your own "optimized" versions will make performance worse.

Cheers,

Brendan
For all things; perfection is, and will always remain, impossible to achieve in practice. However; by striving for perfection we create things that are as perfect as practically possible. Let the pursuit of perfection be our guide.
jgraef
Member
Member
Posts: 47
Joined: Wed Jan 16, 2008 7:37 am
Location: Wonsheim, Germany

Re: Optimized memory functions?

Post by jgraef »

This are my memcpy and memset functions. They are optimized for 32bit.

Code: Select all

void *memcpy(void *dest,const void *src,size_t n) {
  uint32_t num_dwords = n/4;
  uint32_t num_bytes = n%4;
  uint32_t *dest32 = (uint32_t*)dest;
  uint32_t *src32 = (uint32_t*)src;
  uint8_t *dest8 = ((uint8_t*)dest)+num_dwords*4;
  uint8_t *src8 = ((uint8_t*)src)+num_dwords*4;
  uint32_t i;

  for (i=0;i<num_dwords;i++) {
    dest32[i] = src32[i];
  }
  for (i=0;i<num_bytes;i++) {
    dest8[i] = src8[i];
  }
  return dest;
}

Code: Select all

void *memset(void *dest,int val,size_t n) {
  uint32_t num_dwords = n/4;
  uint32_t num_bytes = n%4;
  uint32_t *dest32 = (uint32_t*)dest;
  uint8_t *dest8 = ((uint8_t*)dest)+num_dwords*4;
  uint8_t val8 = (uint8_t)val;
  uint32_t val32 = val|(val<<8)|(val<<16)|(val<<24);
  uint32_t i;

  for (i=0;i<num_dwords;i++) {
    dest32[i] = val32;
  }
  for (i=0;i<num_bytes;i++) {
    dest8[i] = val8;
  }
  return dest;
}
So what they basically do is to handle 4 bytes at once and then when less then 4 bytes are left they do it byte-wise.

Maybe it would be more architecture independent to use int (or unsigned int) instead of uint32_t and then sizeof(int) where the size for the fastest integer is needed.
User avatar
Jeko
Member
Member
Posts: 500
Joined: Fri Mar 17, 2006 12:00 am
Location: Napoli, Italy

Re: Optimized memory functions?

Post by Jeko »

Code: Select all

static inline void *memcpy(void *to, const void *from, unsigned int n)
{
	if(cpu.feature.flags.edx.sse)
	{
		int i;
		for(i=0; i<n/16; i++)
		{
			__asm__ __volatile__ ("movups (%0), %%xmm0\n" "movntdq %%xmm0, (%1)\n"::"r"(from), "r"(to) : "memory");

			from += 16;
			to += 16;
		}
	}
	else if(n&15 && cpu.feature.flags.edx.mmx)
	{
		n = n&15;
		int i;
		for(i=0; i<n/8; i++)
		{
			__asm__ __volatile__ ("movq (%0), %%mm0\n" "movq %%mm0, (%1)\n"::"r"(from), "r"(to):"memory");
			from += 8;
			to += 8;
		}
	}
	if(n & 7)
	{
		n = n&7;

		int d0, d1, d2;
		__asm__ __volatile__(
		"rep ; movsl\n\t"
		"testb $2,%b4\n\t"
		"je 1f\n\t"
		"movsw\n"
		"1:\ttestb $1,%b4\n\t"
		"je 2f\n\t"
		"movsb\n"
		"2:"
		: "=&c" (d0), "=&D" (d1), "=&S" (d2)
		:"0" (n/4), "q" (n),"1" ((long) to),"2" ((long) from)
		: "memory");
	}
	return (to);
}
This is my memcpy, optimized for SSE and MMX. I will support also SSE2 and SSE3, if it will be possible.
This memcpy is really really fast. When I implemented a GUI for my old OS, the windows' moving was impossible without this optimized memcpy.

And this is memset:

Code: Select all

static inline void *memset(void *s, char c, unsigned int count)
{
	int d0, d1;
	__asm__ __volatile__(
	"rep\n\t"
	"stosb"
	: "=&c" (d0), "=&D" (d1)
	:"a" (c),"1" (s),"0" (count)
	:"memory");
	return s;
}
Rewriting virtual memory manager - Working on ELF support - Working on Device Drivers Handling

http://sourceforge.net/projects/jeko - Jeko Operating System
geppy
Member
Member
Posts: 27
Joined: Wed Jun 11, 2008 5:30 pm

Re: Optimized memory functions?

Post by geppy »

01000101 wrote:I figured that, since I'm programming a 64-bit OS, that I might as well take advantage of it
Sure thing you should. All LongMode CPUs have SSE2 I think.

so here it goes - clearing allocated memory before giving it to the user process:

Code: Select all

;input: rsi - 4KB aligned addr, eax - number of 4KB blocks 

  shl  eax, 12
  pxor xmm0, xmm0
  neg  rax
  sub  rsi, rax
.zero:
  movaps [rsi+rax], xmm0
  movaps [rsi+rax+16], xmm0
  movaps [rsi+rax+32], xmm0
  movaps [rsi+rax+48], xmm0
  movaps [rsi+rax+64], xmm0
  movaps [rsi+rax+80], xmm0
  movaps [rsi+rax+96], xmm0
  movaps [rsi+rax+112], xmm0
  movaps [rsi+rax+128], xmm0
  movaps [rsi+rax+144], xmm0
  movaps [rsi+rax+160], xmm0
  movaps [rsi+rax+176], xmm0
  movaps [rsi+rax+192], xmm0
  movaps [rsi+rax+208], xmm0
  movaps [rsi+rax+224], xmm0
  movaps [rsi+rax+240], xmm0
  add    rax, 256
  jnz    .zero                  
Tip: having multiple mov(movaps,movq,...) in a single loop pass does speed up things

and few links to more appropriate threads:
link 1
link2
iammisc
Member
Member
Posts: 269
Joined: Thu Nov 09, 2006 6:23 pm

Re: Optimized memory functions?

Post by iammisc »

Personally, on a 64-bit machine, I would do 64-bit copies and implement Duff's device.

http://en.wikipedia.org/wiki/Duff%27s_device
User avatar
Brendan
Member
Member
Posts: 8561
Joined: Sat Jan 15, 2005 12:00 am
Location: At his keyboard!
Contact:

Re: Optimized memory functions?

Post by Brendan »

Hi,
geppy wrote:Tip: having multiple mov(movaps,movq,...) in a single loop pass does speed up things
Especially for copying data, I wouldn't recommend using MOVAPS for arbitrary bytes, because MOVAPS works on single precision floating point numbers - the CPU may modify your data while converting from an unusual floating point encoding into a usual floating point format (e.g. converting denormals into normals, or converting denormals to zero), and some combinations of arbitrary bytes may represent SNaNs (where SNaNs may trigger exceptions). Note: For clearing memory it doesn't make too much difference because the integer 0x00000000 is the floating point encoding for positive zero).

Instead, it'd be better to use MOVDQA (which works on integer quadwords). For filling many pages with zero it'd be even better to use MOVNTDQ (which also works on integer quadwords, but includes a non-temporal hint) to avoid filling the caches with zeros... Note: MOVNTDQ won't improve the speed of the copy/fill operation, but will prevent lots of cache misses in code that runs after the copy/fill operation has completed.


Cheers,

Brendan
For all things; perfection is, and will always remain, impossible to achieve in practice. However; by striving for perfection we create things that are as perfect as practically possible. Let the pursuit of perfection be our guide.
thre3dee
Member
Member
Posts: 34
Joined: Mon Nov 03, 2008 7:42 pm

Re: Optimized memory functions?

Post by thre3dee »

Jeko wrote:

Code: Select all

static inline void *memcpy(void *to, const void *from, unsigned int n)
{
	if(cpu.feature.flags.edx.sse)
	{
		int i;
		for(i=0; i<n/16; i++)
		{
			__asm__ __volatile__ ("movups (%0), %%xmm0\n" "movntdq %%xmm0, (%1)\n"::"r"(from), "r"(to) : "memory");

			from += 16;
			to += 16;
		}
	}
	else if(n&15 && cpu.feature.flags.edx.mmx)
	{
		n = n&15;
		int i;
		for(i=0; i<n/8; i++)
		{
			__asm__ __volatile__ ("movq (%0), %%mm0\n" "movq %%mm0, (%1)\n"::"r"(from), "r"(to):"memory");
			from += 8;
			to += 8;
		}
	}
	if(n & 7)
	{
		n = n&7;

		int d0, d1, d2;
		__asm__ __volatile__(
		"rep ; movsl\n\t"
		"testb $2,%b4\n\t"
		"je 1f\n\t"
		"movsw\n"
		"1:\ttestb $1,%b4\n\t"
		"je 2f\n\t"
		"movsb\n"
		"2:"
		: "=&c" (d0), "=&D" (d1), "=&S" (d2)
		:"0" (n/4), "q" (n),"1" ((long) to),"2" ((long) from)
		: "memory");
	}
	return (to);
}
This is my memcpy, optimized for SSE and MMX. I will support also SSE2 and SSE3, if it will be possible.
This memcpy is really really fast. When I implemented a GUI for my old OS, the windows' moving was impossible without this optimized memcpy.

And this is memset:

Code: Select all

static inline void *memset(void *s, char c, unsigned int count)
{
	int d0, d1;
	__asm__ __volatile__(
	"rep\n\t"
	"stosb"
	: "=&c" (d0), "=&D" (d1)
	:"a" (c),"1" (s),"0" (count)
	:"memory");
	return s;
}
Hi Jeko, I'd love to use this code in my MSVC kernel however I can't use this inline-assembly as its not compatible with MSVC.

Could you possibly translate this into an generic assembly as I've never encountered this type of inline assembly with numbers and string quotes everywhere? :P Unless you know the MSVC inline assembly syntax:

Code: Select all

__asm {
    mov eax, [myFunctionVar]
    int 0x10
    ; etc etc
}
Would be a great help as I've never used any extended CPU features like SSE1/2/3DNOW...
tsp
Posts: 14
Joined: Tue Apr 24, 2007 5:35 am

Re: Optimized memory functions?

Post by tsp »

iammisc wrote:...Duff's device...
Thanks for this piece of information! I decided to try Duff's Device on my kernel project and the results were pretty good. The start situation was that I used next piece of code for my memory copy function:

Code: Select all

    while (length != 0)
        {
        *(p_dst++) = *(p_src++);
        length--;
        }
I changed the implementation to Duff's Device (as shown in the wiki + a small change so that also the dst pointer is incremented) and the results are that initially on my 32-bit ARM7TDMI based platform running @ 48MHz the data transfer rate (from SRAM to SRAM) was 1.29MB/s but with Duff's Device the results are now 1.88MB/s, so Duff's Device gives ~45% better performance for memory copy operations in this HW platform.
User avatar
kmtdk
Member
Member
Posts: 263
Joined: Sat May 17, 2008 4:05 am
Location: Cyperspace, Denmark
Contact:

Re: Optimized memory functions?

Post by kmtdk »

well
if i remeber right
why not use SMID to transfer big placements ( i dont know anything about SMID)
and rep movsd on the small parts.

but i fell i have missd something in this thred ?

KMT dk
well, what to say, to much to do in too little space.
when it goes up hill, increase work, when it goes straight, test yourself but when going down, slow down.
CodeCat
Member
Member
Posts: 158
Joined: Tue Sep 23, 2008 1:45 pm
Location: Eindhoven, Netherlands

Re: Optimized memory functions?

Post by CodeCat »

See Brendan's post. ;)
User avatar
kmtdk
Member
Member
Posts: 263
Joined: Sat May 17, 2008 4:05 am
Location: Cyperspace, Denmark
Contact:

Re: Optimized memory functions?

Post by kmtdk »

i see now, i think i might overlooked it.

sorry .....



KMT dk
well, what to say, to much to do in too little space.
when it goes up hill, increase work, when it goes straight, test yourself but when going down, slow down.
Post Reply