Optimized memory functions?

Discussions on more advanced topics such as monolithic vs micro-kernels, transactional memory models, and paging vs segmentation should go here. Use this forum to expand and improve the wiki!
User avatar
01000101
Member
Member
Posts: 1599
Joined: Fri Jun 22, 2007 12:47 pm
Contact:

Re: Optimized memory functions?

Post by 01000101 »

interesting.
I'll hopefully have some time today to test this out using the setup you posted.

Brendan: Thanks for the info. Direct cache utilization has always been a bit of a mystery to me. The revision just previous to the one posted didn't have PREFETCHNTA or CLFLUSH at all. So would you suggest just using plain ol' MOVDQA instead of its non-temporal storing brethren? I figured the non-temporal storing would improve cache usage later as it wouldn't be filling cache lines with one-shot data.
User avatar
Brendan
Member
Member
Posts: 8561
Joined: Sat Jan 15, 2005 12:00 am
Location: At his keyboard!
Contact:

Re: Optimized memory functions?

Post by Brendan »

Hi,
01000101 wrote:So would you suggest just using plain ol' MOVDQA instead of its non-temporal storing brethren? I figured the non-temporal storing would improve cache usage later as it wouldn't be filling cache lines with one-shot data.
I'd suggest using "movdqa 0(%0), %%xmm1" then "movntdq %%xmm0, 0(%1)" then "clflush 0(%0)" (but not "clflush 0(%1)").


Cheers,

Brendan
For all things; perfection is, and will always remain, impossible to achieve in practice. However; by striving for perfection we create things that are as perfect as practically possible. Let the pursuit of perfection be our guide.
User avatar
01000101
Member
Member
Posts: 1599
Joined: Fri Jun 22, 2007 12:47 pm
Contact:

Re: Optimized memory functions?

Post by 01000101 »

I'm not sure if anyone here would be interested in this, but I created a wiki article about my optimizated library with full source code and info. It would be greatly appreciated if some of you that are interested to help out and make the code better. It's probably very incomplete at the time of you reading this, but I will be working on it.

http://wiki.osdev.org/User:01000101/optlib/
tsdnz
Member
Member
Posts: 333
Joined: Sun Jun 16, 2013 4:09 am

Re: Optimized memory functions?

Post by tsdnz »

Hi guys, 2013 now, a few years since the last post.

This is mine:

As I am using AMD 64 bit I am going to convert the whiles to rep stosq/stosb

Code: Select all

il void SetMem(void* pMem, byte To, dword Size)
{
	if (Size < 128)
	{
		byte* p = (byte*)pMem;
		byte* pEnd = (byte*)pMem + Size;
			
		while (p < pEnd)
			*p++ = To;

		return;
	}

	byte* p = (byte*)pMem;
	byte* pEnd = (byte*)pMem + Size - sizeof(qword);

	qword Toqword = CharsToqword(To, To, To, To, To, To, To, To);

	// Align to qword boundary
	switch (((qword)pMem) & 0x7)
	{
		case 7:		*p++ = To;
		case 6:		*p++ = To;
		case 5:		*p++ = To;
		case 4:		*p++ = To;
		case 3:		*p++ = To;
		case 2:		*p++ = To;
		case 1:		*p++ = To;
	}

	while (p < pEnd)
	{
		*(qword*)p = Toqword;
		p += sizeof(qword);
	}

	pEnd += sizeof(qword);
	while (p < pEnd)
	{
		*(byte*)p = To;
		p += sizeof(byte);
	}
}
Last edited by tsdnz on Fri Aug 09, 2013 5:12 am, edited 1 time in total.
User avatar
sortie
Member
Member
Posts: 931
Joined: Wed Mar 21, 2012 3:01 pm
Libera.chat IRC: sortie

Re: Optimized memory functions?

Post by sortie »

You do realize that your function doesn't work correctly with non-zero To bytes with arrays below 128 bytes?
tsdnz
Member
Member
Posts: 333
Joined: Sun Jun 16, 2013 4:09 am

Re: Optimized memory functions?

Post by tsdnz »

Awesome, it will soon. I did not notice that.

About to convert it to rep stosq/stosb.

Is there any way to tell the compiler to preserve registers used with a function?
Interrupts?

Currently I have a stub in my secondary bootloader that calls qword [n*8], n being the interrupt.
This points to my kernel interrupt handler which pushes all registers.
No bigger, just like it tidy.

Thanks again
User avatar
sortie
Member
Member
Posts: 931
Joined: Wed Mar 21, 2012 3:01 pm
Libera.chat IRC: sortie

Re: Optimized memory functions?

Post by sortie »

That question is off-topic in this thread - make a new one. Additionally, I fail to comprehend exactly what you are asking. You cannot write interrupt handlers in C, you need to write a short stub in assembly that saves all the registers, calls the interrupt handler, and then reloads all the old registers.
tsdnz
Member
Member
Posts: 333
Joined: Sun Jun 16, 2013 4:09 am

Re: Optimized memory functions?

Post by tsdnz »

Ok, thanks, just what I am doing.
tsdnz
Member
Member
Posts: 333
Joined: Sun Jun 16, 2013 4:09 am

Re: Optimized memory functions?

Post by tsdnz »

Hi this is the memset in rep stos, qword aligned.
I had to use my inline proc to make it inline.

Code: Select all

il void* SetMem(void* pMem, word To, dword Count)
{
	asm volatile(
		"cld\n\t"
		"jecxz 1f\n\t"
		"rep stosb\n\t"
		"1:\n\t"
		"movl %%edx, %%ecx\n\t"
		"test %%edx, %%edx\n\t"
		"jz 2f\n\t"
		"rep stosq\n\t"
		"2:\n\t"
		:
		:"D"(pMem),"c"(Count & 0xF),"d"(Count >> 4),"a"((((qword)To) << 0) | (((qword)To) << 8) | (((qword)To) << 16) | (((qword)To) << 24) | (((qword)To) << 32) | (((qword)To) << 40) | (((qword)To) << 48) | (((qword)To) << 56))
		);
	asm("": : :"%edi","%ecx","cc");

	return pMem;
}
memset:
extern "C" void* memset(void * s, int c, size_t count)
{
return SetMem(s, c, count);
}


This is the assembly, I have put **** to show the memset code.

Code: Select all

 1000000:	41 57                	push   r15
 1000002:	48 b8 41 41 41 41 41 	movabs rax,0x4141414141414141     **** 
 1000009:	41 41 41     
 100000c:	ba 00 03 00 00       	mov    edx,0x300   **** 
 1000011:	31 c9                	xor    ecx,ecx    **** 
 1000013:	48 bf 00 f0 02 01 00 	movabs rdi,0x102f000    **** 
 100001a:	00 00 00 
 100001d:	41 56                	push   r14
 100001f:	41 55                	push   r13
 1000021:	41 54                	push   r12
 1000023:	55                   	push   rbp
 1000024:	53                   	push   rbx
 1000025:	48 81 ec 38 03 00 00 	sub    rsp,0x338
 100002c:	fc                   	cld     ****     
 100002d:	67 67 e3 02          	addr64 jecxz 1000033 <_Z11StartKernelv+0x33>   **** 
 1000031:	f3 aa                	rep stos BYTE PTR es:[rdi],al    **** 
 1000033:	89 d1                	mov    ecx,edx   **** 
 1000035:	85 d2                	test   edx,edx    **** 
 1000037:	74 03                	je     100003c <_Z11StartKernelv+0x3c>   **** 
 1000039:	f3 48 ab             	rep stos QWORD PTR es:[rdi],rax   **** 
 100003c:	50                   	push   rax
 100003d:	51                   	push   rcx
Antti
Member
Member
Posts: 923
Joined: Thu Jul 05, 2012 5:12 am
Location: Finland

Re: Optimized memory functions?

Post by Antti »

Just a general note: I do not understand why to write code like this:

Code: Select all

il void* SetMem(void* pMem, word To, dword Count)
{
   asm volatile(
      "cld\n\t"
      "jecxz 1f\n\t"
      "rep stosb\n\t"
      "1:\n\t"
      "movl %%edx, %%ecx\n\t"
      "test %%edx, %%edx\n\t"
      "jz 2f\n\t"
      "rep stosq\n\t"
      "2:\n\t"
      :
      :"D"(pMem),"c"(Count & 0xF),"d"(Count >> 4),"a"((((qword)To) << 0) | (((qword)To) << 8) | (((qword)To) << 16) | (((qword)To) << 24) | (((qword)To) << 32) | (((qword)To) << 40) | (((qword)To) << 48) | (((qword)To) << 56))
      );
   asm("": : :"%edi","%ecx","cc");

   return pMem;
}
Why not having pure assembly routines for functions like these? As a matter of fact, I have not used inline assembly at all. I think it is much more elegant to have "high-level code" and "low-level code" clearly separated. However, that is just my opinion.
Nable
Member
Member
Posts: 453
Joined: Tue Nov 08, 2011 11:35 am

Re: Optimized memory functions?

Post by Nable »

Antti wrote:Just a general note: I do not understand why to write code like this:

Code: Select all

...
Why not having pure assembly routines for functions like these? As a matter of fact, I have not used inline assembly at all. I think it is much more elegant to have "high-level code" and "low-level code" clearly separated. However, that is just my opinion.
1) inline assembly helps compiler to inline code (less overhead because of such things as saving registers, doing call, doing return, doing restore of registers)
2) it's possible that compiler can better optimize blocks of code aroung this inline asm.

My note would be the following: why would anyone bother now with making hand-crafted memcpy/memset when compiler is able to emit better code (for example, replace memcpy-loop with some MOVes for short lengths, use special instructions for long && aligned blocks) ?
tsdnz
Member
Member
Posts: 333
Joined: Sun Jun 16, 2013 4:09 am

Re: Optimized memory functions?

Post by tsdnz »

Hi guys, I have to hand craft it.

There is a bug in gcc 4.8.1 (my version) where memset calls memset.
Resulting in an infinite loop.

http://www.marshut.com/qnktz/infinite-r ... ction.html
http://forum.osdev.org/viewtopic.php?f=1&t=27016

Just found out that "-fno-tree-loop-distribute-patterns" removes the recursive loop.
I should have read the bug report fully.
And code is much better as below

Code: Select all

extern "C" void* memset(void * s, int c, size_t count)
{
	byte* b = (byte*)s;

	while (count-- > 0)
		*b++ = c;

	return s;
} 
Which results is 16 byte moves:

Code: Select all

1000000:	48 ba 30 78 02 01 00 	movabs rdx,0x1027830
 1000007:	00 00 00 
 100000a:	48 b8 00 f0 02 01 00 	movabs rax,0x102f000
 1000011:	00 00 00 
 1000014:	48 b9 00 20 03 01 00 	movabs rcx,0x1032000
 100001b:	00 00 00 
 100001e:	66 0f 6f 02          	movdqa xmm0,XMMWORD PTR [rdx]
 1000022:	66 0f 7f 00          	movdqa XMMWORD PTR [rax],xmm0
 1000026:	48 83 c0 10          	add    rax,0x10
 100002a:	48 39 c8             	cmp    rax,rcx
 100002d:	75 f3                	jne    1000022 <_Z11StartKernelv+0x22>
pcmattman
Member
Member
Posts: 2566
Joined: Sun Jan 14, 2007 9:15 pm
Libera.chat IRC: miselin
Location: Sydney, Australia (I come from a land down under!)
Contact:

Re: Optimized memory functions?

Post by pcmattman »

The pattern matching turning memset into a call for memset is hardly a bug. It is an optimisation that is quite sane for most cases, and if in the freestanding environment there is an option to pass when compiling your memset.c to disable it.

Also, using xmmN registers is nice, but should only ever be used in the kernel if you fully understand what you are required to do before using them (eg, save floating point state, make sure everything is ready for the instructions, make sure the running CPU supports the instruction, etc...) and are prepared to accept the extra overhead that comes as a result. For this reason, it's usually wise to make sure you compile your kernel code with a set of flags that disallows the compiler from emitting any SSE/MMX/3DNow/etc instructions, and writing code to use them yourself when (... if) the situation warrants.

I haven't outright said "none of that in the kernel" because your system could be a single-tasking system, it could be devoid of a userspace, and so on. As always your mileage may vary and there's no "one size fits all" solution.
tsdnz
Member
Member
Posts: 333
Joined: Sun Jun 16, 2013 4:09 am

Re: Optimized memory functions?

Post by tsdnz »

Ok, good points.

My project is single task, no user space, cpu is set up for SSE/MMX
User avatar
dozniak
Member
Member
Posts: 723
Joined: Thu Jul 12, 2012 7:29 am
Location: Tallinn, Estonia

Re: Optimized memory functions?

Post by dozniak »

Why not let gcc generate you a memset? Why do you have to write it yourself?
Learn to read.
Post Reply