OSDev.org

Posted: **Thu Mar 19, 2009 7:25 am**

interesting.
I'll hopefully have some time today to test this out using the setup you posted.

Brendan: Thanks for the info. Direct cache utilization has always been a bit of a mystery to me. The revision just previous to the one posted didn't have PREFETCHNTA or CLFLUSH at all. So would you suggest just using plain ol' MOVDQA instead of its non-temporal storing brethren? I figured the non-temporal storing would improve cache usage later as it wouldn't be filling cache lines with one-shot data.

Posted: **Thu Mar 19, 2009 8:56 am**

Hi,

01000101 wrote:So would you suggest just using plain ol' MOVDQA instead of its non-temporal storing brethren? I figured the non-temporal storing would improve cache usage later as it wouldn't be filling cache lines with one-shot data.

I'd suggest using "movdqa 0(%0), %%xmm1" then "movntdq %%xmm0, 0(%1)" then "clflush 0(%0)" (but not "clflush 0(%1)").

Cheers,

Brendan

Posted: **Fri Mar 27, 2009 9:27 pm**

I'm not sure if anyone here would be interested in this, but I created a wiki article about my optimizated library with full source code and info. It would be greatly appreciated if some of you that are interested to help out and make the code better. It's probably very incomplete at the time of you reading this, but I will be working on it.

http://wiki.osdev.org/User:01000101/optlib/

Posted: **Fri Aug 09, 2013 4:32 am**

Hi guys, 2013 now, a few years since the last post.

This is mine:

As I am using AMD 64 bit I am going to convert the whiles to rep stosq/stosb

Code: Select all

il void SetMem(void* pMem, byte To, dword Size)
{
	if (Size < 128)
	{
		byte* p = (byte*)pMem;
		byte* pEnd = (byte*)pMem + Size;
			
		while (p < pEnd)
			*p++ = To;

		return;
	}

	byte* p = (byte*)pMem;
	byte* pEnd = (byte*)pMem + Size - sizeof(qword);

	qword Toqword = CharsToqword(To, To, To, To, To, To, To, To);

	// Align to qword boundary
	switch (((qword)pMem) & 0x7)
	{
		case 7:		*p++ = To;
		case 6:		*p++ = To;
		case 5:		*p++ = To;
		case 4:		*p++ = To;
		case 3:		*p++ = To;
		case 2:		*p++ = To;
		case 1:		*p++ = To;
	}

	while (p < pEnd)
	{
		*(qword*)p = Toqword;
		p += sizeof(qword);
	}

	pEnd += sizeof(qword);
	while (p < pEnd)
	{
		*(byte*)p = To;
		p += sizeof(byte);
	}
}

Posted: **Fri Aug 09, 2013 4:51 am**

You do realize that your function doesn't work correctly with non-zero To bytes with arrays below 128 bytes?

Posted: **Fri Aug 09, 2013 5:04 am**

Awesome, it will soon. I did not notice that.

About to convert it to rep stosq/stosb.

Is there any way to tell the compiler to preserve registers used with a function?
Interrupts?

Currently I have a stub in my secondary bootloader that calls qword [n*8], n being the interrupt.
This points to my kernel interrupt handler which pushes all registers.
No bigger, just like it tidy.

Thanks again

Posted: **Fri Aug 09, 2013 5:09 am**

That question is off-topic in this thread - make a new one. Additionally, I fail to comprehend exactly what you are asking. You cannot write interrupt handlers in C, you need to write a short stub in assembly that saves all the registers, calls the interrupt handler, and then reloads all the old registers.

Posted: **Fri Aug 09, 2013 5:10 am**

Ok, thanks, just what I am doing.

Posted: **Sun Aug 11, 2013 2:58 am**

Hi this is the memset in rep stos, qword aligned.
I had to use my inline proc to make it inline.

Code: Select all

il void* SetMem(void* pMem, word To, dword Count)
{
	asm volatile(
		"cld\n\t"
		"jecxz 1f\n\t"
		"rep stosb\n\t"
		"1:\n\t"
		"movl %%edx, %%ecx\n\t"
		"test %%edx, %%edx\n\t"
		"jz 2f\n\t"
		"rep stosq\n\t"
		"2:\n\t"
		:
		:"D"(pMem),"c"(Count & 0xF),"d"(Count >> 4),"a"((((qword)To) << 0) | (((qword)To) << 8) | (((qword)To) << 16) | (((qword)To) << 24) | (((qword)To) << 32) | (((qword)To) << 40) | (((qword)To) << 48) | (((qword)To) << 56))
		);
	asm("": : :"%edi","%ecx","cc");

	return pMem;
}

memset:
extern "C" void* memset(void * s, int c, size_t count)
{
return SetMem(s, c, count);
}

This is the assembly, I have put **** to show the memset code.

Code: Select all

 1000000:	41 57                	push   r15
 1000002:	48 b8 41 41 41 41 41 	movabs rax,0x4141414141414141     **** 
 1000009:	41 41 41     
 100000c:	ba 00 03 00 00       	mov    edx,0x300   **** 
 1000011:	31 c9                	xor    ecx,ecx    **** 
 1000013:	48 bf 00 f0 02 01 00 	movabs rdi,0x102f000    **** 
 100001a:	00 00 00 
 100001d:	41 56                	push   r14
 100001f:	41 55                	push   r13
 1000021:	41 54                	push   r12
 1000023:	55                   	push   rbp
 1000024:	53                   	push   rbx
 1000025:	48 81 ec 38 03 00 00 	sub    rsp,0x338
 100002c:	fc                   	cld     ****     
 100002d:	67 67 e3 02          	addr64 jecxz 1000033 <_Z11StartKernelv+0x33>   **** 
 1000031:	f3 aa                	rep stos BYTE PTR es:[rdi],al    **** 
 1000033:	89 d1                	mov    ecx,edx   **** 
 1000035:	85 d2                	test   edx,edx    **** 
 1000037:	74 03                	je     100003c <_Z11StartKernelv+0x3c>   **** 
 1000039:	f3 48 ab             	rep stos QWORD PTR es:[rdi],rax   **** 
 100003c:	50                   	push   rax
 100003d:	51                   	push   rcx

Posted: **Sun Aug 11, 2013 5:17 am**

Just a general note: I do not understand why to write code like this:

Code: Select all

il void* SetMem(void* pMem, word To, dword Count)
{
   asm volatile(
      "cld\n\t"
      "jecxz 1f\n\t"
      "rep stosb\n\t"
      "1:\n\t"
      "movl %%edx, %%ecx\n\t"
      "test %%edx, %%edx\n\t"
      "jz 2f\n\t"
      "rep stosq\n\t"
      "2:\n\t"
      :
      :"D"(pMem),"c"(Count & 0xF),"d"(Count >> 4),"a"((((qword)To) << 0) | (((qword)To) << 8) | (((qword)To) << 16) | (((qword)To) << 24) | (((qword)To) << 32) | (((qword)To) << 40) | (((qword)To) << 48) | (((qword)To) << 56))
      );
   asm("": : :"%edi","%ecx","cc");

   return pMem;
}

Why not having pure assembly routines for functions like these? As a matter of fact, I have not used inline assembly at all. I think it is much more elegant to have "high-level code" and "low-level code" clearly separated. However, that is just my opinion.

Posted: **Sun Aug 11, 2013 8:20 am**

Antti wrote:Just a general note: I do not understand why to write code like this:
Code: Select all
...
Why not having pure assembly routines for functions like these? As a matter of fact, I have not used inline assembly at all. I think it is much more elegant to have "high-level code" and "low-level code" clearly separated. However, that is just my opinion.

1) inline assembly helps compiler to inline code (less overhead because of such things as saving registers, doing call, doing return, doing restore of registers)
2) it's possible that compiler can better optimize blocks of code aroung this inline asm.

My note would be the following: why would anyone bother now with making hand-crafted memcpy/memset when compiler is able to emit better code (for example, replace memcpy-loop with some MOVes for short lengths, use special instructions for long && aligned blocks) ?

Posted: **Sun Aug 11, 2013 1:59 pm**

Hi guys, I have to hand craft it.

There is a bug in gcc 4.8.1 (my version) where memset calls memset.
Resulting in an infinite loop.

http://www.marshut.com/qnktz/infinite-r ... ction.html
http://forum.osdev.org/viewtopic.php?f=1&t=27016

Just found out that "-fno-tree-loop-distribute-patterns" removes the recursive loop.
I should have read the bug report fully.
And code is much better as below

Code: Select all

extern "C" void* memset(void * s, int c, size_t count)
{
	byte* b = (byte*)s;

	while (count-- > 0)
		*b++ = c;

	return s;
}

Which results is 16 byte moves:

Code: Select all

1000000:	48 ba 30 78 02 01 00 	movabs rdx,0x1027830
 1000007:	00 00 00 
 100000a:	48 b8 00 f0 02 01 00 	movabs rax,0x102f000
 1000011:	00 00 00 
 1000014:	48 b9 00 20 03 01 00 	movabs rcx,0x1032000
 100001b:	00 00 00 
 100001e:	66 0f 6f 02          	movdqa xmm0,XMMWORD PTR [rdx]
 1000022:	66 0f 7f 00          	movdqa XMMWORD PTR [rax],xmm0
 1000026:	48 83 c0 10          	add    rax,0x10
 100002a:	48 39 c8             	cmp    rax,rcx
 100002d:	75 f3                	jne    1000022 <_Z11StartKernelv+0x22>

Posted: **Sun Aug 11, 2013 9:25 pm**

The pattern matching turning memset into a call for memset is hardly a bug. It is an optimisation that is quite sane for most cases, and if in the freestanding environment there is an option to pass when compiling your memset.c to disable it.

Also, using xmmN registers is nice, but should only ever be used in the kernel if you fully understand what you are required to do before using them (eg, save floating point state, make sure everything is ready for the instructions, make sure the running CPU supports the instruction, etc...) and are prepared to accept the extra overhead that comes as a result. For this reason, it's usually wise to make sure you compile your kernel code with a set of flags that disallows the compiler from emitting any SSE/MMX/3DNow/etc instructions, and writing code to use them yourself when (... if) the situation warrants.

I haven't outright said "none of that in the kernel" because your system could be a single-tasking system, it could be devoid of a userspace, and so on. As always your mileage may vary and there's no "one size fits all" solution.

Posted: **Sun Aug 11, 2013 10:32 pm**

Ok, good points.

My project is single task, no user space, cpu is set up for SSE/MMX

Posted: **Mon Aug 12, 2013 2:18 am**

Why not let gcc generate you a memset? Why do you have to write it yourself?

OSDev.org

Optimized memory functions?

Re: Optimized memory functions?

Re: Optimized memory functions?

Re: Optimized memory functions?

Re: Optimized memory functions?

Re: Optimized memory functions?

Re: Optimized memory functions?

Re: Optimized memory functions?

Re: Optimized memory functions?

Re: Optimized memory functions?

Re: Optimized memory functions?

Re: Optimized memory functions?

Re: Optimized memory functions?

Re: Optimized memory functions?

Re: Optimized memory functions?

Re: Optimized memory functions?