Optimized memory functions?
Re: Optimized memory functions?
interesting.
I'll hopefully have some time today to test this out using the setup you posted.
Brendan: Thanks for the info. Direct cache utilization has always been a bit of a mystery to me. The revision just previous to the one posted didn't have PREFETCHNTA or CLFLUSH at all. So would you suggest just using plain ol' MOVDQA instead of its non-temporal storing brethren? I figured the non-temporal storing would improve cache usage later as it wouldn't be filling cache lines with one-shot data.
I'll hopefully have some time today to test this out using the setup you posted.
Brendan: Thanks for the info. Direct cache utilization has always been a bit of a mystery to me. The revision just previous to the one posted didn't have PREFETCHNTA or CLFLUSH at all. So would you suggest just using plain ol' MOVDQA instead of its non-temporal storing brethren? I figured the non-temporal storing would improve cache usage later as it wouldn't be filling cache lines with one-shot data.
Website: https://joscor.com
Re: Optimized memory functions?
Hi,
Cheers,
Brendan
I'd suggest using "movdqa 0(%0), %%xmm1" then "movntdq %%xmm0, 0(%1)" then "clflush 0(%0)" (but not "clflush 0(%1)").01000101 wrote:So would you suggest just using plain ol' MOVDQA instead of its non-temporal storing brethren? I figured the non-temporal storing would improve cache usage later as it wouldn't be filling cache lines with one-shot data.
Cheers,
Brendan
For all things; perfection is, and will always remain, impossible to achieve in practice. However; by striving for perfection we create things that are as perfect as practically possible. Let the pursuit of perfection be our guide.
Re: Optimized memory functions?
I'm not sure if anyone here would be interested in this, but I created a wiki article about my optimizated library with full source code and info. It would be greatly appreciated if some of you that are interested to help out and make the code better. It's probably very incomplete at the time of you reading this, but I will be working on it.
http://wiki.osdev.org/User:01000101/optlib/
http://wiki.osdev.org/User:01000101/optlib/
Website: https://joscor.com
Re: Optimized memory functions?
Hi guys, 2013 now, a few years since the last post.
This is mine:
As I am using AMD 64 bit I am going to convert the whiles to rep stosq/stosb
This is mine:
As I am using AMD 64 bit I am going to convert the whiles to rep stosq/stosb
Code: Select all
il void SetMem(void* pMem, byte To, dword Size)
{
if (Size < 128)
{
byte* p = (byte*)pMem;
byte* pEnd = (byte*)pMem + Size;
while (p < pEnd)
*p++ = To;
return;
}
byte* p = (byte*)pMem;
byte* pEnd = (byte*)pMem + Size - sizeof(qword);
qword Toqword = CharsToqword(To, To, To, To, To, To, To, To);
// Align to qword boundary
switch (((qword)pMem) & 0x7)
{
case 7: *p++ = To;
case 6: *p++ = To;
case 5: *p++ = To;
case 4: *p++ = To;
case 3: *p++ = To;
case 2: *p++ = To;
case 1: *p++ = To;
}
while (p < pEnd)
{
*(qword*)p = Toqword;
p += sizeof(qword);
}
pEnd += sizeof(qword);
while (p < pEnd)
{
*(byte*)p = To;
p += sizeof(byte);
}
}
Last edited by tsdnz on Fri Aug 09, 2013 5:12 am, edited 1 time in total.
Re: Optimized memory functions?
You do realize that your function doesn't work correctly with non-zero To bytes with arrays below 128 bytes?
Re: Optimized memory functions?
Awesome, it will soon. I did not notice that.
About to convert it to rep stosq/stosb.
Is there any way to tell the compiler to preserve registers used with a function?
Interrupts?
Currently I have a stub in my secondary bootloader that calls qword [n*8], n being the interrupt.
This points to my kernel interrupt handler which pushes all registers.
No bigger, just like it tidy.
Thanks again
About to convert it to rep stosq/stosb.
Is there any way to tell the compiler to preserve registers used with a function?
Interrupts?
Currently I have a stub in my secondary bootloader that calls qword [n*8], n being the interrupt.
This points to my kernel interrupt handler which pushes all registers.
No bigger, just like it tidy.
Thanks again
Re: Optimized memory functions?
That question is off-topic in this thread - make a new one. Additionally, I fail to comprehend exactly what you are asking. You cannot write interrupt handlers in C, you need to write a short stub in assembly that saves all the registers, calls the interrupt handler, and then reloads all the old registers.
Re: Optimized memory functions?
Ok, thanks, just what I am doing.
Re: Optimized memory functions?
Hi this is the memset in rep stos, qword aligned.
I had to use my inline proc to make it inline.
memset:
extern "C" void* memset(void * s, int c, size_t count)
{
return SetMem(s, c, count);
}
This is the assembly, I have put **** to show the memset code.
I had to use my inline proc to make it inline.
Code: Select all
il void* SetMem(void* pMem, word To, dword Count)
{
asm volatile(
"cld\n\t"
"jecxz 1f\n\t"
"rep stosb\n\t"
"1:\n\t"
"movl %%edx, %%ecx\n\t"
"test %%edx, %%edx\n\t"
"jz 2f\n\t"
"rep stosq\n\t"
"2:\n\t"
:
:"D"(pMem),"c"(Count & 0xF),"d"(Count >> 4),"a"((((qword)To) << 0) | (((qword)To) << 8) | (((qword)To) << 16) | (((qword)To) << 24) | (((qword)To) << 32) | (((qword)To) << 40) | (((qword)To) << 48) | (((qword)To) << 56))
);
asm("": : :"%edi","%ecx","cc");
return pMem;
}
extern "C" void* memset(void * s, int c, size_t count)
{
return SetMem(s, c, count);
}
This is the assembly, I have put **** to show the memset code.
Code: Select all
1000000: 41 57 push r15
1000002: 48 b8 41 41 41 41 41 movabs rax,0x4141414141414141 ****
1000009: 41 41 41
100000c: ba 00 03 00 00 mov edx,0x300 ****
1000011: 31 c9 xor ecx,ecx ****
1000013: 48 bf 00 f0 02 01 00 movabs rdi,0x102f000 ****
100001a: 00 00 00
100001d: 41 56 push r14
100001f: 41 55 push r13
1000021: 41 54 push r12
1000023: 55 push rbp
1000024: 53 push rbx
1000025: 48 81 ec 38 03 00 00 sub rsp,0x338
100002c: fc cld ****
100002d: 67 67 e3 02 addr64 jecxz 1000033 <_Z11StartKernelv+0x33> ****
1000031: f3 aa rep stos BYTE PTR es:[rdi],al ****
1000033: 89 d1 mov ecx,edx ****
1000035: 85 d2 test edx,edx ****
1000037: 74 03 je 100003c <_Z11StartKernelv+0x3c> ****
1000039: f3 48 ab rep stos QWORD PTR es:[rdi],rax ****
100003c: 50 push rax
100003d: 51 push rcx
Re: Optimized memory functions?
Just a general note: I do not understand why to write code like this:
Why not having pure assembly routines for functions like these? As a matter of fact, I have not used inline assembly at all. I think it is much more elegant to have "high-level code" and "low-level code" clearly separated. However, that is just my opinion.
Code: Select all
il void* SetMem(void* pMem, word To, dword Count)
{
asm volatile(
"cld\n\t"
"jecxz 1f\n\t"
"rep stosb\n\t"
"1:\n\t"
"movl %%edx, %%ecx\n\t"
"test %%edx, %%edx\n\t"
"jz 2f\n\t"
"rep stosq\n\t"
"2:\n\t"
:
:"D"(pMem),"c"(Count & 0xF),"d"(Count >> 4),"a"((((qword)To) << 0) | (((qword)To) << 8) | (((qword)To) << 16) | (((qword)To) << 24) | (((qword)To) << 32) | (((qword)To) << 40) | (((qword)To) << 48) | (((qword)To) << 56))
);
asm("": : :"%edi","%ecx","cc");
return pMem;
}
Re: Optimized memory functions?
1) inline assembly helps compiler to inline code (less overhead because of such things as saving registers, doing call, doing return, doing restore of registers)Antti wrote:Just a general note: I do not understand why to write code like this:Why not having pure assembly routines for functions like these? As a matter of fact, I have not used inline assembly at all. I think it is much more elegant to have "high-level code" and "low-level code" clearly separated. However, that is just my opinion.Code: Select all
...
2) it's possible that compiler can better optimize blocks of code aroung this inline asm.
My note would be the following: why would anyone bother now with making hand-crafted memcpy/memset when compiler is able to emit better code (for example, replace memcpy-loop with some MOVes for short lengths, use special instructions for long && aligned blocks) ?
Re: Optimized memory functions?
Hi guys, I have to hand craft it.
There is a bug in gcc 4.8.1 (my version) where memset calls memset.
Resulting in an infinite loop.
http://www.marshut.com/qnktz/infinite-r ... ction.html
http://forum.osdev.org/viewtopic.php?f=1&t=27016
Just found out that "-fno-tree-loop-distribute-patterns" removes the recursive loop.
I should have read the bug report fully.
And code is much better as below
Which results is 16 byte moves:
There is a bug in gcc 4.8.1 (my version) where memset calls memset.
Resulting in an infinite loop.
http://www.marshut.com/qnktz/infinite-r ... ction.html
http://forum.osdev.org/viewtopic.php?f=1&t=27016
Just found out that "-fno-tree-loop-distribute-patterns" removes the recursive loop.
I should have read the bug report fully.
And code is much better as below
Code: Select all
extern "C" void* memset(void * s, int c, size_t count)
{
byte* b = (byte*)s;
while (count-- > 0)
*b++ = c;
return s;
}
Code: Select all
1000000: 48 ba 30 78 02 01 00 movabs rdx,0x1027830
1000007: 00 00 00
100000a: 48 b8 00 f0 02 01 00 movabs rax,0x102f000
1000011: 00 00 00
1000014: 48 b9 00 20 03 01 00 movabs rcx,0x1032000
100001b: 00 00 00
100001e: 66 0f 6f 02 movdqa xmm0,XMMWORD PTR [rdx]
1000022: 66 0f 7f 00 movdqa XMMWORD PTR [rax],xmm0
1000026: 48 83 c0 10 add rax,0x10
100002a: 48 39 c8 cmp rax,rcx
100002d: 75 f3 jne 1000022 <_Z11StartKernelv+0x22>
-
- Member
- Posts: 2566
- Joined: Sun Jan 14, 2007 9:15 pm
- Libera.chat IRC: miselin
- Location: Sydney, Australia (I come from a land down under!)
- Contact:
Re: Optimized memory functions?
The pattern matching turning memset into a call for memset is hardly a bug. It is an optimisation that is quite sane for most cases, and if in the freestanding environment there is an option to pass when compiling your memset.c to disable it.
Also, using xmmN registers is nice, but should only ever be used in the kernel if you fully understand what you are required to do before using them (eg, save floating point state, make sure everything is ready for the instructions, make sure the running CPU supports the instruction, etc...) and are prepared to accept the extra overhead that comes as a result. For this reason, it's usually wise to make sure you compile your kernel code with a set of flags that disallows the compiler from emitting any SSE/MMX/3DNow/etc instructions, and writing code to use them yourself when (... if) the situation warrants.
I haven't outright said "none of that in the kernel" because your system could be a single-tasking system, it could be devoid of a userspace, and so on. As always your mileage may vary and there's no "one size fits all" solution.
Also, using xmmN registers is nice, but should only ever be used in the kernel if you fully understand what you are required to do before using them (eg, save floating point state, make sure everything is ready for the instructions, make sure the running CPU supports the instruction, etc...) and are prepared to accept the extra overhead that comes as a result. For this reason, it's usually wise to make sure you compile your kernel code with a set of flags that disallows the compiler from emitting any SSE/MMX/3DNow/etc instructions, and writing code to use them yourself when (... if) the situation warrants.
I haven't outright said "none of that in the kernel" because your system could be a single-tasking system, it could be devoid of a userspace, and so on. As always your mileage may vary and there's no "one size fits all" solution.
Re: Optimized memory functions?
Ok, good points.
My project is single task, no user space, cpu is set up for SSE/MMX
My project is single task, no user space, cpu is set up for SSE/MMX
Re: Optimized memory functions?
Why not let gcc generate you a memset? Why do you have to write it yourself?
Learn to read.