OSDev.org

Posted: **Tue Nov 18, 2008 8:46 am**

I'm not sure, and I will test it, but I remember that REP prefix works a bit faster than rejumping.

;faster
REP STOSD

Code: Select all

;slower
s_loop: STOSD
LOOP s_loop

Second code works quite slower because it calculates the new EIP and rejump every time.

Posted: **Tue Nov 25, 2008 6:19 pm**

I used the GCC -O3 argument for my code that was posted originally, and it actually proved to be faster than my hand-written assembly using XMM/MMX registers and MOVNTDQ. Even testing repeat memcpy/set's in a row, there was no noticeable issues related to temporal hints or cache-line misses using the GCC optimizations. The increase was about 10-20 *ticks* (measured using the RDTSC instruction and calculating time elapsed) less than the hand-written assembly.

Posted: **Tue Nov 25, 2008 9:09 pm**

Like you probably read from Brendan's post and form the links that I posted movntdqa is not faster comparing to movdqa on mem sizes that fit into L1 cache.

Another thing to consider is that mov(nt)dqa has latency more that 1 on 65nm core2duo and even large on previous Intel CPUs. Mov(nt)dqa standalone(not paired with other SSE instructions or itself) might not be fast.

If you do prefetch(t0/t1/nta) then you loop better be writing/reading size of the cacheline (64/128 baits) per single loop pass and most certainly cache line aligned.

You tried 64bit MMX on x64 processor? You really shouldn't.

I hope in you hand written assembly you did have the trick with the counter that I illustrated. Basically you should have bunch of instruction that read/write from/to memory, single decrement/increment counter, and single conditional jump based on the flags after changing counter.

IMHO, you should use CLFLUSH or WBINVD instructions after each testing pass (not like those guys on masm32 forum - getting 1 cycle on dword2asciiHex convertion)

PS: Overall OSDev is not right place to discuss general programming, and especially its low level optimization. Maybe you should clock your algos with algos that assembly programming sites provide first (under windows).

Posted: **Thu Nov 27, 2008 1:00 am**

geppy wrote: PS: Overall OSDev is not right place to discuss general programming, and especially its low level optimization. Maybe you should clock your algos with algos that assembly programming sites provide first (under windows).

On the contrary, this is a perfect place to discuss low level optimization, because the number one place you do not want bloat and inefficiency is in the OS kernel.

Would it be a good idea to have a memcpy/memset function using

Code: Select all

rep movsd

for use in an emulator like Bochs, or would it be better to use a SSE version?

Posted: **Thu Nov 27, 2008 7:03 am**

if bochs is on a realtime clock, i expect rep movsd to be the fastest since bochs has special repeat-speedups for that (which eliminates decoding costs for each instruction).

if bochs is on a slowdown clock, you need to optimize for the minimal amount of instructions. That means in 32-bit SSE is better (double the tranfer per instruction, minus some loop overhead). in long mode rep movsq (64 bits/1 instruction) is faster than a SSE move-move (128 bits/2 instructions + extra code for loop overhead)

That is, if my knowledge of the bochs internals is accurate here.

Posted: **Tue Dec 23, 2008 10:01 am**

01000101 wrote:here's a rough prototype that I'm testing right now, I wanted to know of any feedback regarding these memset and memcpy functions.

Code: Select all

void memcpy(void *dest, void *src, uint64 count)
{
  if(!count){return;} // nothing to copy?
  while(count >= 8){ *(uint64*)dest = *(uint64*)src; memcpy_transfers_64++; dest += 8; src += 8; count -= 8; }
  while(count >= 4){ *(uint32*)dest = *(uint32*)src; memcpy_transfers_32++; dest += 4; src += 4; count -= 4; }
  while(count >= 2){ *(uint16*)dest = *(uint16*)src; memcpy_transfers_16++; dest += 2; src += 2; count -= 2; }
  while(count >= 1){ *(uint8*)dest = *(uint8*)src; memcpy_transfers_8++; dest += 1; src += 1; count -= 1; }
  return;
}

void memset(void *dest, uint8 sval, uint64 count)
{
  uint64 val = (sval & 0xFF); // create a 64-bit version of 'sval'
  val |= ((val << 8) & 0xFF00);
  val |= ((val << 16) & 0xFFFF0000);
  val |= ((val << 32) & 0xFFFFFFFF00000000);
    
  if(!count){return;} // nothing to copy?
  while(count >= 8){ *(uint64*)dest = (uint64)val; memset_transfers_64++; dest += 8; count -= 8; }
  while(count >= 4){ *(uint32*)dest = (uint32)val; memset_transfers_32++; dest += 4; count -= 4; }
  while(count >= 2){ *(uint16*)dest = (uint16)val; memset_transfers_16++; dest += 2; count -= 2; }
  while(count >= 1){ *(uint8*)dest = (uint8)val; memset_transfers_8++; dest += 1; count -= 1; }
  return; 
}

A few remarks on this specific code which I don't think have been mentioned yet:
1) unless you are planning on calling the functions mostly with 0 for count, don't put a if (!count) in there. It adds slight overhead each time you call the function, and when count is 0 it'll fall through on all the whiles anyway.
2) only the first line of the whiles need be a while (the one with count >= 8). For the others, an if suffices, as they will always run once at most. This also means you don't need the increments in the last while/if, since nothing comes after it.
3) I'm not entirely sure whether you can perform pointer arithmethic on a void pointer (dest += ...). Your mileage may vary depending on the compiler. I'd do something like (*(uint64*)dest)++ = ... instead.

JAL

Posted: **Tue Dec 23, 2008 10:22 am**

Thanks!
since then, I have removed the trailing while statements, but I didn't even think of the if(!count) statement being useless.

As for the pointer arithmetic, it is valid. If anything it would be bit safer to add the default memory-size as the typecast (something like: "((uint64)dest)++;"). Using *(uint64*) would increment the value pointed to, not the pointer.

Posted: **Tue Dec 23, 2008 2:13 pm**

01000101 wrote:Thanks!
As for the pointer arithmetic, it is valid. If anything it would be bit safer to add the default memory-size as the typecast (something like: "((uint64)dest)++;"). Using *(uint64*) would increment the value pointed to, not the pointer.

I had to look at that twice because the first time I thought you were doing arithmetic with void pointers, which is not valid (except as a rather stupid GCC extension).

A couple of nitpicky things:
- Your function prototypes don't follow the standard.
- uint64 isn't really a valid type, it should be uint64_t. Same goes for uint8/uint8_t. Unless you're programming in matlab.

- count should be of type size_t, not uint64_t, even if size_t is 64 bits on your system.
- In memcpy, src is supposed to be const, and if you're using C99 you should really use the restrict keyword as well.
- In memset, sval should be plain ol' int, and later cast to uint8_t.
- If you're using C99, you should make use of the restrict keyword. The proper C99 prototype for memcpy is: memcpy(void *restrict s1, const void *restrict s2, size_t n)
- As others have pointed out, it may also be faster if you make sure that src and dest are properly aligned first and handle that as needed.

Posted: **Tue Dec 23, 2008 2:50 pm**

01000101 wrote:As for the pointer arithmetic, it is valid. If anything it would be bit safer to add the default memory-size as the typecast (something like: "((uint64)dest)++;"). Using *(uint64*) would increment the value pointed to, not the pointer.

Heh, any correction contains at least one error :) Forgot what law that was :). *((uint64*)someptr++) is supposed to work then, although I'm not entirely sure the dereference takes place before the increment. Gotten a bit rusty with the rules...

JAL

Posted: **Tue Dec 23, 2008 3:03 pm**

jal wrote:
01000101 wrote:As for the pointer arithmetic, it is valid. If anything it would be bit safer to add the default memory-size as the typecast (something like: "((uint64)dest)++;"). Using *(uint64*) would increment the value pointed to, not the pointer.
Heh, any correction contains at least one error Forgot what law that was . *((uint64*)someptr++) is supposed to work then, although I'm not entirely sure the dereference takes place before the increment. Gotten a bit rusty with the rules...

JAL

I believe you're referring to Muphry's law. =)
I have a hard time wrapping my head around which gets incremented in the expression *(uint64*)pointer++. Also, in that expression, is it incremented by one or by sizeof(uint64)? C can be funky at times with pointers.

Posted: **Tue Dec 23, 2008 6:42 pm**

With intelligent chips like Intel makes, I've found almost no benefit to such optimizations. I was really depressed when I made my compiler much more optimal but it had no impact. Intel chips reorder instructions while executing and all kinds of crazy things doing two instructions at once, possibly. The funny things is a good compiler could do it and it almost seems silly to waste silicon on it.

Test your results and let me know -- I'm curious if they shorten timings. If you do improve timings, I really want to know!

I remember playing with STOSB and STOSQ and I don't think it helped.

I built many functions into my compiler. It doesn't allow "inlining" for an arbitrary function, but I can build functions into the compiler pretty easily. The built-in ones can do things like recognize a constant size in MemSet() and pick from STOSB, STOSW, STOSD or STOSQ. It will even substitute MOV [RDI],0 MOV 8[RDI],0 etc for tiny values.

I have BTS, BTR, BT and BTC built-in. That's cool. I have read-time-stamp built-in. Does anyone know if RDTSC's frequency changes on a core i7. I might be in trouble

Posted: **Tue Dec 23, 2008 7:24 pm**

I actually just redid my speed tests on memcpy/memset.

using movntdqa to load and movdqu to store (from xmm registers) had an average tick count of 712 ticks per 512-byte copy or set. Done in a seperate .asm file in nasm.

using just the plain ol' "rep movsb" or "rep stosb" in a .asm file yielded an average of 10 ticks per 512-byte copy or set.

using __uint128_t and uint8_t within a C memcpy/set yielded a result of about 400 ticks per 512-byte copy or set. That was with -O3 -m64 -ftracer -msse4.1 .

so, to me at least, it seems that the simplest of designs really is the best one in this case.
RDTSC doesn't really change per se, but it will increase faster on faster processors as it stores the number of instruction ticks. Thus on faster systems that can ram through code faster, they will of course report more instruction ticks executed per second.

Posted: **Tue Dec 23, 2008 9:19 pm**

Hi,

LoseThos wrote:I was really depressed when I made my compiler much more optimal but it had no impact. Intel chips reorder instructions while executing and all kinds of crazy things doing two instructions at once, possibly. The funny things is a good compiler could do it and it almost seems silly to waste silicon on it.

Don't be depressed - instead, buy yourself something with Intel's Atom CPU in it...

LoseThos wrote:Does anyone know if RDTSC's frequency changes on a core i7. I might be in trouble

I'd assume that certain power management events will cause a Core I7's RDTSC to change frequency (just like certain power management events cause RDTSC to change frequency on lots of Intel CPUs, starting with Pentium-M). However I'm not sure which power management events effect RDTSC frequency. SpeedStep normally does effect RDTSC fequency, but I don't think STPCLK and software controlled thermal throttling do. I'm not sure about the new "turbo mode" either - I think it does effect RDTSC frequency, and I think it's also tied into the local APIC's "thermal management interrupt" so that an OS can find out when RDTSC changes frequency (for both SpeedStep and Turbo Mode).

Normally I'd double check, but I don't have the time at the moment (I get a short lunch break today due to Christmas induced consumer insanity).

Cheers,

Brendan

Posted: **Wed Dec 24, 2008 11:54 am**

Brendan wrote:
LoseThos wrote:Does anyone know if RDTSC's frequency changes on a core i7. I might be in trouble
I'd assume that certain power management events will cause a Core I7's RDTSC to change frequency (just like certain power management events cause RDTSC to change frequency on lots of Intel CPUs, starting with Pentium-M). However I'm not sure which power management events effect RDTSC frequency. SpeedStep normally does effect RDTSC fequency, but I don't think STPCLK and software controlled thermal throttling do. I'm not sure about the new "turbo mode" either - I think it does effect RDTSC frequency, and I think it's also tied into the local APIC's "thermal management interrupt" so that an OS can find out when RDTSC changes frequency (for both SpeedStep and Turbo Mode).

Actually, the Core i7 has a constant TSC. It increments at the maximum CPU rate regardless of current CPU rate. It also has an invariant TSC, which means the TSC increments even when the clock is stopped in a deep C-state. I can't remember off the top of my head if anything special needs to be done to enable that behavior, however.

Posted: **Wed Dec 24, 2008 11:59 am**

01000101 wrote:I actually just redid my speed tests on memcpy/memset.

using movntdqa to load and movdqu to store (from xmm registers) had an average tick count of 712 ticks per 512-byte copy or set. Done in a seperate .asm file in nasm.

using just the plain ol' "rep movsb" or "rep stosb" in a .asm file yielded an average of 10 ticks per 512-byte copy or set.

using __uint128_t and uint8_t within a C memcpy/set yielded a result of about 400 ticks per 512-byte copy or set. That was with -O3 -m64 -ftracer -msse4.1 .

so, to me at least, it seems that the simplest of designs really is the best one in this case.
RDTSC doesn't really change per se, but it will increase faster on faster processors as it stores the number of instruction ticks. Thus on faster systems that can ram through code faster, they will of course report more instruction ticks executed per second.

Hmm, out of curiousity, did you use plain ol' RDTSC, or the serializing RDTSCP? It may not be necessary, I'm not very familiar with the SSE intructions at all to know if they're serializing or not, but I'd be interested in another set of benchmarks where you made use of RDTSCP to make sure things weren't executed out-of-order.

OSDev.org

Optimized memory functions?

Re: Optimized memory functions?

Re: Optimized memory functions?

Re: Optimized memory functions?

Re: Optimized memory functions?

Re: Optimized memory functions?

Re: Optimized memory functions?

Re: Optimized memory functions?

Re: Optimized memory functions?

Re: Optimized memory functions?

Re: Optimized memory functions?

Re: Optimized memory functions?

Re: Optimized memory functions?

Re: Optimized memory functions?

Re: Optimized memory functions?

Re: Optimized memory functions?