Code: Select all
;faster
REP STOSD
Code: Select all
;slower
s_loop: STOSD
LOOP s_loop
Code: Select all
;faster
REP STOSD
Code: Select all
;slower
s_loop: STOSD
LOOP s_loop
On the contrary, this is a perfect place to discuss low level optimization, because the number one place you do not want bloat and inefficiency is in the OS kernel.geppy wrote: PS: Overall OSDev is not right place to discuss general programming, and especially its low level optimization. Maybe you should clock your algos with algos that assembly programming sites provide first (under windows).
Code: Select all
rep movsd
A few remarks on this specific code which I don't think have been mentioned yet:01000101 wrote:here's a rough prototype that I'm testing right now, I wanted to know of any feedback regarding these memset and memcpy functions.
Code: Select all
void memcpy(void *dest, void *src, uint64 count) { if(!count){return;} // nothing to copy? while(count >= 8){ *(uint64*)dest = *(uint64*)src; memcpy_transfers_64++; dest += 8; src += 8; count -= 8; } while(count >= 4){ *(uint32*)dest = *(uint32*)src; memcpy_transfers_32++; dest += 4; src += 4; count -= 4; } while(count >= 2){ *(uint16*)dest = *(uint16*)src; memcpy_transfers_16++; dest += 2; src += 2; count -= 2; } while(count >= 1){ *(uint8*)dest = *(uint8*)src; memcpy_transfers_8++; dest += 1; src += 1; count -= 1; } return; } void memset(void *dest, uint8 sval, uint64 count) { uint64 val = (sval & 0xFF); // create a 64-bit version of 'sval' val |= ((val << 8) & 0xFF00); val |= ((val << 16) & 0xFFFF0000); val |= ((val << 32) & 0xFFFFFFFF00000000); if(!count){return;} // nothing to copy? while(count >= 8){ *(uint64*)dest = (uint64)val; memset_transfers_64++; dest += 8; count -= 8; } while(count >= 4){ *(uint32*)dest = (uint32)val; memset_transfers_32++; dest += 4; count -= 4; } while(count >= 2){ *(uint16*)dest = (uint16)val; memset_transfers_16++; dest += 2; count -= 2; } while(count >= 1){ *(uint8*)dest = (uint8)val; memset_transfers_8++; dest += 1; count -= 1; } return; }
I had to look at that twice because the first time I thought you were doing arithmetic with void pointers, which is not valid (except as a rather stupid GCC extension).01000101 wrote:Thanks!
As for the pointer arithmetic, it is valid. If anything it would be bit safer to add the default memory-size as the typecast (something like: "((uint64)dest)++;"). Using *(uint64*) would increment the value pointed to, not the pointer.
Heh, any correction contains at least one error :) Forgot what law that was :). *((uint64*)someptr++) is supposed to work then, although I'm not entirely sure the dereference takes place before the increment. Gotten a bit rusty with the rules...01000101 wrote:As for the pointer arithmetic, it is valid. If anything it would be bit safer to add the default memory-size as the typecast (something like: "((uint64)dest)++;"). Using *(uint64*) would increment the value pointed to, not the pointer.
I believe you're referring to Muphry's law. =)jal wrote:Heh, any correction contains at least one error Forgot what law that was . *((uint64*)someptr++) is supposed to work then, although I'm not entirely sure the dereference takes place before the increment. Gotten a bit rusty with the rules...01000101 wrote:As for the pointer arithmetic, it is valid. If anything it would be bit safer to add the default memory-size as the typecast (something like: "((uint64)dest)++;"). Using *(uint64*) would increment the value pointed to, not the pointer.
JAL
Don't be depressed - instead, buy yourself something with Intel's Atom CPU in it...LoseThos wrote:I was really depressed when I made my compiler much more optimal but it had no impact. Intel chips reorder instructions while executing and all kinds of crazy things doing two instructions at once, possibly. The funny things is a good compiler could do it and it almost seems silly to waste silicon on it.
I'd assume that certain power management events will cause a Core I7's RDTSC to change frequency (just like certain power management events cause RDTSC to change frequency on lots of Intel CPUs, starting with Pentium-M). However I'm not sure which power management events effect RDTSC frequency. SpeedStep normally does effect RDTSC fequency, but I don't think STPCLK and software controlled thermal throttling do. I'm not sure about the new "turbo mode" either - I think it does effect RDTSC frequency, and I think it's also tied into the local APIC's "thermal management interrupt" so that an OS can find out when RDTSC changes frequency (for both SpeedStep and Turbo Mode).LoseThos wrote:Does anyone know if RDTSC's frequency changes on a core i7. I might be in trouble
Actually, the Core i7 has a constant TSC. It increments at the maximum CPU rate regardless of current CPU rate. It also has an invariant TSC, which means the TSC increments even when the clock is stopped in a deep C-state. I can't remember off the top of my head if anything special needs to be done to enable that behavior, however.Brendan wrote:I'd assume that certain power management events will cause a Core I7's RDTSC to change frequency (just like certain power management events cause RDTSC to change frequency on lots of Intel CPUs, starting with Pentium-M). However I'm not sure which power management events effect RDTSC frequency. SpeedStep normally does effect RDTSC fequency, but I don't think STPCLK and software controlled thermal throttling do. I'm not sure about the new "turbo mode" either - I think it does effect RDTSC frequency, and I think it's also tied into the local APIC's "thermal management interrupt" so that an OS can find out when RDTSC changes frequency (for both SpeedStep and Turbo Mode).LoseThos wrote:Does anyone know if RDTSC's frequency changes on a core i7. I might be in trouble
Hmm, out of curiousity, did you use plain ol' RDTSC, or the serializing RDTSCP? It may not be necessary, I'm not very familiar with the SSE intructions at all to know if they're serializing or not, but I'd be interested in another set of benchmarks where you made use of RDTSCP to make sure things weren't executed out-of-order.01000101 wrote:I actually just redid my speed tests on memcpy/memset.
using movntdqa to load and movdqu to store (from xmm registers) had an average tick count of 712 ticks per 512-byte copy or set. Done in a seperate .asm file in nasm.
using just the plain ol' "rep movsb" or "rep stosb" in a .asm file yielded an average of 10 ticks per 512-byte copy or set.
using __uint128_t and uint8_t within a C memcpy/set yielded a result of about 400 ticks per 512-byte copy or set. That was with -O3 -m64 -ftracer -msse4.1 .
so, to me at least, it seems that the simplest of designs really is the best one in this case.
RDTSC doesn't really change per se, but it will increase faster on faster processors as it stores the number of instruction ticks. Thus on faster systems that can ram through code faster, they will of course report more instruction ticks executed per second.