I've been brainstorming up some ways to optimize core memory functions such as 'memset' and 'memcpy'.
I wanted to make it so that when I do something like 'memset(&cArray, 0x00, 128);', that there wouldn't have to be a loop of 128 char copy commands. Most memset functions I have seen make it so that it loops while the 'count' variable decrements and just does a char copy over and over. I figured that, since I'm programming a 64-bit OS, that I might as well take advantage of it, same goes for 32 bit OSs as well.
here's a rough prototype that I'm testing right now, I wanted to know of any feedback regarding these memset and memcpy functions.
Code: Select all
void memcpy(void *dest, void *src, uint64 count)
{
if(!count){return;} // nothing to copy?
while(count >= 8){ *(uint64*)dest = *(uint64*)src; memcpy_transfers_64++; dest += 8; src += 8; count -= 8; }
while(count >= 4){ *(uint32*)dest = *(uint32*)src; memcpy_transfers_32++; dest += 4; src += 4; count -= 4; }
while(count >= 2){ *(uint16*)dest = *(uint16*)src; memcpy_transfers_16++; dest += 2; src += 2; count -= 2; }
while(count >= 1){ *(uint8*)dest = *(uint8*)src; memcpy_transfers_8++; dest += 1; src += 1; count -= 1; }
return;
}
void memset(void *dest, uint8 sval, uint64 count)
{
uint64 val = (sval & 0xFF); // create a 64-bit version of 'sval'
val |= ((val << 8) & 0xFF00);
val |= ((val << 16) & 0xFFFF0000);
val |= ((val << 32) & 0xFFFFFFFF00000000);
if(!count){return;} // nothing to copy?
while(count >= 8){ *(uint64*)dest = (uint64)val; memset_transfers_64++; dest += 8; count -= 8; }
while(count >= 4){ *(uint32*)dest = (uint32)val; memset_transfers_32++; dest += 4; count -= 4; }
while(count >= 2){ *(uint16*)dest = (uint16)val; memset_transfers_16++; dest += 2; count -= 2; }
while(count >= 1){ *(uint8*)dest = (uint8)val; memset_transfers_8++; dest += 1; count -= 1; }
return;
}
say you want to do a memset of 128 bytes, this would perform 16 64-bit transfers instead of 128 8-bit transfers. Also, say if you wanted to memset 18 bytes, it would perform 2 64-bit and 1 16-bit operations, instead of 18 operations.
What do you think of this type of optimization? What would be the pitfalls?
[edit]btw, the memxxx_transfers_xx variable is just a log variable, and if I wasn't as lazy, would have taken out.[/edit]