Question about which tools to use, bugs, the best way to implement a function, etc should go here. Don't forget to see if your question is answered in the wiki first! When in doubt post here.
MarkOS wrote:Where can I find optimized functions of string.h?
GCC and other compilers have builtin versions of the really time-critical functions like memcpy / memset. You could also look into the source code of available open-source C libraries, of course.
This is my memcpy, how can I optimize it?
The best way is to build on the work of other people, e.g. using the GCC builtins.
Every good solution is obvious once you've found it.
Well you are checking for SS2, SSE and MMX supports which is good. However, what if the CPU doesn't support any of those technologies? (well might not happen at all but you can still use STOSD with a REP prefix).
If I were you and if I was going to code these with General Purpose Registers, I would perform these steps:
• Check for alignment of the buffer. Start putting characters in the aligned memory then start filling the unaligned memory.
• Check for buffer's length and see if it is divisible by 8, 4 or 2. Move QWORDs, DWORDs and WORDs at each iteration if any of those conditions are met. For example, if the buffer length is 16 bytes, you can move 2 QWORDs into it in order to fill it. If the buffer is, say, 13 bytes, you can move 1 QWORD (8-bytes), 1 (DWORD) and then 1 byte into it.
On the field with sword and shield amidst the din of dying of men's wails. War is waged and the battle will rage until only the righteous prevails.
MessiahAndrw wrote:Assembler is a 1-on-1 representation of machine code (macro's don't count) so 1 instruction vs. 4 will be faster.
Not necessarily.
What happens here is that the code will require 4 units of stores and 4 units of decodes compared to 1 store/ 1 decode. However, on modern processors, if the bottleneck is not within the store or decode units this may well even out: If you do a set of divides next to these instructions, the ALU will become the obvious bottleneck and it'll have time to spare to commit the moves. Due to some processor's black magic, the longer sequence might even be faster.
pentium 1's and earlier can't do the necessary scheduling and the longer sequence will indeed take more time.
"Certainly avoid yourself. He is a newbie and might not realize it. You'll hate his code deeply a few years down the road." - Sortie
[ My OS ] [ VDisk/SFS ]
In such case, maybe it would be better then, to use right shifts to avoid divisions (aren't bitwise calculations supposed to be faster?), in this way, by right shifting 1 bit a division by 2 can be emulated, right shifting 2 bits a division by 4 can be emulated and right shifting 3 bits a division by 8 can be emulated. That would be the times to use for a "REP STOSW" "REP STOSD" or "REP STOSQ" operation (should be faster) and, having previously saved the shifted bits, can be used for a small amount of "REP STOSB" operations. In this way, by avoiding "DIV" instructions, and transferring the highest chunk possible at once (WORD, DWORD or QWORD's), everything should be faster.
Also, aligning any routine and control transfers with "JMP" or conditional jump instructions to 4 or 8 bytes using align(4) and align(8).
Last edited by ~ on Fri Apr 06, 2007 4:34 am, edited 1 time in total.
~ wrote:In such case, maybe it would be better then, to use right shifts to avoid divisions (aren't bitwise calculations supposed to be faster?), in this way, by right shifting 1 bit a division by 2 can be emulated, right shifting 3 bits a division by 4 can be emulated and right shifting 4 bits a division by 8 can be emulated.
Yes, for powers of two there's no point in doing integer division, when shift will be "one" clock. You got the math wrong though. They easy way to remember is that right shifting by n-bits will divide by 2^n. Similarly left-shifting by n-bits will multiply by 2^n.
So for division by 2, you shift 1 bit. For division by 4 you shift 2 bits. For division by 8 you shift 3 bits, and for division by 16 you shift 4 bits.
The real problem with goto is not with the control transfer, but with environments. Properly tail-recursive closures get both right.
The problem is that, probably, trying to determine the size and "alignment" for the number of bytes of every arbitrary string could take longer than just performing the shifting and copying process dummily (for example, determine whether the string is 3 bytes long -- by now we assume the buffer to be 4 or 8-bytes aligned though).
Because, instead of just executing an "SHR" (>>) and saving the bits previously for A "REP STOSB" to copy the number of bytes not a power of 2 (supposedly), we would need an if() and maybe modulus % to watch for the best option, and in itself it would involve some sort of math operations or at least additional instructions and its corresponding clock cycles.
i hope you guys realize that any compiler worth using will convert divisions to shifts if possible, so it really isn't worth worrying about in C or C++. Really I find my memcpy to be quite fast:
you can specify at compile time the max size of bytes to transfer at a time, this is of course not taking into account MMX or SSE or anything like that, just plain C and letting the compiler do it's job well. I actually checked the resultant assembly and I would be hard pressed to do it much better
(just so you know the resultant object code has not a single IDIV instruction in it, just shifts where i divided.)
even more fun is my memset, which uses multiplies with a 0x01010101 mask to get a dword of 4 copies of the byte, yet the compiler comes up with very efficient and creative ways to do this..no IMUL instruction: