String.h & memcpy

Jeko · Post by **Jeko** » Thu Apr 05, 2007 3:36 am

Where can I find optimized functions of string.h? (memcpy, memset, strcpy, ...)

This is my memcpy, how can I optimize it?

Code: Select all

char *memcpy(char *from, char *to, unsigned int n)
{
	if(cpu.sse2)
	{
		int i;
		for(i=0; i<n/16; i++)
		{
			__asm__ __volatile__("movntdq (%0), %%xmm0\n" "movntdq %%xmm0, (%1)\n"::"r"(from), "r"(to):"memory");

			from += 16;
			to += 16;
		}
		if(n & 7)
		{
			n = n&7;
			while(n--)
				*to++ = *from++;
		}
		return (to);
	}
	if(cpu.sse)
	{
		int i;
		for(i=0; i<n/16; i++)
		{
			__asm__ __volatile__("movntq (%0), %%xmm0\n" "movntq %%xmm0, (%1)\n"::"r"(from), "r"(to):"memory");

			from += 16;
			to += 16;
		}
		if(n & 7)
		{
			n = n&7;
			while(n--)
				*to++ = *from++;
		}
		return (to);
	}
	if(cpu.mmx)
	{
		int i;
		for(i=0; i<n/8; i++)
		{
			__asm__ __volatile__("movq (%0), %%mm0\n" "movq %%mm0, (%1)\n"::"r"(from), "r"(to):"memory");

			from += 8;
			to += 8;
		}
		if(n & 7)
		{
			n = n&7;
			while(n--)
				*to++ = *from++;
		}
		return (to);
	}

	while(n--)
		*to++ = *from++;

	return (to);
}

Thank you!

Jeko · Post by **Jeko** » Thu Apr 05, 2007 3:39 am

Another small question:
is this:

Code: Select all

mov dword[C0000000h], RGBA

faster than:

Code: Select all

mov byte[C0000000h], A
mov byte[C0000001h], G
mov byte[C0000002h], B
mov byte[C0000003h], R

Solar · Post by **Solar** » Thu Apr 05, 2007 4:55 am

MarkOS wrote:Where can I find optimized functions of string.h?

GCC and other compilers have builtin versions of the really time-critical functions like memcpy / memset. You could also look into the source code of available open-source C libraries, of course.

This is my memcpy, how can I optimize it?

The best way is to build on the work of other people, e.g. using the GCC builtins.

XCHG · Post by **XCHG** » Thu Apr 05, 2007 5:22 am

Well you are checking for SS2, SSE and MMX supports which is good. However, what if the CPU doesn't support any of those technologies? (well might not happen at all but you can still use STOSD with a REP prefix).

If I were you and if I was going to code these with General Purpose Registers, I would perform these steps:

• Check for alignment of the buffer. Start putting characters in the aligned memory then start filling the unaligned memory.
• Check for buffer's length and see if it is divisible by 8, 4 or 2. Move QWORDs, DWORDs and WORDs at each iteration if any of those conditions are met. For example, if the buffer length is 16 bytes, you can move 2 QWORDs into it in order to fill it. If the buffer is, say, 13 bytes, you can move 1 QWORD (8-bytes), 1 (DWORD) and then 1 byte into it.

AndrewAPrice · Post by **AndrewAPrice** » Thu Apr 05, 2007 10:13 pm

MarkOS wrote:Another small question:
is this:
Code: Select all
mov dword[C0000000h], RGBA
faster than:
Code: Select all
mov byte[C0000000h], A
mov byte[C0000001h], G
mov byte[C0000002h], B
mov byte[C0000003h], R

Assembler is a 1-on-1 representation of machine code (macro's don't count) so 1 instruction vs. 4 will be faster.

I don't see how passing RGBA is suppose to work? Are you using macros or HLA? What is that label pointing too?

If all the variables were stored one after the other in memory, then you could pass the label pointing to the first variable as a dword.

pcmattman · Post by **pcmattman** » Thu Apr 05, 2007 10:59 pm

Solar wrote:GCC and other compilers have builtin versions of the really time-critical functions like memcpy / memset.

Question: do these work in OS development, though?

Combuster · Post by **Combuster** » Fri Apr 06, 2007 3:34 am

MessiahAndrw wrote:Assembler is a 1-on-1 representation of machine code (macro's don't count) so 1 instruction vs. 4 will be faster.

Not necessarily.

What happens here is that the code will require 4 units of stores and 4 units of decodes compared to 1 store/ 1 decode. However, on modern processors, if the bottleneck is not within the store or decode units this may well even out: If you do a set of divides next to these instructions, the ALU will become the obvious bottleneck and it'll have time to spare to commit the moves. Due to some processor's black magic, the longer sequence might even be faster.

pentium 1's and earlier can't do the necessary scheduling and the longer sequence will indeed take more time.

~ · Post by ~ » Fri Apr 06, 2007 3:56 am

In such case, maybe it would be better then, to use right shifts to avoid divisions (aren't bitwise calculations supposed to be faster?), in this way, by right shifting 1 bit a division by 2 can be emulated, right shifting 2 bits a division by 4 can be emulated and right shifting 3 bits a division by 8 can be emulated. That would be the times to use for a "REP STOSW" "REP STOSD" or "REP STOSQ" operation (should be faster) and, having previously saved the shifted bits, can be used for a small amount of "REP STOSB" operations. In this way, by avoiding "DIV" instructions, and transferring the highest chunk possible at once (WORD, DWORD or QWORD's), everything should be faster.

Also, aligning any routine and control transfers with "JMP" or conditional jump instructions to 4 or 8 bytes using align(4) and align(8).

mystran · Post by **mystran** » Fri Apr 06, 2007 4:23 am

~ wrote:In such case, maybe it would be better then, to use right shifts to avoid divisions (aren't bitwise calculations supposed to be faster?), in this way, by right shifting 1 bit a division by 2 can be emulated, right shifting 3 bits a division by 4 can be emulated and right shifting 4 bits a division by 8 can be emulated.

Yes, for powers of two there's no point in doing integer division, when shift will be "one" clock. You got the math wrong though. They easy way to remember is that right shifting by n-bits will divide by 2^n. Similarly left-shifting by n-bits will multiply by 2^n.

So for division by 2, you shift 1 bit. For division by 4 you shift 2 bits. For division by 8 you shift 3 bits, and for division by 16 you shift 4 bits.

~ · Post by ~ » Fri Apr 06, 2007 4:39 am

The problem is that, probably, trying to determine the size and "alignment" for the number of bytes of every arbitrary string could take longer than just performing the shifting and copying process dummily (for example, determine whether the string is 3 bytes long -- by now we assume the buffer to be 4 or 8-bytes aligned though).

Because, instead of just executing an "SHR" (>>) and saving the bits previously for A "REP STOSB" to copy the number of bytes not a power of 2 (supposedly), we would need an if() and maybe modulus % to watch for the best option, and in itself it would involve some sort of math operations or at least additional instructions and its corresponding clock cycles.

proxy · Post by **proxy** » Fri Apr 06, 2007 7:01 pm

i hope you guys realize that any compiler worth using will convert divisions to shifts if possible, so it really isn't worth worrying about in C or C++. Really I find my memcpy to be quite fast:

you can specify at compile time the max size of bytes to transfer at a time, this is of course not taking into account MMX or SSE or anything like that, just plain C and letting the compiler do it's job well. I actually checked the resultant assembly and I would be hard pressed to do it much better

(just so you know the resultant object code has not a single IDIV instruction in it, just shifts where i divided.)

Code: Select all


#include <string.h>
#include <assert.h>
#include <stdint.h>

/* quick and dirty macro that tests if a pointer is properly aligned to it's
 * native boundary, we need this because some arches don't like multibyte
 * accesses to cross a page boundary
 */
#define IS_ALIGNED(x) (((uintptr_t)(x) & (sizeof(*(x)) - 1)) == 0)

/* valid options here are, 1, 2, 4 and 8 */
#define MAX_MULTIBYTE 4

/*------------------------------------------------------------------------------
// Name: memcpy(void *dest, const void *src, size_t n
//----------------------------------------------------------------------------*/
void *memcpy(void *_RESTRICT dest, const void *_RESTRICT src, size_t n) {

#if 0
	/* traditional memory copy */
	assert(dest != 0);
	assert(src != 0);

	char *d_ptr			= dest;
	const char *s_ptr	= src;
	
	while(n--) {
		*d_ptr++ = *s_ptr++;
	}
#else

	/* this one is optimized for dword and word aligned copies */
	union {
		void		*ptr;
#if MAX_MULTIBYTE >= 8
		uint64_t	*ptr64;
#endif
#if MAX_MULTIBYTE >= 4
		uint32_t	*ptr32;
#endif
#if MAX_MULTIBYTE >= 2
		uint16_t	*ptr16;
#endif
		uint8_t		*ptr8;	
	} d_ptr;
		
	union {
		const void		*ptr;
#if MAX_MULTIBYTE >= 8
		const uint64_t	*ptr64;
#endif
#if MAX_MULTIBYTE >= 4
		const uint32_t	*ptr32;
#endif
#if MAX_MULTIBYTE >= 2
		const uint16_t	*ptr16;
#endif
		const uint8_t	*ptr8;	
	} s_ptr;
	
	assert(dest != 0);
	assert(src != 0);

	d_ptr.ptr = dest;
	s_ptr.ptr = src;
	
	switch(n & (MAX_MULTIBYTE - 1)) {
	case 0:
#if MAX_MULTIBYTE >= 2
#if MAX_MULTIBYTE >= 4
#if MAX_MULTIBYTE >= 8
		if(!IS_ALIGNED(d_ptr.ptr64) || !IS_ALIGNED(s_ptr.ptr64)) {
			goto unaligned;
		}
	
		/* multiple of 8 */
		n /= 8;
		while(n--) {
			*(d_ptr.ptr64)++ = *(s_ptr.ptr64)++;
		}
		break;
	case 4:
#endif
		if(!IS_ALIGNED(d_ptr.ptr32) || !IS_ALIGNED(s_ptr.ptr32)) {
			goto unaligned;
		}
	
		/* multiple of 4 */
		n /= 4;
		while(n--) {
			*(d_ptr.ptr32)++ = *(s_ptr.ptr32)++;
		}
		break;
	case 6:
	case 2:
#endif
		if(!IS_ALIGNED(d_ptr.ptr16) || !IS_ALIGNED(s_ptr.ptr16)) {
			goto unaligned;
		}
	
		/* multiple of 2 */	
		n /= 2;		
		while(n--) {
			*(d_ptr.ptr16)++ = *(s_ptr.ptr16)++;
		}
		break;
	unaligned:
	default:
#endif
		/* multiple of 1 */
		while(n--) {
			*(d_ptr.ptr8)++ = *(s_ptr.ptr8)++;
		}
	}
#endif
	return dest;
}

#undef IS_ALIGNED
#undef MAX_MULTIBYTE

proxy · Post by **proxy** » Fri Apr 06, 2007 7:02 pm

even more fun is my memset, which uses multiplies with a 0x01010101 mask to get a dword of 4 copies of the byte, yet the compiler comes up with very efficient and creative ways to do this..no IMUL instruction:

Code: Select all


#include <string.h>
#include <assert.h>
#include <stdint.h>

/* quick and dirty macro that tests if a pointer is properly aligned to it's
 * native boundary, we need this because some arches don't like multibyte
 * accesses to cross a page boundary
 */
#define IS_ALIGNED(x) (((uintptr_t)(x) & (sizeof(*(x)) - 1)) == 0)

/* valid options here are, 1, 2, 4 and 8 */
#define MAX_MULTIBYTE 4

/*------------------------------------------------------------------------------
// Name: memset(void *s, int c, size_t n)
//----------------------------------------------------------------------------*/
void *memset(void *s, int c, size_t n) {

#if 0
	/* traditional memset */
	char *s_ptr = s;
	const char ch = (char)(c & 0xff);
	
	assert(s != 0);
	
	while(n--) {
		*s_ptr++ = ch;
	}

#else
	/* this one is optimized for dword and word aligned copies */
	union {
		void		*ptr;
#if MAX_MULTIBYTE >= 8
		uint64_t	*ptr64;
#endif
#if MAX_MULTIBYTE >= 4
		uint32_t	*ptr32;
#endif
#if MAX_MULTIBYTE >= 2
		uint16_t	*ptr16;
#endif
		uint8_t		*ptr8;	
	} d_ptr;
	
	const char ch = (char)(c & 0xff);
		
	assert(s != 0);

	d_ptr.ptr = s;
	
	switch(n & (MAX_MULTIBYTE - 1)) {
	case 0:
#if MAX_MULTIBYTE >= 2
#if MAX_MULTIBYTE >= 4
#if MAX_MULTIBYTE >= 8
		if(!IS_ALIGNED(d_ptr.ptr64)) {
			goto unaligned;
		}
	
		/* multiple of 8 */
		do {
			const uint64_t source = ch * UINT64_C(0x0101010101010101);
			n /= 8;
			while(n--) {
				*(d_ptr.ptr64)++ = source;
			}
		} while(0);
		break;
	case 4:
#endif
		if(!IS_ALIGNED(d_ptr.ptr32)) {
			goto unaligned;
		}
	
		/* multiple of 4 */
		do {
			const uint32_t source = ch * UINT32_C(0x01010101);
			n /= 4;
			while(n--) {
				*(d_ptr.ptr32)++ = source;
			}
		} while(0);
		break;
	case 6:
	case 2:
#endif
		if(!IS_ALIGNED(d_ptr.ptr16)) {
			goto unaligned;
		}
	
		/* multiple of 2 */	
		do {
			const uint16_t source = ch * UINT16_C(0x0101);
			n /= 2;
			while(n--) {
				*(d_ptr.ptr16)++ = source;
			}
		} while(0);
		break;
	unaligned:
	default:
#endif
		/* multiple of 1 */
		while(n--) {
			*(d_ptr.ptr8)++ = ch;
		}
	}
#endif
	return s;
}

OSDev.org

String.h & memcpy

String.h & memcpy

Re: String.h & memcpy

Re: String.h & memcpy