String.h & memcpy

Question about which tools to use, bugs, the best way to implement a function, etc should go here. Don't forget to see if your question is answered in the wiki first! When in doubt post here.
Post Reply
User avatar
Jeko
Member
Member
Posts: 500
Joined: Fri Mar 17, 2006 12:00 am
Location: Napoli, Italy

String.h & memcpy

Post by Jeko »

Where can I find optimized functions of string.h? (memcpy, memset, strcpy, ...)

This is my memcpy, how can I optimize it?

Code: Select all

char *memcpy(char *from, char *to, unsigned int n)
{
	if(cpu.sse2)
	{
		int i;
		for(i=0; i<n/16; i++)
		{
			__asm__ __volatile__("movntdq (%0), %%xmm0\n" "movntdq %%xmm0, (%1)\n"::"r"(from), "r"(to):"memory");

			from += 16;
			to += 16;
		}
		if(n & 7)
		{
			n = n&7;
			while(n--)
				*to++ = *from++;
		}
		return (to);
	}
	if(cpu.sse)
	{
		int i;
		for(i=0; i<n/16; i++)
		{
			__asm__ __volatile__("movntq (%0), %%xmm0\n" "movntq %%xmm0, (%1)\n"::"r"(from), "r"(to):"memory");

			from += 16;
			to += 16;
		}
		if(n & 7)
		{
			n = n&7;
			while(n--)
				*to++ = *from++;
		}
		return (to);
	}
	if(cpu.mmx)
	{
		int i;
		for(i=0; i<n/8; i++)
		{
			__asm__ __volatile__("movq (%0), %%mm0\n" "movq %%mm0, (%1)\n"::"r"(from), "r"(to):"memory");

			from += 8;
			to += 8;
		}
		if(n & 7)
		{
			n = n&7;
			while(n--)
				*to++ = *from++;
		}
		return (to);
	}

	while(n--)
		*to++ = *from++;

	return (to);
}
Thank you!
User avatar
Jeko
Member
Member
Posts: 500
Joined: Fri Mar 17, 2006 12:00 am
Location: Napoli, Italy

Post by Jeko »

Another small question:
is this:

Code: Select all

mov dword[C0000000h], RGBA
faster than:

Code: Select all

mov byte[C0000000h], A
mov byte[C0000001h], G
mov byte[C0000002h], B
mov byte[C0000003h], R
User avatar
Solar
Member
Member
Posts: 7615
Joined: Thu Nov 16, 2006 12:01 pm
Location: Germany
Contact:

Re: String.h & memcpy

Post by Solar »

MarkOS wrote:Where can I find optimized functions of string.h?
GCC and other compilers have builtin versions of the really time-critical functions like memcpy / memset. You could also look into the source code of available open-source C libraries, of course.
This is my memcpy, how can I optimize it?
The best way is to build on the work of other people, e.g. using the GCC builtins.
Every good solution is obvious once you've found it.
User avatar
XCHG
Member
Member
Posts: 416
Joined: Sat Nov 25, 2006 3:55 am
Location: Wisconsin
Contact:

Post by XCHG »

Well you are checking for SS2, SSE and MMX supports which is good. However, what if the CPU doesn't support any of those technologies? (well might not happen at all but you can still use STOSD with a REP prefix).

If I were you and if I was going to code these with General Purpose Registers, I would perform these steps:

• Check for alignment of the buffer. Start putting characters in the aligned memory then start filling the unaligned memory.
• Check for buffer's length and see if it is divisible by 8, 4 or 2. Move QWORDs, DWORDs and WORDs at each iteration if any of those conditions are met. For example, if the buffer length is 16 bytes, you can move 2 QWORDs into it in order to fill it. If the buffer is, say, 13 bytes, you can move 1 QWORD (8-bytes), 1 (DWORD) and then 1 byte into it.
On the field with sword and shield amidst the din of dying of men's wails. War is waged and the battle will rage until only the righteous prevails.
User avatar
AndrewAPrice
Member
Member
Posts: 2309
Joined: Mon Jun 05, 2006 11:00 pm
Location: USA (and Australia)

Post by AndrewAPrice »

MarkOS wrote:Another small question:
is this:

Code: Select all

mov dword[C0000000h], RGBA
faster than:

Code: Select all

mov byte[C0000000h], A
mov byte[C0000001h], G
mov byte[C0000002h], B
mov byte[C0000003h], R
Assembler is a 1-on-1 representation of machine code (macro's don't count) so 1 instruction vs. 4 will be faster.

I don't see how passing RGBA is suppose to work? Are you using macros or HLA? What is that label pointing too?

If all the variables were stored one after the other in memory, then you could pass the label pointing to the first variable as a dword.
My OS is Perception.
pcmattman
Member
Member
Posts: 2566
Joined: Sun Jan 14, 2007 9:15 pm
Libera.chat IRC: miselin
Location: Sydney, Australia (I come from a land down under!)
Contact:

Re: String.h & memcpy

Post by pcmattman »

Solar wrote:GCC and other compilers have builtin versions of the really time-critical functions like memcpy / memset.
Question: do these work in OS development, though?
User avatar
Combuster
Member
Member
Posts: 9301
Joined: Wed Oct 18, 2006 3:45 am
Libera.chat IRC: [com]buster
Location: On the balcony, where I can actually keep 1½m distance
Contact:

Post by Combuster »

MessiahAndrw wrote:Assembler is a 1-on-1 representation of machine code (macro's don't count) so 1 instruction vs. 4 will be faster.
Not necessarily.

What happens here is that the code will require 4 units of stores and 4 units of decodes compared to 1 store/ 1 decode. However, on modern processors, if the bottleneck is not within the store or decode units this may well even out: If you do a set of divides next to these instructions, the ALU will become the obvious bottleneck and it'll have time to spare to commit the moves. Due to some processor's black magic, the longer sequence might even be faster.

pentium 1's and earlier can't do the necessary scheduling and the longer sequence will indeed take more time.
"Certainly avoid yourself. He is a newbie and might not realize it. You'll hate his code deeply a few years down the road." - Sortie
[ My OS ] [ VDisk/SFS ]
User avatar
~
Member
Member
Posts: 1228
Joined: Tue Mar 06, 2007 11:17 am
Libera.chat IRC: ArcheFire

Post by ~ »

In such case, maybe it would be better then, to use right shifts to avoid divisions (aren't bitwise calculations supposed to be faster?), in this way, by right shifting 1 bit a division by 2 can be emulated, right shifting 2 bits a division by 4 can be emulated and right shifting 3 bits a division by 8 can be emulated. That would be the times to use for a "REP STOSW" "REP STOSD" or "REP STOSQ" operation (should be faster) and, having previously saved the shifted bits, can be used for a small amount of "REP STOSB" operations. In this way, by avoiding "DIV" instructions, and transferring the highest chunk possible at once (WORD, DWORD or QWORD's), everything should be faster.

Also, aligning any routine and control transfers with "JMP" or conditional jump instructions to 4 or 8 bytes using align(4) and align(8).
Last edited by ~ on Fri Apr 06, 2007 4:34 am, edited 1 time in total.
User avatar
mystran
Member
Member
Posts: 670
Joined: Thu Mar 08, 2007 11:08 am

Post by mystran »

~ wrote:In such case, maybe it would be better then, to use right shifts to avoid divisions (aren't bitwise calculations supposed to be faster?), in this way, by right shifting 1 bit a division by 2 can be emulated, right shifting 3 bits a division by 4 can be emulated and right shifting 4 bits a division by 8 can be emulated.
Yes, for powers of two there's no point in doing integer division, when shift will be "one" clock. You got the math wrong though. They easy way to remember is that right shifting by n-bits will divide by 2^n. Similarly left-shifting by n-bits will multiply by 2^n.

So for division by 2, you shift 1 bit. For division by 4 you shift 2 bits. For division by 8 you shift 3 bits, and for division by 16 you shift 4 bits.
The real problem with goto is not with the control transfer, but with environments. Properly tail-recursive closures get both right.
User avatar
~
Member
Member
Posts: 1228
Joined: Tue Mar 06, 2007 11:17 am
Libera.chat IRC: ArcheFire

Post by ~ »

The problem is that, probably, trying to determine the size and "alignment" for the number of bytes of every arbitrary string could take longer than just performing the shifting and copying process dummily (for example, determine whether the string is 3 bytes long -- by now we assume the buffer to be 4 or 8-bytes aligned though).

Because, instead of just executing an "SHR" (>>) and saving the bits previously for A "REP STOSB" to copy the number of bytes not a power of 2 (supposedly), we would need an if() and maybe modulus % to watch for the best option, and in itself it would involve some sort of math operations or at least additional instructions and its corresponding clock cycles.
User avatar
proxy
Member
Member
Posts: 108
Joined: Wed Jan 19, 2005 12:00 am
Contact:

Post by proxy »

i hope you guys realize that any compiler worth using will convert divisions to shifts if possible, so it really isn't worth worrying about in C or C++. Really I find my memcpy to be quite fast:

you can specify at compile time the max size of bytes to transfer at a time, this is of course not taking into account MMX or SSE or anything like that, just plain C and letting the compiler do it's job well. I actually checked the resultant assembly and I would be hard pressed to do it much better :-P

(just so you know the resultant object code has not a single IDIV instruction in it, just shifts where i divided.)

Code: Select all


#include <string.h>
#include <assert.h>
#include <stdint.h>

/* quick and dirty macro that tests if a pointer is properly aligned to it's
 * native boundary, we need this because some arches don't like multibyte
 * accesses to cross a page boundary
 */
#define IS_ALIGNED(x) (((uintptr_t)(x) & (sizeof(*(x)) - 1)) == 0)

/* valid options here are, 1, 2, 4 and 8 */
#define MAX_MULTIBYTE 4

/*------------------------------------------------------------------------------
// Name: memcpy(void *dest, const void *src, size_t n
//----------------------------------------------------------------------------*/
void *memcpy(void *_RESTRICT dest, const void *_RESTRICT src, size_t n) {

#if 0
	/* traditional memory copy */
	assert(dest != 0);
	assert(src != 0);

	char *d_ptr			= dest;
	const char *s_ptr	= src;
	
	while(n--) {
		*d_ptr++ = *s_ptr++;
	}
#else

	/* this one is optimized for dword and word aligned copies */
	union {
		void		*ptr;
#if MAX_MULTIBYTE >= 8
		uint64_t	*ptr64;
#endif
#if MAX_MULTIBYTE >= 4
		uint32_t	*ptr32;
#endif
#if MAX_MULTIBYTE >= 2
		uint16_t	*ptr16;
#endif
		uint8_t		*ptr8;	
	} d_ptr;
		
	union {
		const void		*ptr;
#if MAX_MULTIBYTE >= 8
		const uint64_t	*ptr64;
#endif
#if MAX_MULTIBYTE >= 4
		const uint32_t	*ptr32;
#endif
#if MAX_MULTIBYTE >= 2
		const uint16_t	*ptr16;
#endif
		const uint8_t	*ptr8;	
	} s_ptr;
	
	assert(dest != 0);
	assert(src != 0);

	d_ptr.ptr = dest;
	s_ptr.ptr = src;
	
	switch(n & (MAX_MULTIBYTE - 1)) {
	case 0:
#if MAX_MULTIBYTE >= 2
#if MAX_MULTIBYTE >= 4
#if MAX_MULTIBYTE >= 8
		if(!IS_ALIGNED(d_ptr.ptr64) || !IS_ALIGNED(s_ptr.ptr64)) {
			goto unaligned;
		}
	
		/* multiple of 8 */
		n /= 8;
		while(n--) {
			*(d_ptr.ptr64)++ = *(s_ptr.ptr64)++;
		}
		break;
	case 4:
#endif
		if(!IS_ALIGNED(d_ptr.ptr32) || !IS_ALIGNED(s_ptr.ptr32)) {
			goto unaligned;
		}
	
		/* multiple of 4 */
		n /= 4;
		while(n--) {
			*(d_ptr.ptr32)++ = *(s_ptr.ptr32)++;
		}
		break;
	case 6:
	case 2:
#endif
		if(!IS_ALIGNED(d_ptr.ptr16) || !IS_ALIGNED(s_ptr.ptr16)) {
			goto unaligned;
		}
	
		/* multiple of 2 */	
		n /= 2;		
		while(n--) {
			*(d_ptr.ptr16)++ = *(s_ptr.ptr16)++;
		}
		break;
	unaligned:
	default:
#endif
		/* multiple of 1 */
		while(n--) {
			*(d_ptr.ptr8)++ = *(s_ptr.ptr8)++;
		}
	}
#endif
	return dest;
}

#undef IS_ALIGNED
#undef MAX_MULTIBYTE

User avatar
proxy
Member
Member
Posts: 108
Joined: Wed Jan 19, 2005 12:00 am
Contact:

Post by proxy »

even more fun is my memset, which uses multiplies with a 0x01010101 mask to get a dword of 4 copies of the byte, yet the compiler comes up with very efficient and creative ways to do this..no IMUL instruction:

Code: Select all


#include <string.h>
#include <assert.h>
#include <stdint.h>

/* quick and dirty macro that tests if a pointer is properly aligned to it's
 * native boundary, we need this because some arches don't like multibyte
 * accesses to cross a page boundary
 */
#define IS_ALIGNED(x) (((uintptr_t)(x) & (sizeof(*(x)) - 1)) == 0)

/* valid options here are, 1, 2, 4 and 8 */
#define MAX_MULTIBYTE 4

/*------------------------------------------------------------------------------
// Name: memset(void *s, int c, size_t n)
//----------------------------------------------------------------------------*/
void *memset(void *s, int c, size_t n) {

#if 0
	/* traditional memset */
	char *s_ptr = s;
	const char ch = (char)(c & 0xff);
	
	assert(s != 0);
	
	while(n--) {
		*s_ptr++ = ch;
	}

#else
	/* this one is optimized for dword and word aligned copies */
	union {
		void		*ptr;
#if MAX_MULTIBYTE >= 8
		uint64_t	*ptr64;
#endif
#if MAX_MULTIBYTE >= 4
		uint32_t	*ptr32;
#endif
#if MAX_MULTIBYTE >= 2
		uint16_t	*ptr16;
#endif
		uint8_t		*ptr8;	
	} d_ptr;
	
	const char ch = (char)(c & 0xff);
		
	assert(s != 0);

	d_ptr.ptr = s;
	
	switch(n & (MAX_MULTIBYTE - 1)) {
	case 0:
#if MAX_MULTIBYTE >= 2
#if MAX_MULTIBYTE >= 4
#if MAX_MULTIBYTE >= 8
		if(!IS_ALIGNED(d_ptr.ptr64)) {
			goto unaligned;
		}
	
		/* multiple of 8 */
		do {
			const uint64_t source = ch * UINT64_C(0x0101010101010101);
			n /= 8;
			while(n--) {
				*(d_ptr.ptr64)++ = source;
			}
		} while(0);
		break;
	case 4:
#endif
		if(!IS_ALIGNED(d_ptr.ptr32)) {
			goto unaligned;
		}
	
		/* multiple of 4 */
		do {
			const uint32_t source = ch * UINT32_C(0x01010101);
			n /= 4;
			while(n--) {
				*(d_ptr.ptr32)++ = source;
			}
		} while(0);
		break;
	case 6:
	case 2:
#endif
		if(!IS_ALIGNED(d_ptr.ptr16)) {
			goto unaligned;
		}
	
		/* multiple of 2 */	
		do {
			const uint16_t source = ch * UINT16_C(0x0101);
			n /= 2;
			while(n--) {
				*(d_ptr.ptr16)++ = source;
			}
		} while(0);
		break;
	unaligned:
	default:
#endif
		/* multiple of 1 */
		while(n--) {
			*(d_ptr.ptr8)++ = ch;
		}
	}
#endif
	return s;
}

Post Reply