[Solved] Inline assembly memcpy

Tomaka17 · Post by **Tomaka17** » Tue May 19, 2009 1:14 pm

Hello,

I've got a simple but tricky problem with my 'memcpy', 'memmove' and 'memset' macros

Here the code of my memset macro (the problem is the same as with the two others) :

Code: Select all

#define	memset(d,v,n)		\
({ void* _d = (d); \
size_t _n = (n); \
asm("cld ; rep stosb" : : "a"(v), "D"(_d), "c"(_n) : "flags", "memory", "edi", "ecx"); \
_d; })

The '({' is a special GCC syntax which allows you to 'return a value' in a complex macro

But the problem comes from my inline assembly because each time I use this macro I get these errors :

error: can't find a register in class 'CREG' while reloading 'asm'
error: 'asm' operand has impossible constraints

If I remove "edi", "ecx" (after "memory") it compiles but only works without optimization
ie. if I use the -O2 or -O3 flag, GCC thinks that "edi" still has the value of "_d" at the end of the asm statement and uses it again, for example to access some data

What is confusing is that in this document they give an example macro which is exactly the same as mine (except that it is called rep_stosl)

Could it be a bug from GCC (I'm using the new 4.4 version) ?

Thanks,

whowhatwhere · Post by **whowhatwhere** » Tue May 19, 2009 1:34 pm

Tomaka17 wrote:Hello,

I've got a simple but tricky problem with my 'memcpy', 'memmove' and 'memset' macros

Here the code of my memset macro (the problem is the same as with the two others) :
Code: Select all
#define	memset(d,v,n)		\
({ void* _d = (d); \
size_t _n = (n); \
asm("cld ; rep stosb" : : "a"(v), "D"(_d), "c"(_n) : "flags", "memory", "edi", "ecx"); \
_d; })
The '({' is a special GCC syntax which allows you to 'return a value' in a complex macro

But the problem comes from my inline assembly because each time I use this macro I get these errors :
error: can't find a register in class 'CREG' while reloading 'asm'
error: 'asm' operand has impossible constraints
If I remove "edi", "ecx" (after "memory") it compiles but only works without optimization
ie. if I use the -O2 or -O3 flag, GCC thinks that "edi" still has the value of "_d" at the end of the asm statement and uses it again, for example to access some data

What is confusing is that in this document they give an example macro which is exactly the same as mine (except that it is called rep_stosl)

Could it be a bug from GCC (I'm using the new 4.4 version) ?

Thanks,

This is not the way I would do this. For one, inline assembly is a bad idea for things like memset, except on embedded systems. I'd make a loop and fill it "the C way" and let the compiler do optimization.

Tomaka17 · Post by **Tomaka17** » Tue May 19, 2009 2:07 pm

The problem is that the compiler does NOT do any optimization (it doesn't use 'movs' which is cleary the fastest way)

I use the memset and memcpy functions a lot in my code, so optimizing them should be a good idea

EDIT: let's say I want to code an inline 'memset' for fun, what would you answer ?

quok · Post by **quok** » Tue May 19, 2009 2:21 pm

Tomaka17 wrote:

Code: Select all

#define   memset(d,v,n)      \
({ void* _d = (d); \
size_t _n = (n); \
asm("cld ; rep stosb" : : "a"(v), "D"(_d), "c"(_n) : "flags", "memory", "edi", "ecx"); \
_d; })

I'm going to second the suggestion that you write a simple memset() in C, and stay away from inline asm for this. If you really want it as asm, that's fine, but don't use inline asm for it. You'll be making it much harder on yourself than necessary, especially if you want a super optimized memset.

There are two big errors here. First, you cannot specify as clobbers registers that are used as inputs. If you take "edi" and "ecx" out from the clobber list, your error will go away. Second, you have no outputs, and you're not using volatile. Either of those will cause GCC to behave differently when using the optimizer. You should specify the asm as volatile, and specify _d as an output as well as an input (or just use the + modifier in front of it where it is).

You may also want to use "cc" instead of "flags" as they're basically the same thing, but "cc" is more standard than "flags". (That may have changed in GCC 4.4.0 though; I haven't used it yet and unfortunately inline asm semantics change quite a bit from GCC version to GCC version.)

This reminds me, I really need to finish up my gcc inline asm tutorial.

Tomaka17 · Post by **Tomaka17** » Tue May 19, 2009 2:49 pm

Thanks for your answers

quok wrote:First, you cannot specify as clobbers registers that are used as inputs. If you take "edi" and "ecx" out from the clobber list, your error will go away.

Yes it goes away but still doesn't work
How is GCC supposed to know that "rep stosb" modified edi and ecx?
If I was GCC I would store a pointer to a structure into "edi", call memset and then use edi again to retreive some data from the structure

quok wrote:Second, you have no outputs, and you're not using volatile. Either of those will cause GCC to behave differently when using the optimizer. You should specify the asm as volatile, and specify _d as an output as well as an input (or just use the + modifier in front of it where it is).

Yeah I just noticed that
I was doing some tests with outputs as well, so I didn't notice that GCC was simply stripping the memset call

quok wrote:You may also want to use "cc" instead of "flags" as they're basically the same thing, but "cc" is more standard than "flags". (That may have changed in GCC 4.4.0 though; I haven't used it yet and unfortunately inline asm semantics change quite a bit from GCC version to GCC version.)

I was not sure about this because "cc" means "condition control" or something like that (like what is modified by "cmp" or "test")
But the "direction flag" is not really in this category

Tomaka17 · Post by **Tomaka17** » Tue May 19, 2009 3:01 pm

Anyway this is getting on my nerves

I'm just going to use a normal function, like I was doing before

quok · Post by **quok** » Tue May 19, 2009 3:08 pm

Tomaka17 wrote:
quok wrote:First, you cannot specify as clobbers registers that are used as inputs. If you take "edi" and "ecx" out from the clobber list, your error will go away.
Yes it goes away but still doesn't work
How is GCC supposed to know that "rep stosb" modified edi and ecx?
If I was GCC I would store a pointer to a structure into "edi", call memset and then use edi again to retreive some data from the structure

GCC doesn't know anything at all about asm, that's the whole point of specifying clobbers -- to tell GCC what changed. By default, anything specified as an input is also clobbered, so you only have to tell GCC about the things that you change specifically. If your asm template included "xor %%eax, %%eax" then you would have to specify eax as a clobber; gcc doesn't know anything at all about that asm statement. The constraints used (a, b, c, D, etc) tell GCC what register (technically register class) is being used.

For the example that you quoted, if GCC has something in edi already and then hits inline asm that says it will be putting something in edi, GCC throws a 'push edi' in front of the inline asm and will 'pop edi' afterward.

quok wrote:Second, you have no outputs, and you're not using volatile. Either of those will cause GCC to behave differently when using the optimizer. You should specify the asm as volatile, and specify _d as an output as well as an input (or just use the + modifier in front of it where it is).
Yeah I just noticed that
I was doing some tests with outputs as well, so I didn't notice that GCC was simply stripping the memset call

If you do not specify an output, when optimizing the code GCC may delete the asm entirely (this behavior may apply even if volatile is specified). If you do not specify an asm as volatile, then GCC may move the statement around (make it execute outside of a loop for instance). If you need the asm to execute exactly where it is, it MUST be declared volatile.

quok wrote:You may also want to use "cc" instead of "flags" as they're basically the same thing, but "cc" is more standard than "flags". (That may have changed in GCC 4.4.0 though; I haven't used it yet and unfortunately inline asm semantics change quite a bit from GCC version to GCC version.)
I was not sure about this because "cc" means "condition control" or something like that (like what is modified by "cmp" or "test")
But the "direction flag" is not really in this category

Instructions like cmp, test, jmp, etc all modify bits in EFLAGS like the carry flag, zero flag, sign flag, etc. CLD clears the direction flag. These bits (flags) are the same things as "condition codes". The different jmp instructions are generally referred to as jmpcc, because their behavior depends on the specific condition codes set in EFLAGS. CF, ZF, SF, and DF are all in EFLAGS, so it's rather safe to assume (and correct, as well), that DF fits in this category. Regardless though, if "flags" works then go ahead and use that.

quok · Post by **quok** » Tue May 19, 2009 3:39 pm

quok wrote:
Tomaka17 wrote:
Code: Select all
#define   memset(d,v,n)      \
({ void* _d = (d); \
size_t _n = (n); \
asm("cld ; rep stosb" : : "a"(v), "D"(_d), "c"(_n) : "flags", "memory", "edi", "ecx"); \
_d; })
...
You should specify the asm as volatile, and specify _d as an output as well as an input (or just use the + modifier in front of it where it is).
...

Oops, I just noticed I made a mistake there. The + modifier needs to be used on an output variable, not an input.

Code: Select all

asm volatile ("cld; rep stosb" : "+D"(_d) : "a"(v), "c"(_n) : "cc", "memory" );

That should do it for you. Completely untested, of course.

I believe the memory clobber isn't needed but I don't have time to test that right now, sorry.

bewing · Post by **bewing** » Tue May 19, 2009 4:12 pm

@op: There are several more general points that go beyond the very important details that quok has provided.

You say that REP STOS is clearly the most efficient way to code a memset. The clarity is an illusion, unfortunately. If you are only speaking about size optimization, you are correct -- since it is only 2 bytes. However,
1) IIRC this instruction is microcoded -- so there is a significant "set up time" while the CPU encodes microinstructions into the instruction queue,
2) for any moderate sized "set" operation, this code is bound by the speed of the memory bus, which is the same no matter what opcodes you use.

The microcoding issue is the same for all the REP string instructions, except for MOVS (memcpy) -- intel promises in the manuals that they will not microcode that one.

I will agree with you that in a kernel, efficiency is important, and GCC does a much worse job of speed optimization than most people think. Probably the way around this is to write your kernel, and when it is done and perfect -- go back and hand optimize entire functions into ASM.

Using inline asm here and there is far too likely not to be portable even between successive versions of GCC, let alone portable in general.

Combuster · Post by **Combuster** » Tue May 19, 2009 4:24 pm

AFAIK rep stosx and rep movsx are special cases. Intel has a reason to optimize those since a) everybody uses them in that form, and b) they are used often. Completely microcoding would constitute a significant performance loss.

For the AMD case (for which I have stats) rep movsd is about 30% slower than a large unrolled SSE loop (using prefetches and nontemporal stores). While AMD's microcoded instructions usually cost around a factor 5 IIRC.

Owen · Post by **Owen** » Tue May 19, 2009 4:42 pm

GCC already has optimized versions of strcpy, memcpy, memset and co. You just need to enable them. These are specific optimizations (I.E they've been specially programmed into the compiler), but they're off by default when you build something freestanding. Generally GCC will optimize the calls away to an optimized generation of it's own - which can be blindingly fast if copying a fixed size.

NickJohnson · Post by **NickJohnson** » Tue May 19, 2009 6:46 pm

So, how do you actually access these optimized builtin functions? I know they are used when you call memset and friends in application programming, and they are somehow in the stdlib header. Can you just call them with a function prototype and the linker will take care of the rest?

And what about code portability? I've been trying to make my code work under both GCC and TCC with just drop in replacement. Do you have to check for GCC and link your own copies otherwise?

Owen · Post by **Owen** » Wed May 20, 2009 2:11 am

You need to link an memcpy anyway, for cases where the compiler is unable to inline (If you pass a pointer to memcpy being a prime example).

Sorry, I don't know how you enable them.

Tomaka17 · Post by **Tomaka17** » Wed May 20, 2009 4:04 am

Owen wrote:GCC already has optimized versions of strcpy, memcpy, memset and co. You just need to enable them. These are specific optimizations (I.E they've been specially programmed into the compiler), but they're off by default when you build something freestanding. Generally GCC will optimize the calls away to an optimized generation of it's own - which can be blindingly fast if copying a fixed size.

Thanks, I didn't know this
There are in fact a lot of them: http://gcc.gnu.org/onlinedocs/gcc/Other ... r-Builtins

Now I've got a stranger problem

Here is my new code for memcpy

Code: Select all

#define test 1

void* memcpy(void* destination, const void* source, size_t num) {
#if test == 1
	__builtin_memcpy(destination, source, num);
#elif test == 2
	asm volatile("cld ; rep movsb" :: "S"(source), "D"(destination), "c"(num) : "flags", "memory");
#else
	const unsigned char* vsource = (const unsigned char*)source;
	unsigned char* vdestination = (unsigned char*)destination;
	while (num > 0) {
		*vdestination = *vsource;
		vsource++; vdestination++; num--;
	}
#endif
	return destination;
}

With test == 1 it crashes
But with test == 2 or anything else, it's working O_O

Could __builtin_memcpy be bugged?
When disassembling I see that __builtin_memcpy is in fact a function call, maybe the __builtin are in fact in the C library?

EDIT : I didn't see this on GCC's page

Many of these functions are only optimized in certain cases; if they are not optimized in a particular case, a call to the library function will be emitted.

So if I use __builtin_memcpy in my memcpy implementation it would lead to an infinite loop

Tomaka17 · Post by **Tomaka17** » Wed May 20, 2009 4:16 am

Ok I understood the way the whole thing was designed to work

Here is (part of) string.h :

Code: Select all

void*	memcpy	(void* s1, const void* s2, size_t n);
void*	memmove	(void* s1, const void* s2, size_t n);
void*	memset	(void* s, int c, size_t n);

#ifdef __GNUC__
#define	memcpy(d,s,n)				__builtin_memcpy(d,s,n)
#define	memset(d,v,n)				__builtin_memset(d,v,n)
#endif

And string.c :

Code: Select all

#undef memcpy
void* memcpy(void* destination, const void* source, size_t num) {
#if defined(__GNUC__) && defined(_TARGET_X86_)
	asm volatile("cld ; rep movsb" :: "S"(source), "D"(destination), "c"(num) : "flags", "memory");
#else
	const unsigned char* vsource = (const unsigned char*)source;
	unsigned char* vdestination = (unsigned char*)destination;
	while (num > 0) {
		*vdestination = *vsource;
		vsource++; vdestination++; num--;
	}
#endif
	return destination;
}

Consequently when I use memcpy in my O/S it is replaced by __builtin_memcpy and :
* either the compiler uses its internal optimized version
* or it calls the real memcpy which is in my string.c

Thank you again

OSDev.org

[Solved] Inline assembly memcpy

[Solved] Inline assembly memcpy

Re: Inline assembly memcpy

Re: Inline assembly memcpy

Re: Inline assembly memcpy

Re: Inline assembly memcpy

Re: Inline assembly memcpy

Re: Inline assembly memcpy

Re: Inline assembly memcpy

Re: Inline assembly memcpy

Re: Inline assembly memcpy

Re: Inline assembly memcpy

Re: Inline assembly memcpy

Re: Inline assembly memcpy

Re: Inline assembly memcpy

Re: Inline assembly memcpy