Octocontrabass wrote:vvaltchev wrote:but the compilation failed because I use both -Wvla and -Werror:
You can temporarily disable -Wvla:
Code: Select all
#pragma GCC diagnostic push
#pragma GCC diagnostic ignored "-Wvla"
asm (...);
#pragma GCC diagnostic pop
If you don't want to do that for some reason, you can remove the length: "=m"(*(char (*)[])dest)
I did that (with the #pragma etc.), of out curiosity to see if:
1. it all will work as supposed to
2. there will be any effect on the emitted instructions
The results:
1. It worked as expected, with the new clobbers, without "volatile" and without the generic "memory" clobber.
2. The code size of the kernel increased by a few hundreds bytes. In my experience, generally, a code size increase means worse code, but if the code size increase is reasonably small, it might also mean better code (some optimizations generate more instructions and, at the end, the code runs faster). I spent a little time comparing with scripts all the object files and I've observed that the code size increase is a "net effect": some object files became smaller after the change. I even found specific functions that are bigger than before, but I didn't spend much time on that because due to inlining etc. functions are pretty big and it would take me too much time to figure out what happened. I'll just say that's pretty non-obvious: a lot of changes here and there.
So, I gave a last try with compiler explorer, trying to find if there are any differences between the two memcpy() versions.
Here's the code on compiler explorer:
https://godbolt.org/z/Kq71an
In any case, I'll copy & paste the code here, because it's more convenient and permanent:
Code: Select all
#include <stddef.h>
#include <stdint.h>
inline void *memcpy1(void *dest, const void *src, size_t n)
{
uint32_t unused, unused2;
asm volatile ("rep movsl\n\t" // copy 4 bytes at a time, n/4 times
"mov %%ebx, %%ecx\n\t" // then: ecx = ebx = n % 4
"rep movsb\n\t" // copy 1 byte at a time, n%4 times
: "=b" (unused), "=c" (n), "=S" (src), "=D" (unused2)
: "b" (n & 3), "c" (n >> 2), "S"(src), "D"(dest)
: "memory" );
return dest;
}
inline void *memcpy2(void *dest, const void *src, size_t n)
{
uint32_t unused, unused2;
asm ("rep movsl\n\t" // copy 4 bytes at a time, n/4 times
"mov %%ebx, %%ecx\n\t" // then: ecx = ebx = n % 4
"rep movsb\n\t" // copy 1 byte at a time, n%4 times
: "=b" (unused), "=c" (n), "=S" (src), "=D" (unused2), "=m"(*(char (*)[n])dest)
: "b" (n & 3), "c" (n >> 2), "S"(src), "D"(dest), "m"(*(const char (*)[n])src)
);
return dest;
}
void copy_with_asm_volatile(void *a, void *b) { memcpy1(a, b, 25639); }
void copy_with_asm(void *a, void *b) { memcpy2(a, b, 25639); }
And here's the emitted code (GCC 10.2, x86_64, Opt: -O3):
Code: Select all
copy_with_asm_volatile:
push rbx
mov ecx, 6409
mov ebx, 3
rep movsl
mov %ebx, %ecx
rep movsb
pop rbx
ret
copy_with_asm:
push rbx
mov r8, rdi // <---- Additional code
mov ebx, 3
mov ecx, 6409
rep movsl
mov %ebx, %ecx
rep movsb
pop rbx
ret
For some weird reason, copy_with_asm() used an additional instruction. That might explain while there's a small code size increase in the whole project, but it doesn't explain the case where, for some translation units, the code size is smaller.
And the story gets more interesting when we consider other compilers: with clang 11, there's no difference: both the functions are like copy_with_asm_volatile. But, with Intel's compiler 2021.1.2:
Code: Select all
copy_with_asm_volatile:
push rbx #31.1
mov ebx, 3 #7.0
mov ecx, 6409 #7.0
rep movsl
mov %ebx, %ecx
rep movsb
pop rbx #33.1
ret #33.1
copy_with_asm:
push rbx #36.1
mov ebx, 3 #20.0
mov QWORD PTR [-16+rsp], rsi #17.14 // <---- Additional code
mov ecx, 6409 #20.0
rep movsl
mov %ebx, %ecx
rep movsb
mov QWORD PTR [-16+rsp], rsi #20.0 // <---- Additional code
pop rbx #38.1
ret #38.1
We get two extra instructions! I'm speechless
What do you think, @Octocontrabass?