Framebuffer: Draw a PSF character

Korona · Post by **Korona** » Fri Mar 19, 2021 2:29 am

bzt wrote:The problem with memcpy is, that it does not return the value (it requires a memory address which can't always be optimized).

That's true in the general case, but for local vars, it can reliably be optimized. You are right that this is a bit awkward when you need to store values globally.

One could create an inline helper function to be UB safe:

Code: Select all

uint32_t load_u32_unaligned(uint32_t *p) {
    uint32_t x;
    memcpy(&x, p, sizeof(uint32_t));
    return x;
}

EDIT: you can also write that helper using inline asm to avoid the memcpy.

vvaltchev · Post by **vvaltchev** » Fri Mar 19, 2021 8:57 am

Octocontrabass wrote:
vvaltchev wrote:but the compilation failed because I use both -Wvla and -Werror:
You can temporarily disable -Wvla:
Code: Select all
#pragma GCC diagnostic push
#pragma GCC diagnostic ignored "-Wvla"
asm (...);
#pragma GCC diagnostic pop
If you don't want to do that for some reason, you can remove the length: "=m"(*(char (*)[])dest)

I did that (with the #pragma etc.), of out curiosity to see if:
1. it all will work as supposed to
2. there will be any effect on the emitted instructions

The results:
1. It worked as expected, with the new clobbers, without "volatile" and without the generic "memory" clobber.
2. The code size of the kernel increased by a few hundreds bytes. In my experience, generally, a code size increase means worse code, but if the code size increase is reasonably small, it might also mean better code (some optimizations generate more instructions and, at the end, the code runs faster). I spent a little time comparing with scripts all the object files and I've observed that the code size increase is a "net effect": some object files became smaller after the change. I even found specific functions that are bigger than before, but I didn't spend much time on that because due to inlining etc. functions are pretty big and it would take me too much time to figure out what happened. I'll just say that's pretty non-obvious: a lot of changes here and there.

So, I gave a last try with compiler explorer, trying to find if there are any differences between the two memcpy() versions.
Here's the code on compiler explorer: https://godbolt.org/z/Kq71an

In any case, I'll copy & paste the code here, because it's more convenient and permanent:

Code: Select all

#include <stddef.h>
#include <stdint.h>

inline void *memcpy1(void *dest, const void *src, size_t n)
{
   uint32_t unused, unused2;
   asm volatile ("rep movsl\n\t"         // copy 4 bytes at a time, n/4 times
                 "mov %%ebx, %%ecx\n\t"  // then: ecx = ebx = n % 4
                 "rep movsb\n\t"         // copy 1 byte at a time, n%4 times
                 : "=b" (unused), "=c" (n), "=S" (src), "=D" (unused2)
                 : "b" (n & 3), "c" (n >> 2), "S"(src), "D"(dest)
                 : "memory" );
   return dest;
}

inline void *memcpy2(void *dest, const void *src, size_t n)
{
   uint32_t unused, unused2;
   asm ("rep movsl\n\t"         // copy 4 bytes at a time, n/4 times
        "mov %%ebx, %%ecx\n\t"  // then: ecx = ebx = n % 4
        "rep movsb\n\t"         // copy 1 byte at a time, n%4 times
        : "=b" (unused), "=c" (n), "=S" (src), "=D" (unused2), "=m"(*(char (*)[n])dest)
        : "b" (n & 3), "c" (n >> 2), "S"(src), "D"(dest), "m"(*(const char (*)[n])src)
        );
   return dest;
}
void copy_with_asm_volatile(void *a, void *b) { memcpy1(a, b, 25639); }
void copy_with_asm(void *a, void *b) { memcpy2(a, b, 25639); }

And here's the emitted code (GCC 10.2, x86_64, Opt: -O3):

Code: Select all

copy_with_asm_volatile:
        push    rbx
        mov     ecx, 6409
        mov     ebx, 3
        rep movsl
        mov %ebx, %ecx
        rep movsb

        pop     rbx
        ret
copy_with_asm:
        push    rbx
        mov     r8, rdi                      // <---- Additional code
        mov     ebx, 3
        mov     ecx, 6409
        rep movsl
        mov %ebx, %ecx
        rep movsb

        pop     rbx
        ret

For some weird reason, copy_with_asm() used an additional instruction. That might explain while there's a small code size increase in the whole project, but it doesn't explain the case where, for some translation units, the code size is smaller.

And the story gets more interesting when we consider other compilers: with clang 11, there's no difference: both the functions are like copy_with_asm_volatile. But, with Intel's compiler 2021.1.2:

Code: Select all

copy_with_asm_volatile:
        push      rbx                                           #31.1
        mov       ebx, 3                                        #7.0
        mov       ecx, 6409                                     #7.0
        rep movsl
        mov %ebx, %ecx
        rep movsb

        pop       rbx                                           #33.1
        ret                                                     #33.1
copy_with_asm:
        push      rbx                                           #36.1
        mov       ebx, 3                                        #20.0
        mov       QWORD PTR [-16+rsp], rsi                      #17.14  // <---- Additional code
        mov       ecx, 6409                                     #20.0
        rep movsl
        mov %ebx, %ecx
        rep movsb

        mov       QWORD PTR [-16+rsp], rsi                      #20.0  // <---- Additional code
        pop       rbx                                           #38.1
        ret                                                     #38.1

We get two extra instructions! I'm speechless

What do you think, @Octocontrabass?

Octocontrabass · Post by **Octocontrabass** » Fri Mar 19, 2021 4:14 pm

GCC adds an extra instruction because it doesn't see the possibility of using RDI to satisfy the "=m"(*(char (*)[n])dest) output constraint. You can modify the constraints to convince it to do so.

Code: Select all

inline void * memcpy3( void * dest, const void * src, size_t n )
{
    void * temp = dest;
    asm( "rep movsl\n\t"
         "mov %k5, %%ecx\n\t"
         "rep movsb\n\t"
         : "+D"(dest), "+S"(src), "=c"(n), "=m"(*(char (*)[n])dest)
         : "c"(n >> 2), "r"(n & 3), "m"(*(const char (*)[n])src)
         );
    return temp;
}

I'm not sure what's up with ICC. It looks like it's miscompiling "m"(*(const char (*)[n])src) as a pointer in memory instead of an object in memory, but that doesn't explain why it feels the need to spill RSI after the inline assembly.

bzt · Post by **bzt** » Fri Mar 19, 2021 5:26 pm

Korona wrote:One could create an inline helper function to be UB safe:

Not good, because it's implementation specific if "load_u32_unaligned" gets inlined or not. If it's not inlined, then your performance is off. Pointer cast is always inlined.

Korona wrote:EDIT: you can also write that helper using inline asm to avoid the memcpy.

Not good either, inline asm isn't portable. Pointer cast uses only C syntax, so it is portable as long as all the target architectures support unaligned access.

Octocontrabass wrote:If the pointer cast is undefined behavior, either due to misalignment or aliasing, you're telling the compiler that the code is unreachable.

You're totally wrong about this. First, UB is not the same as unreachable code, not by a long shot. Second, I've already told you, if all your target architectures support unaligned access, then pointer cast ISN'T an undefined behaviour.

Octocontrabass wrote:You've told it not to optimize. Everything is fine when you turn on optimizations.

Then show me, how would you turn that optimization on for TCC or SmallerC for example.

Octocontrabass wrote:This has nothing to do with memcpy. All local variables are spilled to the stack when optimizations are disabled. You'll see the same thing with any other local variable.

Oh, but it absolutely does. Don't you really understand, that the only reason for the local variable is because memcpy needs an input memory address? On the other hand pointer cast doesn't need anything.

Octocontrabass wrote:If your library relies on undefined behavior, it's not portable.

If your library relies on unspecified and implementation specific compiler features, then it's not portable, and this isn't an UB. (BTW I have a project compiled for multiple architectures, both 32 and 64 bit, x86 and ARM, using different MSVC versions (2015 and up), Clang and gcc in all combinations. All works fine, not even a warning, so it doesn't matter what you say when the empirical facts prove you wrong.)

Octocontrabass wrote:I'm not sure what's up with ICC.

Like I told you: it is a really bad idea to rely on unspecified and implementation specific compiler features.

Cheers,
bzt

Korona · Post by **Korona** » Fri Mar 19, 2021 5:51 pm

TCC does not optimize based on alignment, so for TCC, you're save with the pointer cast.

bzt · Post by **bzt** » Fri Mar 19, 2021 6:47 pm

Korona wrote:TCC does not optimize based on alignment, so for TCC, you're save with the pointer cast.

I know, but does it eliminate local variables if they were only to support memcpy? Or it doesn't replace memcpy with MOVs in the first place?

Cheers,
bzt

Octocontrabass · Post by **Octocontrabass** » Sat Mar 20, 2021 1:12 am

bzt wrote:You're totally wrong about this. First, UB is not the same as unreachable code, not by a long shot.

Perhaps that's an oversimplification, but it's more similar than you think. The compiler assumes code will only be reachable when its behavior is defined, which means any code path that would rely on undefined behavior is considered unreachable. You're relying on the compiler not recognizing undefined behavior, but compilers keep getting smarter.

bzt wrote:Second, I've already told you, if all your target architectures support unaligned access, then pointer cast ISN'T an undefined behaviour.

Where does it say that in the C standard? All I can find is this:

ANSI C89 wrote:Among the invalid values for dereferencing a pointer by the unary * operator are a null pointer, an address inappropriately aligned for the type of object pointed to, or the address of an object that has automatic storage duration when execution of the block in which the object is declared and of all enclosed blocks has terminated.

If an invalid value has been assigned to the pointer, the behavior of the unary * operator is undefined.

bzt wrote:Then show me, how would you turn that optimization on for TCC or SmallerC for example.

Use TCC to build GCC 4.7.4, then use GCC 4.7.4 to compile the code with optimizations.

bzt wrote:(BTW I have a project compiled for multiple architectures, both 32 and 64 bit, x86 and ARM, using different MSVC versions (2015 and up), Clang and gcc in all combinations. All works fine, not even a warning, so it doesn't matter what you say when the empirical facts prove you wrong.)

"It works fine now" does not mean "it will work fine with a new compiler", and compilers can't warn you every time they find undefined behavior.

Korona · Post by **Korona** » Sat Mar 20, 2021 5:58 am

bzt wrote:
Korona wrote:TCC does not optimize based on alignment, so for TCC, you're save with the pointer cast.
I know, but does it eliminate local variables if they were only to support memcpy? Or it doesn't replace memcpy with MOVs in the first place?

No, I don't think so. So if you want to please both TCC and GCC, you probably have to use `#ifdef` (which is tedious). It's sad that there is no "this-pointer-is-unaligned-i-know-what-im-doing" attribute for GCC.

vvaltchev · Post by **vvaltchev** » Sat Mar 20, 2021 7:18 am

Octocontrabass wrote:GCC adds an extra instruction because it doesn't see the possibility of using RDI to satisfy the "=m"(*(char (*)[n])dest) output constraint. You can modify the constraints to convince it to do so.

Writing inline assembly feels like playing chess with the compiler

Octocontrabass wrote:

Code: Select all

inline void * memcpy3( void * dest, const void * src, size_t n )
{
    void * temp = dest;
    asm( "rep movsl\n\t"
         "mov %k5, %%ecx\n\t"
         "rep movsb\n\t"
         : "+D"(dest), "+S"(src), "=c"(n), "=m"(*(char (*)[n])dest)
         : "c"(n >> 2), "r"(n & 3), "m"(*(const char (*)[n])src)
         );
    return temp;
}

That code rocks, man! Just re-implementing memcpy() like that saved ~1k in code size in my project. Now, I'm using your techniques in other inline assembly functions (e.g. bzero()) to further improve my project. And it's probably faster too. In my understanding, the thing that has most impact is the use of a general-purpose register ("r") chosen by the compiler vs. forcing the use of the EBX register. That, with the simply "asm" statement, gave much more freedom to the compiler to optimize. Thanks for that!

Octocontrabass wrote:I'm not sure what's up with ICC. It looks like it's miscompiling "m"(*(const char (*)[n])src) as a pointer in memory instead of an object in memory, but that doesn't explain why it feels the need to spill RSI after the inline assembly.

Yeah.. inline asm is a tricky territory and compilers don't behave the same way. The sad thing is that it looks to me like there are no absolute rock-solid rules for inline assembly that guarantees us what will happen: there are a ton of "guidelines", but it seems like we have to "play" with the compiler to get it right. For example, now my memcpy() and bzero() are much better, memset() is simpler (but generates the same code) while I'm unable to "win the game" with the compiler about memmove(). Check these examples on compiler explorer: no matter what I do, the asm volatile implementation seems unbeatable, at least for me. Maybe you'll come with some brilliant fancy modifier/constraint, but I simply have no idea how to do any better. It seems that having both "dest" which becames "dest+n-1" and a clobber which requires the old value of "dest" might be the problem (and the same applies for "src"), but I can be completely off-road, of course.

Octocontrabass · Post by **Octocontrabass** » Sat Mar 20, 2021 10:56 am

Korona wrote:It's sad that there is no "this-pointer-is-unaligned-i-know-what-im-doing" attribute for GCC.

There is, though.

Code: Select all

typedef int this_int_is_unaligned_i_know_what_im_doing __attribute__ ((aligned (1)));

printf("%d\n", *((this_int_is_unaligned_i_know_what_im_doing *)address));

Clang seems to support it as well.

vvaltchev wrote:It seems that having both "dest" which becames "dest+n-1" and a clobber which requires the old value of "dest" might be the problem (and the same applies for "src"), but I can be completely off-road, of course.

Unfortunately, you are correct: as far as the compiler knows, you need some way to access the memory operands, so it has to allocate them separately. The result is better when the function is not inlined since the caller considers those additional registers clobbered whether they've been used or not. But, if you're really trying to optimize memmove(), you should instead focus on splitting the move into smaller blocks that don't overlap, since REP MOVSB is extremely slow when the direction flag is set.

There's no need for a "cc" clobber since you're not modifying any flags.

Korona · Post by **Korona** » Sat Mar 20, 2021 10:59 am

Oh, that's good to know. I thought that aligned() could only increase alignment but I am mistaken (i.e., for struct members, it can also decrease the alignment).

vvaltchev · Post by **vvaltchev** » Sat Mar 20, 2021 11:44 am

Octocontrabass wrote:Unfortunately, you are correct: as far as the compiler knows, you need some way to access the memory operands, so it has to allocate them separately. The result is better when the function is not inlined since the caller considers those additional registers clobbered whether they've been used or not.

I understand.

Octocontrabass wrote:But, if you're really trying to optimize memmove(), you should instead focus on splitting the move into smaller blocks that don't overlap, since REP MOVSB is extremely slow when the direction flag is set.

Ouch. I was convinced that "Enhanced REP MOVSB" worked in either direction

Fortunately, I have a very few memmove() calls in my whole project and memmove() uses memcpy() when it's possible by checking "dest <= src || ((ulong)src + n <= (ulong)dest)". But, yeah, I agree that memmove() could optimized better by splitting the data in chunks.

Octocontrabass wrote:There's no need for a "cc" clobber since you're not modifying any flags.

The fact that I'm changing twice the direction flag, doesn't matter? I mean, I agree that the "net effect" is null on EFLAGS, but do we have any guarantees that the compiler won't interleave instructions in the inline asm when there's no "volatile"?

Octocontrabass · Post by **Octocontrabass** » Sat Mar 20, 2021 12:11 pm

vvaltchev wrote:The fact that I'm changing twice the direction flag, doesn't matter? I mean, I agree that the "net effect" is null on EFLAGS, but do we have any guarantees that the compiler won't interleave instructions in the inline asm when there's no "volatile"?

"The compiler replaces tokens in the template that refer to inputs, outputs, and goto labels, and then outputs the resulting string to the assembler." No mention of splitting it apart into individual instructions.

The "cc" clobber doesn't apply to the direction flag anyway. If it did, you'd see the compiler insert CLD every time you use it.

vvaltchev · Post by **vvaltchev** » Sat Mar 20, 2021 1:39 pm

Octocontrabass wrote:The compiler replaces tokens in the template that refer to inputs, outputs, and goto labels, and then outputs the resulting string to the assembler."[/url] No mention of splitting it apart into individual instructions.

The "cc" clobber doesn't apply to the direction flag anyway. If it did, you'd see the compiler insert CLD every time you use it.

Ah OK, I get it. Sorry for asking so many questions, but the whole topic isn't trivial at all: there are traps and pitfalls everywhere

OSDev.org

Framebuffer: Draw a PSF character

Re: Framebuffer: Draw a PSF character

Re: Framebuffer: Draw a PSF character

Re: Framebuffer: Draw a PSF character

Re: Framebuffer: Draw a PSF character

Re: Framebuffer: Draw a PSF character

Re: Framebuffer: Draw a PSF character

Re: Framebuffer: Draw a PSF character

Re: Framebuffer: Draw a PSF character

Re: Framebuffer: Draw a PSF character

Re: Framebuffer: Draw a PSF character

Re: Framebuffer: Draw a PSF character

Re: Framebuffer: Draw a PSF character

Re: Framebuffer: Draw a PSF character

Re: Framebuffer: Draw a PSF character