Framebuffer: Draw a PSF character

Question about which tools to use, bugs, the best way to implement a function, etc should go here. Don't forget to see if your question is answered in the wiki first! When in doubt post here.
Octocontrabass
Member
Member
Posts: 5568
Joined: Mon Mar 25, 2013 7:01 pm

Re: Framebuffer: Draw a PSF character

Post by Octocontrabass »

vvaltchev wrote:Also, it feels unfair to me seeing C compilers "throw away" an ISA feature like unaligned access, available almost for free on modern CPUs. [...] It's like I have no other choice other can using memcpy or memmove or writing assembly, when C code could have been just fine.
GCC and Clang are both perfectly capable of using unaligned accesses when the target architecture supports it. In fact, using memcpy() tends to generate the best code. Even with most optimizations disabled, they still recognize the opportunity to use an unaligned MOV in place of a memcpy() call.

If you're compiling with options that disable the memcpy() built-in, you can use __builtin_memcpy() instead.
Korona
Member
Member
Posts: 1000
Joined: Thu May 17, 2007 1:27 pm
Contact:

Re: Framebuffer: Draw a PSF character

Post by Korona »

Exactly. GCC is 100% reliably translates 4-byte and 8-byte memcpy calls into loads/stores: memcpy is replaced by __builtin_memcpy and __builtin_memcpy is translated to stores/loads in GCC's backend.

GCC can use unaligned loads/stores without introducing UB, the syntax is just different (memcpy vs. raw pointer access).
managarm: Microkernel-based OS capable of running a Wayland desktop (Discord: https://discord.gg/7WB6Ur3). My OS-dev projects: [mlibc: Portable C library for managarm, qword, Linux, Sigma, ...] [LAI: AML interpreter] [xbstrap: Build system for OS distributions].
vvaltchev
Member
Member
Posts: 274
Joined: Fri May 11, 2018 6:51 am

Re: Framebuffer: Draw a PSF character

Post by vvaltchev »

I agree, guys: __builtin_memcpy() is the solution because it completely solves the problem with new code.

There's one more problem though: the legacy code. Imagine the case when you're building a big C project from source, maybe not actively maintained or you need to build an older version of the code: unaligned access might exist lurking somewhere. -Wcast-align might help, but not in all the cases. So, I wrote a comment on GCC's bug (closed with wontfix) that you mentioned: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93031#c5, mentioning the unaligned accesses in the Linux kernel and the general problem with legacy code. I hope they will answer my questions. In particular, I asked if the following code:

Code: Select all

*(unsigned long *)(void *)(dst+res) = c;
which doesn't trigger -Wcast-align (because of the void* cast) will make any difference in reducing GCC's assumptions or it will just suppress the warning and nothing more.
Tilck, a Tiny Linux-Compatible Kernel: https://github.com/vvaltchev/tilck
sj95126
Member
Member
Posts: 151
Joined: Tue Aug 11, 2020 12:14 pm

Re: Framebuffer: Draw a PSF character

Post by sj95126 »

Octocontrabass wrote:If you're compiling with options that disable the memcpy() built-in, you can use __builtin_memcpy() instead.
The only thing I might suggest is instead of using

Code: Select all

#define memcpy(x,y,z) __builtin_memcpy((x),(y),(z))
use a wrapper function instead. GCC will inline the instructions to perform the memcpy, which can get in the way when you're looking at a disassembly listing.
Korona
Member
Member
Posts: 1000
Joined: Thu May 17, 2007 1:27 pm
Contact:

Re: Framebuffer: Draw a PSF character

Post by Korona »

It is a common misconception that __builtin_memcpy makes GCC generate a memcpy function of its own. That is not what it does. __builtin_memcpy just recognizes small sizes and translates them to usual load/stores. For big copies, it just calls memcpy.

EDIT: so the correct pattern is something like this:

Code: Select all

// In a header:
#define memcpy __builtin_memcpy

// In a source file:
#undef memcpy

void *memcpy(void *dest, const void *src, size_t size) {
    // Your implementation here.
}
managarm: Microkernel-based OS capable of running a Wayland desktop (Discord: https://discord.gg/7WB6Ur3). My OS-dev projects: [mlibc: Portable C library for managarm, qword, Linux, Sigma, ...] [LAI: AML interpreter] [xbstrap: Build system for OS distributions].
sj95126
Member
Member
Posts: 151
Joined: Tue Aug 11, 2020 12:14 pm

Re: Framebuffer: Draw a PSF character

Post by sj95126 »

Korona wrote:It is a common misconception that __builtin_memcpy makes GCC generate a memcpy function of its own. That is not what it does. __builtin_memcpy just recognizes small sizes and translates them to usual load/stores. For big copies, it just calls memcpy.
For what definition of small? Because this looks inline to me.

Code: Select all

        char *a, *b;

        __builtin_memcpy(a, b, 1000);
    1155:       48 8b 55 f0             mov    -0x10(%rbp),%rdx
    1159:       48 8b 45 f8             mov    -0x8(%rbp),%rax
    115d:       b9 e8 03 00 00          mov    $0x3e8,%ecx
    1162:       48 8b 30                mov    (%rax),%rsi
    1165:       48 89 32                mov    %rsi,(%rdx)
    1168:       89 ce                   mov    %ecx,%esi
    116a:       48 01 d6                add    %rdx,%rsi
    116d:       48 8d 7e 08             lea    0x8(%rsi),%rdi
    1171:       89 ce                   mov    %ecx,%esi
    1173:       48 01 c6                add    %rax,%rsi
    1176:       48 83 c6 08             add    $0x8,%rsi
    117a:       48 8b 76 f0             mov    -0x10(%rsi),%rsi
    117e:       48 89 77 f0             mov    %rsi,-0x10(%rdi)
    1182:       48 8d 7a 08             lea    0x8(%rdx),%rdi
    1186:       48 83 e7 f8             and    $0xfffffffffffffff8,%rdi
    118a:       48 29 fa                sub    %rdi,%rdx
    118d:       48 29 d0                sub    %rdx,%rax
    1190:       01 d1                   add    %edx,%ecx
    1192:       83 e1 f8                and    $0xfffffff8,%ecx
    1195:       c1 e9 03                shr    $0x3,%ecx
    1198:       89 ca                   mov    %ecx,%edx
    119a:       89 d2                   mov    %edx,%edx
    119c:       48 89 c6                mov    %rax,%rsi
    119f:       48 89 d1                mov    %rdx,%rcx
    11a2:       f3 48 a5                rep movsq %ds:(%rsi),%es:(%rdi)
Which is why I prefer a wrapper. The caller's disassembly will look the same, without the inline code getting in the way.

Sure, each inline __builtin_memcpy can be tailored to the specific needs of each copy, but at this stage of my kernel development, performance is not my biggest concern. I can always replace the wrapper with a #define later.
Korona
Member
Member
Posts: 1000
Joined: Thu May 17, 2007 1:27 pm
Contact:

Re: Framebuffer: Draw a PSF character

Post by Korona »

Small means known small size. Otherwise, __builtin_memcpy lowers to memcpy: https://godbolt.org/z/8vj54a.
managarm: Microkernel-based OS capable of running a Wayland desktop (Discord: https://discord.gg/7WB6Ur3). My OS-dev projects: [mlibc: Portable C library for managarm, qword, Linux, Sigma, ...] [LAI: AML interpreter] [xbstrap: Build system for OS distributions].
vvaltchev
Member
Member
Posts: 274
Joined: Fri May 11, 2018 6:51 am

Re: Framebuffer: Draw a PSF character

Post by vvaltchev »

Korona wrote:It is a common misconception that __builtin_memcpy makes GCC generate a memcpy function of its own. That is not what it does. __builtin_memcpy just recognizes small sizes and translates them to usual load/stores. For big copies, it just calls memcpy.

EDIT: so the correct pattern is something like this:

Code: Select all

// In a header:
#define memcpy __builtin_memcpy

// In a source file:
#undef memcpy

void *memcpy(void *dest, const void *src, size_t size) {
    // Your implementation here.
}
I liked your idea, so I tried it in Tilck. It worked perfectly, and it saved me ~3.5k in code size because it stopped inlining my inline-assembly memcpy() implementation and turned many call-sites in actual function calls. It's great to save code-size, but I was worried about the overhead of all those new calls. So, I run my "fork_perf" benchmark and, unfortunately, I observed a ~4% regression. I know, it might seem a lot, but I got that result consistently, over many runs.

So, while __builtin_memcpy() is better for small and compile-time constant 'n', because it just does one or two MOVs, in all the other cases where a function call is performed is worse in terms of runtime cost. Therefore, I found a middle-ground solution that both reduces the code size by ~3k and also it's faster:

Code: Select all

inline void *memcpy(void *dest, const void *src, size_t n)
{
   u32 unused; /* See the comment in strlen() about the unused variable */
   u32 unused2;

   if (__builtin_constant_p(n) && n <= BUILTIN_SIZE_THRESHOLD)     // New code using __builtin_memcpy
      return __builtin_memcpy(dest, src, n);                       // when it's best.

   asmVolatile("rep movsl\n\t"         // copy 4 bytes at a time, n/4 times
               "mov %%ebx, %%ecx\n\t"  // then: ecx = ebx = n % 4
               "rep movsb\n\t"         // copy 1 byte at a time, n%4 times
               : "=b" (unused), "=c" (n), "=S" (src), "=D" (unused2)
               : "b" (n & 3), "c" (n >> 2), "S"(src), "D"(dest)
               : "cc", "memory");

   return dest;
}
With this implementation:
- the compiler is free to inline memcpy() or not to (the same code is also instantiated in a translation unit with extern inline).
- when 'n' is known at compile-time and it's small enough (16 or smaller), just MOVs are emitted by __builtin_memcpy()
- the code size is reduced by ~3k
- the runtime performance should be slightly better (but the improvement is not measurable, with my current tests)

Thanks @Korona, for suggesting the idea. Even if I didn't use it directly, it lead to a better code, at the end.
Tilck, a Tiny Linux-Compatible Kernel: https://github.com/vvaltchev/tilck
Octocontrabass
Member
Member
Posts: 5568
Joined: Mon Mar 25, 2013 7:01 pm

Re: Framebuffer: Draw a PSF character

Post by Octocontrabass »

GCC provides options to configure the behavior of __builtin_memcpy() and friends on x86. You may be able to improve performance by tweaking the decisions.
vvaltchev wrote:

Code: Select all

   asmVolatile("rep movsl\n\t"         // copy 4 bytes at a time, n/4 times
               "mov %%ebx, %%ecx\n\t"  // then: ecx = ebx = n % 4
               "rep movsb\n\t"         // copy 1 byte at a time, n%4 times
               : "=b" (unused), "=c" (n), "=S" (src), "=D" (unused2)
               : "b" (n & 3), "c" (n >> 2), "S"(src), "D"(dest)
               : "cc", "memory");
You can replace the volatile qualifier and the two unnecessary clobbers with appropriate input and output parameters. Something like this:

Code: Select all

   asm("rep movsl\n\t"         // copy 4 bytes at a time, n/4 times
       "mov %%ebx, %%ecx\n\t"  // then: ecx = ebx = n % 4
       "rep movsb\n\t"         // copy 1 byte at a time, n%4 times
       : "=b" (unused), "=c" (n), "=S" (src), "=D" (unused2), "=m"(*(char (*)[n])dest)
       : "b" (n & 3), "c" (n >> 2), "S"(src), "D"(dest), "m"(*(const char (*)[n])src) );
But, why aren't you using REP MOVSB? The comment in your code suggests you've avoided it because it copies one byte at a time, but modern CPUs will copy entire cache lines at once. (Unless the source or destination is uncacheable. Are you using memcpy on MMIO? You should probably use a separate function for that.)
vvaltchev
Member
Member
Posts: 274
Joined: Fri May 11, 2018 6:51 am

Re: Framebuffer: Draw a PSF character

Post by vvaltchev »

Octocontrabass wrote:GCC provides options to configure the behavior of __builtin_memcpy() and friends on x86. You may be able to improve performance by tweaking the decisions.
Thank you, @Octocontrabass. It's always amazing to discover how many things our compilers support that we didn't know about.
Octocontrabass wrote:You can replace the volatile qualifier and the two unnecessary clobbers with appropriate input and output parameters.[/url] Something like this:

Code: Select all

   asm("rep movsl\n\t"         // copy 4 bytes at a time, n/4 times
       "mov %%ebx, %%ecx\n\t"  // then: ecx = ebx = n % 4
       "rep movsb\n\t"         // copy 1 byte at a time, n%4 times
       : "=b" (unused), "=c" (n), "=S" (src), "=D" (unused2), "=m"(*(char (*)[n])dest)
       : "b" (n & 3), "c" (n >> 2), "S"(src), "D"(dest), "m"(*(const char (*)[n])src) );
OK, I admit I haven't tried "=m/m" with such a fancy pointer-to-array cast. I agree that's more sophisticated than just clobbering the memory, because it's more restrictive. About the volatile, I wanted to remove it as well and I believe that at some point I did that, but later I discovered that, in some cases, the compiler thought is was OK to interleave my inline asm with other instructions and messed everything up. So, while I agree that your memory clobbering is better, do you agree that in theory clobbering "memory" should be enough? Can the compiler feel "entitled" to break "asm" when it's not volatile if just "memory" clobbering is used? I would be happy to remove that volatile, but I'm afraid to struggle again debugging some nasty UB bug.
Octocontrabass wrote:But, why aren't you using REP MOVSB? The comment in your code suggests you've avoided it because it copies one byte at a time, but modern CPUs will copy entire cache lines at once. (Unless the source or destination is uncacheable. Are you using memcpy on MMIO? You should probably use a separate function for that.)
In my understanding (might be wrong, of course), REP MOVSL is fast as REP MOVSB on modern machines, while on older machines REP MOVSL is definitively faster than the other one. Therefore, I optimized the code for older machines, without loosing anything on modern machines. Is that correct? I would be very surprised if REP MOVSB can be faster than REP MOVSL because under the hood it can be implemented as REP MOVSB with ecx=N*4 but.. anything's possible.
Tilck, a Tiny Linux-Compatible Kernel: https://github.com/vvaltchev/tilck
Octocontrabass
Member
Member
Posts: 5568
Joined: Mon Mar 25, 2013 7:01 pm

Re: Framebuffer: Draw a PSF character

Post by Octocontrabass »

vvaltchev wrote:So, while I agree that your memory clobbering is better, do you agree that in theory clobbering "memory" should be enough? Can the compiler feel "entitled" to break "asm" when it's not volatile if just "memory" clobbering is used?
In this case, the "memory" clobber is enough. It's not great for the optimizer, since you don't specify which memory you're using as input or output, but it will work.

The volatile keyword is necessary when you're accessing things that aren't visible to the optimizer, such as hardware I/O.
vvaltchev wrote:In my understanding (might be wrong, of course), REP MOVSL is fast as REP MOVSB on modern machines, while on older machines REP MOVSL is definitively faster than the other one. Therefore, I optimized the code for older machines, without loosing anything on modern machines. Is that correct? I would be very surprised if REP MOVSB can be faster than REP MOVSL because under the hood it can be implemented as REP MOVSB with ecx=N*4 but.. anything's possible.
On some CPUs, REP MOVSB is faster than REP MOVSD when the source or destination is misaligned.
vvaltchev
Member
Member
Posts: 274
Joined: Fri May 11, 2018 6:51 am

Re: Framebuffer: Draw a PSF character

Post by vvaltchev »

Octocontrabass wrote:In this case, the "memory" clobber is enough. It's not great for the optimizer, since you don't specify which memory you're using as input or output, but it will work.
OK, that's what exactly what I was thinking.
Octocontrabass wrote:The volatile keyword is necessary when you're accessing things that aren't visible to the optimizer, such as hardware I/O.
Can you be more specific about what do you mean with hardware I/O? In/out instructions and RDTSC and friends?

Anyway, just to check your theory (which sounds good to me), I tried again removing the volatile keyword and I got a triple fault, immediately. No other changes. In the past, I spent some time debugging what happens precisely without "volatile" and I remember seeing other instructions interleaving with the inline asm ones. Those instructions altered some of the registers that the inline asm needed and that's why everything broke. In my understanding, when "volatile" is used, the compiler cannot split nor reorder the instructions inside the inline assembly. With plain asm it can. So, in theory, it's better to use just "asm" and leave more freedom to the compiler to optimize what it can but, in practice, even with the right clobbers etc. not always work. Do you have any ideas why "asm" might be not enough in this case? I honestly, don't.

Also, I tried the "=m"(*(char (*)[n])dest" clobber, but the compilation failed because I use both -Wvla and -Werror:

Code: Select all

/home/vlad/dev/tilck/include/tilck/common/arch/generic_x86/asm_x86_strings.h:104:16: error: ISO C90 forbids variable length array [-Werror=vla]
                : "=b" (unused), "=c" (n), "=S" (src), "=D" (unused2), "=m"(*(char (*)[n])dest)
                ^
Sad story. I mean, yours is just a clobber, not a real VLA, but the compiler doesn't care.
The hardware is always surprising!! :-)
Tilck, a Tiny Linux-Compatible Kernel: https://github.com/vvaltchev/tilck
Octocontrabass
Member
Member
Posts: 5568
Joined: Mon Mar 25, 2013 7:01 pm

Re: Framebuffer: Draw a PSF character

Post by Octocontrabass »

vvaltchev wrote:Anyway, just to check your theory (which sounds good to me), I tried again removing the volatile keyword and I got a triple fault, immediately. No other changes.
Hmm, that seems to disagree with the GCC documentation, yet compiler explorer agrees, the "memory" clobber doesn't appear to be enough. Perhaps the optimizer ignores the possibility of output through a "memory" clobber? That's annoying but reasonable.
vvaltchev wrote:but the compilation failed because I use both -Wvla and -Werror:
You can temporarily disable -Wvla:

Code: Select all

#pragma GCC diagnostic push
#pragma GCC diagnostic ignored "-Wvla"
asm (...);
#pragma GCC diagnostic pop
If you don't want to do that for some reason, you can remove the length: "=m"(*(char (*)[])dest)
User avatar
bzt
Member
Member
Posts: 1584
Joined: Thu Oct 13, 2016 4:55 pm
Contact:

Re: Framebuffer: Draw a PSF character

Post by bzt »

The problem with memcpy is, that it does not return the value (it requires a memory address which can't always be optimized). Using a pointer cast will tell the compiler to return a value, no matter what compiler it is and what optimizations it can do. With memcpy you're relying entirely on the hope that maybe the compiler will handle that. I don't like relying on implemenation-specific compiler optimization features, I prefer to optimize my code by hand.

I mean while you can do this and it will always compile perfectly into a single MOV no matter the compiler and it's optimizer capabilities (okay, two MOVs if the compiler is dummy, get address and a dereference):

Code: Select all

printf("%d", *((int*)someaddress));
you cannot do the same with memcpy

Code: Select all

printf("%d", memcpy(?, someaddress, 4));
this means that you must use a temporary memory variable. (Read: your code will unnecessarily copy the data twice).

And no gcc isn't smart enough to optimize this, here's a simple example you can try:

Code: Select all

#include <stdio.h>
#include <string.h>

static char buf[4096];

int main()
{
    int i;
    memcpy(&i, &buf[3], sizeof(int));
    printf("%d\n", i);
    return 0;
}
Compile and check the result:

Code: Select all

gcc test.c -o test
0000000000001149 <main>:
    1149:	55                   	push   %rbp
    114a:	48 89 e5             	mov    %rsp,%rbp
    114d:	48 83 ec 10          	sub    $0x10,%rsp
    1151:	64 48 8b 04 25 28 00 	mov    %fs:0x28,%rax
    1158:	00 00
    115a:	48 89 45 f8          	mov    %rax,-0x8(%rbp)
    115e:	31 c0                	xor    %eax,%eax
    1160:	8b 05 fd 2e 00 00    	mov    0x2efd(%rip),%eax        # 4063 <buf+0x3>
    1166:	89 45 f4             	mov    %eax,-0xc(%rbp)
    1169:	8b 45 f4             	mov    -0xc(%rbp),%eax
    116c:	89 c6                	mov    %eax,%esi
    116e:	48 8d 3d 8f 0e 00 00 	lea    0xe8f(%rip),%rdi        # 2004 <_IO_stdin_used+0x4>
    1175:	b8 00 00 00 00       	mov    $0x0,%eax
    117a:	e8 c1 fe ff ff       	call   1040 <printf@plt>
    117f:	b8 00 00 00 00       	mov    $0x0,%eax
    1184:	48 8b 55 f8          	mov    -0x8(%rbp),%rdx
    1188:	64 48 2b 14 25 28 00 	sub    %fs:0x28,%rdx
    118f:	00 00
    1191:	74 05                	je     1198 <main+0x4f>
    1193:	e8 98 fe ff ff       	call   1030 <__stack_chk_fail@plt>
    1198:	c9                   	leave
    1199:	c3                   	ret
as you can see even though memcpy was replaced by MOV, the temporary variable remained, meaning four MOVs and an additional variable on stack instead of single MOV with a register
1160 moves from buf to eax,
1166 moves from eax to stack,
1169 and 116c moves from stack to printf argument.
That's more than 4 times overhead. It's even worse than what a dummy compiler would generate as it accesses the stack too. Twice.

You'll have to enable certain compiler specific optimizations to make it go away, but there's absolutely no guarantee that a compiler can do this at all. Plus an stb-style header only library most certainly can't specify optimization flags in a portable way (unless it has an ifdef maze to detect compiler and use the appropriate pragma which might be or might be not supported by the actual compiler...)

So I'm not sold, I'll go with pointer casting as it doesn't require compiler specific features and it results in much better code (even on dummy compilers).
Best case:

Code: Select all

mov ($buf + 3), %esi
Worst case:

Code: Select all

mov $buf + 3, %eax
mov (%eax), %esi
That's still much better than what gcc produced with memcpy (excuse I've used "buf + 3" to make the example readable, using rip-relative address instead would go exactly the same way).

Cheers,
bzt
Octocontrabass
Member
Member
Posts: 5568
Joined: Mon Mar 25, 2013 7:01 pm

Re: Framebuffer: Draw a PSF character

Post by Octocontrabass »

bzt wrote:Using a pointer cast will tell the compiler to return a value, no matter what compiler it is and what optimizations it can do.
If the pointer cast is undefined behavior, either due to misalignment or aliasing, you're telling the compiler that the code is unreachable.
bzt wrote:And no gcc isn't smart enough to optimize this,
You've told it not to optimize. Everything is fine when you turn on optimizations.
bzt wrote:as you can see even though memcpy was replaced by MOV, the temporary variable remained, meaning four MOVs and an additional variable on stack instead of single MOV with a register
1160 moves from buf to eax,
1166 moves from eax to stack,
1169 and 116c moves from stack to printf argument.
That's more than 4 times overhead. It's even worse than what a dummy compiler would generate as it accesses the stack too. Twice.
This has nothing to do with memcpy. All local variables are spilled to the stack when optimizations are disabled. You'll see the same thing with any other local variable.
bzt wrote:You'll have to enable certain compiler specific optimizations to make it go away, but there's absolutely no guarantee that a compiler can do this at all. Plus an stb-style header only library most certainly can't specify optimization flags in a portable way
If the user of your library can't figure out how to enable compiler optimizations, that's their problem, not yours. If your library relies on undefined behavior, it's not portable.
Post Reply