Do you use compiler optimizations?

bzt · Post by **bzt** » Tue Feb 12, 2019 1:54 pm

Hi

Since I have finally enabled SSE in my bootloader, I played a bit with compiler optimizers. Here are my foundings:

First, I've used a known to be correct and perfectly working code (with "-O0") as a baseline. Then I've recompiled it with "-O2" and various optimizer flags and "-fsanitize=*" turning some features separately off. I have to say I wasn't pleased with the results, but maybe someone here knows the solution for one or more of my problems.

Incorrect resolving of struct field addresses
I map the first page as supervisor only, so that any user code that tries to dereference NULL will cause a page fault. This works great. But I didn't wanted to waste 4k, therefore I put some process related, for kernel's eye only variables there. Unfortunately both gcc and Clang miscompiles the following struct reference (where pid is not the first field in the struct, hence the accessed memory address is definitely not 0):

Code: Select all

p = *((proc_struct*)0)->pid

and instead of a correct "mov rax, [8]" instruction, they generate an "UD" instruction. Thankfully there's an easy solution, I've added "-fno-delete-null-pointer-checks" to the compiler flags. Browsing the net revealed that this particular optimization messed up the Linux kernel too really badly (although for a different reason).

Bad code generation
Someting I cannot understand, that gcc generated bad, misaligned code. I've called a C function, and in it's function prologue (before the first instruction compiled from the first expression in the C code) it crashed. Debugging revealed that the faulty instruction was a "movaps [rsp], xmm0". Checking rsp it was 0xfffffffffffa8, which is not 16 bytes aligned. Not only SysV ABI expects 16 bytes stack alignment upon function calls, so does movaps. Now the problem with this is, that the programmer cannot influence the stack pointer from C directly, nor can he tell the compiler to use movups, so it is definitely the compiler's responsibility.
The only solution I could came up with was to add "-mno-sse", which quite defeats the whole purpose of my SSE optimization experiment. And this is the better part, as at least with this kind of errors I could see in run-time that the generated code was wrong.

Changed schematics
The most extremely annoying thing, that the optimizers changed the code's behaviour silently. No errors, no run-time faults, just the code does not do what the algorithm in the C source means. This should not happen under no circumstances IMHO. One of the problematic code was:

Code: Select all

void kpanic(char *reason, ...) {
  va_list args;
  va_start(reason, args);
  char strbuf[128];
  vsprintf(strbuf, reason, args);

  kprintf("%s", strbuf);              <--- this worked

  if (debugger_enabled) {
    debugger(strbuf);
  } else {
    kprintf("Panic: ");

    kprintf("%s", strbuf);          <--- this doesn't
    ...

Debugging revealed that the optimized code simply passed an incorrect address to the second kprintf() (used the wrong register). I'd like to point out that I've added the first kprintf() after I haven't seen the string printed, so it doesn't matter if there's a kprintf() before the 'if' statement or not. I'm really curoius what madness made gcc to optimize this code incorrectly in the 'else' block, especially when the same code was ok before the 'if'.

Another problematic code was in the logger

Code: Select all

char *old = ptr;
...here are a bunch of sprintf()s to concatenate the formatted date and the message to ptr...
/* if debug console enabled, print the log message */
if (debugconsole_enabled) {
    while (old<ptr)
       debugconsole_putc(*old++);
}

Believe it or not, this only printed the first letter. Using objdump it turned out, that the entire loop was optimized away for some strange reason. I've tried

Code: Select all

for (;old<ptr;ptr++)

too, didn't work. Finally with

Code: Select all

while(*old)

gcc generated the correct code, but Clang failed no matter what. Clearly the compilers mistaken that ptr haven't changed. But it has, so how on earth do you tell the compiler don't optimize away an important loop? Gcc at least supports __attribute__((optimize(0))) as a function attribute (which is not good because that removes all the other, potentially good optimizations from the function), but Clang doesn't know that attribute. Using "-f*loop*" is not an option, because I wanted to optimize all the other loops in my code except for this one.

Conclusion: I didn't feel I wanted to guess what part of my code's schematics will be changed silently next time, so for now I went back with "-O0" and manual optimization. Maybe it's just me, but I want my generated code to do as the C source says. I just don't want any "if(ptr!=NULL)" silently removed. And I obviously don't want broken, faulty code to be generated either.

What are your experience with gcc and Clang optimizers regarding kernel development?
bzt

Octocontrabass · Post by **Octocontrabass** » Tue Feb 12, 2019 3:14 pm

bzt wrote:Incorrect resolving of struct field addresses

You are casting 0 to a pointer, and then dereferencing that pointer. In C, casting 0 to a pointer results in a null pointer (regardless of the compiler's internal representation of pointers), and dereferencing a null pointer is undefined behavior. I'm not sure why you expected anything else.

It also seems like you're over-optimizing. In a 32-bit OS with a 1GiB/3GiB kernel/user split, that's less than 0.0004% of the kernel's 1GiB address space. You're better off leaving page 0 unmapped (so it always faults) and mapping that 4kB elsewhere.

bzt wrote:Bad code generation

Do you have an example that generates this bad code? I find it hard to believe that GCC would make a mistake like this; it's more likely that you've accidentally misaligned the stack somewhere.

bzt wrote:Changed schematics

Er...

Code: Select all

  va_start(reason, args);

Those two arguments should be the other way around. Maybe that's why it's generating nonsense code?

I'm not sure what's up with the loop optimization. I think I'd have to see more context to figure out where the optimizer gets it wrong.

Solar · Post by **Solar** » Tue Feb 12, 2019 4:03 pm

bzt wrote:First, I've used a known to be correct and perfectly working code (with "-O0") as a baseline.

Errr... no.

All other things aside, you can safely assume that the optimizing code of GCC / Clang, if anything, is better tested than the -O0 code...

Code that Octocontrabass didn't cover:

Code: Select all

char *old = ptr;
/* ...here are a bunch of sprintf()s to concatenate the formatted date and the message to ptr... */
/* if debug console enabled, print the log message */
if (debugconsole_enabled) {
    while (old<ptr)
       debugconsole_putc(*old++);
}

Believe it or not, this only printed the first letter.

From the code you showed, it shouldn't even print that much, as old == ptr to begin with. So you did something in the code you didn't show us, and it's likely that this was where your intention and your execution differed. Show us self-contained examples, or it doesn't make much sense trying to "spot the bug".

I've tried
Code: Select all
for (;old<ptr;ptr++)
too, didn't work.

As you're incrementing the wrong variable, that's not surprising. I guess you just got the condition backwards while typing your post, but that just enhances the importance of self-contained examples that you write, check to actually exhibit the problem, and then copy & paste unaltered.

Finally with
Code: Select all
while(*old)
gcc generated the correct code, but Clang failed no matter what. Clearly the compilers mistaken that ptr haven't changed. But it has, so how on earth do you tell the compiler don't optimize away an important loop?

Wrong approach. First, prove that you are actually right. Check your assumptions. Show that "old" and "ptr" (and other variables involved) have the expected values, and that all functions you use work as assumed. I'd bet money that you're in for a surprise.

Conclusion: I didn't feel I wanted to guess what part of my code's schematics will be changed silently next time, so for now I went back with "-O0" and manual optimization. Maybe it's just me, but I want my generated code to do as the C source says. I just don't want any "if(ptr!=NULL)" silently removed. And I obviously don't want broken, faulty code to be generated either.

The chances of correct code input giving faulty output are very, very small. And code is correct if the language says so, not if "it works with -O0".

What are your experience with gcc and Clang optimizers regarding kernel development?

That every time my code didn't behave as expected, it turned out I made a mistake somewhere. I, not the compiler. (I never did -O0 compiles to begin with, except when I wanted to step through code in GDB, as optimization introduces some confusing jumping around in the source while you step through it.)

Schol-R-LEA · Post by **Schol-R-LEA** » Tue Feb 12, 2019 5:42 pm

@bzt: Setting aside the optimization options for a moment, what warning/error options do you have set?

bzt · Post by **bzt** » Wed Feb 13, 2019 8:27 am

Thanks, but this topic is not about debugging my code, it's about optimizers misbehaving.

As I've said, the code is compiling (with both gcc and Clang, "-O0 -ansi -Wall -ffreestanding -fno-builtins -fno-stack-protector -nostdlib -nostdinc") without errors or warnings and running perfectly without faults and doing exactly what it's expected to do. That's the baseline, because with my experiment I specify the baseline.

I've made mistakes in my exampes, sorry for that. Non the less they were just examples. Forget them. The same issue is with the Linux kernel.

Octocontrabass wrote:I'm not sure why you expected anything else.

Because in kernel development it's perfectly valid to use a struct which starts from the beginning of the memory (we are in freestanding environment not in hosted environment). If not my example, then think of real mode IVT. But this one is solved, I've found the flag which turn that stupid "optimization" off.

Solar wrote:That every time my code didn't behave as expected, it turned out I made a mistake somewhere. I, not the compiler.

Octocontrabass wrote:(on changed schemantics) Er...

See? There's no way that I have made a mistake in my C code. One kprintf() before the "if", one in the "else" block, and gcc generated different code for them (second being just wrong). Not with -O0 or Clang.

Octocontrabass wrote:Do you have an example that generates this bad code?

Code: Select all

void main()                    <--- here the stack is aligned, I've checked
{
   subsystem_init();
}

void subsystem_init()
{
    int i, j, k;                <--- movaps [rsp] generated here in the code before the "call kprintf"
    kprintf("subsystem_init");
}

But the C code doesn't really matter. This is clearly a bug, as a compiler should never generate misaligned stack access code no matter what. (This wasn't the first time I run into such a problem btw, referencing some fields in FAT BPB also caused misaligned memory access fault under AArch64. That's wasn't a stack issue, so you could defend gcc a bit in that case). The stack issue is very problematic, because you can't influence the stack pointer directly from C.

Octocontrabass wrote:Those two arguments should be the other way around. Maybe that's why it's generating nonsense code?

That was just a mistake in the example. If the argument order would be an issue, then it would be an issue with -O0 isn't that so?

Solar wrote:From the code you showed, it shouldn't even print that much

Exactly my point, it shouldn't but it did print one character! Also using equivalent "for()" and "while()" should not influence the optimization, yet it did (for gcc at least, not for Clang). My concern for optimization is precisely that, the "from the code shown, it shouldn't even do that", that's the topic.

Solar wrote:All other things aside, you can safely assume that the optimizing code of GCC / Clang, if anything, is better tested than the -O0 code...

That would be a false assumption, that's my point. With Miles Fidelman's words: "if gcc's optimizer is opening a class of security holes - then it's gcc that has to be fixed" - or optimizer not being used, may I add.

Cheers,
bzt

Solar · Post by **Solar** » Wed Feb 13, 2019 8:49 am

bzt wrote:Thanks, but this topic is not about debugging my code, it's about optimizers misbehaving.

The point is they aren't misbehaving, your code is. The odds are overwhelmingly against you.

As I've said, the code is compiling (with both gcc and Clang, "-O0 -ansi -Wall -ffreestanding -fno-builtins -fno-stack-protector -nostdlib -nostdinc") without errors or warnings and running perfectly without faults and doing exactly what it's expected to do. That's the baseline, because with my experiment I specify the baseline.

"-Wall" is nowhere near to making full use of the error-checking capabilities of your compiler. It's a mere essential baseline. Start with "-Wall -Wextra", and then work your way up to the strictest set of error-checking options possible.

And then be aware that the compiler is still by no means obliged to tell you about every mistake you might have made. Its job is compiling your code, it's not a statical code analyzer. As I said before, the fact that your code "works" with -O0 does by no means imply it isn't broken.

The fact that I have to tell you this, unfortunately, flavors my judgement of your overall skills, which is why I remain unmoved by your claims that GCC / Clang are broken. Your code is most likely riddled with bugs and ambiguities you are not even aware of at this point.

Optimization does not break compliant code. It does, however, uncover bugs that can remain hidden in unoptimized compiles. If your correct code got broken, I'll buy you a beer and apologize in public (i.e., here).

Because in kernel development it's perfectly valid to use a struct which starts from the beginning of the memory (we are in freestanding environment not in hosted environment).

In C, "rules as written", it is never valid to dereference the null pointer, not even in a freestanding environment. I agree, though, that there has to be some (implementation-defined) way for your compiler to do something to that end, because there has to be some way for it to implement the offsetof macro. Yes, I actually implemented that macro in PDCLib by dereferencing a null pointer. But you should rely on that macro instead of doing that zero-casting on your own. (It's in stddef.h, i.e. available in freestanding environments as well.)

But this one is solved, I've found the flag which turn that stupid "optimization" off.

You're doing the "angry child" routine here ("I am right and the compiler isn't, I'll shove down this instruction down its throat to make it behave"). That attitude will set you up for a lot of pain yet to come. It also degrades your code quality. You should be striving for code that gets it right with the absolute minimum amount of "special" flags, options, and / or attributes necessary.

Solar wrote:That every time my code didn't behave as expected, it turned out I made a mistake somewhere. I, not the compiler.
Octocontrabass wrote:(on changed schemantics) Er...
See? There's no way that I have made a mistake in my C code.

No, I don't see, since the code you showed was not the code giving the described behavior.

Bottom line, I will refuse to even consider a compiler bug until you have shown a Minimal, Compilable, Verifiable Example / Short, Self-Contained, Correct Example to trigger what you consider erroneous behavior.

You asked for our opinion and advice. That's it, right there.

Or, to paraphrase StackExchange parlance, voting to close.

Solar · Post by **Solar** » Wed Feb 13, 2019 9:13 am

Solar wrote:In C, "rules as written", it is never valid to dereference the null pointer, not even in a freestanding environment. I agree, though, that there has to be some (implementation-defined) way for your compiler to do something to that end, because there has to be some way for it to implement the offsetof macro. Yes, I actually implemented that macro in PDCLib by dereferencing a null pointer. But you should rely on that macro instead of doing that zero-casting on your own.

Wait a second. I've just looked at my own code comment:

Code: Select all

/* The offsetof macro
   Contract: Expand to an integer constant expression of type size_t, which
   represents the offset in bytes to the structure member from the beginning
   of the structure. If the specified member is a bitfield, behaviour is                       // <------
   undefined.
   There is no standard-compliant way to do this.
   This implementation casts an integer zero to 'pointer to type', and then
   takes the address of member. This is undefined behaviour but should work on
   most compilers.
*/
#define _PDCLIB_offsetof( type, member ) ( (size_t) &( ( (type *) 0 )->member ) )

Could it be that your proc_struct is a / contains bitfields?

bzt · Post by **bzt** » Wed Feb 13, 2019 9:34 am

Solar wrote:The point is they aren't misbehaving, your code is.

Again, it's not only my code, the same issue stands with the Linux kernel too.

Optimization does not break compliant code. If your correct code got broken, I'll buy you a beer.

According to the Linux developers and my tests, it does. Please show any C code and any combination of optimizer flags which could "correctly" generate misaligned stack access. That's just not possible with the elements of the C language, therefore it must be a gcc optimizer bug. Thank you for the beer, I hope can take your word for it one day!

You're doing the "angry child" routine here ("I am right and the compiler isn't, I'll shove down this instruction down its throat to make it behave"). That attitude will set you up for a lot of pain yet to come.

There's nothing "angry child" about it. As you've said, "there has to be some (implementation-defined) way for your compiler to do something to that end" and that flag is that. Good for me.

No, I don't see, since the code you showed was not the code giving the described behavior.

And that's was my point. The code shown shouldn't, but did do the described behavior. I'm not sure if I ever able to prove this to you, so you have to take my word for it. Every single log function call outputted a single '2' character nothing less, nothing more (that '2' was the first character from the date in the log message). Compiled with "-O0" or loop replaced with a "while(*old)" and the whole log message (staring with the full date) appeared on the console as expected.

Bottom line, I will refuse to accept the possibility to even consider a compiler bug until you have shown a Minimal, Compilable, Verifiable Example / Short, Self-Contained, Correct Example to trigger what you consider erroneous behavior.

I understand that, and fair enough. But I'm not talking about some specific issue here that you can create a PoC for. Those were just examples. It is more like compiler optimizer bugs in general. (The ones Linux developers found included, all of which have MCVE). Or take for example that case (reported on OSDev too) when the optimizer replaced the loop in a memcpy() with a "call memcpy", causing an infinite function call chain.

Because it's wildly unlikely (though, I admit, not entirely unheard of) to find an error like that. I've seen that happen once, in all those years here on OSDev and StackOverflow.

In my more than 30 years of development, I've found at least 4 gcc bugs, and 1 in llvm lld (not much, I admit, but more than one). Gcc developers (just like the GNU community in general) are not eager to admit their mistakes, but Clang developers already fixed the issue.

Or, to paraphrase StackExchange parlance, voting to close.

Maybe I've opened this topic in the wrong subforum, but it's not supposed to be a ticket that you can close unresolved, it's supposed to start a theoretical discussion. Just for the sake of the argument, forget about me, let's just assume that Linux developers are right, and gcc optimizer is indeed opening doors for security holes.

Cheers,
bzt

Solar · Post by **Solar** » Wed Feb 13, 2019 9:57 am

bzt wrote:
Solar wrote:The point is they aren't misbehaving, your code is.
Again, it's not only my code, the same issue stands with the Linux kernel too.

I will not start elaborating my optinion on the Linux kernel code quality, or we wouldn't see the end of it. Also, I didn't gather as much from the link you posted.

You are making bold statements here regarding your opinion. It's up to you to follow them up if you want us to change ours.

Please show any C code and any combination of optimizer flags which could "correctly" generate misaligned stack access.

Reversal of the burden of proof. You claimed you had such code. Why not just show it yourself? I am claiming it can't be done, so I'm not the one having to prove anything.

Thank you for the beer, I hope can take your word for it one day!

Right now you're more likely to get a jug of scorn...

As you've said, "there has to be some (implementation-defined) way for your compiler to do something to that end" and that flag is that. Good for me.

What I said was that the standard-defined offsetof() macro does what you want in a compliant and portable manner. What you are doing is forcing your non-compliant solution to be accepted by this version of this specific compiler.

No, I don't see, since the code you showed was not the code giving the described behavior.
And that's was my point. The code shown shouldn't, but did do the described behavior.

No, the code shown was faulty (reversing the parameters) and incomplete, i.e. not self-contained / the issue not reproducible.

But I'm not talking about some specific issue here that you can create a PoC for. Those were just examples. It is more like compiler optimizer bugs in general.

And I am talking about non-compliant code, obviously broken code examples, under-use of compiler warnings, and generally bad coding practices on your part.

...let's just assume that Linux developers are right...

I will work on that assumption the day hell freezes over.

Octocontrabass · Post by **Octocontrabass** » Wed Feb 13, 2019 11:26 am

bzt wrote:As I've said, the code is compiling (with both gcc and Clang, "-O0 -ansi -Wall -ffreestanding -fno-builtins -fno-stack-protector -nostdlib -nostdinc") without errors or warnings and running perfectly without faults and doing exactly what it's expected to do. That's the baseline, because with my experiment I specify the baseline.

GCC and Clang are C compilers, so they compile code according to the C standard. The C standard allows (and sometimes expects) behavior that is not intuitive. For example, the C standard states that the compiler may assume any undefined behavior is unreachable and optimize on that assumption. This causes the Linux bug you linked above: Linux dereferenced a pointer, and later checked if the pointer is null. Dereferencing a null pointer is undefined behavior, so the compiler assumes that the pointer cannot be null and optimizes away the check.

I agree that it's confusing for compilers to make that assumption. However, as far as we can tell, the compilers are correct. If you have any examples of code that is correct according to the C standard and still produces undesired behavior, by all means, please show us.

Solar wrote:In C, "rules as written", it is never valid to dereference the null pointer, not even in a freestanding environment. I agree, though, that there has to be some (implementation-defined) way for your compiler to do something to that end, because there has to be some way for it to implement the offsetof macro. Yes, I actually implemented that macro in PDCLib by dereferencing a null pointer. But you should rely on that macro instead of doing that zero-casting on your own. (It's in stddef.h, i.e. available in freestanding environments as well.)

Incidentally, GCC and Clang both provide __builtin_offsetof for C libraries to implement the offsetof macro.

bzt · Post by **bzt** » Wed Feb 13, 2019 12:50 pm

Hi Solar,

To satisfy your curiousity (and also because I like beer

), here's the full output with sources and disassembly. I've restricted the code to the relevant parts.

Sources:

Code: Select all

x86_64/start.S:
_start:
    cli
    cld

    /* ...cpu initialization, lgdt etc. irrelevant, so removed... */

    /* jump to C function main() in 64 bit code segment */
    xchg %bx, %bx
    pushq   $0x08
    pushq   $main
    lretq

Code: Select all

main.c:
#include <core.h>
void main()
{
    lang_init();
    while(1);
}

Code: Select all

lang.c:
#include <core.h>
#include "lang.h"
void lang_init()
{
    char fn[]="/sys/lang/core.\0\0\0\0\0";
    char *s,*e,*a;
    int i=0,l,k;
__asm__ __volatile__("xchg %bx, %bx");                  <--- NOTE: there's no code in this function above this line
    ...
    /* the rest of this function is irrelevant as it never gets executed. Nothing non ANSI C btw, no inline assembly. */

Please note the ottermost important thing: there's nothing in the C code which could modify or mess up the stack pointer.

Compilation with full output (I haven't removed anything, no warnings or errors of any kind):

Code: Select all

$ x86_64-elf-gcc -D_AS=1  -DDEBUG=1 -DOPTIMIZE=0 -ansi -Wall -Wextra -Wpedantic -O2 -fpic -ffreestanding -nostdinc -nostdlib -fno-stack-protector -I../../../include -I./ibmpc -mno-red-zone -c start.S -o start.o
$ x86_64-elf-gcc  -DDEBUG=1 -DOPTIMIZE=0 -D_OSZ_CORE_=1 -D__x86_64__ -D__ibmpc__ -fpic -fno-stack-protector -fno-builtin -nostdlib -nostdinc -I. -I./x86_64 -I./x86_64/ibmpc -I../../include -ansi -Wall -Wextra -Wpedantic -ffreestanding -O2 -fno-delete-null-pointer-checks -fno-stack-protector -mno-red-zone  -c main.c -o main.o
$ x86_64-elf-gcc  -DDEBUG=1 -DOPTIMIZE=0 -D_OSZ_CORE_=1 -D__x86_64__ -D__ibmpc__ -fpic -fno-stack-protector -fno-builtin -nostdlib -nostdinc -I. -I./x86_64 -I./x86_64/ibmpc -I../../include -ansi -Wall -Wextra -Wpedantic -ffreestanding -O2 -fno-delete-null-pointer-checks -fno-stack-protector -mno-red-zone  -c lang.c -o lang.o

GCC version: 8.2.1 20180831

Bochs output:

Code: Select all

01991983883i[CPU0  ] [1991983883] Stopped on MAGIC BREAKPOINT
(0) Magic breakpoint
Next at t=1991983883
(0) [0x0000001380dc] 0008:ffffffffffe020dc (_start+5c): push 0x0000000000000008   ; 6a08
<bochs:4> s
Next at t=1991983884
(0) [0x0000001380de] 0008:ffffffffffe020de (_start+5e): push 0xffffffffffe05d20   ; 68205de0ff
<bochs:5> s
Next at t=1991983885
(0) [0x0000001380e3] 0008:ffffffffffe020e3 (_start+63): retf                      ; 48cb
<bochs:6> print-stack
Stack address size 8
 | STACK 0xffffffffffffffe0 [0xffffffff:0xffe05d20]                                     <--- NOTE: RSP is properly aligned
 | STACK 0xffffffffffffffe8 [0x00000000:0x00000008]
...
<bochs:7> c
01991983900e[CPU0  ] write_linear_xmmword_aligned(): #GP misaligned access
01991983900e[CPU0  ] interrupt(long mode): IDT entry extended attributes DWORD4 TYPE != 0
01991983900e[CPU0  ] interrupt(long mode): IDT entry extended attributes DWORD4 TYPE != 0
01991983900i[CPU0  ] CPU is in long mode (active)
01991983900i[CPU0  ] CS.mode = 64 bit
01991983900i[CPU0  ] SS.mode = 64 bit
01991983900i[CPU0  ] EFER   = 0x00000d01
01991983900i[CPU0  ] | RAX=0000000000000000  RBX=0000000000000000
01991983900i[CPU0  ] | RCX=0000000077bae39f  RDX=00000000bfebfbff
01991983900i[CPU0  ] | RSP=ffffffffffffff78  RBP=0000000000001abe                       <--- NOTE: RSP is not aligned
01991983900i[CPU0  ] | RSI=0000000000003422  RDI=0000000000002b28
01991983900i[CPU0  ] |  R8=0000000000000000   R9=0000000000000000
01991983900i[CPU0  ] | R10=0000000000000000  R11=0000000000000000
01991983900i[CPU0  ] | R12=0000000000000000  R13=0000000000000000
01991983900i[CPU0  ] | R14=0000000000000000  R15=0000000000000000
01991983900i[CPU0  ] | IOPL=0 ID vip vif ac vm RF nt of df if tf SF zf AF PF cf
01991983900i[CPU0  ] | SEG sltr(index|ti|rpl)     base    limit G D
01991983900i[CPU0  ] |  CS:0008( 0001| 0|  0) 00000000 0000ffff 0 0
01991983900i[CPU0  ] |  DS:001b( 0003| 0|  3) 00000000 0fffffff 1 0
01991983900i[CPU0  ] |  SS:0010( 0002| 0|  0) 00000000 0fffffff 1 0
01991983900i[CPU0  ] |  ES:001b( 0003| 0|  3) 00000000 0fffffff 1 0
01991983900i[CPU0  ] |  FS:001b( 0003| 0|  3) 00000000 0fffffff 1 0
01991983900i[CPU0  ] |  GS:001b( 0003| 0|  3) 00000000 0fffffff 1 0
01991983900i[CPU0  ] |  MSR_FS_BASE:0000000000000000
01991983900i[CPU0  ] |  MSR_GS_BASE:0000000000000000
01991983900i[CPU0  ] | RIP=ffffffffffe05d87 (ffffffffffe05d87)
01991983900i[CPU0  ] | CR0=0xe0000011 CR2=0x0000000000000000
01991983900i[CPU0  ] | CR3=0x0000a000 CR4=0x00000368
(0).[1991983900] [0x00000013bd87] 0008:ffffffffffe05d87 (lang_init+27): movaps dqword ptr ss:[rsp+16], xmm0 ; 0f29442410
01991983900p[CPU0  ] >>PANIC<< exception(): 3rd (13) exception with no resolution
01991983900e[CPU0  ] WARNING: Any simulation after this point is completely bogus !
Next at t=1991983901
(0) [0x00000013bd87] 0008:ffffffffffe05d87 (lang_init+27): movaps dqword ptr ss:[rsp+16], xmm0 ; 0f29442410
<bochs:8> q

Let's take a look at the generated code.

Objdump:

Code: Select all

ffffffffffe02080 <_start>:
ffffffffffe02080:       fa                      cli
ffffffffffe02081:       fc                      cld
...
ffffffffffe020d9:       66 87 db                xchg   %bx,%bx                      <--- first xchg that I used to print stack
ffffffffffe020dc:       6a 08                   pushq  $0x8
ffffffffffe020de:       68 20 5d e0 ff          pushq  $0xffffffffffe05d20
ffffffffffe020e3:       48 cb                   lretq
...
ffffffffffe05d20 <main>:
ffffffffffe05d20:       48 83 ec 08             sub    $0x8,%rsp
ffffffffffe05d24:       31 c0                   xor    %eax,%eax
ffffffffffe05d26:       e8 35 00 00 00          callq  ffffffffffe05d60 <lang_init>
...
ffffffffffe05d60 <lang_init>:
ffffffffffe05d60:       41 57                   push   %r15
ffffffffffe05d62:       41 56                   push   %r14
ffffffffffe05d64:       41 55                   push   %r13
ffffffffffe05d66:       41 54                   push   %r12
ffffffffffe05d68:       55                      push   %rbp
ffffffffffe05d69:       53                      push   %rbx
ffffffffffe05d6a:       48 83 ec 38             sub    $0x38,%rsp                   <---  this causes the misalignment
ffffffffffe05d6e:       8b 05 c9 1b 00 00       mov    0x1bc9(%rip),%eax        # ffffffffffe0793d <platform_dbgputc+0x520>
ffffffffffe05d74:       f3 0f 6f 05 b1 1b 00    movdqu 0x1bb1(%rip),%xmm0        # ffffffffffe0792d <platform_dbgputc+0x510>
ffffffffffe05d7b:       00
ffffffffffe05d7c:       89 44 24 20             mov    %eax,0x20(%rsp)
ffffffffffe05d80:       0f b6 05 ba 1b 00 00    movzbl 0x1bba(%rip),%eax        # ffffffffffe07941 <platform_dbgputc+0x524>
ffffffffffe05d87:       0f 29 44 24 10          movaps %xmm0,0x10(%rsp)
ffffffffffe05d8c:       88 44 24 24             mov    %al,0x24(%rsp)
ffffffffffe05d90:       66 87 db                xchg   %bx,%bx                      <--- second xchg never reached

As you can see from bochs output, when _start jumps to main(), the stack is properly aligned. Main does nothing, just calls a function. I'd like to point out that when lang_init() is called, the stack is still properly aligned, because of "sub $8" and the 8 bytes return address "callq" pushes. So if anything is happening, that's happening with the code gcc generates into lang_init().

There's nothing important in the lang_init() function, no code that could influence the generation of "movaps". The first instruction in C is the second "xchg %bx, %bx" which is not reached. If anything is generated into the function before that, that's 100% gcc's responsibility, such as the "sub $0x38, %rsp" instruction.

Now let's see what code is generated if we use "-O0". I haven't repeated _start as that hasn't changed.

Objdump:

Code: Select all

ffffffffffe072fd <main>:
ffffffffffe072fd:       55                      push   %rbp
ffffffffffe072fe:       48 89 e5                mov    %rsp,%rbp
ffffffffffe07301:       b8 00 00 00 00          mov    $0x0,%eax
ffffffffffe07306:       e8 48 00 00 00          callq  ffffffffffe07353 <lang_init>
...
ffffffffffe07353 <lang_init>:
ffffffffffe07353:       55                      push   %rbp
ffffffffffe07354:       48 89 e5                mov    %rsp,%rbp
ffffffffffe07357:       48 83 ec 40             sub    $0x40,%rsp
ffffffffffe0735b:       48 8b 05 4b 24 00 00    mov    0x244b(%rip),%rax        # ffffffffffe097ad <platform_dbgputc+0x524>
ffffffffffe07362:       48 8b 15 4c 24 00 00    mov    0x244c(%rip),%rdx        # ffffffffffe097b5 <platform_dbgputc+0x52c>
ffffffffffe07369:       48 89 45 c0             mov    %rax,-0x40(%rbp)
ffffffffffe0736d:       48 89 55 c8             mov    %rdx,-0x38(%rbp)
ffffffffffe07371:       8b 05 46 24 00 00       mov    0x2446(%rip),%eax        # ffffffffffe097bd <platform_dbgputc+0x534>
ffffffffffe07377:       89 45 d0                mov    %eax,-0x30(%rbp)
ffffffffffe0737a:       0f b6 05 40 24 00 00    movzbl 0x2440(%rip),%eax        # ffffffffffe097c1 <platform_dbgputc+0x538>
ffffffffffe07381:       88 45 d4                mov    %al,-0x2c(%rbp)
ffffffffffe07384:       c7 45 e4 00 00 00 00    movl   $0x0,-0x1c(%rbp)
ffffffffffe0738b:       66 87 db                xchg   %bx,%bx                  <--- reached without throwing a #GP

There's a "push %rbp", so with the return address pushed by "callq", the stack is still aligned, and %rsp has exactly the same memory address as with -O2 when lang_init() starts. No important change in main() then.

In lang_init() though, this time you'll find %rbp relative addressing mostly, and no instructions that could throw a general protection fault, therefore the second "xchg %bx, %bx" is reached as it should.

Conclusion: as the stack is properly aligned (rsp & 0xF == 0) on function call in both cases, therefore the stack misalignment is caused by a gcc optimizer bug.

Cheers,
bzt

bzt · Post by **bzt** » Wed Feb 13, 2019 12:58 pm

Octocontrabass wrote:GCC and Clang are C compilers, so they compile code according to the C standard. The C standard allows (and sometimes expects) behavior that is not intuitive. For example, the C standard states that the compiler may assume any undefined behavior is unreachable and optimize on that assumption.

That's exactly why I'm suggesting a trusted kernel should never rely on compiler optimizations. Maybe it works, passes every tests you could imagine. But then one day with a newer version of the compiler (which now tries to optimize something it haven't tried until then), the compilation will silently introduce a security hole into the kernel without anybody knowing it (and for which you haven't got a test case because nobody knew about it). With manual optimizations only that could never happen, and it's much-much easier to test for a newer compiler.

Cheers,
bzt

Octocontrabass · Post by **Octocontrabass** » Wed Feb 13, 2019 2:56 pm

bzt wrote:Conclusion: as the stack is properly aligned (rsp & 0xF == 0) on function call in both cases, therefore the stack misalignment is caused by a gcc optimizer bug.

System V ABI AMD64 Supplement 0.99.7 draft wrote:The end of the input argument area shall be aligned on a 16 (32, if __m256 ispassed on stack) byte boundary. In other words, the value (%rsp+8) is always a multiple of 16 (32) when control is transferred to the function entry point.

I don't think the way you're calling your main() function adheres to the alignment requirement in the System V ABI.

bzt wrote:That's exactly why I'm suggesting a trusted kernel should never rely on compiler optimizations.

It sounds like what you want is a compiler for a language like C, but without all of the design pitfalls. (I've heard good things about Rust, although I've never tried it myself.)

alexfru · Post by **alexfru** » Thu Feb 14, 2019 1:44 am

You may have UB. The problem with UB is that it's become a creeping UB, where UB effects may appear far away from the operation actually causing it.

I remember a bug I investigated some 10+ years ago, it was something like:

Code: Select all

void foo(unsigned mask, int pos, int cond)
{
  unsigned flag = mask & (1u << pos);
  if (cond)
  {
    // use flag
  }
  else
  {
    // don't use flag
  }
}

Here pos was valid only for non-zero cond (and could be negative or greater than or equal to the number of bits in int otherwise).
The code crashed when cond was zero even though pos isn't even used in that case. But why?

It turned out that with optimizations enabled, the compiler generated the BT instruction to calculate the value of flag. What this instruction does is it converts the bit offset (pos) to a (d/q)word offset (by dividing by 16/32/64) and adds that to the address of mask. That means if pos is out of range, memory outside of the mask variable will be touched. Oops.

Per the language standard the compiler did nothing wrong. The mistake was programmer's, he did not pay attention to the possibility of UB in the code.

Linus wants to live in the days when UB was not creeping, like the 80's. In order to simulate the desired reality of the days long past he has built a wall around his code, a wall of compiler options that disable certain optimizations or allow certain deviations from the language standard. His desire is understandable (few like it when stuff breaks through no action of theirs other than picking up a new version of the compiler) and his approach is practical (since he doesn't write code to the standard anyway). I don't expect him to grow up here. Even if he does, it may be too late anyway (too much code to review, debug and fix). The situation is far from ideal, though. But I strongly recommend not to follow his path for any new code (compared to Linux your code is new).

Btw, there's also this COOGL language that's supposed to be better and safer than C (while still very close to C) for things like kernels.

MollenOS · Post by **MollenOS** » Thu Feb 14, 2019 3:09 am

bzt wrote:

Code: Select all

ffffffffffe02080 <_start>:
ffffffffffe02080:       fa                      cli
ffffffffffe02081:       fc                      cld
...
ffffffffffe020d9:       66 87 db                xchg   %bx,%bx                      <--- first xchg that I used to print stack
ffffffffffe020dc:       6a 08                   pushq  $0x8
ffffffffffe020de:       68 20 5d e0 ff          pushq  $0xffffffffffe05d20
ffffffffffe020e3:       48 cb                   lretq
...
ffffffffffe05d20 <main>:
ffffffffffe05d20:       48 83 ec 08             sub    $0x8,%rsp
ffffffffffe05d24:       31 c0                   xor    %eax,%eax
ffffffffffe05d26:       e8 35 00 00 00          callq  ffffffffffe05d60 <lang_init>
...
ffffffffffe05d60 <lang_init>:
ffffffffffe05d60:       41 57                   push   %r15
ffffffffffe05d62:       41 56                   push   %r14
ffffffffffe05d64:       41 55                   push   %r13
ffffffffffe05d66:       41 54                   push   %r12
ffffffffffe05d68:       55                      push   %rbp
ffffffffffe05d69:       53                      push   %rbx
ffffffffffe05d6a:       48 83 ec 38             sub    $0x38,%rsp                   <---  this causes the misalignment
ffffffffffe05d6e:       8b 05 c9 1b 00 00       mov    0x1bc9(%rip),%eax        # ffffffffffe0793d <platform_dbgputc+0x520>
ffffffffffe05d74:       f3 0f 6f 05 b1 1b 00    movdqu 0x1bb1(%rip),%xmm0        # ffffffffffe0792d <platform_dbgputc+0x510>
ffffffffffe05d7b:       00
ffffffffffe05d7c:       89 44 24 20             mov    %eax,0x20(%rsp)
ffffffffffe05d80:       0f b6 05 ba 1b 00 00    movzbl 0x1bba(%rip),%eax        # ffffffffffe07941 <platform_dbgputc+0x524>
ffffffffffe05d87:       0f 29 44 24 10          movaps %xmm0,0x10(%rsp)
ffffffffffe05d8c:       88 44 24 24             mov    %al,0x24(%rsp)
ffffffffffe05d90:       66 87 db                xchg   %bx,%bx                      <--- second xchg never reached

As you can see from bochs output, when _start jumps to main(), the stack is properly aligned. Main does nothing, just calls a function. I'd like to point out that when lang_init() is called, the stack is still properly aligned, because of "sub $8" and the 8 bytes return address "callq" pushes. So if anything is happening, that's happening with the code gcc generates into lang_init().

There's nothing important in the lang_init() function, no code that could influence the generation of "movaps". The first instruction in C is the second "xchg %bx, %bx" which is not reached. If anything is generated into the function before that, that's 100% gcc's responsibility, such as the "sub $0x38, %rsp" instruction.

Ok, so you may say the stack is unaligned after the subtract operation. But is it me that is completely oblivious? The value 0x38 is not a value that would cause the stack to go unaligned, this must have happened before. From what it looks to me, the stack must have been unaligned before you even called the function.

I have a really hard time believing its due to bugs in the compiler. Now I don't use gcc, I use clang, but I compile ALL of my kernel code with -O3, and I have yet to encounter an example where it was clang that fucked up, and not myself.

OSDev.org

Do you use compiler optimizations?

Do you use compiler optimizations?

Re: Do you use compiler optimizations?

Re: Do you use compiler optimizations?

Re: Do you use compiler optimizations?

Re: Do you use compiler optimizations?

Re: Do you use compiler optimizations?

Re: Do you use compiler optimizations?

Re: Do you use compiler optimizations?

Re: Do you use compiler optimizations?

Re: Do you use compiler optimizations?

Re: Do you use compiler optimizations?

Re: Do you use compiler optimizations?

Re: Do you use compiler optimizations?

Re: Do you use compiler optimizations?

Re: Do you use compiler optimizations?