OSDev.org

Posted: **Thu Mar 21, 2019 6:20 am**

linuxyne wrote:Did the code allow gcc to prove certain properties about tcba and ccb as always true (or false)?
If so, the compiler is within its rights (under the appropriate optimization-levels) to exploit them.

No. I've of course added volatile keyword (that was the first thing to do), no change, and I have expliticly told gcc not to assume invariants being true or false.

If gcc can conclude that sched_awake needs to be called even before it has a chance to check
ccb->hd_active, then it can eradicate the loop and checks (including similar checks in sched_awake).

That's my point, the compiler cannot conclude that correctly under no circumstances. Volatile or not, the contents of the queues are surely not known at compile time, therefore making any conclusions about them is invalid.

Besides the ccb->hd_active (head of the active queue at priority level i) has nothing to do with the ccb->hd_timerq (head of the blocked queue waiting for alarms). Sched_awake is only needed if the scheduler picks a task from the blocked qeueue, there's absolutely no reason to awake an already active task. Doing so would change the semantics, something no compiler or optimizer has the right to do. And there's no reason why to jump into the middle of sched_awake function without setting the input parameter and after the input checks, the compiler certainly cannot conclude correctly that any random input will be always valid at run-time. Optimizations, command line options aside, both assumptions are just plain wrong.

I'd go further: silently eradicating any kind of assert() is a huge mistake. The programmer put those there for a good reason, no optimization has the right to override that, and they should not. Eliminating input checks may be tolerateable for an application, but not for a kernel in freestanding mode.

The compiler is also within its rights to move the checks about 'ccb->hd_timerq' and 'tcba->alarmusec <= ccb->sched_ticks' out of the for and the while loops, or conclude them as always true/false (unless volatile or similar restrictions are imposed).

No, as I've said volatile didn't help. What's more, I've tried
1. "-fno-split-loops" (don't make assumption on invariants being true/false) and
2. "-fno-move-loop-invariants" (don't use RTL optimization on invariants)
3. "-fno-unswitch-loops" (don't move branches with invariant expression out of the loop) and
4. "-fno-peel-loops" (don't remove loops).
5. "-fno-crossjumping" (don't unify code of different functions)
Not to mention why did the removal of "continue" solved the issue? How could a "continue" inside a conditional branch influence the invariant or the conditional expression for the branch in any way? Why did "-fno-partial-iniling" generate the correct code when there's absoutely no inlining here? (We have a jump to sched_awake)

Cheers,
bzt

Posted: **Thu Mar 21, 2019 6:53 am**

This is the code, slightly modified:

Code: Select all

flag = ccb->hd_timerq && tcba->alarmusec <= ccb->sched_ticks;
pri = tcba->priority;

do {    
        nonempty = false;
        for(i=PRI_SYS; i<PRI_IDLE; i++) {
                if (flag && i == pri) {
                        sched_awake(tcba);
                        goto found;
                }
                
                if(!ccb->hd_active[i])
                        continue;
                
                nonempty = true;
                if(ccb->cr_active[i])
                        goto found;
                
                ccb->cr_active[i] = ccb->hd_active[i];
        }
} while(nonempty);
...
found:
...

The flag is independent of the loops. If compiler can show at compile time that flag remains true and pri remains PRI_SYS, then it knows to eliminate the loops including any hd_active and cr_active accesses inside those loops. (Edited s/is/remains).

You can see that if we call sched_awake (with i == pri == PRI_SYS), there's no need to run the loop. Given that the claim was that the loops were indeed eliminated, the above situation was my first guess.

About jumping in the middle of sched_awake, one needs to look at the disassembly in the entirely to see what the generated code actually was before concluding that jumping in the middle was actually incorrect. The layout of the code does not matter much as long as it works as intended.

I am not sure what you mean by volatile did not help. Were the loops removed even after forcing the compiler to read from memory for tcba and ccb pointers? Is it possible to get my hands on the scripts/files to reproduce the bugs myself?

If any compiler has the issues that are being suggested here, it would be unusable. Or, your code is just the right piece to expose them, in which case building a small testcase for each of the issues that you
had will help improve the compiler.

Posted: **Thu Mar 21, 2019 7:15 am**

bzt wrote:Why did "-fno-partial-iniling" generate the correct code when there's absoutely no inlining here?

GCC inlined part of sched_awake(), then removed unreachable code. Without the inlining, GCC was not able to determine that the code is unreachable.

The code you've shown isn't enough to compile, so I can't tell you why GCC thinks part of your code is unreachable.

Posted: **Thu Mar 21, 2019 6:21 pm**

Hi,

Thank you for your answers!

Code: Select all

flag = ccb->hd_timerq && tcba->alarmusec <= ccb->sched_ticks;
pri = tcba->priority;                                      <-- possible fault here

do {

@linuxyne: your code is different to mine. For example your code would throw a page fault right on the first call to sched_pick(). My code only resolves tcba fields after it verified that hd_timerq is not empty.

If compiler can show at compile time that flag remains true

That's my point, it is not possible no matter what. For example if no task calls alarm() or delay() system calls then hd_timerq will be 0 all the time and tcba wouldn't be mapped at all. Also it is not possible to evaluate wether sched_ticks is bigger than alarmusec at compile time. That purely depends on when you call sched_pick() in run-time. There's no way to correctly predict the value of "flag" at compile time.

You can see that if we call sched_awake (with i == pri == PRI_SYS), there's no need to run the loop.

Not true. There's another condition, namely sched_ticks. Also, you can't possibly predict the value of "pri" at compile time either. How could you possibly know in which priority queue will a task call alarm() (or for that matter that there'll be a task calling alarm() in the first place)? Both predictions on "flag" and "pri" is a mistake which the compiler shouldn't have made (and it does not if I remove the "continue" keyword).

But let's assume you're right and the compiler assumes both flag == true and i == pri == PRI_SYS. What's with the rest of the sched_pick() function when returned from sched_awake()? Even if the compiler eliminates the loop the instructions after the "found:" label should be still there! Those are doing things like switching to the new address space and calculating the next scheduler interrupt, none of which sched_awake() do.

About jumping in the middle of sched_awake, one needs to look at the disassembly in the entirely to see what the generated code actually was before concluding that jumping in the middle was actually incorrect.

You are missing the point here. Even if the rest of sched_pick() is moved into sched_awake(), the sched_pick() does not have an input argument like sched_awake(), and %rdi is not set at all before the jump. That can't be good, no matter what's in the disassembly (and it's not, it's causing randomly page faults depending on current %rdi value when I do the sched_pick() call).

I am not sure what you mean by volatile did not help. Were the loops removed even after forcing the compiler to read from memory for tcba and ccb pointers?

Yes exactly. It generated the same code even if I added "volatile" to ccb and tcba. I've also tried to add volatile to the struct field hd_timerq.

If any compiler has the issues that are being suggested here, it would be unusable.

I wouldn't call it unusable, but optimization is certainly not as mature as many developer think.

Or, your code is just the right piece to expose them

More likely. It seems I've came up with a simple, yet so complex algorithm that confuses the compiler's optimizer. (Just for the records, what you see here is a recursive scheduler in which reentrancy was replaced by a loop and a state machine, as stored in cr_active queues).

Either this, or those scientologist bastards hacked my computer again. Last time they broke into my house to gain physical access to my non-networked computer when I was at work. Thank God I had that cam installed and they haven't found it, one more rock-solid proof against them

Seriously, who in their right mind would think that breaking into a house is "fair" by any means??? Clear the planet, you can start by eliminating the cult of scientology!!!

@Octrocontrabass: yes I know that. What I meant inlining was an unwanted compiler's choice. I call sched_awake() from other parts as well, so this is not a called-only-once function, nor a small one, therefore it shouldn't be inlined at all. Same stands for sched_pick(), which is also called from assembly in the syscall handler. And there's no removed unreachable code in sched_awake() at all, it was working well when the message sender called it when a new message arrived to a blocked task. And compiler thinking that any other parts of sched_pick() were unreachable is clearly a mistake too.

Unfortunately I can't show you a complete compileable code right now because I'm in a middle of a refactoring for SMP. But when I'm finished with that I'll push to my remote repo and came back to this and try to reproduce the error and let you know. My code is quite complex, not easy to cut out this part alone (ccb and tcb are platform specific structs, for example ccb_t is just a normal struct under AArch64, but it aligns with TSS segment on x86_64). But here's source for sched_awake() if it helps, nothing special as you can see, only single and double linked list handling:

Code: Select all

void sched_awake(tcb_t *tcb)
{
  ccb_t *ccb;

  if(tcb->magic != TCB_MAGIC || tcb->priority > PRI_IDLE || tcb->cpuid >= numcores)
    return;

  ccb = (ccb_t*)CCB_ADDRESS(tcb->cpuid);

  if(tcb->state != TCB_RUNNING) {
    if(ccb->hd_blocked == tcb->pid) {
      ccb->hd_blocked = tcb->next;
    }
    if(ccb->hd_timerq == tcb->pid) {
      ccb->hd_timerq = tcb->next;
      if(ccb->hd_timerq)
        vmm_map(LDYN_tcbalarm, PID2PAGE(ccb->hd_timerq));
    }
    if(tcb->prev) {
      vmm_map(LDYN_tmp, PID2PAGE(tcb->prev));
      ((tcb_t*)LDYN_tmp)->next = tcb->next;
    }
    if(tcb->next) {
      vmm_map(LDYN_tmp, PID2PAGE(tcb->next));
      ((tcb_t*)LDYN_tmp)->prev = tcb->prev;
    }
    sched_add(tcb);
    if(ccb->id != ((ccb_t*)LDYN_ccb)->id && ccb->currentpid == idle_tcb.pid)
      platform_awakecpu(ccb->id);
  }
  ccb->cr_active[tcb->priority] = tcb->pid;
}

A little twist, sched_pick() operates on current cpu core's ccb (mapped at LDYN_ccb), while sched_awake() operates on the ccb of the cpu core on which the appointed task is running. Also if sched_awake() removes a task from the timerq, it maps the tcb of the new head at LDYN_tcbalam, which is relevant for the next call of sched_pick().

Cheers,
bzt

Posted: **Fri Mar 22, 2019 1:54 am**

bzt wrote:@Octrocontrabass: yes I know that. What I meant inlining was an unwanted compiler's choice. I call sched_awake() from other parts as well, so this is not a called-only-once function, nor a small one, therefore it shouldn't be inlined at all.

GCC saw an optimization that applies to sched_awake() only when it's called from sched_pick(), so it inlined the part it was able to optimize. Since part of sched_awake() is inlined, the function call from sched_pick() appears to jump somewhere in the middle of the function instead of the usual entry point. However, as far as GCC is aware, the end result is functionally equivalent.

In general, it's a good optimization. It just happens to cause trouble in this case due to a bug in your code.

bzt wrote:Unfortunately I can't show you a complete compileable code right now because I'm in a middle of a refactoring for SMP.

Without an example that can be compiled, I have no way to reproduce the issue you're seeing to narrow down the cause.

bzt wrote:But here's source for sched_awake() if it helps, nothing special as you can see, only single and double linked list handling:

The only thing that jumps out to me is the unusually high number of pointer casts. Accessing an object through an incompatible pointer type is undefined behavior.

Posted: **Fri Mar 22, 2019 5:17 am**

bzt wrote: your code is different to mine.

Ah. You are correct.

Given the claim that the properties about the code cannot be judged at compile time, it is worthwhile to look at what it was that the compiler did generate. Having something to experiment with will help quite a bit, even if it is not toned down for testing/other purposes.

Posted: **Fri Mar 22, 2019 1:08 pm**

Hi,

I've tested this issue with a 8.2.0 cross compiler, also with my distro's gcc 8.2.1, the results, objdumps, Makefile and sources are here:
https://github.com/bztsrc/gcc-optbug

@Octocontrabass: this was very annoying, so I've created a minimal compilable example you can try. I'm very curious if you got the same results with your gcc. If not, that means those sons of a bitches scientologists break into my home again and gained physical access to my computer.

@linuxyne: Interestingly the minimal example does not behave exactly as the entire build environment: it does not generate a jump to sched_awake, instead it creates a sched_pick.part.1 function and it does not call sched_awake() at all, not even with -O0! The difference could be caused by DEBUG=0 define in the minimal example and DEBUG=1 in my full build environment. Also note that both tcba and ccb are defined as volatile.

Code: Select all

0000000000000940 <sched_pick>:
 940:	31 c0                	xor    %eax,%eax
 942: e9 39 fa ff ff            jmpq 380 <sched_pick.part.1>

So with a slightly different result, but the issue remains, and this example proves there's definitely something wrong with the gcc optimizer.

Running diff reveals that the only difference between sched1.c and sched2.c is:

Code: Select all

$ diff sched1.c sched2.c
395d394
<                 continue;

and yet the output dumps contain different (and incorrect) code.

Cheers,
bzt

Posted: **Fri Mar 22, 2019 2:39 pm**

1000 LOC is hardly "minimal"...

Posted: **Fri Mar 22, 2019 5:37 pm**

bzt wrote:@Octocontrabass: this was very annoying, so I've created a minimal compilable example you can try. I'm very curious if you got the same results with your gcc.

I get the same results. However, your example doesn't demonstrate the bug.

GCC's optimizer is not perfect, and it can't always reduce two programs with identical semantics to the same internal representation. This is why the removal of an unneeded "continue" statement changes GCC's output. The resulting code behaves the same either way.

Your example disassembly doesn't include symbols the linker will use to fix up jump and call destinations. The call to sched_awake() is present, but not indicated by the disassembler.

Do you have a minimal example that demonstrates the bug?

Posted: **Fri Mar 22, 2019 7:31 pm**

Solar wrote:1000 LOC is hardly "minimal"...

I've narrowed it down to one single compilation object. Be my guest to reduce it further.

Cheers,
bzt

Posted: **Fri Mar 22, 2019 8:39 pm**

Hi,

Thank you for checking out.

Octocontrabass wrote:GCC's optimizer is not perfect, and it can't always reduce two programs with identical semantics to the same internal representation. This is why the removal of an unneeded "continue" statement changes GCC's output. The resulting code behaves the same either way.

That's not true. Without "continue" it's working, with "continue" it's generating page faults randomly at run-time. Wouldn't call that "behaves the same way". I know gcc's optimizer is not perfect, that was my point. Now you see that too.

Your example disassembly doesn't include symbols the linker will use to fix up jump and call destinations. The call to sched_awake() is present, but not indicated by the disassembler.

All symbols are available in the object file for the disassembler to use. Both sched_awake and sched_pick are defined in this single object file.

Anyway, my question still stands, why does a "continue" which does not alter control flow change the inlining of the generated code?

Btw, I've downloaded and compiled the latest gcc 8.3.0, and the code compiled with that works, there are no page faults, inlining or not. If this was an optimizer bug, it has been fixed since 8.2.1. If this is an UB, then gcc 8.3.0 handles that differently.

Cheers,
bzt

Posted: **Sat Mar 23, 2019 2:44 am**

bzt wrote:That's not true. Without "continue" it's working, with "continue" it's generating page faults randomly at run-time. Wouldn't call that "behaves the same way". I know gcc's optimizer is not perfect, that was my point. Now you see that too.

I don't see it. You haven't provided an example that shows the page faults.

If that 1000-line example of yours is showing random page faults, please tell me where the page faults are occurring so I know where to look.

bzt wrote:Anyway, my question still stands, why does a "continue" which does not alter control flow change the inlining of the generated code?

I already told you: because GCC's optimizer is not perfect. Even though you and I can see the "continue" statement makes no difference, GCC is not smart enough to reduce both examples to the same output. If you want to know why GCC isn't smart enough to figure that out, you'll have to ask the GCC developers.

Posted: **Sat Mar 23, 2019 4:42 am**

Thanks for the files, bzt. Some analysis below:

In sched1.o.O2.txt, 0x460 is the line where sched_awake should be called:

Code: Select all

 459:   48 c7 c7 00 10 00 80    mov    $0xffffffff80001000,%rdi; LDYN_tcbalarm
 460:   e8 00 00 00 00          callq  465 <sched_pick.part.1+0xe5>; call to sched_awake

Checking the relocation record for the offset 0x461:

Code: Select all

Relocation section '.rela.text' at offset 0xfa0 contains 38 entries:
  Offset          Info           Type           Sym. Value    Sym. Name + Addend
000000000461  000e00000004 R_X86_64_PLT32    00000000000001e0 sched_awake - 4

I suppose that after the relocations are applied, the instruction at 0x460 will correctly call sched_awake.

Similar situation in sched1.o.O0.txt,

Code: Select all

 a5c:   48 c7 45 e0 00 10 00    movq   $0xffffffff80001000,-0x20(%rbp); LDYN_tcbalarm
 a63:   80
 . . .
 ab7:   48 8b 45 e0             mov    -0x20(%rbp),%rax
 abb:   48 89 c7                mov    %rax,%rdi
 abe:   e8 00 00 00 00          callq  ac3 <sched_pick+0x77>; call to sched_awake

Edit: Similar situation in sched2.o.O2.txt

Code: Select all

 752:   48 c7 c7 00 10 00 80    mov    $0xffffffff80001000,%rdi; LDYN_tcbalarm
 759:   e8 00 00 00 00          callq  75e <sched_pick+0x8e>; call to sched_awake

objdump supports -r or --relocs to print the relocation information when disassembling the .o file.

It seems that both 1 and 2 generate similar code wrt the two functions in question, regardless of their difference in the 'continue' keyword. Similar logs should help make sense of gcc's outputs in the build environment, where other complexities are seen.

I doubt that the optimizer is introducing incorrect code.

Edit2: partial inlining

Posted: **Mon Mar 25, 2019 6:32 am**

Hi,

Thank you for your efforts.

As I've said

bzt wrote:Btw, I've downloaded and compiled the latest gcc 8.3.0, and the code compiled with that works, there are no page faults, inlining or not.

At this point it is pretty safe to assume that my computer has been hacked and those m*therfucker cult of scientists have tempered with my compilers. The fact that my cross-compiler was 8.2.0 supports that too, as I've already upgraded it to 8.3.0 in this year once (approx. a month ago). I believe I had written a post about the upgrade too, so we can check the date if we want. This wasn't the first time either when I experienced interesting behaviour from the compilers and the emulators and when a simple reinstall suddenly made all the mysterious problems go away.
viewtopic.php?f=11&t=24932
viewtopic.php?p=275369#p275369

One more bad point for the cultist. Let's get rid of them, make the world a better place!

Cheers,
bzt

Posted: **Mon Mar 25, 2019 7:29 am**

Only that it's hugely unlikely that the root cause actually was a compiler bug. It's more likely the update just covered up the issue... personally, I would not feel at ease until I had figured out what exactly made pre-8.3 compiles emit the faulty behavior, be it code bug or compiler bug...

OSDev.org

Finding an UB

Re: Finding an UB

Re: Finding an UB

Re: Finding an UB

Re: Finding an UB

Re: Finding an UB

Re: Finding an UB

Re: Finding an UB

Re: Finding an UB

Re: Finding an UB

Re: Finding an UB

Re: Finding an UB

Re: Finding an UB

Re: Finding an UB

Re: Finding an UB

Re: Finding an UB