It's interesting that no OSDev resources discussed this important topic before. From further inspection, there seems to be a high rate of hobby x86-64 kernels that get their leaf functions stacks silently overriden in case of an interrupt triggered in the right place.
Now to the story: somehow the montonic PIT interrupts that get triggered every 1 millisecond badly corrupted my kernel state. At first, I thought the handler code might have corrupted the kernel stack, but minimizing it only to acknowledging the local APIC:
Code: Select all
push %rax
movq $(VIRTUAL(APIC_PHBASE) + APIC_EOI), %rax
movl $0,(%rax)
pop %rax
iretq
It was weird. Once I enable interrupts and program the PIT to fire at a high rate, things go insane: random failed assert()s and page-fault exceptions get triggered all over the place. I even minimized the handler code more by ditching the IOAPIC and using the PIC in Automatic EOI mode. This has led to the absolute architecturally minimum x86 IRQ handler of:
Code: Select all
iretq
After days and days of disassembly and hex dumps, I found that GCC generated this assemly for memcpy() at -O0:
Code: Select all
ffffffff80109c88: 55 push %rbp
ffffffff80109c89: 48 89 e5 mov %rsp,%rbp
[snip]
/* Bochs magic breakpoint */
ffffffff80109caa: 66 87 db xchg %bx,%bx
/* Our manual software interrupt */
ffffffff80109cad: cd f0 int $0xf0
/* Failing code, specially last line */
ffffffff80109caf: 48 8b 45 f8 mov -0x8(%rbp),%rax
ffffffff80109cb3: 0f b6 10 movzbl (%rax),%edx
ffffffff80109cb6: 48 8b 45 f0 mov -0x10(%rbp),%rax
ffffffff80109cba: 88 10 mov %dl,(%rax)
Now certainly the generated code is interrupts unsafe. Scanning the AMD64 ABI document for any paragraphs that mentioned the stack, the reason was found: it's the red zone. The zone is a 128-byte area, below the stack, mandated by the x86-64 ABI to be safe for use to leaf functions. It's also safe for higher level functions to use before they call any other function, where they'll need to 'reserve' the used parts of the zone beforehand by moving the stack further down.
All what was needed to fix the bug, like a magic pill, was instructing GCC not to use this x86-interrupts-unsafe zone:
Code: Select all
-mno-red-zone
Everything became sane afterwards: the heavy test cases now works well while the PIT is firing rapidly at all possible optimization levels. I would really like to thank Brendan for advising me to further investigate the issue using Bochs binary single-stepping debugger when I was stuck