Long-mode Kernels and the AMD64 ABI 'Red Zone'
Posted: Sat Mar 20, 2010 7:46 am
It has costed me 6 days to debug and fix this, but it was really worth the effort.
It's interesting that no OSDev resources discussed this important topic before. From further inspection, there seems to be a high rate of hobby x86-64 kernels that get their leaf functions stacks silently overriden in case of an interrupt triggered in the right place.
Now to the story: somehow the montonic PIT interrupts that get triggered every 1 millisecond badly corrupted my kernel state. At first, I thought the handler code might have corrupted the kernel stack, but minimizing it only to acknowledging the local APIC: led to the same buggy behaviour.
It was weird. Once I enable interrupts and program the PIT to fire at a high rate, things go insane: random failed assert()s and page-fault exceptions get triggered all over the place. I even minimized the handler code more by ditching the IOAPIC and using the PIC in Automatic EOI mode. This has led to the absolute architecturally minimum x86 IRQ handler of: but nothing really changed: the same ugly symptomps prevailed.
After days and days of disassembly and hex dumps, I found that GCC generated this assemly for memcpy() at -O0: Noticed something? Look again, especially at how the x86 ops accessed the stack. Yes, GCC kept parts of the leaf function local state below the stack pointer. Now when the interrupt was soft-triggerd, the CPU rightfully pushed CPU counter, status word, and stack pointer (%ss, %rsp, %rflags, %cs, %rip) which meant that parts of the kernel state got corrupted through the implicit CPU stack usage!
Now certainly the generated code is interrupts unsafe. Scanning the AMD64 ABI document for any paragraphs that mentioned the stack, the reason was found: it's the red zone. The zone is a 128-byte area, below the stack, mandated by the x86-64 ABI to be safe for use to leaf functions. It's also safe for higher level functions to use before they call any other function, where they'll need to 'reserve' the used parts of the zone beforehand by moving the stack further down.
All what was needed to fix the bug, like a magic pill, was instructing GCC not to use this x86-interrupts-unsafe zone: Note that the bug was much easier to trigger at -O0 due to -O0's heavy stack usage. At -O2 and -O3, the bug got triggered with much less frequency.
Everything became sane afterwards: the heavy test cases now works well while the PIT is firing rapidly at all possible optimization levels. I would really like to thank Brendan for advising me to further investigate the issue using Bochs binary single-stepping debugger when I was stuck
It's interesting that no OSDev resources discussed this important topic before. From further inspection, there seems to be a high rate of hobby x86-64 kernels that get their leaf functions stacks silently overriden in case of an interrupt triggered in the right place.
Now to the story: somehow the montonic PIT interrupts that get triggered every 1 millisecond badly corrupted my kernel state. At first, I thought the handler code might have corrupted the kernel stack, but minimizing it only to acknowledging the local APIC:
Code: Select all
push %rax
movq $(VIRTUAL(APIC_PHBASE) + APIC_EOI), %rax
movl $0,(%rax)
pop %rax
iretq
It was weird. Once I enable interrupts and program the PIT to fire at a high rate, things go insane: random failed assert()s and page-fault exceptions get triggered all over the place. I even minimized the handler code more by ditching the IOAPIC and using the PIC in Automatic EOI mode. This has led to the absolute architecturally minimum x86 IRQ handler of:
Code: Select all
iretq
After days and days of disassembly and hex dumps, I found that GCC generated this assemly for memcpy() at -O0:
Code: Select all
ffffffff80109c88: 55 push %rbp
ffffffff80109c89: 48 89 e5 mov %rsp,%rbp
[snip]
/* Bochs magic breakpoint */
ffffffff80109caa: 66 87 db xchg %bx,%bx
/* Our manual software interrupt */
ffffffff80109cad: cd f0 int $0xf0
/* Failing code, specially last line */
ffffffff80109caf: 48 8b 45 f8 mov -0x8(%rbp),%rax
ffffffff80109cb3: 0f b6 10 movzbl (%rax),%edx
ffffffff80109cb6: 48 8b 45 f0 mov -0x10(%rbp),%rax
ffffffff80109cba: 88 10 mov %dl,(%rax)
Now certainly the generated code is interrupts unsafe. Scanning the AMD64 ABI document for any paragraphs that mentioned the stack, the reason was found: it's the red zone. The zone is a 128-byte area, below the stack, mandated by the x86-64 ABI to be safe for use to leaf functions. It's also safe for higher level functions to use before they call any other function, where they'll need to 'reserve' the used parts of the zone beforehand by moving the stack further down.
All what was needed to fix the bug, like a magic pill, was instructing GCC not to use this x86-interrupts-unsafe zone:
Code: Select all
-mno-red-zone
Everything became sane afterwards: the heavy test cases now works well while the PIT is firing rapidly at all possible optimization levels. I would really like to thank Brendan for advising me to further investigate the issue using Bochs binary single-stepping debugger when I was stuck