[Solved] Interrupt handler bug?
[Solved] Interrupt handler bug?
Hi!
I'm writing a kernel for amd64 long mode. So far only doing the basic bootstrapping (interrupts, apic, page tables etc).
I've started to see a problem though - I sometimes get strange page faults and trigger "impossible" assertions in my logic, maybe every 5th time I run through my kernel code (I'm using Bochs). Most of the time, everything seems to run exactly as I'd expect. I think I've isolated it down to the following:
- I only get the errors when compiling with -O0, never with -O2.
- I only get the errors when I'm running with interrupts enabled, never with interrupts disabled.
So my main suspicion is that I have some kind of memory/register corruption going on, that depends on the timing of how my interrupt handling runs relative to the rest of my code.
The thing is, no matter how much I've stared at my interrupt code or compared it to other sources online, I can't find anything that seems wrong. So that's why I'm hoping someone here has more wisdom and can tell if I'm doing something that would cause issues.
Oh, and the only interrupt that seems to be triggered (in both the normal case and when I get my assertions) is INT32, which would be the IRQ 0 timer interrupt, so no strange stuff going on there.
Attaching the relevant parts of my isr setup, in asm and C++.
I'm writing a kernel for amd64 long mode. So far only doing the basic bootstrapping (interrupts, apic, page tables etc).
I've started to see a problem though - I sometimes get strange page faults and trigger "impossible" assertions in my logic, maybe every 5th time I run through my kernel code (I'm using Bochs). Most of the time, everything seems to run exactly as I'd expect. I think I've isolated it down to the following:
- I only get the errors when compiling with -O0, never with -O2.
- I only get the errors when I'm running with interrupts enabled, never with interrupts disabled.
So my main suspicion is that I have some kind of memory/register corruption going on, that depends on the timing of how my interrupt handling runs relative to the rest of my code.
The thing is, no matter how much I've stared at my interrupt code or compared it to other sources online, I can't find anything that seems wrong. So that's why I'm hoping someone here has more wisdom and can tell if I'm doing something that would cause issues.
Oh, and the only interrupt that seems to be triggered (in both the normal case and when I get my assertions) is INT32, which would be the IRQ 0 timer interrupt, so no strange stuff going on there.
Attaching the relevant parts of my isr setup, in asm and C++.
- Attachments
-
- int.cpp
- (779 Bytes) Downloaded 51 times
-
[The extension s has been deactivated and can no longer be displayed.]
Last edited by cadaker on Wed Nov 07, 2018 12:24 pm, edited 1 time in total.
-
- Member
- Posts: 5586
- Joined: Mon Mar 25, 2013 7:01 pm
Re: Interrupt handler bug?
The System V ABI for x64 includes a red zone at the top of the stack. Are your interrupt handlers clobbering it?
Re: Interrupt handler bug?
Oh, I hadn't considered that. That's a good suggestion to look into, thanks!
-
- Member
- Posts: 5586
- Joined: Mon Mar 25, 2013 7:01 pm
Re: Interrupt handler bug?
In case it helps, the usual solutions are to either disable the red zone (don't forget libgcc if you're using GCC) or have interrupts switch to a new stack.
Re: Interrupt handler bug?
You could switch to a new stack on an interrupt using an interrupt stack table, https://os.phil-opp.com/double-fault-exceptions/ this is a tutorial that implements it. I know it is in rust but it shouldn't be to hard to convert it to C/C++.
Re: Interrupt handler bug?
Not getting any issues anymore after rebuilding with -mno-red-zone, so this was very probably the issue.
Also realized that I'm not linking in libgcc, so I should get some proper interrupt stacks in soon, I guess, and fix this properly.
Thanks a lot everyone!
Also realized that I'm not linking in libgcc, so I should get some proper interrupt stacks in soon, I guess, and fix this properly.
Thanks a lot everyone!
Re: [Solved] Interrupt handler bug?
Ugh, ISTs... I dislike them. You can register up to 7 interrupt stacks (3 bit field and 0 has a special meaning), but you are going to want to handle dozens if not hundreds of interrupts. Not to speak of exceptions. So you can't use one stack for each interrupt. So what then? You also can't nest two interrupts that use the same IST, as those would clobber the stack. Whereas good old stack switching just adds the new frame to the stack.
I would seriously suggest rebuilding libgcc with special kernel options (-mno-red-zone because interrupts can always occur in kernel mode, -mcmodel=kernel in order to use the correct relocations for code that will run at -2GB, -msoft-float to prevent the use of floating point registers or SSE in kernel mode) and applying those to your kernel as well. This way, there is no need for ISTs. Well, almost no need. Some exceptions can happen at any time. Now, I have decided to ditch ISTs entirely, and therefore also need to ditch the "syscall" instruction, as that one does not switch stacks. But if you choose to use syscall, you will have times when you are at CPL0 with an invalid RSP value. And for those you might want to consider ISTs. But then you have to be careful not to nest anything. Which can be a challange, for NMIs for instance, as those are triggerred externally, beyond your control. The CPU doesn't recognize further NMIs until the next "iret", but for one, this could be the iret from an exception handler, from an exception caused while handling the NMI, and for two, this could be an iret from the firmware executed in system management mode. Which you couldn't see, because SMM is invisible to you.
Another canonical example is double fault exceptions. But why though? All exceptions can happen at any time, anyway.
I would seriously suggest rebuilding libgcc with special kernel options (-mno-red-zone because interrupts can always occur in kernel mode, -mcmodel=kernel in order to use the correct relocations for code that will run at -2GB, -msoft-float to prevent the use of floating point registers or SSE in kernel mode) and applying those to your kernel as well. This way, there is no need for ISTs. Well, almost no need. Some exceptions can happen at any time. Now, I have decided to ditch ISTs entirely, and therefore also need to ditch the "syscall" instruction, as that one does not switch stacks. But if you choose to use syscall, you will have times when you are at CPL0 with an invalid RSP value. And for those you might want to consider ISTs. But then you have to be careful not to nest anything. Which can be a challange, for NMIs for instance, as those are triggerred externally, beyond your control. The CPU doesn't recognize further NMIs until the next "iret", but for one, this could be the iret from an exception handler, from an exception caused while handling the NMI, and for two, this could be an iret from the firmware executed in system management mode. Which you couldn't see, because SMM is invisible to you.
Another canonical example is double fault exceptions. But why though? All exceptions can happen at any time, anyway.
Carpe diem!
Re: [Solved] Interrupt handler bug?
Hmm, yeah, reading up on this, I guess you're right. I mean, I'd still need the IST setup to be interrupt safe w.r.t. red-zones in userspace, but there really doesn't seem to be any good way to be interrupt-safe in this way inside the kernel. How... disappointing.
Re: [Solved] Interrupt handler bug?
No, you don't need them for that. Userspace typically runs at CPL3, whereas any interrupts run at CPL0. So the stack is already switched to the CPL0 stack if a privilege change occurs. You only need to setup the TSS. The red zone only becomes important for signal handling.
Carpe diem!