Page 1 of 1

ISR code intermittently loops, works under debugger

Posted: Mon Oct 03, 2011 8:56 am
by tomos
*takes a breath*

Hello all!

In my x86 OS project I have a number of modules semi-implemented including a basic heap, interrupt and system call support, partial newlib support, as well as partial ATA and ext2 drivers. Now I am quite confident that my system call and interrupt code is working for the most part, but I've uncovered a very hard to track down bug.

PROBLEM:

My interrupt entry and exit code intermittently enters an infinite loop. I have an assembly language wrapper which includes 'isr_entry' and 'isr_restore'. 'isr_entry' pushes all the relevant registers, saves the context, and then calls the actual ISR. 'isr_restore' restores the context, pops all the registers, and then does an iret. What I find when debugging is that the code somehow enters a state where it irets back into the 'isr_restore' code. To make matters worse this doesn't happen consistently - it seems to behave differently on Qemu, Bochs, and VirtualBox.

SYMPTOMS:

I have code to test my ext2 driver, and the first thing it does is read the superblock, and then try to parse the root directory on my test drive. My infinite loop seems to occur after an ATA disk request, apparently when returning from the ISR. If I run the program in QEMU it usually reads the first two sectors for the superblock and then loops when it tries to return from the ata_read_multiple call (after servicing the interrupts). However, sometimes when I am stepping through the program in GDB (connected to QEMU) it goes further and doesn't enter the loop. One time it got further and then looped, another time in ran until completion. So I've seen an apparent difference sometimes when running through a debugger. It never gets past reading the superblock when the debugger isn't connected.

Note, my ATA driver doesn interrupt driven PIO.

With VirtualBox, the code always successfully reads the superblock and the BGD table, but then locks up when trying to read the inode table. Bochs locks up similarly to QEMU right after reading the superblock. I don't have a debugger setup for both VirtualBox and Bochs (yet).

So three emulators, three behaviors.

QUESTIONS:

I have a feeling this has to do with interrupts firing when I'm not prepared for them. Beofre going further, I have a few questions:

When should the kernel disable interrupts (still no processes)? In particular, when handling other interrupts. In my case, I have the ISR stub code (isr_entry, isr_restore) which handles the context and calls the actual ISR. My ATA ISR disables interrupts upon entry and enables them before leaving. Should the stub code disable interrupts? I say this because once when I was debugging the loop I found that the isr_restore code returned into the very bottom of my ATA ISR (just after the "sti" instruction). It'd be great if someone could expand on the nuances of controlling interrupts during ISR code.

How do I handle differences in test platforms? For a bug like this, the problem seems compounded by the fact that it behaves differently in my three emulators. Why would it always read the superblock in VirtualBox, but only read it sometimes and only while I am stepping through the code in QEMU? How can you sanely manage a problem like this?

I know this is very case and implementation specific, but from your experiences what kind of situation could cause an ISR stub to enter an infinite loop? (perhaps besides some obvious programmer error)

As an example, here is the disassembly of my ata_isr() function. I've noticed the code loop to the pop instruction following sti. So there is this small window where after enabling interrupts, another one can fire even though we are still in an ISR. However, when I think about it the stack should still be OK even if that happens. Either way this is the bottom of the ata_isr, and I've watch it leave ata_isr, go to isr_restore, call iret, and then return into the bottom of ata_isr at 0x103d2e.

Code: Select all

  103d2a:       fb                      sti
  103d2b:       83 c4 24                add    $0x24,%esp
  103d2e:       5b                      pop    %ebx
  103d2f:       5d                      pop    %ebp
  103d30:       c3                      ret
I also noticed that it seems after entering this loop that the stack keeps growing downward, which is strange because I expected the opposite. If its in a loop of returning from an ISR into itself, wouldn't the stack keep shrinking (up toward higher memory)? Maybe that's the hint, why does the stack keep growing while in this loop, when most of the time its in the isr_restore code pop'ing the registers.

DISCLAIMER:

I am not sure what I will get with this question, but I've been banging my head against the keyboard for sometime now. I am pretty sure that on their own my ATA driver is working OK. I can't really see what it could to to fudge the stack so badly as to cause this loop. I am guessing the problem has to lie with interrupts firing at unexpected times. I know this problem is very hard to solve remotely with a forum post, but I would appreciate anybody that could help. Of course, feel free to ignore this as tl;dr and move on.

Thanks!

Re: ISR code intermittently loops, works under debugger

Posted: Mon Oct 03, 2011 9:37 am
by rdos
The most common cause of interrupts re-firing is that the cause for them is not removed. This is hardware specific. If you don't read certain registers in the ISR, hardware will not deassert the interrupt, and as soon as you do an EOI, a new interrupt happens, locking your system up.

Re: ISR code intermittently loops, works under debugger

Posted: Mon Oct 03, 2011 10:12 am
by tomos
The interrupts aren't refiring, but my stack somehow gets messed up when one does, unexpectedly. Because when I loop its not because interrupts fire at I keep calling the handler, but at somepoint when returning from an ISR I return back into the ISR code itself. It has to be something with my stack, but I am not sure where it is occurring.

Re: ISR code intermittently loops, works under debugger

Posted: Mon Oct 03, 2011 1:30 pm
by gerryg400
Generally, I think it's better to have all interrupts disabled during your interrupt_entry and more importantly your interrupt_exit code. This allows the stack to unwind before the next interrupt and should make problems like this easier to track down. On x86 the CPU can disable interrupts on entry, but I guess you already do that.

During the handling of an interrupt the PIC should prevent the interrupt that you are servicing from re-occurring. You just need to not ACK the PIC until you're completely done and have disabled global interrupts again.

I do something like this when an interrupt occurs. Note that at step 0 the CPU does a CLI and at step 9 it does an implied STI.

Code: Select all

0. Interrupt occurs and CLI is done by the CPU.
1. save context
2. increase nesting level
3. if (coming-from-user-space) switch stacks

4. STI (enable global interrupts)
        5. call interrupt handlers (like c functions)
        6. handlers returns
7. CLI (disable global interrupts)

8. Ack PIC
9. if (nested) 
      IRET
   else
      schedule and return to user
Of course sometimes the handler itself may need to disable interrupts for a short time but that's a separate issue.

[Edit] Note, I feel your pain with this type of bug. I recently spent a week staring at my code looking for an error before I discovered that GCC was optimising my code and swapping steps 7 and 8 so that my ISR was re-entering.

Re: ISR code intermittently loops, works under debugger

Posted: Tue Oct 04, 2011 6:52 am
by tomos
Ok, having narrowed things down as much as I have I don't know why I didn't try this. I arranged my ISR stub similar to yours gerry - disabled interrupts directly on entry, left them off for the ISR, and only enabled them right before the IRET. It's not what I want, but it's allowing me to continue and fix other things.

Theoretically, shouldn't ISR stubs be able to handle interrupts, even as they are managing the stack and context? I was thinking that interrupts should only be disabled within ISRs which really can't be interrupts. Ideally.

For now, I am CLI'ing on entry and STI'ing on exit, no more hangs for the moment. Just a new problem with ext2 driver.

I might start peeking at other people's projects to get an idea of how they handle the situation...

Thanks for the input!

Re: ISR code intermittently loops, works under debugger

Posted: Tue Oct 04, 2011 2:25 pm
by gerryg400
...only enabled them right before the IRET
You should be enabling them during IRET. Specifically, you should leave interrupts disabled and allow IRET to pop an eflags image that has the IF bit set.

Note that the CLI at my step 0 and the STI at my step 9 are both automatically done by the processor.

Notice also that I have a 'nesting_level' variable per core. It is set to 0 when the cpu goes to ring 3. It is incremented every time an interrupt occurs and decremented just before IRET. It is very useful for debugging.

I allow interrupts to nest (higher priority interrupt can interrupt lower priority one) and the APIC timer interrupt can occur on 2 cores at once but I don't allow for example interrupt 14 to interrupt itself or to occur on 2 cores at once. I let the (A)PIC prevent that.

Re: ISR code intermittently loops, works under debugger

Posted: Wed Oct 05, 2011 7:16 am
by tomos
Wow, again, something that I should have caught. That's how I enable interrupts when I first load the kernels context (sort of as a single process system atm). It was indeed working this way, but only by a matter of luck. Letting IRET pop eflags is certainly the way to go.

As for disabling interrupts automatically, I was fuzzy on this so I hit the Intel Manuals, and of course in Volume 1, section 6.4.1:
The difference between an interrupt gate and a trap gate is as follows. If an interrupt
or exception handler is called through an interrupt gate, the processor clears the
interrupt enable (IF) flag in the EFLAGS register to prevent subsequent interrupts
from interfering with the execution of the handler. When a handler is called through
a trap gate, the state of the IF flag is not changed.
So by using an interrupt gate the processor automatically clears the IF flag. Good to know.

Thanks a ton for the pointers and helping spur my thought process (I was certainly in a rut.) I am going to continue to pour over the manuals and decide how I could implement a similar nested interrupt handling mechanism.