ISR code intermittently loops, works under debugger
Posted: Mon Oct 03, 2011 8:56 am
*takes a breath*
Hello all!
In my x86 OS project I have a number of modules semi-implemented including a basic heap, interrupt and system call support, partial newlib support, as well as partial ATA and ext2 drivers. Now I am quite confident that my system call and interrupt code is working for the most part, but I've uncovered a very hard to track down bug.
PROBLEM:
My interrupt entry and exit code intermittently enters an infinite loop. I have an assembly language wrapper which includes 'isr_entry' and 'isr_restore'. 'isr_entry' pushes all the relevant registers, saves the context, and then calls the actual ISR. 'isr_restore' restores the context, pops all the registers, and then does an iret. What I find when debugging is that the code somehow enters a state where it irets back into the 'isr_restore' code. To make matters worse this doesn't happen consistently - it seems to behave differently on Qemu, Bochs, and VirtualBox.
SYMPTOMS:
I have code to test my ext2 driver, and the first thing it does is read the superblock, and then try to parse the root directory on my test drive. My infinite loop seems to occur after an ATA disk request, apparently when returning from the ISR. If I run the program in QEMU it usually reads the first two sectors for the superblock and then loops when it tries to return from the ata_read_multiple call (after servicing the interrupts). However, sometimes when I am stepping through the program in GDB (connected to QEMU) it goes further and doesn't enter the loop. One time it got further and then looped, another time in ran until completion. So I've seen an apparent difference sometimes when running through a debugger. It never gets past reading the superblock when the debugger isn't connected.
Note, my ATA driver doesn interrupt driven PIO.
With VirtualBox, the code always successfully reads the superblock and the BGD table, but then locks up when trying to read the inode table. Bochs locks up similarly to QEMU right after reading the superblock. I don't have a debugger setup for both VirtualBox and Bochs (yet).
So three emulators, three behaviors.
QUESTIONS:
I have a feeling this has to do with interrupts firing when I'm not prepared for them. Beofre going further, I have a few questions:
When should the kernel disable interrupts (still no processes)? In particular, when handling other interrupts. In my case, I have the ISR stub code (isr_entry, isr_restore) which handles the context and calls the actual ISR. My ATA ISR disables interrupts upon entry and enables them before leaving. Should the stub code disable interrupts? I say this because once when I was debugging the loop I found that the isr_restore code returned into the very bottom of my ATA ISR (just after the "sti" instruction). It'd be great if someone could expand on the nuances of controlling interrupts during ISR code.
How do I handle differences in test platforms? For a bug like this, the problem seems compounded by the fact that it behaves differently in my three emulators. Why would it always read the superblock in VirtualBox, but only read it sometimes and only while I am stepping through the code in QEMU? How can you sanely manage a problem like this?
I know this is very case and implementation specific, but from your experiences what kind of situation could cause an ISR stub to enter an infinite loop? (perhaps besides some obvious programmer error)
As an example, here is the disassembly of my ata_isr() function. I've noticed the code loop to the pop instruction following sti. So there is this small window where after enabling interrupts, another one can fire even though we are still in an ISR. However, when I think about it the stack should still be OK even if that happens. Either way this is the bottom of the ata_isr, and I've watch it leave ata_isr, go to isr_restore, call iret, and then return into the bottom of ata_isr at 0x103d2e.
I also noticed that it seems after entering this loop that the stack keeps growing downward, which is strange because I expected the opposite. If its in a loop of returning from an ISR into itself, wouldn't the stack keep shrinking (up toward higher memory)? Maybe that's the hint, why does the stack keep growing while in this loop, when most of the time its in the isr_restore code pop'ing the registers.
DISCLAIMER:
I am not sure what I will get with this question, but I've been banging my head against the keyboard for sometime now. I am pretty sure that on their own my ATA driver is working OK. I can't really see what it could to to fudge the stack so badly as to cause this loop. I am guessing the problem has to lie with interrupts firing at unexpected times. I know this problem is very hard to solve remotely with a forum post, but I would appreciate anybody that could help. Of course, feel free to ignore this as tl;dr and move on.
Thanks!
Hello all!
In my x86 OS project I have a number of modules semi-implemented including a basic heap, interrupt and system call support, partial newlib support, as well as partial ATA and ext2 drivers. Now I am quite confident that my system call and interrupt code is working for the most part, but I've uncovered a very hard to track down bug.
PROBLEM:
My interrupt entry and exit code intermittently enters an infinite loop. I have an assembly language wrapper which includes 'isr_entry' and 'isr_restore'. 'isr_entry' pushes all the relevant registers, saves the context, and then calls the actual ISR. 'isr_restore' restores the context, pops all the registers, and then does an iret. What I find when debugging is that the code somehow enters a state where it irets back into the 'isr_restore' code. To make matters worse this doesn't happen consistently - it seems to behave differently on Qemu, Bochs, and VirtualBox.
SYMPTOMS:
I have code to test my ext2 driver, and the first thing it does is read the superblock, and then try to parse the root directory on my test drive. My infinite loop seems to occur after an ATA disk request, apparently when returning from the ISR. If I run the program in QEMU it usually reads the first two sectors for the superblock and then loops when it tries to return from the ata_read_multiple call (after servicing the interrupts). However, sometimes when I am stepping through the program in GDB (connected to QEMU) it goes further and doesn't enter the loop. One time it got further and then looped, another time in ran until completion. So I've seen an apparent difference sometimes when running through a debugger. It never gets past reading the superblock when the debugger isn't connected.
Note, my ATA driver doesn interrupt driven PIO.
With VirtualBox, the code always successfully reads the superblock and the BGD table, but then locks up when trying to read the inode table. Bochs locks up similarly to QEMU right after reading the superblock. I don't have a debugger setup for both VirtualBox and Bochs (yet).
So three emulators, three behaviors.
QUESTIONS:
I have a feeling this has to do with interrupts firing when I'm not prepared for them. Beofre going further, I have a few questions:
When should the kernel disable interrupts (still no processes)? In particular, when handling other interrupts. In my case, I have the ISR stub code (isr_entry, isr_restore) which handles the context and calls the actual ISR. My ATA ISR disables interrupts upon entry and enables them before leaving. Should the stub code disable interrupts? I say this because once when I was debugging the loop I found that the isr_restore code returned into the very bottom of my ATA ISR (just after the "sti" instruction). It'd be great if someone could expand on the nuances of controlling interrupts during ISR code.
How do I handle differences in test platforms? For a bug like this, the problem seems compounded by the fact that it behaves differently in my three emulators. Why would it always read the superblock in VirtualBox, but only read it sometimes and only while I am stepping through the code in QEMU? How can you sanely manage a problem like this?
I know this is very case and implementation specific, but from your experiences what kind of situation could cause an ISR stub to enter an infinite loop? (perhaps besides some obvious programmer error)
As an example, here is the disassembly of my ata_isr() function. I've noticed the code loop to the pop instruction following sti. So there is this small window where after enabling interrupts, another one can fire even though we are still in an ISR. However, when I think about it the stack should still be OK even if that happens. Either way this is the bottom of the ata_isr, and I've watch it leave ata_isr, go to isr_restore, call iret, and then return into the bottom of ata_isr at 0x103d2e.
Code: Select all
103d2a: fb sti
103d2b: 83 c4 24 add $0x24,%esp
103d2e: 5b pop %ebx
103d2f: 5d pop %ebp
103d30: c3 ret
DISCLAIMER:
I am not sure what I will get with this question, but I've been banging my head against the keyboard for sometime now. I am pretty sure that on their own my ATA driver is working OK. I can't really see what it could to to fudge the stack so badly as to cause this loop. I am guessing the problem has to lie with interrupts firing at unexpected times. I know this problem is very hard to solve remotely with a forum post, but I would appreciate anybody that could help. Of course, feel free to ignore this as tl;dr and move on.
Thanks!