a disgusting bug
a disgusting bug
I have kept on debugging it more than 7 hours, and i can't solve it.
My kernel runs well on qemu and bochs.
But on a real machine, it can't respond to an interrupt for a second time.
For example, after it responds to a timer interrupt, it will never respond to timer interrupt again. Of course , it can still respond to a network card interrupt. But still, after receive a network package, it can not receive the second one.
I can guarantee i have configured 8259A in a right way, like EOI.
A strange phenomenon i found is that, whenever the kernel trapped into kernel due to an interrupt, the ISR value is zero !
And even more strangely, the bit of IMR corresponding to that interrupt is set to 1 !
For example:
before timer interrupt occurs(irq 0), IMR = 0xf3f8 , and then 0xf3f9.
before keyboard interrupt(irq 1) occurs, IMR = 0xf3f9, and then 0xf3fb.
before NIC interrupt(irq 10) occurs, IMR = 0xf3fb, and then 0xf7fb.
I promise i never touched IMR after initializing it!
My kernel runs well on qemu and bochs.
But on a real machine, it can't respond to an interrupt for a second time.
For example, after it responds to a timer interrupt, it will never respond to timer interrupt again. Of course , it can still respond to a network card interrupt. But still, after receive a network package, it can not receive the second one.
I can guarantee i have configured 8259A in a right way, like EOI.
A strange phenomenon i found is that, whenever the kernel trapped into kernel due to an interrupt, the ISR value is zero !
And even more strangely, the bit of IMR corresponding to that interrupt is set to 1 !
For example:
before timer interrupt occurs(irq 0), IMR = 0xf3f8 , and then 0xf3f9.
before keyboard interrupt(irq 1) occurs, IMR = 0xf3f9, and then 0xf3fb.
before NIC interrupt(irq 10) occurs, IMR = 0xf3fb, and then 0xf7fb.
I promise i never touched IMR after initializing it!
Re: a disgusting bug
It sounds as if you are not acknowledging the interrupts. But who knows - without seeing your code people might as well roll dice to get the answer.
Re: a disgusting bug
OK.iansjack wrote:It sounds as if you are not acknowledging the interrupts. But who knows - without seeing your code people might as well roll dice to get the answer.
【1】 init8259A( int mask)
this function is implemented in assemble code. Note, it's a little ugly, because it's argument @mask only applies to the master 8259A chip, and the slave chip is initialized in hard code.
Code: Select all
init8259A:
mov al,11h
out 20h,al ;send icw1 to 0x20 [icw4 needed]
iodelay;
out 0a0h,al ;send icw1 to 0xa0
iodelay
mov al,20h
out 21h,al ;send icw2 to 0x21. [irq0=0x20]
iodelay
mov al,28h
out 0a1h,al ;send icw2 to 0xa1. [irq8=0x28]
iodelay
mov al,4
out 21h,al ;send icw3 to 0x21 [link slave chip at ir2]
iodelay
mov al,2
out 0a1h,al ;send icw3 to 0xa1 [link master chip from ir2]
iodelay
mov al,1
out 21h,al ;[80X86 mod,normal EOI]
iodelay
out 0a1h,al
iodelay
;fetch the argument @mask
mov al,[esp+4];
out 21h,al
iodelay
mov al,11110011b;
out 0a1h,al
iodelay
ret
Code: Select all
add esp,4
push 11111000b
call init8259A
add esp,4
Code: Select all
unsigned do_IRQ(stack_frame regs){
int err_code = regs.err_code + 256;
int irq = err_code - 0x20;
oprintf(" !%u ", irq);
u16 isr = pic_get_isr();
u16 imr = read_imr_of8259();
oprintf("imr: %x, isr: %x\n", imr, isr);
return;
Code: Select all
imr:0xf3f9, isr:0
imr:0xf3fb, isr:0
imr:0xf7fb, isr:0
the third line corresponds to a NIC interrupt.
I did try sending an EOI, but it changed nothing.
Have you noticed that the value of IMR and ISR is obviously wrong ?
And, the followings are some sub-functions:
【read_imr_of8259()】
Code: Select all
;read_imr_of8259(void)
xor eax,eax
in al, 0a1h
mov ah, al
in al, 21h
ret
Code: Select all
void mask_and_ack_8259A(u32 irq){
if(irq >= 8) out_byte(0xa0, 0x20);
out_byte(0x20, 0x20);
}
- Combuster
- Member
- Posts: 9301
- Joined: Wed Oct 18, 2006 3:45 am
- Libera.chat IRC: [com]buster
- Location: On the balcony, where I can actually keep 1½m distance
- Contact:
Re: a disgusting bug
I spy JamesM's known bugs. I do not trust your printf type errors and the fact that your printed output doesn't actually match the code, and I also do not believe anything that says irq = error_code + 224.
Re: a disgusting bug
Excuse me, i can't understand your meaning. This is the interrupt entrance in my kernel.( just like linux)Combuster wrote:I spy JamesM's known bugs.and I also do not believe anything that says irq = error_code + 224.
Code: Select all
...
; timer interrupt
i20h:
push 0x20 - 256
jmp common_interrupt
; keyboard interrupt
i21h:
push 0x21 - 256
jmp common_interrupt
.....
Code: Select all
common_interrupt:
SAVE_ALL
push ret_from_intr
jmp do_IRQ
Re: a disgusting bug
I think i got some clues, at least, i have solved this bug.
Everything became normal when i comment a code snippet in file "kernel.c":
for a more brief view, it should be:
Intel APIC registers are mapped to memory region from 0xfee00000. The register on offset 0x300 is ACIC_ICR_LOW.
The code above is trying to broadcase SIPI on SMP environment.
This short code snippet is old code during my early time writing this kernel ( at that time, i tried to write a muti-core kernel, and gave up later), and i forgot to delete it. so have this bug...
But i don't think this is a very certain explanation. This may be not the end.
APIC will bother traditional PIC(i.e. 8259A), that's true, but i didn't enable APIC at all ! The code snippet above is the only place where i touched APIC.
The second confusion is that, this code snippet have existed in my kernel for a long time, and lived peace with my single-core based kernel. It just suddenly(i hope that's it) made my kernel behaves strangely. I regret that i didn't backup the lasted good version.
Anyway, i have decided to check all my bottom layer code written in an early time, the bug caused by hardware is really horrible.
I have wasted 3 days on this bug.
Everything became normal when i comment a code snippet in file "kernel.c":
Code: Select all
__asm__ __volatile__(
"movl $0xc4500,0xfee00300\n\t"
"mov $0xffffffff,%ecx\n\t"
"shr $10,%ecx\n\t"
"delay:inc %eax\n\t"
"loop delay\n\t"
"movl $(0xc4600+0x8),%eax\n\t"
"movl %eax,0xfee00300\n\t"
);
Code: Select all
__asm__ __volatile__(
"movl $0xc4500,0xfee00300\n\t"
"movl $(0xc4600+0x8),%eax\n\t"
"movl %eax,0xfee00300\n\t"
);
The code above is trying to broadcase SIPI on SMP environment.
This short code snippet is old code during my early time writing this kernel ( at that time, i tried to write a muti-core kernel, and gave up later), and i forgot to delete it. so have this bug...
But i don't think this is a very certain explanation. This may be not the end.
APIC will bother traditional PIC(i.e. 8259A), that's true, but i didn't enable APIC at all ! The code snippet above is the only place where i touched APIC.
The second confusion is that, this code snippet have existed in my kernel for a long time, and lived peace with my single-core based kernel. It just suddenly(i hope that's it) made my kernel behaves strangely. I regret that i didn't backup the lasted good version.
Anyway, i have decided to check all my bottom layer code written in an early time, the bug caused by hardware is really horrible.
I have wasted 3 days on this bug.
Last edited by miaowei on Mon Jul 04, 2016 8:35 am, edited 1 time in total.
- Combuster
- Member
- Posts: 9301
- Joined: Wed Oct 18, 2006 3:45 am
- Libera.chat IRC: [com]buster
- Location: On the balcony, where I can actually keep 1½m distance
- Contact:
Re: a disgusting bug
It's certainly not the hardware's fault. More likely, it's your style of programming that would most aptly be described as Cargo cult, and I have no idea where to start commenting on all the code snippets you consider correct when they're unnecessarily awkward at minimum and obviously bugged half the time.miaowei wrote:the bug caused by hardware
At some point I put the words "printf type errors" together. For starters, do you know what a "type error" is, and how it might relate to printf?
Re: a disgusting bug
Yes, it's not the hardware's fault. And, i didn't mean that, i wanted to say "bottom layer issue is hard to debug".Combuster wrote:It's certainly not the hardware's fault.miaowei wrote:the bug caused by hardware
I think the next time i wrote a post, i should pay more attention to my english. I believe many words in my post had indicated something i didn't intend.
I am sorry.