a disgusting bug

miaowei · Post by **miaowei** » Sun Jul 03, 2016 1:06 am

I have kept on debugging it more than 7 hours, and i can't solve it.
My kernel runs well on qemu and bochs.
But on a real machine, it can't respond to an interrupt for a second time.
For example, after it responds to a timer interrupt, it will never respond to timer interrupt again. Of course , it can still respond to a network card interrupt. But still, after receive a network package, it can not receive the second one.

I can guarantee i have configured 8259A in a right way, like EOI.

A strange phenomenon i found is that, whenever the kernel trapped into kernel due to an interrupt, the ISR value is zero !
And even more strangely, the bit of IMR corresponding to that interrupt is set to 1 !
For example:
before timer interrupt occurs(irq 0), IMR = 0xf3f8 , and then 0xf3f9.
before keyboard interrupt(irq 1) occurs, IMR = 0xf3f9, and then 0xf3fb.
before NIC interrupt(irq 10) occurs, IMR = 0xf3fb, and then 0xf7fb.
I promise i never touched IMR after initializing it!

iansjack · Post by **iansjack** » Sun Jul 03, 2016 2:41 am

It sounds as if you are not acknowledging the interrupts. But who knows - without seeing your code people might as well roll dice to get the answer.

miaowei · Post by **miaowei** » Sun Jul 03, 2016 3:38 am

iansjack wrote:It sounds as if you are not acknowledging the interrupts. But who knows - without seeing your code people might as well roll dice to get the answer.

OK.
【1】 init8259A( int mask)
this function is implemented in assemble code. Note, it's a little ugly, because it's argument @mask only applies to the master 8259A chip, and the slave chip is initialized in hard code.

Code: Select all

init8259A:
	mov al,11h
	out 20h,al		;send icw1 to 0x20		[icw4 needed]
	iodelay;
	out 0a0h,al		;send icw1 to 0xa0
	iodelay

	mov al,20h
	out 21h,al		;send icw2 to 0x21.		[irq0=0x20]
	iodelay
	mov al,28h
	out 0a1h,al		;send icw2 to 0xa1.		[irq8=0x28]
	iodelay

	mov al,4
	out 21h,al		;send icw3 to 0x21		[link slave chip at ir2]
	iodelay
	mov al,2
	out 0a1h,al 	;send icw3 to 0xa1		[link master chip from ir2]
	iodelay

	mov al,1
	out 21h,al		;[80X86 mod,normal EOI]
	iodelay
	out 0a1h,al
	iodelay
       
        ;fetch the argument @mask
	mov al,[esp+4];
	out 21h,al
	iodelay
	mov al,11110011b;	
	out 0a1h,al
	iodelay
	ret

【2】It's called in "kernel.asm":

Code: Select all

add esp,4
push 11111000b
call init8259A 
add esp,4

【3】here is one version of my debugging code.

Code: Select all

unsigned do_IRQ(stack_frame regs){
	int err_code = regs.err_code + 256;
	int irq = err_code - 0x20;
	oprintf(" !%u ", irq);

	u16 isr = pic_get_isr();
	u16 imr = read_imr_of8259();
	oprintf("imr: %x, isr: %x\n", imr, isr);
         return;

During the seven hours today, i have written several versions, and i think this version is the most representative one. I didn't send an EOI, but i don't think that's the reason it gave such output:

Code: Select all

imr:0xf3f9, isr:0
imr:0xf3fb, isr:0
imr:0xf7fb, isr:0

the second line corresponds to a keyboard interrupt.
the third line corresponds to a NIC interrupt.

I did try sending an EOI, but it changed nothing.
Have you noticed that the value of IMR and ISR is obviously wrong ?

And, the followings are some sub-functions:
【read_imr_of8259()】

Code: Select all

        ;read_imr_of8259(void)
	xor eax,eax
	in al, 0a1h 
	mov ah, al 
	in al, 21h 
	ret

【mask_and_ack_8259A()】 don't pay much attention to it's name, i just copy from linux kernel. And this is what i used for sending an EOI.

Code: Select all

void mask_and_ack_8259A(u32 irq){
	if(irq >= 8) out_byte(0xa0, 0x20);
	out_byte(0x20, 0x20);
}

Combuster · Post by **Combuster** » Sun Jul 03, 2016 6:34 am

I spy JamesM's known bugs. I do not trust your printf type errors and the fact that your printed output doesn't actually match the code, and I also do not believe anything that says irq = error_code + 224.

miaowei · Post by **miaowei** » Mon Jul 04, 2016 4:44 am

Combuster wrote:I spy JamesM's known bugs.and I also do not believe anything that says irq = error_code + 224.

Excuse me, i can't understand your meaning. This is the interrupt entrance in my kernel.( just like linux)

Code: Select all

...
; timer interrupt
i20h:
	push 0x20 - 256
	jmp common_interrupt
; keyboard interrupt
i21h:
	push 0x21 - 256
	jmp common_interrupt
.....

Code: Select all

common_interrupt:
	SAVE_ALL
	push ret_from_intr
	jmp do_IRQ

miaowei · Post by **miaowei** » Mon Jul 04, 2016 5:53 am

I think i got some clues, at least, i have solved this bug.
Everything became normal when i comment a code snippet in file "kernel.c":

Code: Select all

	__asm__ __volatile__(
		"movl $0xc4500,0xfee00300\n\t"
		"mov $0xffffffff,%ecx\n\t"
		"shr $10,%ecx\n\t"
		"delay:inc %eax\n\t"
		"loop delay\n\t"
		"movl $(0xc4600+0x8),%eax\n\t"
		"movl %eax,0xfee00300\n\t"	
			);

for a more brief view, it should be:

Code: Select all

	__asm__ __volatile__(
		"movl $0xc4500,0xfee00300\n\t"
		"movl $(0xc4600+0x8),%eax\n\t"
		"movl %eax,0xfee00300\n\t"	
			);

Intel APIC registers are mapped to memory region from 0xfee00000. The register on offset 0x300 is ACIC_ICR_LOW.
The code above is trying to broadcase SIPI on SMP environment.
This short code snippet is old code during my early time writing this kernel ( at that time, i tried to write a muti-core kernel, and gave up later), and i forgot to delete it. so have this bug...

But i don't think this is a very certain explanation. This may be not the end.
APIC will bother traditional PIC(i.e. 8259A), that's true, but i didn't enable APIC at all ! The code snippet above is the only place where i touched APIC.
The second confusion is that, this code snippet have existed in my kernel for a long time, and lived peace with my single-core based kernel. It just suddenly(i hope that's it) made my kernel behaves strangely. I regret that i didn't backup the lasted good version.

Anyway, i have decided to check all my bottom layer code written in an early time, the bug caused by hardware is really horrible.
I have wasted 3 days on this bug.

Combuster · Post by **Combuster** » Mon Jul 04, 2016 7:44 am

miaowei wrote:the bug caused by hardware

It's certainly not the hardware's fault. More likely, it's your style of programming that would most aptly be described as Cargo cult, and I have no idea where to start commenting on all the code snippets you consider correct when they're unnecessarily awkward at minimum and obviously bugged half the time.

At some point I put the words "printf type errors" together. For starters, do you know what a "type error" is, and how it might relate to printf?

miaowei · Post by **miaowei** » Mon Jul 04, 2016 8:34 am

Combuster wrote:
miaowei wrote:the bug caused by hardware
It's certainly not the hardware's fault.

Yes, it's not the hardware's fault. And, i didn't mean that, i wanted to say "bottom layer issue is hard to debug".
I think the next time i wrote a post, i should pay more attention to my english. I believe many words in my post had indicated something i didn't intend.
I am sorry.

OSDev.org

a disgusting bug

a disgusting bug

Re: a disgusting bug

Re: a disgusting bug

Re: a disgusting bug

Re: a disgusting bug

Re: a disgusting bug

Re: a disgusting bug

Re: a disgusting bug