Kernel not working on some hardware

Question about which tools to use, bugs, the best way to implement a function, etc should go here. Don't forget to see if your question is answered in the wiki first! When in doubt post here.
Post Reply
chnoser
Posts: 4
Joined: Sun Feb 15, 2015 2:20 pm

Kernel not working on some hardware

Post by chnoser »

Hello,

I am coding a small kernel as a (very time consuming) hobby.
I got to the point where pagination, multitasking are implemented: I am running some simple code in ring 3 (busy loops that print to the screen) and a basic scheduler is in place which makes the switch between theses 'tasks'.

The code works with qemu, bochs, and in "real life" on one of my computer, when booted from usb and grub - but it doesn't work on my developpement computer. The scheduler says it's switching tasks, but for some reason on this computer it is not the case. There is no crash: just the user code is not called (and this is not a display problem, because if I voluntary add a bug in the busy loops the kernel doesn't crash).

I have tried a LOT of things, and I am starting to become crazy.

What could be the typical reasons that make some code running on a computer A and not B? (in my case A and B are very similar, 2 netbooks).

I'd really appreciate any new directions to explore.
Thanks a lot!
User avatar
Combuster
Member
Member
Posts: 9301
Joined: Wed Oct 18, 2006 3:45 am
Libera.chat IRC: [com]buster
Location: On the balcony, where I can actually keep 1½m distance
Contact:

Re: Kernel not working on some hardware

Post by Combuster »

Have you printed enough to the screen to establish which exact lines of code gets executed and which code doesn't?
"Certainly avoid yourself. He is a newbie and might not realize it. You'll hate his code deeply a few years down the road." - Sortie
[ My OS ] [ VDisk/SFS ]
chnoser
Posts: 4
Joined: Sun Feb 15, 2015 2:20 pm

Re: Kernel not working on some hardware

Post by chnoser »

Hi, thanks for the prompt answer.

If I simplify things a lot (only one task) I can see on screen that I am trying to call the code with the correct arguments. Then I enter the code where the task switch is done (I know I am executing it because if I add a page fault here, the correct isr is called). But after iret nothing happens (and if I put some random value for the pushed eip again to force a page fault nothing happens as well).

At the same time, the timer interrupt is still working, same for the keyboard, so the code is still running.
And this only happens on one computer, not on the other, not with emulators - I have no idea how to debug this...
chnoser
Posts: 4
Joined: Sun Feb 15, 2015 2:20 pm

Re: Kernel not working on some hardware

Post by chnoser »

I am still trying to understand the problem on this particular computer, and exploring other possibilities... after having simplified over and over the code, I realized that even basic things are not working on this machine.
And now I am wondering if this can be related to the fact that the processor is hyperthreaded (and as of now I am really really far from using these features in my baby kernel).

My question is: is it possible to run a simple kernel using interrupts like in the first tutorials, even if it has multiple cores / processors? Do I need to initialize something special in the hyperthreading case, or I can assume the processor is running happily like a 'simple' uni-processor?
As the other computer that works is using almost the same processor (both are atoms), I think the answer is yes and my problem has nothing to do with this, but I'd like to be sure so that I am not getting even more lost.

Sorry for the dumb question, I searched quite a bit for an answer and nothing clear popped up (at least for me).
Thanks!
alexfru
Member
Member
Posts: 1112
Joined: Tue Mar 04, 2014 5:27 am

Re: Kernel not working on some hardware

Post by alexfru »

If you never initialize another CPU nor run any code on it, it cannot affect your buggy code running on the boot CPU.
Here are a few types of possible bugs that may show only sometimes:
- incorrect device drivers affecting interrupt handling and the scheduler (e.g. interrupts aren't acknowledged and stop coming)
- race conditions
- uninitialized variables (global or on-stack) or registers or reading/writing data at wrong memory locations
- incorrect self-modifying code
- missing or incorrect TLB flushes when unmapping pages
- incorrect C/C++ code that gets fatally reordered by the compiler without your knowledge and leading to incorrect manipulations with page tables or contributing to race conditions
chnoser
Posts: 4
Joined: Sun Feb 15, 2015 2:20 pm

Re: Kernel not working on some hardware

Post by chnoser »

Ok thanks for the answer for the multi processor case.

I'm happy to report that the code finally works - the problem was that on this computer the IRQ 10 was sometimes firing. No handler associated, but still I though it was supposed to work (in this case I am using the default handler that makes the EOI and exits). This doesn't look enough for this irq and the user threads were stopped (no idea how and why exactly), while at the same time some parts of the kernel code were still running (timer and keyboard ok).

My dirty solution for now is to enable only the lines I use (timer and keyboard) and that's it, all good. My last question is why is this happening (and how should I handle the interrupt)...?

Thanks again for the help!
User avatar
Brendan
Member
Member
Posts: 8561
Joined: Sat Jan 15, 2005 12:00 am
Location: At his keyboard!
Contact:

Re: Kernel not working on some hardware

Post by Brendan »

Hi,
chnoser wrote:Ok thanks for the answer for the multi processor case.

I'm happy to report that the code finally works - the problem was that on this computer the IRQ 10 was sometimes firing. No handler associated, but still I though it was supposed to work (in this case I am using the default handler that makes the EOI and exits). This doesn't look enough for this irq and the user threads were stopped (no idea how and why exactly), while at the same time some parts of the kernel code were still running (timer and keyboard ok).

My dirty solution for now is to enable only the lines I use (timer and keyboard) and that's it, all good. My last question is why is this happening (and how should I handle the interrupt)...?
I have no idea why it's happening (and don't know which device might be sending the IRQ 10). I suspect that it might be a PCI device, and when you send EOI it just sends another IRQ immediately (because it's level triggered and not edge triggered), so that a CPU spends all of its time handling the IRQ flood.

How to handle the interrupt is "don't". During boot you should mask all IRQs in the PIC and/or IO APIC (and possibly also disable all PCI devices), and when you install a device driver you should enable the device's IRQ/s during the device driver's initialisation.

You might also want to consider implementing some sort of "IRQ flood prevention" mechanism. For example, when a device driver handles an IRQ it should return some sort of status to indicate whether or not that device was responsible for the IRQ. When an IRQ occurs but no device driver returns "It was my device!", and if the IRQ won't go away on its own (after a few tries), then you know you've got a problem and need to disable that IRQ (and terminate any/all device drivers that were sharing that IRQ).


Cheers,

Brendan
For all things; perfection is, and will always remain, impossible to achieve in practice. However; by striving for perfection we create things that are as perfect as practically possible. Let the pursuit of perfection be our guide.
Post Reply