Hello,
I am coding a small kernel as a (very time consuming) hobby.
I got to the point where pagination, multitasking are implemented: I am running some simple code in ring 3 (busy loops that print to the screen) and a basic scheduler is in place which makes the switch between theses 'tasks'.
The code works with qemu, bochs, and in "real life" on one of my computer, when booted from usb and grub - but it doesn't work on my developpement computer. The scheduler says it's switching tasks, but for some reason on this computer it is not the case. There is no crash: just the user code is not called (and this is not a display problem, because if I voluntary add a bug in the busy loops the kernel doesn't crash).
I have tried a LOT of things, and I am starting to become crazy.
What could be the typical reasons that make some code running on a computer A and not B? (in my case A and B are very similar, 2 netbooks).
I'd really appreciate any new directions to explore.
Thanks a lot!
Kernel not working on some hardware
- Combuster
- Member
- Posts: 9301
- Joined: Wed Oct 18, 2006 3:45 am
- Libera.chat IRC: [com]buster
- Location: On the balcony, where I can actually keep 1½m distance
- Contact:
Re: Kernel not working on some hardware
Have you printed enough to the screen to establish which exact lines of code gets executed and which code doesn't?
Re: Kernel not working on some hardware
Hi, thanks for the prompt answer.
If I simplify things a lot (only one task) I can see on screen that I am trying to call the code with the correct arguments. Then I enter the code where the task switch is done (I know I am executing it because if I add a page fault here, the correct isr is called). But after iret nothing happens (and if I put some random value for the pushed eip again to force a page fault nothing happens as well).
At the same time, the timer interrupt is still working, same for the keyboard, so the code is still running.
And this only happens on one computer, not on the other, not with emulators - I have no idea how to debug this...
If I simplify things a lot (only one task) I can see on screen that I am trying to call the code with the correct arguments. Then I enter the code where the task switch is done (I know I am executing it because if I add a page fault here, the correct isr is called). But after iret nothing happens (and if I put some random value for the pushed eip again to force a page fault nothing happens as well).
At the same time, the timer interrupt is still working, same for the keyboard, so the code is still running.
And this only happens on one computer, not on the other, not with emulators - I have no idea how to debug this...
Re: Kernel not working on some hardware
I am still trying to understand the problem on this particular computer, and exploring other possibilities... after having simplified over and over the code, I realized that even basic things are not working on this machine.
And now I am wondering if this can be related to the fact that the processor is hyperthreaded (and as of now I am really really far from using these features in my baby kernel).
My question is: is it possible to run a simple kernel using interrupts like in the first tutorials, even if it has multiple cores / processors? Do I need to initialize something special in the hyperthreading case, or I can assume the processor is running happily like a 'simple' uni-processor?
As the other computer that works is using almost the same processor (both are atoms), I think the answer is yes and my problem has nothing to do with this, but I'd like to be sure so that I am not getting even more lost.
Sorry for the dumb question, I searched quite a bit for an answer and nothing clear popped up (at least for me).
Thanks!
And now I am wondering if this can be related to the fact that the processor is hyperthreaded (and as of now I am really really far from using these features in my baby kernel).
My question is: is it possible to run a simple kernel using interrupts like in the first tutorials, even if it has multiple cores / processors? Do I need to initialize something special in the hyperthreading case, or I can assume the processor is running happily like a 'simple' uni-processor?
As the other computer that works is using almost the same processor (both are atoms), I think the answer is yes and my problem has nothing to do with this, but I'd like to be sure so that I am not getting even more lost.
Sorry for the dumb question, I searched quite a bit for an answer and nothing clear popped up (at least for me).
Thanks!
Re: Kernel not working on some hardware
If you never initialize another CPU nor run any code on it, it cannot affect your buggy code running on the boot CPU.
Here are a few types of possible bugs that may show only sometimes:
- incorrect device drivers affecting interrupt handling and the scheduler (e.g. interrupts aren't acknowledged and stop coming)
- race conditions
- uninitialized variables (global or on-stack) or registers or reading/writing data at wrong memory locations
- incorrect self-modifying code
- missing or incorrect TLB flushes when unmapping pages
- incorrect C/C++ code that gets fatally reordered by the compiler without your knowledge and leading to incorrect manipulations with page tables or contributing to race conditions
Here are a few types of possible bugs that may show only sometimes:
- incorrect device drivers affecting interrupt handling and the scheduler (e.g. interrupts aren't acknowledged and stop coming)
- race conditions
- uninitialized variables (global or on-stack) or registers or reading/writing data at wrong memory locations
- incorrect self-modifying code
- missing or incorrect TLB flushes when unmapping pages
- incorrect C/C++ code that gets fatally reordered by the compiler without your knowledge and leading to incorrect manipulations with page tables or contributing to race conditions
Re: Kernel not working on some hardware
Ok thanks for the answer for the multi processor case.
I'm happy to report that the code finally works - the problem was that on this computer the IRQ 10 was sometimes firing. No handler associated, but still I though it was supposed to work (in this case I am using the default handler that makes the EOI and exits). This doesn't look enough for this irq and the user threads were stopped (no idea how and why exactly), while at the same time some parts of the kernel code were still running (timer and keyboard ok).
My dirty solution for now is to enable only the lines I use (timer and keyboard) and that's it, all good. My last question is why is this happening (and how should I handle the interrupt)...?
Thanks again for the help!
I'm happy to report that the code finally works - the problem was that on this computer the IRQ 10 was sometimes firing. No handler associated, but still I though it was supposed to work (in this case I am using the default handler that makes the EOI and exits). This doesn't look enough for this irq and the user threads were stopped (no idea how and why exactly), while at the same time some parts of the kernel code were still running (timer and keyboard ok).
My dirty solution for now is to enable only the lines I use (timer and keyboard) and that's it, all good. My last question is why is this happening (and how should I handle the interrupt)...?
Thanks again for the help!
Re: Kernel not working on some hardware
Hi,
How to handle the interrupt is "don't". During boot you should mask all IRQs in the PIC and/or IO APIC (and possibly also disable all PCI devices), and when you install a device driver you should enable the device's IRQ/s during the device driver's initialisation.
You might also want to consider implementing some sort of "IRQ flood prevention" mechanism. For example, when a device driver handles an IRQ it should return some sort of status to indicate whether or not that device was responsible for the IRQ. When an IRQ occurs but no device driver returns "It was my device!", and if the IRQ won't go away on its own (after a few tries), then you know you've got a problem and need to disable that IRQ (and terminate any/all device drivers that were sharing that IRQ).
Cheers,
Brendan
I have no idea why it's happening (and don't know which device might be sending the IRQ 10). I suspect that it might be a PCI device, and when you send EOI it just sends another IRQ immediately (because it's level triggered and not edge triggered), so that a CPU spends all of its time handling the IRQ flood.chnoser wrote:Ok thanks for the answer for the multi processor case.
I'm happy to report that the code finally works - the problem was that on this computer the IRQ 10 was sometimes firing. No handler associated, but still I though it was supposed to work (in this case I am using the default handler that makes the EOI and exits). This doesn't look enough for this irq and the user threads were stopped (no idea how and why exactly), while at the same time some parts of the kernel code were still running (timer and keyboard ok).
My dirty solution for now is to enable only the lines I use (timer and keyboard) and that's it, all good. My last question is why is this happening (and how should I handle the interrupt)...?
How to handle the interrupt is "don't". During boot you should mask all IRQs in the PIC and/or IO APIC (and possibly also disable all PCI devices), and when you install a device driver you should enable the device's IRQ/s during the device driver's initialisation.
You might also want to consider implementing some sort of "IRQ flood prevention" mechanism. For example, when a device driver handles an IRQ it should return some sort of status to indicate whether or not that device was responsible for the IRQ. When an IRQ occurs but no device driver returns "It was my device!", and if the IRQ won't go away on its own (after a few tries), then you know you've got a problem and need to disable that IRQ (and terminate any/all device drivers that were sharing that IRQ).
Cheers,
Brendan
For all things; perfection is, and will always remain, impossible to achieve in practice. However; by striving for perfection we create things that are as perfect as practically possible. Let the pursuit of perfection be our guide.