Question about which tools to use, bugs, the best way to implement a function, etc should go here. Don't forget to see if your question is answered in the wiki first! When in doubt post here.
I am working on a 32-bit kernel loaded by grub2. I am working on a little driver that gets the time from CMOS, and stores it in a structure that is placed on my heap. All of a sudden, I am getting an error on real hardware that states: "452: out of range pointer: 0x7." It then says press any key to exit which causes the computer to continue on in the boot list. This leads me to believe the error is occurring in grub for some reason.
Note: I do not get this error in VirtualBox.
This is the code that I added which seems to result in the error:
The first thing I'd do is put a "while(true) {}" in your code as close to the entry point as possible. If it still crashes, then it's very likely the problem is GRUB.
However, maybe GRUB sets up exception handlers, and your code crashes and triggers GRUB's exception handler. In that case the problem isn't GRUB; and I'd probably start by inserting code in various places (e.g. "printf("AA\n");") to determine how far it gets before it crashes.
The other thing I'd consider is changing it so that "cmos_obtaintime()" takes a pointer to a pre-allocated buffer (e.g. "static void cmos_obtaintime(cmos_datetime*outputBuffer)", so that you can do this:
If that fixes the problem, then the problem was probably your "kmalloc()".
Cheers,
Brendan
For all things; perfection is, and will always remain, impossible to achieve in practice. However; by striving for perfection we create things that are as perfect as practically possible. Let the pursuit of perfection be our guide.
Thanks for the reply. I carried out your suggestions and found that they did stop the error. It gets weird though, printing the address out to the console also removes the error, with the heap functions still being called. Investigation of the output of this print shows that the heap is functioning as expected.
In terms of grub performing exception handling, I don't really see how it could because my OS takes over the IRQs/ISRs much earlier in its initialization process.
By adding a function in a completely different source file, the error stopped, and I haven't been able to reproduce it since. I am definitely unnerved by this disappearing bug, though my hunch is that it has something to do with grub not liking the image.
Looks like the error message comes from this line of the GRUB source: http://anonscm.debian.org/cgit/pkg-grub ... tor.c#n452
This part of the code should not be executed after control has been passed to your kernel image. So either your kernel jumps to that point for whatever strange reason (which I would expect to rather cause a triple fault or at least some exception instead of just printing, since you probably already messed with the screen), or it happens in GRUB before it passes control to your kernel. This is probably hard to debug, but with Bochs debugger it should be possible. I would use a version of your image which reproduces the bug and check in Bochs or QEMU with GDB where it runs to and what the call stack looks like.
Thanks for the reply. I carried out your suggestions and found that they did stop the error. It gets weird though, printing the address out to the console also removes the error, with the heap functions still being called. Investigation of the output of this print shows that the heap is functioning as expected.
In terms of grub performing exception handling, I don't really see how it could because my OS takes over the IRQs/ISRs much earlier in its initialization process.
By adding a function in a completely different source file, the error stopped, and I haven't been able to reproduce it since. I am definitely unnerved by this disappearing bug, though my hunch is that it has something to do with grub not liking the image.
Jacob
You may think that this is a solution, but almost always, this kind of "nothing to do" repairs tend to shadow important the details about a bug that can still be there, and can come back later . I'll undo that "repair" and do what XenOS suggests, so that we and you can know where the bug comes from...
When I find a bug in my kernel's harsh early stages, before ISR's are setted up, I tend to use GDB's next and step commands. If I reach a point where EIP becomes something like 0xe000, and GDB's says I'm in function ??, I'ld have reached a GRUB exception handler . The solution? Once there, dump the stack frames and you'll get the offending procedure.
Note how the entry for *(.rodata) no longer has the second '*'. I'm not entirely sure exactly why this solved it, though, Looking at the outputted linker maps, the .rodata.str1.4 (etc) sections are now being placed before the .data section instead of after with the other .rodata stuff.
Still searching for it, its just difficult when any debugging code stop the bug from occurring. I am also not able to reproduce this bug in an emulator, only on real hardware.
jvc wrote:So I have concluded that my kernel's entry point never even gets called, this error is happening before my kernel runs.
In that case, there's about 6 possibilities:
The hardware is faulty, and something about the computer causes GRUB to crash. E.g. maybe there's faulty RAM.
The hardware/firmware is buggy, and something about the computer causes GRUB to crash. E.g. maybe there's a bug in the way the BIOS implemented one of the BIOS functions that GRUB uses.
GRUB is buggy, and something about the computer causes GRUB to crash. E.g. maybe GRUB expects to be able to use 636 KiB of RAM at 0x00000000, but on that computer the EBDA is larger.
GRUB is buggy, and something about your files cause GRUB to crash. E.g. maybe your kernel has a section that's supposed to be loaded at 0x10000000 and GRUB fails to check if there's RAM at 0x10000000 before attempting to load that section.
Your files are buggy. E.g. maybe your kernel violates the multi-boot spec in some way, and it's unreasonable to expect GRUB's sanity checks to detect it. Note: I can't actually think of a good example of this that would work on some computers but not others.
Something about the way you've created the boot image is wrong. For example, maybe you're booting from USB flash and that specific device tells the BIOS there's 32 sectors per track, but you've assumed 63 sectors per track in your partition table causing GRUB to load the wrong code/data using the wrong "CHS" values.
Mostly, you're going to need more information (and more testing to obtain more information). For example, maybe try a very minimal kernel (e.g. a 32 byte flat binary) and see if GRUB will start it correctly, or a different type of boot device (e.g. boot from CD-ROM instead of USB flash), or a different version of GRUB, or...
Cheers,
Brendan
For all things; perfection is, and will always remain, impossible to achieve in practice. However; by striving for perfection we create things that are as perfect as practically possible. Let the pursuit of perfection be our guide.
Thanks for the reply Brendan. So I tested on a few different computers, and the bug was reproduced on all of them, though it still does not appear in an emulator. I also tried a few different versions of Grub, and the error appeared on all three. I tried a couple different boot mediums as well, and the same issue.
I found that the bug is highly dependent on a very very specific layout for the binary. As in adding a single function call, or even a single parameter to an existing function call will cause the bug to disappear. Once again, the kernel does not execute, which I found by placing an hlt instruction right at the entry point (after a cli instruction that was there from before). I did have to remove a later instruction (a mov command two lines down) to get the bug to persist.
Furthermore, the bug only appears when the kernel is loaded to virtual address 0xC0100000 (3GB and 1MB) and physical address 0x100000 (1MB). If I use the linker script to offset the kernel by a single page, the bug disappears.
I really feel like this might be some really finicky grub thing, where it just does not like this very specific set of circumstances. Maybe I'm wrong, so I shall try a different boot load, maybe I will get somewhere that way.