Weird behavior on real hardware
Weird behavior on real hardware
Hi, I have encountered some very weird behavior with my OS on real hardware, heres a little background:
1. I'm using my own BIOS bootloader
2. My OS is x86-64
4. Uses APIC
5. Boots all APs
6. Uses RTC to get time of boot
7. Doesn't touch PCI devices (yet)
8. Doesn't touch ACPI (yet)
9. Doesn't touch PS2 devices (yet)
10. Uses PIT as a timer before lapic timers are ready + calibrating on each core
11. Uses VESA modes set by the bootloader to draw a basic desktop
12. I write the raw image of the virtual hard disk of my OS to a USB flash drive for testing on real hw.
So far I've tested my OS on the following:
1. VMWare (with all kinds of settings, up to 16 cores)
2. VirtualBox
3. QEMU (with + without kvm)
4. Bochs
5. My own PC with i9-9900K + new MB
I haven't been able to get it to crash even once on any of the aforementioned platforms.
However, when testing on my laptop (with i7-7700HQ) I noticed that it only boots ONCE successfully and all the attempts after that lead to a black screen (after bootloader hands control to my OS) and a reboot after like 5 seconds (so probably triple fault?). When I relogin back to Windows, then shutdown, then start my OS again it boots perfectly fine again, and after that all consecutive attempts to boot it again result in a triple fault. And i've tested this behavior multiple times. What could be the reason for such behavior? Does Windows save some state that my OS relies on? I'm not seeing this behavior on my main PC. The only difference between the two I could point out is that the VESA mode set by my bootloader on my laptop is 1920x1080x32 (since it supports the "get EDID" bios function) and 1024x768x32 on my PC. However, when testing on Bochs it also sets 1920x1080 and still works perfectly fine, so I don't think that's related... Anyways, would really appreciate any pointers as to where to look for the potential cause. Good thing is that it's easily reproducible so I might be able to disable certain components of my OS one by one and try again and maybe debug it that way... If you need any more information feel free to ask in the comments. Thanks.
1. I'm using my own BIOS bootloader
2. My OS is x86-64
4. Uses APIC
5. Boots all APs
6. Uses RTC to get time of boot
7. Doesn't touch PCI devices (yet)
8. Doesn't touch ACPI (yet)
9. Doesn't touch PS2 devices (yet)
10. Uses PIT as a timer before lapic timers are ready + calibrating on each core
11. Uses VESA modes set by the bootloader to draw a basic desktop
12. I write the raw image of the virtual hard disk of my OS to a USB flash drive for testing on real hw.
So far I've tested my OS on the following:
1. VMWare (with all kinds of settings, up to 16 cores)
2. VirtualBox
3. QEMU (with + without kvm)
4. Bochs
5. My own PC with i9-9900K + new MB
I haven't been able to get it to crash even once on any of the aforementioned platforms.
However, when testing on my laptop (with i7-7700HQ) I noticed that it only boots ONCE successfully and all the attempts after that lead to a black screen (after bootloader hands control to my OS) and a reboot after like 5 seconds (so probably triple fault?). When I relogin back to Windows, then shutdown, then start my OS again it boots perfectly fine again, and after that all consecutive attempts to boot it again result in a triple fault. And i've tested this behavior multiple times. What could be the reason for such behavior? Does Windows save some state that my OS relies on? I'm not seeing this behavior on my main PC. The only difference between the two I could point out is that the VESA mode set by my bootloader on my laptop is 1920x1080x32 (since it supports the "get EDID" bios function) and 1024x768x32 on my PC. However, when testing on Bochs it also sets 1920x1080 and still works perfectly fine, so I don't think that's related... Anyways, would really appreciate any pointers as to where to look for the potential cause. Good thing is that it's easily reproducible so I might be able to disable certain components of my OS one by one and try again and maybe debug it that way... If you need any more information feel free to ask in the comments. Thanks.
Re: Weird behavior on real hardware
Are you using Legacy-BIOS mode of UEFI? Or is it a native Legacy-BIOS computer?
Greetings
Peter
Greetings
Peter
Re: Weird behavior on real hardware
It's the standard UEFI with CSM, class 2.PeterX wrote:Are you using Legacy-BIOS mode of UEFI? Or is it a native Legacy-BIOS computer?
Greetings
Peter
Re: Weird behavior on real hardware
It's likely that many of the issues on real hardware may lie solely with the BIOS.
When I tried my bootsect/kernel on real hardware for the first time, it failed spectacularly on three different boxes. I eventually tracked down all the issues to BIOS problems:
- some machines don't implement int 15 function E820 at all, or at least properly. Some use the carry flag to indicate the list is complete, some reset BX to 0. One machine I have gathers only garbage data, another only reports a single memory range, which doesn't seem right. It doesn't report the space over 1MB (this may be related to the process of iterating through the list)
- some machines don't support use of LBA addressing for certain device types. I was successfully loading my kernel with C/H/S under Bochs and VirtualBox as a floppy image, then later with LBA, and as a hard drive image. At least one of my real systems refused to offer LBA for floppy disks. I eventually switched to an El Torito ISO image so I can also boot from a flash drive.
When I tried my bootsect/kernel on real hardware for the first time, it failed spectacularly on three different boxes. I eventually tracked down all the issues to BIOS problems:
- some machines don't implement int 15 function E820 at all, or at least properly. Some use the carry flag to indicate the list is complete, some reset BX to 0. One machine I have gathers only garbage data, another only reports a single memory range, which doesn't seem right. It doesn't report the space over 1MB (this may be related to the process of iterating through the list)
- some machines don't support use of LBA addressing for certain device types. I was successfully loading my kernel with C/H/S under Bochs and VirtualBox as a floppy image, then later with LBA, and as a hard drive image. At least one of my real systems refused to offer LBA for floppy disks. I eventually switched to an El Torito ISO image so I can also boot from a flash drive.
Re: Weird behavior on real hardware
Is it possible that the power button on your laptop is suspending it rather than switching it off?
Re: Weird behavior on real hardware
Maybe, its hard to tell. I do see the same boot screen each time thoiansjack wrote:Is it possible that the power button on your laptop is suspending it rather than switching it off?
Re: Weird behavior on real hardware
UEFI stores some data in non-volative variables upon booting. I'm not sure if this is the reason for your problem.
Greetings
Peter
Greetings
Peter
-
- Member
- Posts: 5572
- Joined: Mon Mar 25, 2013 7:01 pm
Re: Weird behavior on real hardware
UEFI and ACPI both control nonvolatile configuration. Since you're not using them, it's either some kind of bug in your code accidentally clobbering something nonvolatile, or the firmware is expecting you to use ACPI and blowing up when you configure the hardware without it.8infy wrote:1. I'm using my own BIOS bootloader
8. Doesn't touch ACPI (yet)
That doesn't really help narrow it down though. Perhaps try disabling various parts of your OS (especially booting APs and poking APIC) to see if it's one of those that upsets things.
I'd be interested in seeing how you're calling it and what results you got. That function is standardized in ACPI so it should work pretty consistently across PCs that can boot Windows as long as you follow the spec. (And yes, the BX/carry flag thing is in the spec. The difference is that a valid memory range is returned when BX is zero, but not when the carry flag is set.)sj95126 wrote:- some machines don't implement int 15 function E820 at all, or at least properly. Some use the carry flag to indicate the list is complete, some reset BX to 0. One machine I have gathers only garbage data, another only reports a single memory range, which doesn't seem right. It doesn't report the space over 1MB (this may be related to the process of iterating through the list)
I can't say I'm too surprised. It's not possible for the BIOS to know how to translate LBA to CHS to access a floppy disk.sj95126 wrote:- some machines don't support use of LBA addressing for certain device types. I was successfully loading my kernel with C/H/S under Bochs and VirtualBox as a floppy image, then later with LBA, and as a hard drive image. At least one of my real systems refused to offer LBA for floppy disks.
I'm assuming you mean a hybrid image, since El Torito is not used when booting flash drives.sj95126 wrote:I eventually switched to an El Torito ISO image so I can also boot from a flash drive.
Re: Weird behavior on real hardware
I'll have to get back to you on that. I'm not at the same location as most of my hardware right now.Octocontrabass wrote:I'd be interested in seeing how you're calling it and what results you got. That function is standardized in ACPI so it should work pretty consistently across PCs that can boot Windows as long as you follow the spec. (And yes, the BX/carry flag thing is in the spec. The difference is that a valid memory range is returned when BX is zero, but not when the carry flag is set.)sj95126 wrote:- some machines don't implement int 15 function E820 at all, or at least properly. Some use the carry flag to indicate the list is complete, some reset BX to 0. One machine I have gathers only garbage data, another only reports a single memory range, which doesn't seem right. It doesn't report the space over 1MB (this may be related to the process of iterating through the list)
I did look at my code and I may be short-circuiting the spec a bit. However, it's worked just fine on Bochs and VirtualBox so I didn't fix it. On two real boxes, it *appears* that the methodology is working (the table it builds has 5-6 entries which is typical) but the values were garbage. I may not have been resetting a register that was getting trashed. Even so, at least the first result returned should be valid and it wasn't.
Sorry, I sort of glibly conflated two topics. I create an El Torito no-emul ISO to boot under Bochs and VirtualBox. The boot image contained in there can also be copied to the sdX1 partition of a bootable flash drive.I'm assuming you mean a hybrid image, since El Torito is not used when booting flash drives.sj95126 wrote:I eventually switched to an El Torito ISO image so I can also boot from a flash drive.
Re: Weird behavior on real hardware
OK, getting back to that. It turns out I wasn't doing function E820 quite right. It worked on virtual machines but not real hardware. Two of the three real machines now return usable and believable results, though the third (a Lenovo laptop circa 2010) still isn't quite right. It takes 20 calls before it sets BX to 0 (CF stays clear) and some of the entries after the first ten or so are not only total garbage, but bit 0 of the extended attributes is 1, meaning DON'T ignore the entry. (yes, the call is returning 24 bytes each time, I added debugging to print out CF, BX and CL after each call)sj95126 wrote:I'll have to get back to you on that.Octocontrabass wrote:I'd be interested in seeing how you're calling it and what results you got. That function is standardized in ACPI so it should work pretty consistently across PCs that can boot Windows as long as you follow the spec. (And yes, the BX/carry flag thing is in the spec. The difference is that a valid memory range is returned when BX is zero, but not when the carry flag is set.)sj95126 wrote:- some machines don't implement int 15 function E820 at all, or at least properly. Some use the carry flag to indicate the list is complete, some reset BX to 0. One machine I have gathers only garbage data, another only reports a single memory range, which doesn't seem right. It doesn't report the space over 1MB (this may be related to the process of iterating through the list)