Weird behavior on real hardware

Question about which tools to use, bugs, the best way to implement a function, etc should go here. Don't forget to see if your question is answered in the wiki first! When in doubt post here.
Post Reply
8infy
Member
Member
Posts: 185
Joined: Sun Apr 05, 2020 1:01 pm

Weird behavior on real hardware

Post by 8infy »

Hi, I have encountered some very weird behavior with my OS on real hardware, heres a little background:
1. I'm using my own BIOS bootloader
2. My OS is x86-64
4. Uses APIC
5. Boots all APs
6. Uses RTC to get time of boot
7. Doesn't touch PCI devices (yet)
8. Doesn't touch ACPI (yet)
9. Doesn't touch PS2 devices (yet)
10. Uses PIT as a timer before lapic timers are ready + calibrating on each core
11. Uses VESA modes set by the bootloader to draw a basic desktop
12. I write the raw image of the virtual hard disk of my OS to a USB flash drive for testing on real hw.

So far I've tested my OS on the following:
1. VMWare (with all kinds of settings, up to 16 cores)
2. VirtualBox
3. QEMU (with + without kvm)
4. Bochs
5. My own PC with i9-9900K + new MB

I haven't been able to get it to crash even once on any of the aforementioned platforms.
However, when testing on my laptop (with i7-7700HQ) I noticed that it only boots ONCE successfully and all the attempts after that lead to a black screen (after bootloader hands control to my OS) and a reboot after like 5 seconds (so probably triple fault?). When I relogin back to Windows, then shutdown, then start my OS again it boots perfectly fine again, and after that all consecutive attempts to boot it again result in a triple fault. And i've tested this behavior multiple times. What could be the reason for such behavior? Does Windows save some state that my OS relies on? I'm not seeing this behavior on my main PC. The only difference between the two I could point out is that the VESA mode set by my bootloader on my laptop is 1920x1080x32 (since it supports the "get EDID" bios function) and 1024x768x32 on my PC. However, when testing on Bochs it also sets 1920x1080 and still works perfectly fine, so I don't think that's related... Anyways, would really appreciate any pointers as to where to look for the potential cause. Good thing is that it's easily reproducible so I might be able to disable certain components of my OS one by one and try again and maybe debug it that way... If you need any more information feel free to ask in the comments. Thanks.
PeterX
Member
Member
Posts: 590
Joined: Fri Nov 22, 2019 5:46 am

Re: Weird behavior on real hardware

Post by PeterX »

Are you using Legacy-BIOS mode of UEFI? Or is it a native Legacy-BIOS computer?

Greetings
Peter
8infy
Member
Member
Posts: 185
Joined: Sun Apr 05, 2020 1:01 pm

Re: Weird behavior on real hardware

Post by 8infy »

PeterX wrote:Are you using Legacy-BIOS mode of UEFI? Or is it a native Legacy-BIOS computer?

Greetings
Peter
It's the standard UEFI with CSM, class 2.
sj95126
Member
Member
Posts: 151
Joined: Tue Aug 11, 2020 12:14 pm

Re: Weird behavior on real hardware

Post by sj95126 »

It's likely that many of the issues on real hardware may lie solely with the BIOS.

When I tried my bootsect/kernel on real hardware for the first time, it failed spectacularly on three different boxes. I eventually tracked down all the issues to BIOS problems:

- some machines don't implement int 15 function E820 at all, or at least properly. Some use the carry flag to indicate the list is complete, some reset BX to 0. One machine I have gathers only garbage data, another only reports a single memory range, which doesn't seem right. It doesn't report the space over 1MB (this may be related to the process of iterating through the list)
- some machines don't support use of LBA addressing for certain device types. I was successfully loading my kernel with C/H/S under Bochs and VirtualBox as a floppy image, then later with LBA, and as a hard drive image. At least one of my real systems refused to offer LBA for floppy disks. I eventually switched to an El Torito ISO image so I can also boot from a flash drive.
User avatar
iansjack
Member
Member
Posts: 4703
Joined: Sat Mar 31, 2012 3:07 am
Location: Chichester, UK

Re: Weird behavior on real hardware

Post by iansjack »

Is it possible that the power button on your laptop is suspending it rather than switching it off?
8infy
Member
Member
Posts: 185
Joined: Sun Apr 05, 2020 1:01 pm

Re: Weird behavior on real hardware

Post by 8infy »

iansjack wrote:Is it possible that the power button on your laptop is suspending it rather than switching it off?
Maybe, its hard to tell. I do see the same boot screen each time tho
PeterX
Member
Member
Posts: 590
Joined: Fri Nov 22, 2019 5:46 am

Re: Weird behavior on real hardware

Post by PeterX »

UEFI stores some data in non-volative variables upon booting. I'm not sure if this is the reason for your problem.

Greetings
Peter
Octocontrabass
Member
Member
Posts: 5574
Joined: Mon Mar 25, 2013 7:01 pm

Re: Weird behavior on real hardware

Post by Octocontrabass »

8infy wrote:1. I'm using my own BIOS bootloader
8. Doesn't touch ACPI (yet)
UEFI and ACPI both control nonvolatile configuration. Since you're not using them, it's either some kind of bug in your code accidentally clobbering something nonvolatile, or the firmware is expecting you to use ACPI and blowing up when you configure the hardware without it.

That doesn't really help narrow it down though. Perhaps try disabling various parts of your OS (especially booting APs and poking APIC) to see if it's one of those that upsets things.
sj95126 wrote:- some machines don't implement int 15 function E820 at all, or at least properly. Some use the carry flag to indicate the list is complete, some reset BX to 0. One machine I have gathers only garbage data, another only reports a single memory range, which doesn't seem right. It doesn't report the space over 1MB (this may be related to the process of iterating through the list)
I'd be interested in seeing how you're calling it and what results you got. That function is standardized in ACPI so it should work pretty consistently across PCs that can boot Windows as long as you follow the spec. (And yes, the BX/carry flag thing is in the spec. The difference is that a valid memory range is returned when BX is zero, but not when the carry flag is set.)
sj95126 wrote:- some machines don't support use of LBA addressing for certain device types. I was successfully loading my kernel with C/H/S under Bochs and VirtualBox as a floppy image, then later with LBA, and as a hard drive image. At least one of my real systems refused to offer LBA for floppy disks.
I can't say I'm too surprised. It's not possible for the BIOS to know how to translate LBA to CHS to access a floppy disk.
sj95126 wrote:I eventually switched to an El Torito ISO image so I can also boot from a flash drive.
I'm assuming you mean a hybrid image, since El Torito is not used when booting flash drives.
sj95126
Member
Member
Posts: 151
Joined: Tue Aug 11, 2020 12:14 pm

Re: Weird behavior on real hardware

Post by sj95126 »

Octocontrabass wrote:
sj95126 wrote:- some machines don't implement int 15 function E820 at all, or at least properly. Some use the carry flag to indicate the list is complete, some reset BX to 0. One machine I have gathers only garbage data, another only reports a single memory range, which doesn't seem right. It doesn't report the space over 1MB (this may be related to the process of iterating through the list)
I'd be interested in seeing how you're calling it and what results you got. That function is standardized in ACPI so it should work pretty consistently across PCs that can boot Windows as long as you follow the spec. (And yes, the BX/carry flag thing is in the spec. The difference is that a valid memory range is returned when BX is zero, but not when the carry flag is set.)
I'll have to get back to you on that. I'm not at the same location as most of my hardware right now.

I did look at my code and I may be short-circuiting the spec a bit. However, it's worked just fine on Bochs and VirtualBox so I didn't fix it. On two real boxes, it *appears* that the methodology is working (the table it builds has 5-6 entries which is typical) but the values were garbage. I may not have been resetting a register that was getting trashed. Even so, at least the first result returned should be valid and it wasn't.
sj95126 wrote:I eventually switched to an El Torito ISO image so I can also boot from a flash drive.
I'm assuming you mean a hybrid image, since El Torito is not used when booting flash drives.
Sorry, I sort of glibly conflated two topics. I create an El Torito no-emul ISO to boot under Bochs and VirtualBox. The boot image contained in there can also be copied to the sdX1 partition of a bootable flash drive.
sj95126
Member
Member
Posts: 151
Joined: Tue Aug 11, 2020 12:14 pm

Re: Weird behavior on real hardware

Post by sj95126 »

sj95126 wrote:
Octocontrabass wrote:
sj95126 wrote:- some machines don't implement int 15 function E820 at all, or at least properly. Some use the carry flag to indicate the list is complete, some reset BX to 0. One machine I have gathers only garbage data, another only reports a single memory range, which doesn't seem right. It doesn't report the space over 1MB (this may be related to the process of iterating through the list)
I'd be interested in seeing how you're calling it and what results you got. That function is standardized in ACPI so it should work pretty consistently across PCs that can boot Windows as long as you follow the spec. (And yes, the BX/carry flag thing is in the spec. The difference is that a valid memory range is returned when BX is zero, but not when the carry flag is set.)
I'll have to get back to you on that.
OK, getting back to that. It turns out I wasn't doing function E820 quite right. It worked on virtual machines but not real hardware. Two of the three real machines now return usable and believable results, though the third (a Lenovo laptop circa 2010) still isn't quite right. It takes 20 calls before it sets BX to 0 (CF stays clear) and some of the entries after the first ten or so are not only total garbage, but bit 0 of the extended attributes is 1, meaning DON'T ignore the entry. (yes, the call is returning 24 bytes each time, I added debugging to print out CF, BX and CL after each call)
Post Reply