Page 1 of 1

Crash on real hardware even when doing (basically) nothing

Posted: Tue Oct 29, 2024 9:34 am
by glolichen
I am developing a kernel for long mode x86, which can run on QEMU with OVMF to emulate UEFI hardware. I disabled every feature on the kernel and all it currently does is write a few pixels to VGA. When booting on real hardware, grub2 displays that "booting os" message and crashes a few milliseconds later by displaying a black screen.

The "loader" code called by grub2 is:

Code: Select all

_start:
	mov edx, 100
draw_loop:
	mov byte [0xA0000 + edx], 255
	mov byte [0xA0000 + 4096 + edx], 255
	mov byte [0xA0000 + 2 * 4096 + edx], 255
	mov byte [0xA0000 + 3 * 4096 + edx], 255
	mov byte [0xA0000 + 4 * 4096 + edx], 255
	mov byte [0xA0000 + 5 * 4096 + edx], 255
	mov byte [0xA0000 + 6 * 4096 + edx], 255
	dec edx
	cmp edx, 0
	jnz draw_loop

	cli
	jmp $
Full code is at https://github.com/glolichen/os Any help is appreciated. Thanks!

Re: Crash on real hardware even when doing (basically) nothing

Posted: Tue Oct 29, 2024 5:24 pm
by MichaelPetch
What type of hardware? Any chance that nothing gets displayed because the hardware you are on doesn't have MMIO video memory at 0xA0000? Maybe it didn't crash and you are sitting at a black screen while in an infinite loop? Have you tried querying GRUB for the frame buffer and using the framebuffer address to write pixels?

Re: Crash on real hardware even when doing (basically) nothing

Posted: Tue Oct 29, 2024 8:51 pm
by Octocontrabass
Your code works in QEMU because QEMU emulates an ancient SVGA display adapter. Ancient SVGA display adapters always map the framebuffer at 0xA0000 to allow access from real mode.

Modern display adapters don't map the framebuffer at 0xA0000 unless they're in a VGA-compatible mode. UEFI doesn't use VGA-compatible modes. You need to parse the Multiboot2 information structure to find the framebuffer base address.

Re: Crash on real hardware even when doing (basically) nothing

Posted: Tue Oct 29, 2024 9:23 pm
by glolichen
Octocontrabass wrote: Tue Oct 29, 2024 8:51 pm Your code works in QEMU because QEMU emulates an ancient SVGA display adapter. Ancient SVGA display adapters always map the framebuffer at 0xA0000 to allow access from real mode.

Modern display adapters don't map the framebuffer at 0xA0000 unless they're in a VGA-compatible mode. UEFI doesn't use VGA-compatible modes. You need to parse the Multiboot2 information structure to find the framebuffer base address.
Thanks for this piece of information. I didn't know that before.

I have some code for querying VGA framebuffer address from multiboot2 information tables. It initializes a page frame allocator based on multiboot2 mmap and maps some virtual memory to the framebuffer. Then I draw a green line to test. Code is at https://github.com/glolichen/os/blob/main/src/kmain.c, lines 78-237. Of course it works in QEMU but not on a laptop.

By the way, I am testing this by building the kernel normally (GCC cross compiler and grub-mkrescue) and using dd to write it to a USB drive, then booting from the USB from the BIOS menu. What are some other hardware/QEMU/build process quirks that may cause this/these issue(s)? Thanks.

Re: Crash on real hardware even when doing (basically) nothing

Posted: Tue Oct 29, 2024 9:50 pm
by Octocontrabass
glolichen wrote: Tue Oct 29, 2024 9:23 pmOf course it works in QEMU but not on a laptop.
I'm assuming you mean the screen remains black. I didn't see anything that would obviously cause a black screen, but I did see you're making a lot of assumptions about the framebuffer layout instead of using the values from the Multiboot2 information. You need to at least use the width, height, and pitch instead of blindly assuming 1024, 768, and 4096.

I also noticed this inline assembly clobbering RAX.

Re: Crash on real hardware even when doing (basically) nothing

Posted: Tue Oct 29, 2024 10:47 pm
by MichaelPetch
I think the black screen is simply the result of the video mode switch that GRUB does. When I build your project and run the ISO here in QEMU I do get the black screen and then a page fault occurs. The page fault interrupt handling does a 'hlt` so the OS is in a do nothing state at that point so it appears as a hang with a black screen. I don't have time to debug this tonight but I'll just toss what I got here. The issue seems to be in the code that builds the memory map:

Code: Select all

INFO:  TSS loaded
INFO:  multiboot pointer: 0x932D00
INFO:  interrupts: initialized
INFO:  announced mbi size 0xA40
INFO:  tag 21, size 0xC
INFO:  tag 1, size 0x9
INFO:  command line =
INFO:  tag 2, size 0x1D
INFO:  boot loader name = GRUB 2.06-2ubuntu7.2
INFO:  tag 10, size 0x1C
INFO:  tag 6, size 0xB8
INFO:  mmap
INFO:      base_addr = 0x0, length = 0x9FC00, type = Available, 0x932D78
INFO:      base_addr = 0x9FC00, length = 0x400, type = Reserved, 0x932D90
INFO:      base_addr = 0xF0000, length = 0x10000, type = Reserved, 0x932DA8
INFO:      base_addr = 0x100000, length = 0x7EE0000, type = Available, 0x932DC0
INFO:      base_addr = 0x7FE0000, length = 0x20000, type = Reserved, 0x932DD8
INFO:      base_addr = 0xFFFC0000, length = 0x40000, type = Reserved, 0x932DF0
INFO:      base_addr = 0xFD00000000, length = 0x300000000, type = Reserved, 0x932E08
INFO:  tag 9, size 0x594
INFO:  tag 4, size 0x10
INFO:  mem_lower = 639KB, mem_upper = 129920KB
INFO:  tag 5, size 0x14
INFO:  boot device 0xE0, 0xFFFFFFFF, 0xFFFFFFFF
INFO:  tag 7, size 0x310
INFO:  tag 8, size 0x26
INFO:  framebuffer address 0xFD000000
INFO:  framebuffer type 1
INFO:  framebuffer pitch 4096
INFO:  framebuffer width 1024
INFO:  framebuffer height 768
INFO:  framebuffer color type 255
INFO:  tag 14, size 0x1C
INFO:  total mbi size 0xFFFFFFFFFBF9CF40
INFO:  pmm: total available memory: 0x76C7DA8 bytes (30287 nodes)
INFO:  pmm: add block: start 0x918258 end 0x7FE0000
INFO:  pmm: add block: start 0x98E738 end 0x7FE0000
check_exception old: 0xffffffff new 0xe
     0: v=0e e=0000 i=0 cpl=0 IP=0008:ffffffff801011f1 pc=ffffffff801011f1 SP=0010:ffffffff80911f80 CR2=0000000081b72dd0
RAX=00000000ffffffff RBX=0000000007fe0000 RCX=0000000000000001 RDX=0000000100932d67
RSI=000000000000000a RDI=00000000000003f8 RBP=ffffffff80911fe0 RSP=ffffffff80911f80
R8 =0000000000932e08 R9 =ffffffff8010c000 R10=0000000000000000 R11=0000000000000000
R12=ffffffff8010e000 R13=0000000081b72dc0 R14=0000000000918258 R15=0000000000932d68
RIP=ffffffff801011f1 RFL=00000297 [--S-APC] CPL=0 II=0 A20=1 SMM=0 HLT=0
ES =0010 0000000000000000 ffffffff 00af9300 DPL=0 DS   [-WA]
CS =0008 0000000000000000 ffffffff 00af9a00 DPL=0 CS64 [-R-]
SS =0010 0000000000000000 ffffffff 00af9300 DPL=0 DS   [-WA]
DS =0010 0000000000000000 ffffffff 00af9300 DPL=0 DS   [-WA]
FS =0010 0000000000000000 ffffffff 00af9300 DPL=0 DS   [-WA]
GS =0010 0000000000000000 ffffffff 00af9300 DPL=0 DS   [-WA]
LDT=0000 0000000000000000 0000ffff 00008200 DPL=0 LDT
TR =0018 ffffffff8010a000 00000068 00008900 DPL=0 TSS64-avl
GDT=     ffffffff8010a072 00000027
IDT=     ffffffff809171e0 00000fff
CR0=80000011 CR2=0000000081b72dd0 CR3=000000000010b000 CR4=00000020
DR0=0000000000000000 DR1=0000000000000000 DR2=0000000000000000 DR3=0000000000000000
DR6=00000000ffff0ff0 DR7=0000000000000400
CCS=0000000100932d67 CCD=ffffffff81240059 CCO=SUBQ
EFER=0000000000000500
ERROR: Exception 0xE: Page Fault, code 0
ERROR: address: 0x81B72DD0
`info mem` in QEMU shows:

Code: Select all

(qemu) info mem
0000000000000000-0000000080000000 0000000080000000 -rw
ffffffff80000000-fffffffffffb3000 000000007ffb3000 -rw
fffffffffffb4000-fffffffffffb6000 0000000000002000 -rw
fffffffffffb7000-fffffffffffb8000 0000000000001000 -r-
fffffffffffb8000-fffffffffffb9000 0000000000001000 -rw
fffffffffffbb000-fffffffffffbc000 0000000000001000 -r-
fffffffffffbd000-fffffffffffbe000 0000000000001000 -r-
fffffffffffbe000-fffffffffffbf000 0000000000001000 -rw
fffffffffffc9000-fffffffffffca000 0000000000001000 -r-
fffffffffffd1000-fffffffffffd2000 0000000000001000 -r-
fffffffffffd2000-fffffffffffd3000 0000000000001000 -rw
fffffffffffd6000-fffffffffffd7000 0000000000001000 -rw
fffffffffffd9000-fffffffffffda000 0000000000001000 -rw
fffffffffffe5000-fffffffffffe6000 0000000000001000 -rw
fffffffffffe6000-fffffffffffe7000 0000000000001000 -r-
fffffffffffe8000-fffffffffffe9000 0000000000001000 -rw
fffffffffffeb000-fffffffffffec000 0000000000001000 -r-
fffffffffffed000-fffffffffffee000 0000000000001000 -rw
ffffffffffff0000-ffffffffffff1000 0000000000001000 -rw
ffffffffffff6000-ffffffffffff7000 0000000000001000 -rw
ffffffffffffd000-ffffffffffffe000 0000000000001000 -r-
`info tlb` output is long but there are a lot of bogus entries with bad flags. It should be noted that in the exception output e=0000 means that you got a page fault trying to read from a non present page. CR2=0000000081b72dd0 is the memory address that was accessed.

Re: Crash on real hardware even when doing (basically) nothing

Posted: Wed Oct 30, 2024 8:18 pm
by Octocontrabass
That sounds like what you'd get if the stack overflowed and started overwriting the page tables.

Which will happen pretty quickly if you accidentally initialize the stack so it's full instead of empty.

Re: Crash on real hardware even when doing (basically) nothing

Posted: Wed Oct 30, 2024 9:40 pm
by glolichen
Octocontrabass wrote: Wed Oct 30, 2024 8:18 pm That sounds like what you'd get if the stack overflowed and started overwriting the page tables.

Which will happen pretty quickly if you accidentally initialize the stack so it's full instead of empty.
Thanks for that, I changed the esp/rsp to the stack_top instead.
MichaelPetch wrote: Tue Oct 29, 2024 10:47 pm I think the black screen is simply the result of the video mode switch that GRUB does. When I build your project and run the ISO here in QEMU I do get the black screen and then a page fault occurs. The page fault interrupt handling does a 'hlt` so the OS is in a do nothing state at that point so it appears as a hang with a black screen. I don't have time to debug this tonight but I'll just toss what I got here. The issue seems to be in the code that builds the memory map:

Code: Select all

INFO:  TSS loaded
INFO:  multiboot pointer: 0x932D00
INFO:  interrupts: initialized
INFO:  announced mbi size 0xA40
INFO:  tag 21, size 0xC
INFO:  tag 1, size 0x9
INFO:  command line =
INFO:  tag 2, size 0x1D
INFO:  boot loader name = GRUB 2.06-2ubuntu7.2
INFO:  tag 10, size 0x1C
INFO:  tag 6, size 0xB8
INFO:  mmap
INFO:      base_addr = 0x0, length = 0x9FC00, type = Available, 0x932D78
INFO:      base_addr = 0x9FC00, length = 0x400, type = Reserved, 0x932D90
INFO:      base_addr = 0xF0000, length = 0x10000, type = Reserved, 0x932DA8
INFO:      base_addr = 0x100000, length = 0x7EE0000, type = Available, 0x932DC0
INFO:      base_addr = 0x7FE0000, length = 0x20000, type = Reserved, 0x932DD8
INFO:      base_addr = 0xFFFC0000, length = 0x40000, type = Reserved, 0x932DF0
INFO:      base_addr = 0xFD00000000, length = 0x300000000, type = Reserved, 0x932E08
INFO:  tag 9, size 0x594
INFO:  tag 4, size 0x10
INFO:  mem_lower = 639KB, mem_upper = 129920KB
INFO:  tag 5, size 0x14
INFO:  boot device 0xE0, 0xFFFFFFFF, 0xFFFFFFFF
INFO:  tag 7, size 0x310
INFO:  tag 8, size 0x26
INFO:  framebuffer address 0xFD000000
INFO:  framebuffer type 1
INFO:  framebuffer pitch 4096
INFO:  framebuffer width 1024
INFO:  framebuffer height 768
INFO:  framebuffer color type 255
INFO:  tag 14, size 0x1C
INFO:  total mbi size 0xFFFFFFFFFBF9CF40
INFO:  pmm: total available memory: 0x76C7DA8 bytes (30287 nodes)
INFO:  pmm: add block: start 0x918258 end 0x7FE0000
INFO:  pmm: add block: start 0x98E738 end 0x7FE0000
check_exception old: 0xffffffff new 0xe
     0: v=0e e=0000 i=0 cpl=0 IP=0008:ffffffff801011f1 pc=ffffffff801011f1 SP=0010:ffffffff80911f80 CR2=0000000081b72dd0
RAX=00000000ffffffff RBX=0000000007fe0000 RCX=0000000000000001 RDX=0000000100932d67
RSI=000000000000000a RDI=00000000000003f8 RBP=ffffffff80911fe0 RSP=ffffffff80911f80
R8 =0000000000932e08 R9 =ffffffff8010c000 R10=0000000000000000 R11=0000000000000000
R12=ffffffff8010e000 R13=0000000081b72dc0 R14=0000000000918258 R15=0000000000932d68
RIP=ffffffff801011f1 RFL=00000297 [--S-APC] CPL=0 II=0 A20=1 SMM=0 HLT=0
ES =0010 0000000000000000 ffffffff 00af9300 DPL=0 DS   [-WA]
CS =0008 0000000000000000 ffffffff 00af9a00 DPL=0 CS64 [-R-]
SS =0010 0000000000000000 ffffffff 00af9300 DPL=0 DS   [-WA]
DS =0010 0000000000000000 ffffffff 00af9300 DPL=0 DS   [-WA]
FS =0010 0000000000000000 ffffffff 00af9300 DPL=0 DS   [-WA]
GS =0010 0000000000000000 ffffffff 00af9300 DPL=0 DS   [-WA]
LDT=0000 0000000000000000 0000ffff 00008200 DPL=0 LDT
TR =0018 ffffffff8010a000 00000068 00008900 DPL=0 TSS64-avl
GDT=     ffffffff8010a072 00000027
IDT=     ffffffff809171e0 00000fff
CR0=80000011 CR2=0000000081b72dd0 CR3=000000000010b000 CR4=00000020
DR0=0000000000000000 DR1=0000000000000000 DR2=0000000000000000 DR3=0000000000000000
DR6=00000000ffff0ff0 DR7=0000000000000400
CCS=0000000100932d67 CCD=ffffffff81240059 CCO=SUBQ
EFER=0000000000000500
ERROR: Exception 0xE: Page Fault, code 0
ERROR: address: 0x81B72DD0
`info mem` in QEMU shows:

Code: Select all

(qemu) info mem
0000000000000000-0000000080000000 0000000080000000 -rw
ffffffff80000000-fffffffffffb3000 000000007ffb3000 -rw
fffffffffffb4000-fffffffffffb6000 0000000000002000 -rw
fffffffffffb7000-fffffffffffb8000 0000000000001000 -r-
fffffffffffb8000-fffffffffffb9000 0000000000001000 -rw
fffffffffffbb000-fffffffffffbc000 0000000000001000 -r-
fffffffffffbd000-fffffffffffbe000 0000000000001000 -r-
fffffffffffbe000-fffffffffffbf000 0000000000001000 -rw
fffffffffffc9000-fffffffffffca000 0000000000001000 -r-
fffffffffffd1000-fffffffffffd2000 0000000000001000 -r-
fffffffffffd2000-fffffffffffd3000 0000000000001000 -rw
fffffffffffd6000-fffffffffffd7000 0000000000001000 -rw
fffffffffffd9000-fffffffffffda000 0000000000001000 -rw
fffffffffffe5000-fffffffffffe6000 0000000000001000 -rw
fffffffffffe6000-fffffffffffe7000 0000000000001000 -r-
fffffffffffe8000-fffffffffffe9000 0000000000001000 -rw
fffffffffffeb000-fffffffffffec000 0000000000001000 -r-
fffffffffffed000-fffffffffffee000 0000000000001000 -rw
ffffffffffff0000-ffffffffffff1000 0000000000001000 -rw
ffffffffffff6000-ffffffffffff7000 0000000000001000 -rw
ffffffffffffd000-ffffffffffffe000 0000000000001000 -r-
`info tlb` output is long but there are a lot of bogus entries with bad flags. It should be noted that in the exception output e=0000 means that you got a page fault trying to read from a non present page. CR2=0000000081b72dd0 is the memory address that was accessed.
I'm also getting this when running it in qemu with less physical memory, but with a GPF instead of page fault. There's definitely something wrong with the page frame initialization... I'll take a look at it when I have time, I appreciate telling me this though.

Re: Crash on real hardware even when doing (basically) nothing

Posted: Thu Oct 31, 2024 2:13 am
by MichaelPetch
I agree with octo, and make sure you fix it for the lower and higher half stacks. I think one of your other main issues is that you end up creating a linked list in memory after your kernel. It is quite possible that the grub multiboot structure exists beyond the end of the kernel. The GRUB spec doesn't say where the structure will be. For example in the QEMU output I posted yesterday you may want to look closely at this in the output from my environment:

Code: Select all

multiboot pointer: 0x932D00
And your kernel_end is at 0x918258 which can be found in the line

Code: Select all

pmm: add block: start 0x918258 end 0x7FE0000
Just so happens the memory map itself is likely in the same general area. When you go to create your linked list you are very quickly going to overwrite the multiboot memory map structure as you add to the list. I had noticed I end up in an infinite loop processing a clobbered multiboot memory map where `kmain` can't find the end tag because of the corruption.

Re: Crash on real hardware even when doing (basically) nothing

Posted: Fri Nov 01, 2024 11:52 am
by glolichen
Thanks for all your help on this topic. A significant source of error is that qemu initializes more memory than real hardware, and there were a number of places in the code where I presumed a zeroed memory, which appears to cause the PF/GPFs that occasionally come up.