Page 1 of 1

Second stage of bootloader can't call function

Posted: Wed Jun 06, 2018 3:12 pm
by scdown
I'm trying to have my first stage load my second stage and part of the kernel, and have my second stage load the rest of the kernel before switching to 32 bit mode.

18 sectors are loaded successfully off the floppy disk into memory by the first sector, and the second sector starts. I know this because I get "Started in 16-bit Real Mode" which the second stage runs, but load_kernel is never ran. I've tried replacing 'call load_kernel' with 'jmp load_kernel' - it doesn't change anything. What's interesting is that if I move my include statements from between the call and the function, the function is called, but then the function itself is unable to call a different function, loaded in one of the included files.

bootsect2.asm is the second stage of the bootloader. From a memory dump I can see that the entire second stage - and some of the kernel - is loaded to memory without any difficulties. I've attacked my code.

Re: Second stage of bootloader can't call function

Posted: Wed Jun 06, 2018 5:10 pm
by Brendan
Hi,

For NASM; "%include" works a bit like one file is cut & pasted into another file. It doesn't do anything else, like remembering if you were using 16-bit before the "%include" and automatically switching back to 16-bit after the "%include".

For your code; the assembler is told to generate 16-bit code at the start of "bootsect2.asm" and the assembler would generate 16-bit code up until it reaches the "%include "boot/32bit_print.asm"", and then (inside "boot/32bit_print.asm") the assembler is told to switch to 32-bit code. Then the assembler continues using 32-bit code for "boot/disk2.asm" and "load_kernel:". The end result is that it crashes because CPU is executing code assembled for 32-bit while the CPU is in 16-bit real mode.

To solve this, just put "bits 16" at the end of "boot/32bit_print.asm".

I didn't look too hard, but there were a few other (unrelated) things I noticed...

Near the start of "bootsect.asm" you load SS, then do another instruction ("mov [BOOT_DRIVE], dl"), then load SP. This is a little dangerous because an IRQ can occur after you've set SS but before you set SP, causing the IRQ handler to corrupt random memory. For a worst case example, if the BIOS left SS:SP set to 0x4000:0x7E00 (in the middle of nowhere) then an IRQ handler can interrupt immediately after the "mov [BOOT_DRIVE], dl" and could end up using "SS:SP = 0x0000:0x7E00" and overwrite your code. To guard against this you shouldn't have anything between the "mov ss,cx" and the "mov sp,bp" (which will be fine on modern CPUs because of special hacks built into the CPU to disable IRQs for the instruction after SS is set). For ancient CPUs (8086) there's no special hack and you would have to do "cli" before changing SS and SP and then "sti" after; but for these CPUs you're going to crash anyway (you assume the CPU supports 32-bit without checking and these CPUs don't support 32-bit).

For floppy disk you should have a BPB in the first sector because if you don't some poorly designed operating systems (Windows) complain that the disk is faulty and needs to be reformatted. If you do have a BPB, then you can use the "sectors per track" and "number of heads" fields in the BPB (instead of using your "SEC_COUNT" and "HEAD_COUNT" labels). This makes it easy to have (e.g.) a utility that generates disk images or a utility that formats floppy disks that sets up the BPB to suit the size of the floppy (e.g. 1440 KiB, 1680 KiB, 1200 KiB, ...), where the same boot code works for all the different floppy disk formats.

For NASM, labels without a colon can be dangerous. For example, if you want an instruction like "stosd" but there's a typo and you write "stosdd" instead, then NASM will think it's a label without a colon and you won't get any error, and then you'll spend hours trying to figure out where the bug is. To avoid this, NASM has a "warn on orphaned labels" option, where it warns you about labels that don't have a colon, so you'll get a warning for "stosdd" (but you have to use colons for labels).

For NASM, most directives have a normal version that does not have square braces (e.g. like "org 0x7C00" and "bits 16" and "section .text") that should almost always be used; plus a special lower-level internal version that should only ever be used in special macro that does have square braces (e.g. like "[org 0x7C00]" and "[bits 16]" and "[section .text]"). For some of these if you use the wrong version it breaks features (e.g. if you use "[section .text]" instead of "section .text" you'll break the "__SECT__" macro); and for others (e.g. "[org 0x7C00]" and "[bits 16]") there's currently no difference but it may break features in future versions of NASM.


Cheers,

Brendan

Re: Second stage of bootloader can't call function

Posted: Wed Jun 06, 2018 5:48 pm
by scdown
Brendan wrote:Hi,

For NASM; "%include" works a bit like one file is cut & pasted into another file. It doesn't do anything else, like remembering if you were using 16-bit before the "%include" and automatically switching back to 16-bit after the "%include".

For your code; the assembler is told to generate 16-bit code at the start of "bootsect2.asm" and the assembler would generate 16-bit code up until it reaches the "%include "boot/32bit_print.asm"", and then (inside "boot/32bit_print.asm") the assembler is told to switch to 32-bit code. Then the assembler continues using 32-bit code for "boot/disk2.asm" and "load_kernel:". The end result is that it crashes because CPU is executing code assembled for 32-bit while the CPU is in 16-bit real mode.

To solve this, just put "bits 16" at the end of "boot/32bit_print.asm".

I didn't look too hard, but there were a few other (unrelated) things I noticed...

Near the start of "bootsect.asm" you load SS, then do another instruction ("mov [BOOT_DRIVE], dl"), then load SP. This is a little dangerous because an IRQ can occur after you've set SS but before you set SP, causing the IRQ handler to corrupt random memory. For a worst case example, if the BIOS left SS:SP set to 0x4000:0x7E00 (in the middle of nowhere) then an IRQ handler can interrupt immediately after the "mov [BOOT_DRIVE], dl" and could end up using "SS:SP = 0x0000:0x7E00" and overwrite your code. To guard against this you shouldn't have anything between the "mov ss,cx" and the "mov sp,bp" (which will be fine on modern CPUs because of special hacks built into the CPU to disable IRQs for the instruction after SS is set). For ancient CPUs (8086) there's no special hack and you would have to do "cli" before changing SS and SP and then "sti" after; but for these CPUs you're going to crash anyway (you assume the CPU supports 32-bit without checking and these CPUs don't support 32-bit).

For floppy disk you should have a BPB in the first sector because if you don't some poorly designed operating systems (Windows) complain that the disk is faulty and needs to be reformatted. If you do have a BPB, then you can use the "sectors per track" and "number of heads" fields in the BPB (instead of using your "SEC_COUNT" and "HEAD_COUNT" labels). This makes it easy to have (e.g.) a utility that generates disk images or a utility that formats floppy disks that sets up the BPB to suit the size of the floppy (e.g. 1440 KiB, 1680 KiB, 1200 KiB, ...), where the same boot code works for all the different floppy disk formats.

For NASM, labels without a colon can be dangerous. For example, if you want an instruction like "stosd" but there's a typo and you write "stosdd" instead, then NASM will think it's a label without a colon and you won't get any error, and then you'll spend hours trying to figure out where the bug is. To avoid this, NASM has a "warn on orphaned labels" option, where it warns you about labels that don't have a colon, so you'll get a warning for "stosdd" (but you have to use colons for labels).

For NASM, most directives have a normal version that does not have square braces (e.g. like "org 0x7C00" and "bits 16" and "section .text") that should almost always be used; plus a special lower-level internal version that should only ever be used in special macro that does have square braces (e.g. like "[org 0x7C00]" and "[bits 16]" and "[section .text]"). For some of these if you use the wrong version it breaks features (e.g. if you use "[section .text]" instead of "section .text" you'll break the "__SECT__" macro); and for others (e.g. "[org 0x7C00]" and "[bits 16]") there's currently no difference but it may break features in future versions of NASM.


Cheers,

Brendan
Thank you, that was incredibly informative. I've implemented what you said, and have started working on your other recommendations, but I have a new problem. For some reason, when I include gdt.asm or switch_pm.asm anywhere than in the second last line of the file, I get disk read errors constantly. Do you have any idea why that could be?

Re: Second stage of bootloader can't call function

Posted: Thu Jun 07, 2018 12:27 am
by Brendan
Hi,
scdown wrote:Thank you, that was incredibly informative. I've implemented what you said, and have started working on your other recommendations, but I have a new problem. For some reason, when I include gdt.asm or switch_pm.asm anywhere than in the second last line of the file, I get disk read errors constantly. Do you have any idea why that could be?
I'd assume it's a similar problem - "switch_pm.asm" changes to 32-bit and doesn't change back to 16-bit (like "32bit_print.asm" did).


Cheers,

Brendan

Re: Second stage of bootloader can't call function

Posted: Thu Jun 07, 2018 2:21 am
by MichaelFarthing
Brendan wrote: To solve this, just put "bits 16" at the end of "boot/32bit_print.asm".
I would actually put this in the orginal 16 bit file after the include statement.
I think this makes the reason for it more obvious and removes the implied assumption within the 32 bit file that it has been included from a 16 bit file

Re: Second stage of bootloader can't call function

Posted: Thu Jun 07, 2018 4:17 pm
by scdown
Brendan wrote:Hi,
scdown wrote:Thank you, that was incredibly informative. I've implemented what you said, and have started working on your other recommendations, but I have a new problem. For some reason, when I include gdt.asm or switch_pm.asm anywhere than in the second last line of the file, I get disk read errors constantly. Do you have any idea why that could be?
I'd assume it's a similar problem - "switch_pm.asm" changes to 32-bit and doesn't change back to 16-bit (like "32bit_print.asm" did).


Cheers,

Brendan
Unfortunately, it wasn't this. Other random things were screwing it up too - for example, 'int 0x10', which I was using to put the screen into VBE mode, was making the disk reading code throw errors somehow.

In the end, I gave up and put things back the way they were before, so that the first stage is in charge of loading everything.