Page 1 of 2

[SOLVED] Strange errors when kernel goes bigger

Posted: Mon Jan 06, 2014 3:23 pm
by wichtounet
Hi,

I've run into a strange error. I've added some code in a function that is not executed at bootup and now my OS refuses to boot.

Bochs outputs a lot of these errors:

Code: Select all

00050007770e[CPU0 ] write_virtual_qword_64(): canonical failure
00050007817e[CPU0 ] write_virtual_word_64(): canonical failure
00050007864e[CPU0 ] write_virtual_word_64(): canonical failure
00050007911e[CPU0 ] write_virtual_word_64(): canonical failure
...
And then it ends in Page Fault.

I was wondering what can cause a canonical failure ?

From what I've found, it comes from when an address uses more than 48 bits, is that the only case when that can occurs ?

Thanks

Baptiste

Re: What can cause a canonical failure in Bochs ?

Posted: Mon Jan 06, 2014 5:00 pm
by Brendan
Hi,
wichtounet wrote:I was wondering what can cause a canonical failure ?
For long mode paging, the virtual address space is split in 2 usable parts with a hole in the middle, like this:
  • 0x0000000000000000 to 0x00007FFFFFFFFFFF = usable (typically "user space")
  • 0x0000800000000000 to 0xFFFF7FFFFFFFFFFF = unusable (not "canonical")
  • 0xFFFF800000000000 to 0xFFFFFFFFFFFFFFFF = usable (typically "kernel space")
Basically the highest 17 bits of a virtual address must contain the same (either all 1 or all 0).
wichtounet wrote:From what I've found, it comes from when an address uses more than 48 bits, is that the only case when that can occurs ?
There are no other cases where virtual addresses can be non-canonical.


Cheers,

Brendan

Re: What can cause a canonical failure in Bochs ?

Posted: Tue Jan 07, 2014 3:28 am
by wichtounet
Thanks for the complete answer :) I didn't know about the hole.

As there are no other case causing canonical failure, it is probably again because of my bootloader screwing things around :(

Re: Strange errors when kernel goes bigger

Posted: Tue Jan 07, 2014 2:59 pm
by wichtounet
This problem is driving me crazy, after hours of debugging, I haven't found a solution :x

I was able to pinpoint the assembly instruction causing issues and it is indeed a canonical failure. What I don't understand is were it does come from.

The line is this one: mov qword ptr ds:[rsi], 0x3fe0

A. When the kernel makes 32248 bytes, it works perfectly, rsi has value that make sense (0x100000).
B. When I uncomment some code (not executed code) and the kernel gets bigger (37368 bytes), it doesn't work because rsi has a value that does not make sense at all (0x737361207961772D)

I noticed something that could be interesting. If I compile the kernel that works (A.) with mcmodel=large, it fails in the same fashion as (B.) with different rsi value (0x746169636F737361). Could it be something going one with my compile options ?

Here are the compile and link flags:

Code: Select all

CC=x86_64-elf-g++
AS=x86_64-elf-as
OC=x86_64-elf-objcopy

WARNING_FLAGS=-Wall -Wextra -pedantic -Wold-style-cast -Wshadow
COMMON_CPP_FLAGS=-masm=intel -Iinclude/ -nostdlib -Os -std=c++11 -fno-stack-protector -fno-exceptions -funsigned-char -fno-rtti -ffreestanding -fomit-frame-pointer -mno-red-zone -mno-3dnow -mno-mmx -fno-asynchronous-unwind-tables -mcmodel=large

CPP_FLAGS_LOW=-march=i386 -m32 -fno-strict-aliasing -fno-pic -fno-toplevel-reorder -mno-sse -mno-sse2 -mno-sse3 -mno-sse4 -mno-sse4.1 -mno-sse4.2

CPP_FLAGS_16=$(COMMON_CPP_FLAGS) $(CPP_FLAGS_LOW) -mregparm=3 -mpreferred-stack-boundary=2
CPP_FLAGS_32=$(COMMON_CPP_FLAGS) $(CPP_FLAGS_LOW) -mpreferred-stack-boundary=4
CPP_FLAGS_64=$(COMMON_CPP_FLAGS) -mno-sse3 -mno-sse4 -mno-sse4.1 -mno-sse4.2

COMMON_LINK_FLAGS=-lgcc
I also attached the linker script.

It doesn't seem to come from the bootloader, since the correct code seems to be executed in both case, when I debug with magic breakpoints, I fall at the same lines of code in A and B.

I'm open to every idea, as mine didn't made a change :(

I know it is quite few information, but I don't know what else to include as the problem really seems weird.

Thanks

Re: Strange errors when kernel goes bigger

Posted: Tue Jan 07, 2014 3:15 pm
by Combuster
Sounds like a typical case of the bootloader not loading enough so that part of the binary gets replaced with garbage.

Re: Strange errors when kernel goes bigger

Posted: Tue Jan 07, 2014 3:37 pm
by wichtounet
Combuster wrote:Sounds like a typical case of the bootloader not loading enough so that part of the binary gets replaced with garbage.
Yes, I know, that is what I thought at first, but I haven't found an error in the bootloader. And the point that mcmodel=large breaks the code seems to indicate the same, isn't it ?

The bootloader is using FAT32 to read the file. The file is 37368 bytes and 10 clusters of 4096 bytes are read and loaded into memory.

Here are the values loaded:

Code: Select all

Cluster LBA     Segment    Offset   
3        4050     1536          0
4        4058     1792          0
5        4066     2048          0
6        4074     2304          0
7        4082     2560          0
8        4090     2816          0
9        4098     3072          0
10       4106     3328          0
11       4114     3584          0
12       4122     3840          0
I'll check it again tomorrow, but it seems to me that enough data is loaded.

Re: Strange errors when kernel goes bigger

Posted: Tue Jan 07, 2014 3:45 pm
by Owen
The obvious thing to do now is to set a breakpoint somewhere before rsi is loaded with the bogus value and step through it to see where it comes from.

-mcmodel=large causing issues would agree with the possibility of the bootloader not loading enough; it tends to increase the size of the binary somewhat.

Re: Strange errors when kernel goes bigger

Posted: Tue Jan 07, 2014 4:09 pm
by iansjack
Stack colliding with code perhaps? As Owen said, time to start debugging.

Re: Strange errors when kernel goes bigger

Posted: Wed Jan 08, 2014 1:10 am
by wichtounet
Thanks for your ideas.
Owen wrote:-mcmodel=large causing issues would agree with the possibility of the bootloader not loading enough; it tends to increase the size of the binary somewhat.
Yes, you're right, I didn't thought about that, the kernel is indeed quite a bit larger with large mcmodel.
iansjack wrote:Stack colliding with code perhaps?
Unfortunately not, the stack is set at 0x0:0x4000:

Code: Select all

    xor ax, ax
    mov ss, ax
    mov sp, 0x4000
Owen wrote:The obvious thing to do now is to set a breakpoint somewhere before rsi is loaded with the bogus value and step through it to see where it comes from.
iansjack wrote:As Owen said, time to start debugging.
I'll have to try to use remote gdb on Bochs, because the generated assembly code is too big and complicated to be debugged by hand. I hope that works...

Re: Strange errors when kernel goes bigger

Posted: Wed Jan 08, 2014 2:35 pm
by wichtounet
I finally found the problem...

The .bss section was not included in my flat binary :(

I was creating the flat binary using objcopy:

$(OC) -R .note -R .comment -O binary kernel.bin.o kernel.bin

but that did not include the .bss section. I add to use this command:

$(OC) -R .note -R .comment -O binary --set-section-flags .bss=alloc,load,contents kernel.bin.o kernel.bin

I cannot believe I lost so many time for this single error :(

Re: Strange errors when kernel goes bigger

Posted: Thu Jan 09, 2014 9:33 am
by xenos
Well, normally .bss should not be included in the binary, as it should not contain any data before the kernel writes something there. Initially it should contain only zeros (and including those zeros into the kernel file is a waste of space). So the bootloader should load your kernel's .code and .data sections and write zeros to the .bss. However, as you are obviously using a flat binary, the bootloader does not know anything about the .bss, and therefore does not reserve that space, unless you forcefully include those zeros into the file.

The proper solution to this would be to use a binary format with a section (or rather segment) header such as ELF. Then the bootloader can read this header and determine which parts of the kernel file should be loaded where in memory, and which should initially be filled with zeros.

Re: Strange errors when kernel goes bigger

Posted: Thu Jan 09, 2014 10:53 am
by wichtounet
XenOS wrote:Well, normally .bss should not be included in the binary, as it should not contain any data before the kernel writes something there. Initially it should contain only zeros (and including those zeros into the kernel file is a waste of space). So the bootloader should load your kernel's .code and .data sections and write zeros to the .bss. However, as you are obviously using a flat binary, the bootloader does not know anything about the .bss, and therefore does not reserve that space, unless you forcefully include those zeros into the file.

The proper solution to this would be to use a binary format with a section (or rather segment) header such as ELF. Then the bootloader can read this header and determine which parts of the kernel file should be loaded where in memory, and which should initially be filled with zeros.
Yes, I know that, but I don't want to implement all this in my bootloader in 16bits assembly... That is what I'm going to do in my OS when I will load programs. I want to spend the few time I have writing my OS, not spend too much of it on my bootloader (I know, I should use Grub in this case, but I don't like this option either). Moreover, as a flat binary is not ELF, I would have thought that by default the .bss section would be included.

Re: Strange errors when kernel goes bigger

Posted: Fri Jan 10, 2014 10:27 pm
by palk
The BSS is never included because there's nothing in it. It's the loader's job to zero the BSS, but it's not a bad idea to also zero the BSS in your kernel as a "just in case."

It's pretty simple codeā€¦ no reason not to do it in real mode unless the addressing prevents it. All you have to do is set up ES:(E)DI to point to the start of the BSS segment, clear AL, set up (E)CX with the size of the BSS segment, and then just REP STOSB.

Example code for real mode:

Code: Select all

mov es, (BSS_START >> 16)
mov di, (BSS_START & 0xFFFF)
xor al, al
mov cx, BSS_SIZE
rep stosb

Re: Strange errors when kernel goes bigger

Posted: Sat Jan 11, 2014 4:04 am
by wichtounet
Yes, the assembly code for writing zeros is simple... but the assembly code to find BSS_START and BSS_SIZE is not... Moreover, it implies an ELF executable, which I don't have, since I'm only using a flat binary.

Re: Strange errors when kernel goes bigger

Posted: Sat Jan 11, 2014 1:14 pm
by xenos
wichtounet wrote:Yes, the assembly code for writing zeros is simple... but the assembly code to find BSS_START and BSS_SIZE is not... Moreover, it implies an ELF executable, which I don't have, since I'm only using a flat binary.
Why? It is very simple to define these two symbols in your linker script and to use them in your assembly. This also works for flat binary files.