BSS not getting zeroed properly on my laptop?

Question about which tools to use, bugs, the best way to implement a function, etc should go here. Don't forget to see if your question is answered in the wiki first! When in doubt post here.
8infy
Member
Member
Posts: 185
Joined: Sun Apr 05, 2020 1:01 pm

BSS not getting zeroed properly on my laptop?

Post by 8infy »

Hi! I have a problem with my kernel not working on one of the laptops that I have (by not working I mean triple faulting in a loop).
(It's an old intel i5-500whatever HP laptop)

I've so far tested on every single existing emulator (vmware, vbox, bochs, qemu) + a few other laptops and computers, and it worked great on all of them.
However, on this one laptop it triple faults because of what looks like BSS not getting zeroed properly (I have to also mention that I'm using my own bootloader + flat binary so i'm the one responsible for zeroing the BSS)
After hours of debugging I realized that it triple faults on a super early kernel main function, and after adding a few tests I've confirmed that. It hangs on this code:

Code: Select all

    // logger initialization function
    void Logger::initialize()
    {
        // This is a test line of code I've added and indeed it hangs here
        // these are both static pointers members of the logger class and are supposed to be zeroed
        if (s_sinks || s_write_lock) {
            hang(); // hangs here
        }

        ASSERT(s_sinks == nullptr);
        ASSERT(s_write_lock == nullptr);

        s_sinks = new DynamicArray<LogSink*>(2);

        if (E9LogSink::is_supported())
            s_sinks->emplace(new E9LogSink());

        s_sinks->emplace(new SerialSink());

        s_write_lock = new InterruptSafeSpinLock;
    }
Removing both assertions as well as my test if statement allows the laptop to boot further, however it still hangs later on scheduler initialization,
on what looks like another static pointer that's supposed to be zero.

My linker script:

Code: Select all

ENTRY(start)
OUTPUT_FORMAT("binary")

SECTIONS
{
    kernel_space_begin = 0xFFFFFFFF80000000; /* MAX - 2GB */
    . = kernel_space_begin + 0x100000; /* 1MB into the address space */

    .text ALIGN(4K) : AT (ADDR (.text) - kernel_space_begin)
    {
        *(.entry)
        *(.text)
    }

    .rodata ALIGN(4K) : AT (ADDR (.rodata) - kernel_space_begin)
    {
        global_constructors_begin = .;
        *(.ctors)
        global_constructors_end = .;

        *(.rodata)
    }

    .data ALIGN(4K) : AT (ADDR (.data) - kernel_space_begin)
    {
        *(.data)
    }

    .magic ALIGN(4K) : AT (ADDR (.magic) - kernel_space_begin)
    {
        *(.magic)
    }


    section_bss_begin = .;
    .bss ALIGN(4K) : AT (ADDR (.bss) - kernel_space_begin)
    {
        *(COMMON)
        *(.bss)
    }
    section_bss_end = .;
    section_bss_size = section_bss_end - section_bss_begin;
}
Code for zeroing the bss

Code: Select all

    ; zero section bss
    mov rdi, section_bss_begin
    mov rcx, section_bss_size
    mov rbp, rax ; save the rax
    mov rax, 0
    rep stosb
    mov rax, rbp
It could of course be related to some other issue, maybe some other part of the kernel overwriting BSS for some reason, but that doesn't happen on any other device/emulator I have.
Anyways, if you have any ideas about what I'm doing wrong here I would really appreciate if you could tell me, I'm kind of out of ideas here.
Octocontrabass
Member
Member
Posts: 5568
Joined: Mon Mar 25, 2013 7:01 pm

Re: BSS not getting zeroed properly on my laptop?

Post by Octocontrabass »

Hard to say what could be going wrong without seeing the rest of your code.
8infy wrote:

Code: Select all

    rep stosb
You did clear the direction flag at some point, right?
8infy
Member
Member
Posts: 185
Joined: Sun Apr 05, 2020 1:01 pm

Re: BSS not getting zeroed properly on my laptop?

Post by 8infy »

Octocontrabass wrote:Hard to say what could be going wrong without seeing the rest of your code.
8infy wrote:

Code: Select all

    rep stosb
You did clear the direction flag at some point, right?
Good catch, don't think I did. I just added cld at the top of the entrypoint, however, it doesn't seem to help.

I swear, I just commented the assertions and my if statement and it gets super far into the boot process,
here's me testing the panic screen (way after that logging initialization function)
Image

I'm almost 100% certain its because of garbage in BSS and I have absolutely no clue how it gets there...
Who's messing with my memory? :cry:
linuxyne
Member
Member
Posts: 211
Joined: Sat Jul 02, 2016 7:02 am

Re: BSS not getting zeroed properly on my laptop?

Post by linuxyne »

This might not fix the problem, but section_bss_begin is assigned a value first and then the alignment for the .bss section is performed. If section_bss_begin isn't suitably aligned, then section_bss_begin and .bss start are different values.

Code: Select all

    section_bss_begin = .;
    .bss ALIGN(4K) : AT (ADDR (.bss) - kernel_space_begin)
    {
        *(COMMON)
        *(.bss)
    }
    section_bss_end = .;
Might want to try the following:

Code: Select all

    .bss ALIGN(4K) : AT (ADDR (.bss) - kernel_space_begin)
    {
    section_bss_begin = .;
        *(COMMON)
        *(.bss)
    section_bss_end = .;
    }

MichaelPetch
Member
Member
Posts: 797
Joined: Fri Aug 26, 2016 1:41 pm
Libera.chat IRC: mpetch

Re: BSS not getting zeroed properly on my laptop?

Post by MichaelPetch »

Did you properly enable the A20 line? Did you read enough sectors into memory? Were the sectors properly loaded in order? Did you properly call the global constructors since this was C++? Did you use inline assembly but miss something that caused undefined behavior? Did you write assembly code that was called from C++ that didn't properly follow the calling convention causing undefined behaviour? The issue could be a multitude of things not directly related to the code you are showing or how you are building the bootloader/kernel. If you put your project into Github or some other similar service we might be able to see what the possible problems are.
8infy
Member
Member
Posts: 185
Joined: Sun Apr 05, 2020 1:01 pm

Re: BSS not getting zeroed properly on my laptop?

Post by 8infy »

MichaelPetch wrote:Did you properly enable the A20 line? Did you read enough sectors into memory? Were the sectors properly loaded in order? Did you properly call the global constructors since this was C++? Did you use inline assembly but miss something that caused undefined behavior? Did you write assembly code that was called from C++ that didn't properly follow the calling convention causing undefined behaviour? The issue could be a multitude of things not directly related to the code you are showing or how you are building the bootloader/kernel. If you put your project into Github or some other similar service we might be able to see what the possible problems are.
Those are all possible of course, but extremely unlikely, because like I said i've tried to reproduce this many times on emulators/real hardware and just couldn't.
It only happens on this specific laptop. That said I have the entire kernel+bootloader code up on github, but I didn't think anyone would interested enough to try and investigate it
so I just provided a condensed description of the problem. Here's the link to the repo in case anyone wants to try that or maybe could spot an obvious error https://github.com/UltraOS/Ultra.
MichaelPetch
Member
Member
Posts: 797
Joined: Fri Aug 26, 2016 1:41 pm
Libera.chat IRC: mpetch

Re: BSS not getting zeroed properly on my laptop?

Post by MichaelPetch »

Problems like this are usually because something has't be initialized or properly set up and hardware is generally less forgiving about leaving things in a state you may have made assumptions about.

Just because it works in an emulator doesn't mean it will work on real hardware. QEMU (without kvm) for example skips a lot of checks for speed and efficiency.

Timing issues on real hardware could make things work differently than an emulator.

Famous last words I usually hear "but it works on all the emulators the code must not be wrong". It is almost certain you have made some kind of assumption or made some kind of coding error that manifests itself as an unexpected failure on a real system.
MichaelPetch
Member
Member
Posts: 797
Joined: Fri Aug 26, 2016 1:41 pm
Libera.chat IRC: mpetch

Re: BSS not getting zeroed properly on my laptop?

Post by MichaelPetch »

I did see you provided a link to code. I can look at it later but maybe there are others that have time now. I wanted to point out that in your bootloader you make a false assumption that the segment register DS is set to 0 which isn't guaranteed. About the only thing you can assume is that DL has the drive letter and there is a small BIOS stack set (of an unknown size) that could be anywhere in RAM.

Code: Select all

init:
    ; initilize memory segments
    cli
    mov [boot_drive], dl                         ; <----- DS may not be zero so these 2 moves may not update right memory
    mov [this_partition], bx
    xor ax, ax
    mov ds, ax
    mov es, ax
    mov ss, ax
    mov sp, 0x7C00
    sti
I'm not in anyway suggesting this has anything to do with the problems you see, but is an example in the first few lines of code that you are making assumptions about what is in memory and the registers that may not apply to real hardware.
8infy
Member
Member
Posts: 185
Joined: Sun Apr 05, 2020 1:01 pm

Re: BSS not getting zeroed properly on my laptop?

Post by 8infy »

MichaelPetch wrote:I did see you provided a link to code. I can look at it later but maybe there are others that have time now. I wanted to point out that in your bootloader you make a false assumption that the segment register DS is set to 0 which isn't guaranteed. About the only thing you can assume is that DL has the drive letter and there is a small BIOS stack set (of an unknown size) that could be anywhere in RAM.

Code: Select all

init:
    ; initilize memory segments
    cli
    mov [boot_drive], dl                         ; <----- DS may not be zero so these 2 moves may not update right memory
    mov [this_partition], bx
    xor ax, ax
    mov ds, ax
    mov es, ax
    mov ss, ax
    mov sp, 0x7C00
    sti
I'm not in anyway suggesting this has anything to do with the problems you see, but is an example in the first few lines of code that you are making assumptions about what is in memory and the registers that may not apply to real hardware.
Thanks, it's probably something subtle like this that breaks everything :roll: Will fix this specific thing
(UPD: I think u were looking at VBR code first, which is loaded by the mbr, so the state is well defined in this case (yeah I have this kinda legacy approach where I have mbr->vbr->...)
I'm planning to rewrite the entire bootloader thing to be 2 stages and make the 2nd stage entirely in c++, just couldn't find the time to do it,
and this bootloader I have right now works literally everywhere and I never had problems with it, so I don't think its a priority right now.)
linuxyne
Member
Member
Posts: 211
Joined: Sat Jul 02, 2016 7:02 am

Re: BSS not getting zeroed properly on my laptop?

Post by linuxyne »

Could this be a problem - 2nd instruction below? sub doesn't support imm64.

Code: Select all

    mov rcx, section_bss_end
    sub rcx, section_bss_begin
    mov rdi, section_bss_begin

Code: Select all

  17:	48 b9 00 00 00 00 00 	movabs $0x0,%rcx
  1e:	00 00 00 
  21:	48 81 e9 00 00 00 00 	sub    $0x0,%rcx
  28:	48 bf 00 00 00 00 00 	movabs $0x0,%rdi
  2f:	00 00 00 

Code: Select all

0000000000000019 R_X86_64_64       section_bss_end
0000000000000024 R_X86_64_32S      section_bss_begin
000000000000002a R_X86_64_64       section_bss_begin
---

Edit: Think it's not a problem.

Code: Select all

.data:0000000d 48 bc 50 bb 12 80 ff ff ff ff    movabs rsp,0xffffffff8012bb50
.data:00000017 48 b9 50 bb 12 80 ff ff ff ff    movabs rcx,0xffffffff8012bb50
.data:00000021 48 81 e9 00 60 12 80             sub    rcx,0xffffffff80126000
.data:00000028 48 bf 00 60 12 80 ff ff ff ff    movabs rdi,0xffffffff80126000

8infy
Member
Member
Posts: 185
Joined: Sun Apr 05, 2020 1:01 pm

Re: BSS not getting zeroed properly on my laptop?

Post by 8infy »

linuxyne wrote:Could this be a problem - 2nd instruction below? sub doesn't support imm64.

Code: Select all

    mov rcx, section_bss_end
    sub rcx, section_bss_begin
    mov rdi, section_bss_begin

Code: Select all

  17:	48 b9 00 00 00 00 00 	movabs $0x0,%rcx
  1e:	00 00 00 
  21:	48 81 e9 00 00 00 00 	sub    $0x0,%rcx
  28:	48 bf 00 00 00 00 00 	movabs $0x0,%rdi
  2f:	00 00 00 

Code: Select all

0000000000000019 R_X86_64_64       section_bss_end
0000000000000024 R_X86_64_32S      section_bss_begin
000000000000002a R_X86_64_64       section_bss_begin
---

Edit: Think it's not a problem.

Code: Select all

.data:0000000d 48 bc 50 bb 12 80 ff ff ff ff    movabs rsp,0xffffffff8012bb50
.data:00000017 48 b9 50 bb 12 80 ff ff ff ff    movabs rcx,0xffffffff8012bb50
.data:00000021 48 81 e9 00 60 12 80             sub    rcx,0xffffffff80126000
.data:00000028 48 bf 00 60 12 80 ff ff ff ff    movabs rdi,0xffffffff80126000

I'm super confused, I looked at intel manual and I can't find sub, imm64 as well.
Yet here im testing this code on bochs and it looks correct?
Image

Anyways, the code I currently have generates the bss size directly in the ld file as well (I posted it with this question) so I'm not doing any sub at all there.
linuxyne
Member
Member
Posts: 211
Joined: Sat Jul 02, 2016 7:02 am

Re: BSS not getting zeroed properly on my laptop?

Post by linuxyne »

Someone with more knowledge about this can perhaps verify.

Nasm isn't going to use/emit anything like sub,imm64. But then, would the linker take care of confirming that the truncation of the address is safe (here it indeed is, since mapping in -2GB, etc.), and would it fail to link if it determined that the truncation isn't safe?

Edit: I guess that, since we are using canonical form of addresses, such a subtraction operation on addresses is safe. (But may probably not work for larger than 32-bit addresses).
kzinti
Member
Member
Posts: 898
Joined: Mon Feb 02, 2015 7:11 pm

Re: BSS not getting zeroed properly on my laptop?

Post by kzinti »

linuxyne wrote:But then, would the linker take care of confirming that the truncation of the address is safe (here it indeed is, since mapping in -2GB, etc.), and would it fail to link if it determined that the truncation isn't safe?
Yes, if it was a problem, the assembler or linker would complain.

It probably would be easier to just have your assembler call a C function to clear the BSS.

Code: Select all

    call clearBss
and then:

Code: Select all

void clearBss()
    extern void* section_bss_begin[];
    extern void* section_bss_end[];

    auto bssSize = (uintptr_t)section_bss_end - (uintptr_t)section_bss_begin;
    memset(section_bss_begin, 0, bssSize);
}
Personally I would call this function something like _init() and also use it to initialize other things like calling global constructors (if you have any).
Last edited by kzinti on Mon Oct 05, 2020 1:03 am, edited 2 times in total.
8infy
Member
Member
Posts: 185
Joined: Sun Apr 05, 2020 1:01 pm

Re: BSS not getting zeroed properly on my laptop?

Post by 8infy »

kzinti wrote:It probably would be easier to just have your assembler call a C function to clear the BSS.

Code: Select all

    call clearBss
and then:

Code: Select all

void clearBss()
    extern void* section_bss_begin[];
    extern void* section_bss_end[];

    auto bssSize = (uintptr_t)section_bss_end - (uintptr_t)section_bss_begin;
    memset(section_bss_begin, 0, bssSize);
}
Personally I would call this function something like _init() and also use it to initialize other things like calling global constructors (if you have any).
Yeah it would be easier to do this, I'm not sure why I'm not doing it this way. i am calling global constructors in C++ tho, but I do it after initializing the heap allocator.
linuxyne
Member
Member
Posts: 211
Joined: Sat Jul 02, 2016 7:02 am

Re: BSS not getting zeroed properly on my laptop?

Post by linuxyne »

kzinti wrote: Yes, if it was a problem, the assembler or linker would complain.
Thanks for the clarification.

---

Is it possible for the s_sinks to lie outside of bss?

Below are the limits of the bss for a run:

Code: Select all

0xffffffff80126000 to 0xffffffff8012bb50
See below, which is very likely the Logger::initialize function comparing the s_sinks and s_write_lock resp.

One can see that it looks beyond the limits of bss for s_sinks - 0xffffffff8012bb58.

Code: Select all

ffffffff801179e2:	83 3d 6f 41 01 00 00 	cmpl   $0x0,0x1416f(%rip)        # 0xffffffff8012bb58
ffffffff801179e9:	48 89 e5             	mov    %rsp,%rbp
ffffffff801179ec:	41 56                	push   %r14
ffffffff801179ee:	41 55                	push   %r13
ffffffff801179f0:	41 54                	push   %r12
ffffffff801179f2:	53                   	push   %rbx
ffffffff801179f3:	0f 85 71 02 00 00    	jne    0xffffffff80117c6a
ffffffff801179f9:	48 83 3d 4f 41 01 00 	cmpq   $0x0,0x1414f(%rip)        # 0xffffffff8012bb50
ffffffff80117a00:	00 
ffffffff80117a01:	0f 85 44 02 00 00    	jne    0xffffffff80117c4b
The disassembly was generated by running the following on Kernel.bin. That is, the start of the Kernel.bin is at 0xffffffff80100000.

Code: Select all

objdump -b binary -D -m i386:x86-64 --adjust-vma=0xffffffff80100000
Edit:

Heh.

Code: Select all

0000000000000000 l    d  .bss._ZN6kernel6Logger7s_sinksE	0000000000000000 .bss._ZN6kernel6Logger7s_sinksE
0000000000000000  w    O .bss._ZN6kernel6Logger7s_sinksE	0000000000000008 _ZN6kernel6Logger7s_sinksE
You might want to try adding *(.bss.*) and/or *(.bss*), whichever is more general and allowed, to the linker script.
Post Reply