OSDev.org

Posted: **Tue Apr 22, 2025 9:27 pm**

Currently, I'm implementing fork() and multitasking in a custom x86 kernel and have encountered a very strange bug. I've debugged carefully and reviewed the code multiple times, but I still can't figure out what's wrong, so I'm seeking advice from anyone with OS development experience.

Here's the flow of execution:

Code: Select all

void init_process(void)
{
    struct task_struct *task;

    task = kmalloc(sizeof(struct task_struct));
    if (!task)
        do_panic("init process failed");
    
    task->pid = INIT_PROCESS_PID;
    task->uid = task->pid; // stub
    task->cr3 = current_cr3();
    task->esp0 = (uint32_t)&stack_top;
    task->time_slice_remaining = DEFAULT_TIMESLICE;
    task->parent = NULL;
    task->vblocks.by_base = RB_ROOT;
    task->vblocks.by_size = RB_ROOT;
    task->mapping_files.by_base = RB_ROOT;
    task->state = PROCESS_RUNNING;

    init_list_head(&task->children);
    init_list_head(&task->ready);
    
    pid_table_register(task);
    
    current = task;

    exec_fn(init_process_code);
}

This function is called in kernel mode. It allocates and initializes the first user process structure, then calls exec_fn() to jump into user space.

Code: Select all

void exec_fn(void (*func)())
{
    user_vspace_clean(&current->vblocks, &current->mapping_files, CL_TLB_INVL | CL_RECYCLE);
    user_vspace_init();
    memcpy((void *)USER_CODE_BASE, func, USER_CODE_SIZE);
    set_rdonly();
    jmp_to_entry_point();
}

This function removes all previous user virtual memory mappings, initializes a new space (currently only code and stack regions are defined with static base/size), copies the function pointed to by func to the user code segment, marks it read-only, and jumps to the entry point.

Code: Select all

void init_process_code(void)
{
    char buf[10];
    char base[11] = {'0', '1', '2', '3', '4', '5', '6', '7', '8', '9', '\0'};
    int pid;
    int magic_number = 1234567;

    for (int i = 0; i < 10; i++) {
        pid = fork();
        if (pid < 0) {
            write("fork failed\n");
            while (true);
        }
        if (pid == 0) {
            write("child created my pid: ");
            t_number_to_string(buf, getuid(), 10, base);
            write(buf);
            magic_number++;
            write("magic number: ");
            t_number_to_string(buf, magic_number, 10, base);
            write(buf);
            write("\n"); 
            while (true);
        }
        else {
            write("fork succeed child pid: ");
            t_number_to_string(buf, pid, 10, base);
            write(buf);
            write("\n");
        }
    }
    if (magic_number != 1234567)
        write("cow failed.. \n");
    else
        write("ok..");
    while(true);
}

This is a simple test to check whether fork() and multitasking are working properly. All functions used here are marked with __attribute__((always_inline)), meaning they do not cause a jump into kernel address space. System calls are made via int $0x80.

However, only the else block executes, and the child process (if (pid == 0) block) never runs. It seems like only the parent continues executing, and I can’t figure out why. To narrow things down, I simplified the code like this:

Code: Select all

void init_process_code(void)
{
    int pid;

    pid = fork();
    while (true);
}

Here’s my fork() implementation:

Code: Select all

static inline int create_child(struct task_struct **out_child)
{
    struct task_struct *child;
    int ret = -ENOMEM;

    child = kmalloc(sizeof(struct task_struct));
    if (!child)
        goto out;

    child->pid = alloc_pid();
    if (child->pid == PID_NONE) {
        ret = -EAGAIN;
        goto out_clean_child;
    }
    child->uid = child->pid; // stub
    
    child->esp0 = (uint32_t)kmalloc(KERNEL_STACK_SIZE);
    if (!child->esp0)
        goto out_clean_pid;
    child->esp0 += KERNEL_STACK_SIZE;
    
    child->vblocks.by_base = RB_ROOT;
    child->vblocks.by_size = RB_ROOT;
    if (vblocks_clone(&child->vblocks) < 0)
        goto out_clean_kstack;
    
    child->mapping_files.by_base = RB_ROOT;
    if (mapping_files_clone(&child->mapping_files) < 0)
        goto out_clean_vblocks;

    pages_set_cow();
    child->cr3 = pgdir_clone();
    
    child->time_slice_remaining = DEFAULT_TIMESLICE;
    child->parent = current;
    child->state = PROCESS_READY;

    init_list_head(&child->children);

    *out_child = child;
    return 0;

out_clean_vblocks:
    vblocks_clean(&child->vblocks);
out_clean_kstack:
    kstack_free(child);
out_clean_pid:
    free_pid(child->pid);
out_clean_child:
    kfree(child);
out:
    return ret;
}

Code: Select all

static inline int sys_fork(void)
{
    struct task_struct *child;
    int ret;

    ret = create_child(&child);
    if (ret < 0)
        return ret;

    stack_copy_and_adjust(child);

    if (current == child)
        return 0;

    add_child_to_parent(current, child);
    pid_table_register(child);
    ready_queue_enqueue(child);
    return child->pid;
}

The create_child() function clones the current process’s virtual memory, allocates a new kernel stack, and sets up the new task_struct. stack_copy_and_adjust() copies the kernel stack and adjusts the esp.

Here’s the stack_copy_and_adjust() assembly code:

Code: Select all

section .text
global stack_copy_and_adjust
extern current
extern memcpy32
stack_copy_and_adjust:
    push ebx
    push esi
    push edi
    push ebp

    mov eax, [current]
    mov ebx, [esp + 20]

    mov esi, [eax + OFFSET_TASK_ESP0]
    sub esi, 8192
    mov edi, [ebx + OFFSET_TASK_ESP0]
    sub edi, 8192
    
    push 2048
    push esi
    push edi
    call memcpy32
    add esp, 12

    mov eax, esp
    sub eax, esi
    add edi, eax

    mov [ebx + OFFSET_TASK_ESP], edi  

    mov ebx, [esp + 12]
    mov esi, [esp + 8]
    mov edi, [esp + 4]
    add esp, 16

    ret

I’ve verified that esi and edi point correctly and memory is copied successfully.

Here’s the task switching function:

Code: Select all

global switch_to_task
extern current
extern tss
switch_to_task:
    push ebx
    push esi
    push edi
    push ebp

    mov esi, [current]
    mov [esi + OFFSET_TASK_ESP], esp     

    mov edi, [esp + 20]
    mov [current], edi

    mov esp, [edi + OFFSET_TASK_ESP]
    mov eax, [edi + OFFSET_TASK_CR3]        
    mov edx, [edi + OFFSET_TASK_ESP0]
    mov ebx, [tss]
    mov [ebx + OFFSET_TSS_ESP0], edx
    mov ecx, cr3

    cmp eax, ecx
    je .doneVAS
    mov cr3, eax
    
.doneVAS:
    pop ebp
    pop edi
    pop esi
    pop ebx

    ret

Code: Select all

static inline void schedule(void)
{
    struct task_struct *next_task;

    if (current->pid == 2) {
        printk("aaaaaaa\n");
        while (true);
    }
    next_task = list_next_entry(current, ready);

    current->state = PROCESS_READY;
    next_task->state = PROCESS_RUNNING;

    switch_to_task(next_task);
}

To test if a child is running, I added a debug check:

Code: Select all

if (current->pid == 2) {
    printk("aaaaaaa\n");
    while (true);
}

When .doneVAS: is reached, the four registers pushed in stack_copy_and_adjust() are restored, and after the ret, execution should resume from the if (current == child) condition.

Here’s the modified sys_fork() for debugging:

Code: Select all

if (current == child) {
    printk("child pid = %u esp = %x esp0 = %x \n", current->pid, current->esp, current->esp0);
    return 0;
}
// printk("child pid = %u esp = %x esp0 = %x \n", child->pid, child->esp, child->esp0);

Here’s the weird part: If I comment out the printk() line, esp and esp0 are incorrect. But if the line is active, the output values are exactly correct. This suggests that something is uninitialized or corrupted and printk() is accidentally fixing it.

At this point, I’m completely stuck. Any advice, insight, or suggestions would be deeply appreciated, especially from those with experience implementing fork() or task switching on x86.

Also, I can't use GDB as it freezes after jumping into user mode.

Posted: **Tue Apr 22, 2025 10:35 pm**

zerone015 wrote: ↑Tue Apr 22, 2025 9:27 pmThis function is called in kernel mode. It allocates and initializes the first user process structure, then calls exec_fn() to jump into user space.

Normally you don't jump into user space, you return to it. You do this by initializing each new task's kernel stack as if that task had already been running in user space and was interrupted (or performing a system call) right before executing its first instruction.

zerone015 wrote: ↑Tue Apr 22, 2025 9:27 pmstack_copy_and_adjust() copies the kernel stack and adjusts the esp.

This is why it doesn't work. The kernel stack contains absolute addresses, including addresses pointing to itself. You must create a new kernel stack for the child task entirely from scratch. If you're already creating a new kernel stack each time you spawn a new thread from scratch, it should be pretty easy to do the same thing for fork().

Posted: **Tue Apr 22, 2025 11:43 pm**

Octocontrabass wrote: ↑Tue Apr 22, 2025 10:35 pm Normally you don't jump into user space, you return to it. You do this by initializing each new task's kernel stack as if that task had already been running in user space and was interrupted (or performing a system call) right before executing its first instruction.

Thank you for the explanation. I think there might have been a misunderstanding caused by how I described my setup. I’m actually not jumping into user space directly — I return to it using iret, as shown in the function below:

Code: Select all

static inline void jmp_to_entry_point(void) {
    __asm__ (
        "mov %[user_ds], %%ax\n\t"
        "mov %%ax, %%ds\n\t"
        "mov %%ax, %%es\n\t"
        "mov %%ax, %%fs\n\t"
        "mov %%ax, %%gs\n\t"
        "pushl %[user_ss]\n\t"
        "pushl %[user_esp]\n\t"
        "pushf\n\t"
        "orl $0x200, (%%esp)\n\t"
        "pushl %[user_cs]\n\t"
        "pushl %[user_eip]\n\t"
        "iret"
        :
        : [user_ss] "i" (GDT_SELECTOR_DATA_PL3),
          [user_esp] "i" (USER_STACK_TOP),
          [user_cs] "i" (GDT_SELECTOR_CODE_PL3),
          [user_eip] "i" (USER_CODE_BASE),
          [user_ds] "i" (GDT_SELECTOR_DATA_PL3)
        : "memory", "eax"
    );
}

So I believe the behavior aligns with the approach you described — returning to user space as if the task had been interrupted.

Octocontrabass wrote: ↑Tue Apr 22, 2025 10:35 pm This is why it doesn't work. The kernel stack contains absolute addresses, including addresses pointing to itself. You must create a new kernel stack for the child task entirely from scratch. If you're already creating a new kernel stack each time you spawn a new thread from scratch, it should be pretty easy to do the same thing for fork().

I initially assumed that the stack would never contain the address of the stack itself, and that everything would always be accessed relative to esp. But now that I think about it, there are cases like pushad or when passing the address of a local variable as a function argument, where this assumption can break.

In the current implementation, the CPU context is already saved on the stack, so it is not separately stored in the task_struct structure. Therefore, to execute the child process in the same context as the parent process, the kernel stack of the child must be manipulated so that it continues to return (ret) to the return address that was stored in the parent's kernel stack.

In what other cases is the kernel stack self-contained? And if copying the parent's kernel stack to the child's kernel stack is not the correct approach, how can the child's kernel stack be properly constructed from scratch?

Posted: **Wed Apr 23, 2025 10:34 pm**

Ok. I now realize what I was missing. My kernel is being compiled with the -O2 option, which implicitly includes the -fomit-frame-pointer flag. So I assumed that the compiler would not use ebp as the base pointer for the stack frame and would instead always access the stack using only esp.

However, that was a complete misunderstanding. After examining the disassembly, I saw that the compiler actually uses ebp as the base pointer and relies on this value to restore esp.

Thanks to Octocontrabass, I was able to solve it. I'm really grateful.

Posted: **Mon Apr 28, 2025 10:11 am**

zerone015 wrote: ↑Tue Apr 22, 2025 11:43 pmSo I believe the behavior aligns with the approach you described — returning to user space as if the task had been interrupted.

But it doesn't. You're still calling a dedicated function to switch to user mode. I'm talking about initializing the new task's state so that you can perform an ordinary task switch, relying on the return addresses stored in the new task's stack to cause it to return to user mode.

zerone015 wrote: ↑Tue Apr 22, 2025 11:43 pmAnd if copying the parent's kernel stack to the child's kernel stack is not the correct approach, how can the child's kernel stack be properly constructed from scratch?

When you call switch_to_task to run the new task for the first time, what values will it pop from the stack? Make sure those values are at the top of the new task's kernel stack. One of those values is a return address, which you can set as if it were called by your assembly interrupt/syscall handler. The interrupt/syscall handler will then pop the userspace registers and return to user mode via something like IRET.

If you need to run some additional code before the task returns to user mode for the first time, you can set the switch_to_task return address to run that code, then insert a new return address pointing to the interrupt/syscall handler.

This is unrelated, but I've never liked how Brendan's switch_to_task does so much more than just switching stacks. Most of that extra work could be done by the caller...

OSDev.org

[Solved]Weird fork() Bug in Custom x86 Kernel – Child Process Never Runs

[Solved]Weird fork() Bug in Custom x86 Kernel – Child Process Never Runs

Re: Weird fork() Bug in Custom x86 Kernel – Child Process Never Runs

Re: Weird fork() Bug in Custom x86 Kernel – Child Process Never Runs

Re: Weird fork() Bug in Custom x86 Kernel – Child Process Never Runs

Re: Weird fork() Bug in Custom x86 Kernel – Child Process Never Runs