Page 1 of 1

triple fault after context switch

Posted: Mon Aug 08, 2016 11:01 pm
by szhou42
So, after my kernel setup all the memory management and file system stuff, I decided to load a usermode entry program(flat binary, not elf) from hard disk and run it under user mode.

This is how I do it in detail
In kernel mode, load a program called "userentry.bin", which is basically a program that loops infinitely

Code: Select all

load_program("userentry.bin");
In load_program:

Code: Select all

    vfs_node_t * program = file_open(filename, 0);
    if(!program) {
        printf("Fail to open %s, does it even exist?\n", filename);
        return;
    }
    uint32_t size = vfs_get_file_size(program);
    char * program_code = kcalloc(size, 1);
    vfs_read(program, 0, size, program_code);
    // create a process, all the context registers are zeroed out by calloc(), set eip to the start of program code, also set IF flag 
    pcb_t * p1 = kcalloc(sizeof(pcb_t), 1);
    memcpy(p1, current_process, sizeof(pcb_t));
    p1->regs.eip = (uint32_t)program_code;
    p1->regs.eflags = 0x200; // enable interrupt
    // Insert the process into process list
    p1->self = list_insert_front(process_list, p1);
    // call yield via a system call, but actually we're still in kernel mode..
    asm volatile("mov $1, %eax");
    asm volatile("int $0x80");
The yield system call will then call the scheduler function, which finds next function to run, which has to be the process created in load_program().
Then scheduler does a context switch to the program, and the program does start running.

However, a triple fault occur after 1 second.
I've tried to set breakpoints on the interrupt or exception handler, but the kernel always triple fault before any exception/interrupt happens
I also tried to insert a int 0x80 instruction into the program, and it will immediately triple fault and reset the machine.

I understand that my way of loading a process is kind of weird because I did not even create a separate address space for the new program, but i'm just experimenting running a few programs concurrently in user mode

I suspect that the context I manually created for the process leads to the triple fault..
Can someone explain what's going on ??
my os code is here for reference: https://github.com/szhou42/osdev/tree/master/src

Any help would be appreciated! Thanks!

Re: triple fault after context switch

Posted: Tue Aug 09, 2016 12:35 am
by iansjack
What have you done in the way of debugging? Are you running under a debugger? Do you have exception handlers that will stop execution when there is an exception and display the contents of important registers? Are you using paging?

You really need to treat this as a chance to hone your debugging skills so that you at least know where the program is failing and what the contents of registers and memory are at that point. It should then be fairly simple to find the error. A likely cause is a corrupted stack.

Re: triple fault after context switch

Posted: Tue Aug 09, 2016 2:59 am
by Ch4ozz
I dont see you allocating memory for the stack at all.
You are currently using the stack of the kernel thread it seems.

Re: triple fault after context switch

Posted: Tue Aug 09, 2016 10:19 am
by szhou42
iansjack wrote:What have you done in the way of debugging? Are you running under a debugger? Do you have exception handlers that will stop execution when there is an exception and display the contents of important registers? Are you using paging?

You really need to treat this as a chance to hone your debugging skills so that you at least know where the program is failing and what the contents of registers and memory are at that point. It should then be fairly simple to find the error. A likely cause is a corrupted stack.
Thanks for your quick reply! but please just don't assume we post here because we're too lazy to debug ourselves. My situation is a bit different, it resets the whole machine before any useful info is printed out, and it's not as easy as single stepped and find exactly where it cause the triple fault. yes i am using paging,

I‘ve been debugging using gdb(both source and machine mode) and stepped through every assembly instruction until the loaded program is actually running.

Code: Select all

loop:
xchg bx,bx
xchg bx,bx
xchg bx,bx
xchg bx,bx
xchg bx,bx
jmp loop
What I've found is that, if I disable interrupt, nothing happen, the program just loops infinitely. However, if interrupt is enabled, triple fault will occur in 1 second.
I've tried to set breakpoints around my exception and interrupt handler, but the program seems to always cause a triple fault before any exception or interrupt handler. but I am still sure that it has something to do with exception, otherwise the program wouldn't have run normally with IF flag disabled.

I tried inserting a int 0x80 in the program to trigger an exception, to see if there is something wrong with exception. and the int 0x80 does immediately cause a triple fault, so I am now suspecting that the current manually created context(registers) has prevented the CPU from triggering an exception. I will look into some documentation about this now..

Re: triple fault after context switch

Posted: Tue Aug 09, 2016 10:24 am
by szhou42
Ch4ozz wrote:I dont see you allocating memory for the stack at all.
You are currently using the stack of the kernel thread it seems.
Hi Ch4ozz, you're right. but the program does not use the stack at all, i m just trying out loading and running a program.

Re: triple fault after context switch

Posted: Tue Aug 09, 2016 11:20 am
by onlyonemac
szhou42 wrote:What I've found is that, if I disable interrupt, nothing happen, the program just loops infinitely. However, if interrupt is enabled, triple fault will occur in 1 second.
I've tried to set breakpoints around my exception and interrupt handler, but the program seems to always cause a triple fault before any exception or interrupt handler. but I am still sure that it has something to do with exception, otherwise the program wouldn't have run normally with IF flag disabled.
What's happening is that the timer interrupt is coming along and somehow the CPU can't find the interrupt handler for it, so the CPU tries to fault but can't find the fault handler, so the CPU tries to double-fault but can't find the double-fault handler, so the CPU triple-faults. Most likely, the CPL has been messed up somewhere and your interrupt handler can't be called from whatever CPL your "userspace binary" is running in.

Re: triple fault after context switch

Posted: Tue Aug 09, 2016 12:13 pm
by iansjack
szhou42 wrote:
iansjack wrote:Thanks for your quick reply! but please just don't assume we post here because we're too lazy to debug ourselves. My situation is a bit different, it resets the whole machine before any useful info is printed out, and it's not as easy as single stepped and find exactly where it cause the triple fault. yes i am using paging,
Rest assured that I assume nothing. This is why I need to ask when you give no details of what debugging steps you have taken.

Now that you have given that information, I would suggest that you write handlers for all exceptions. At their simplest they will just halt the processor. Then you can use gdb to determine exactly where the program is faulting and which exception it is throwing; this should give you a good idea of what the problem is. The triple fault will never occur as you will catch the very first exception.

Re: triple fault after context switch

Posted: Tue Aug 09, 2016 1:23 pm
by szhou42
iansjack wrote:
szhou42 wrote:
iansjack wrote:Thanks for your quick reply! but please just don't assume we post here because we're too lazy to debug ourselves. My situation is a bit different, it resets the whole machine before any useful info is printed out, and it's not as easy as single stepped and find exactly where it cause the triple fault. yes i am using paging,
Rest assured that I assume nothing. This is why I need to ask when you give no details of what debugging steps you have taken.

Now that you have given that information, I would suggest that you write handlers for all exceptions. At their simplest they will just halt the processor. Then you can use gdb to determine exactly where the program is faulting and which exception it is throwing; this should give you a good idea of what the problem is. The triple fault will never occur as you will catch the very first exception.
This is weird, i've written handler for every exception/irq interrupts, and i've set breakpoints on each and everyone of them. but none are caught.
but anyway, you are right that it's caused by stack corruption, the esp was accidentally set to zero when the user program was running. So i guess it somehow triple faults when the CPU was not able to push any data to the stack.
Thank you very much, i've learned more about how to debug a triple fault now :).

Re: triple fault after context switch

Posted: Tue Aug 09, 2016 1:26 pm
by szhou42
onlyonemac wrote:
szhou42 wrote:What I've found is that, if I disable interrupt, nothing happen, the program just loops infinitely. However, if interrupt is enabled, triple fault will occur in 1 second.
I've tried to set breakpoints around my exception and interrupt handler, but the program seems to always cause a triple fault before any exception or interrupt handler. but I am still sure that it has something to do with exception, otherwise the program wouldn't have run normally with IF flag disabled.
What's happening is that the timer interrupt is coming along and somehow the CPU can't find the interrupt handler for it, so the CPU tries to fault but can't find the fault handler, so the CPU tries to double-fault but can't find the double-fault handler, so the CPU triple-faults. Most likely, the CPL has been messed up somewhere and your interrupt handler can't be called from whatever CPL your "userspace binary" is running in.
Hi onlyonemac, I fixed the bug eventually. you're right that the CPU triple fault during the timer irq, but it's actually due to esp being set to 0 accidentally instead of CPU unable to find the handler. Thanks for your help :D !

Re: triple fault after context switch

Posted: Tue Aug 09, 2016 2:35 pm
by onlyonemac
szhou42 wrote:Hi onlyonemac, I fixed the bug eventually. you're right that the CPU triple fault during the timer irq, but it's actually due to esp being set to 0 accidentally instead of CPU unable to find the handler. Thanks for your help :D !
I see. Yes, that would also prevent the CPU from being able to execute the interrupt handler and subsequent exception handlers.

Re: triple fault after context switch

Posted: Tue Aug 09, 2016 3:44 pm
by iansjack
One of the joys of 64-bit mode is that you can set different stacks for different interrupts/exceptions, so you can guarantee that there is always a valid stack for exceptions even if there isn't a change in privilege level.