Page 1 of 1

Page fault in scheduler

Posted: Sat Oct 27, 2012 4:08 am
by mariuszp
Hello,

I am writing an operating system, I have set up paging, loading ELF executables, and multitasking. When I load a program, and then run it, it runs for a while, but after a one or two reschedules, when it switches the page directory, it jumps to "0" for some reason. I tried debugging with Bochs, but the only thing I noticed is that it happens after switching the page directory.

I will attach the scheduler and paging code, Bochs debugger output (part of it), and a screenshot of a page fault (caused by a random instruction at EIP 1).

I really cannot figure out what the problem is. All help will be appreciated. If you need any more code or info, please ask :)

Bochs debugger output (the relevant part):

Code: Select all

(0).[105706000] [0x0000000000102117] 0008:00102117 (unk. ctxt): mov cr3, eax              ; 0f22d8
(0).[105706001] [0x000000000010211a] 0008:0010211a (unk. ctxt): mov eax, cr0              ; 0f20c0
(0).[105706002] [0x000000000010211d] 0008:0010211d (unk. ctxt): mov dword ptr ss:[ebp-4], eax ; 8945fc
(0).[105706003] [0x0000000000102120] 0008:00102120 (unk. ctxt): or dword ptr ss:[ebp-4], 0x80000000 ; 814dfc00000080
(0).[105706004] [0x0000000000102127] 0008:00102127 (unk. ctxt): mov eax, dword ptr ss:[ebp-4] ; 8b45fc
(0).[105706005] [0x000000000010212a] 0008:0010212a (unk. ctxt): mov cr0, eax              ; 0f22c0
(0).[105706006] [0x000000000010212d] 0008:0010212d (unk. ctxt): leave                     ; c9
(0).[105706007] [0x000000000010212e] 0008:0010212e (unk. ctxt): ret                       ; c3						; XXX
00105706008i[CPU0 ] LOCK prefix unallowed (op1=0x53, modrm=0x00)
(0).[105706008] [0x0000000000000000] 0008:00000000 (unk. ctxt): push ebx                  ; 53
00105706009i[CPU0 ] LOCK prefix unallowed (op1=0x53, modrm=0x00)
(0).[105706009] [0x0000000000000001] 0008:00000001 (unk. ctxt): inc dword ptr ds:[eax]    ; ff00
CPU 0: Exception 0x0e - (#PF) page fault occured (error_code=0x0002)
CPU 0: Interrupt 0x0e occured (error_code=0x0002)
PS. "write|kernel" means the instruction attempted to write something, in kernel mode.

Re: Page fault in scheduler

Posted: Sat Oct 27, 2012 5:20 am
by jnc100
Your problem appears to be that esp does not point to a valid stack in the new address space after you switch - this causes the leave, ret combo to try and return to eip 0. Having a further look at your code its dangerous to have switch_task() call set_kernel_stack() and switch_page_dir() as two separate functions. The problem here is that after you switch the stack in set_kernel_stack() you will then execute a 'ret' at the end of that function to return to switch_task(), using the stack for the new process but the page directory of the old process. I'd recommend that the stack switching and page directory changing all occur at once within an assembler function that is not using the stack for storage. e.g. (pseudo-asm)

Code: Select all

;void do_switch(uintptr_t new_esp, uintptr_t *old_esp, uintptr_t new_cr3, uintptr_t *old_cr3)
save all registers (e.g. pushad +/- segment selectors etc if not flat mode +/- flags if not triggered by an interrupt which saves flags +/- mmx/xmm state if not using lazy-xmm switches etc)
mov eax, [esp + offset_to_new_esp] ; you will have to calculate this to take account of all the pushes above
mov ebx, [esp + offset_to_old_esp]
mov ecx, [esp + offset_to_new_cr3]
mov edx, [esp + offset_to_old_cr3]

; then do the change
mov [ebx], esp ; store old esp
mov esp, eax ; load new esp
mov eax, cr3
mov [edx], eax ; store old cr3
mov cr3, ecx ; load new cr3

restore all registers in the opposite order to which you saved them
ret
Note I'm not suggesting this will fix all your problems but a reliable task_switch function is a step in the right direction. I suggest you start debugging it by using co-operative multitasking first rather than relying on the timer interrupt (e.g. have a yield() syscall) as this makes it far easier to debug without the timer interrupt constantly going off whilst you're debugging.

Regards,
John.

Re: Page fault in scheduler

Posted: Sat Oct 27, 2012 6:07 am
by mariuszp
I am sorry for not making this clear - the set_kernel_stack() function just changes the esp0 field of the TSS. Does that mean it is safe to call switch_page_dir() after it? Also, the process is sometimes able to run for a while (and even make some system calls!) before the problem occurs.

I will check the ESP though. Thank you.

Re: Page fault in scheduler

Posted: Sat Oct 27, 2012 8:41 am
by mariuszp
I just looked at my execve() function, and tried to get it to print the location of segments (loaded from the Program Header), but I noticed it was saying that there are 3, but showing only two, and then the page fault. I decided to make it print "OK" on the screen when the loop finishes, and it never did. The interesting thing is that switch_page_dir() is called inside the loop (to make sure that the CPU actually uses the modified page directory, instead of just caching the old one and using that (I've heard this was possible).

And when switch_page_dir() returns, it returns to address 0. I do not see what is wrong with it, especially since the first segment loads properly.

Any help?

[P.S. i call switch_page_dir() on the current directory, meaning that the switch itself cannot cause the problem]