Page 1 of 1

Getting errors only when NOT debugging

Posted: Mon Jan 27, 2020 10:51 am
by Lagor
Hi all,
I started writing a simple OS and got to the point where I'm running usermode code (loaded from a simple ELF file that i managed to get into Physical memory).
Everything runs fine except that _only_ when i'm NOT debugging, after the user programs runs, whenever i type a key and invoke the keyboard handler, i get a page fault.
The pf is at the only instruction that my user code executes (jmp $) and has error code 5.
From what i get, error code 5 is a 'usermode read on a present page' type of PF.
My PF handler litteraly does nothing to correct this, it just returns and the user code keeps running fine.(if i avoid printing the PF details you wouldn't even notice that there's something wrong).
I know that without seing the code I cant provide much info but i find it weird that i dont get the same results when debugging my os (runs on qemu) with GDB.
I couldn't find much about this but my feeling is that the problems is about receiving interrupts that would mess up my code (bacause of my own inexperience) that somehow dont get thrown when GDB is working.
I should also specify that my keyboard handler in kernel mode works fine as well as the 'scheduler' that gets called every x milliseconds by the PIT interrupt.
I'm posting this with the hope that someone more knowledged than me ran into something similar and could point me in the right direction.

Thank you,
FG

Re: Getting errors only when NOT debugging

Posted: Mon Jan 27, 2020 11:26 am
by Octocontrabass
Does invoking the keyboard handler involve any change to the paging structures or registers? That might include kernel memory allocations.

Are you running your OS on bare metal or in a virtual machine? If bare metal, what CPU? If a virtual machine, which one?

Re: Getting errors only when NOT debugging

Posted: Mon Jan 27, 2020 1:07 pm
by Lagor
Thanks for your reply.

The keyboard handler does switch from the page directory/page tables of the user process to the ones of the kernel.
My guess, though, is that the problem lies elsewhere.
Assuming that the issue might be causes by a somehow invalid PD/PT i think the user code would just crash and wont execute any further.
In my case the code runs perfectly fine without my pf handler actually doing anything.
Also with the debugger active and stepping line by line it doesn't throw the PF.

I am running on a VM, i tried both qemu and VirtualBox with the same results. On my HW the results are actually different but i need to try again to actually dignose the problem.

Re: Getting errors only when NOT debugging

Posted: Mon Jan 27, 2020 1:23 pm
by Octocontrabass
Lagor wrote:The keyboard handler does switch from the page directory/page tables of the user process to the ones of the kernel.
How do you invalidate the TLB when switching back to the user process?

Re: Getting errors only when NOT debugging

Posted: Mon Jan 27, 2020 1:43 pm
by Lagor
I do not.... i feel like i need to look into this....? :?

I should add that i simply change the content of cr3 to make it point to the user or kernel page dir.
I know i dont have all of the pieces setup for multitasking but i just wanted to try with some easy and kind of hardcoded examples just to get a feeling for it.
I will definetly look into how tlb works because so far i thought it was just a cache to make things run faster but i have the feeling that it might not be that simple.
It does still haunt me why, while debugging it, everything works as intended but in real time it does not, i'm afraid i'm missing something huge...

Re: Getting errors only when NOT debugging

Posted: Mon Jan 27, 2020 11:24 pm
by Octocontrabass
Assuming you're not using global pages or PCID, a MOV to CR3 is enough to invalidate the entire TLB. Invalidating the whole TLB can be an expensive operation, so there are various methods to avoid invalidating TLB entries for page mappings that haven't changed. You must invalidate TLB entries for page mappings that have changed.

Does your page fault handler also update CR3? Perhaps your keyboard handler isn't loading the correct value into CR3 when it returns, and your page fault handler is accidentally correcting the issue.

You've mentioned printing the page fault details, but did you print it in a way where you can see if it happens more than once? Since your page fault handler doesn't do anything to fix the fault, you could be trapped in an infinite loop of page faults.

It's a bit of a long shot, but if both Qemu and VirtualBox are using hardware virtualization, you might have stumbled across a CPU erratum. This could explain why the fault resolves itself without you doing anything.

Re: Getting errors only when NOT debugging

Posted: Tue Jan 28, 2020 3:10 am
by Lagor
Octocontrabass wrote: Does your page fault handler also update CR3? Perhaps your keyboard handler isn't loading the correct value into CR3 when it returns, and your page fault handler is accidentally correcting the issue.
It does but in the exact same way that other handlers do. I checked the structure of the PT and it's the same in any case.
I dont think this is the issue, if i fucked up the PT, it wouldn't work when i'm debugging and not work when i'm not debugging (which i think is the key to my problem somehow...)
Octocontrabass wrote: You've mentioned printing the page fault details, but did you print it in a way where you can see if it happens more than once? Since your page fault handler doesn't do anything to fix the fault, you could be trapped in an infinite loop of page faults.
it does happen once. My user program keeps running fine (prints numbers on screen) after the pf handler returns whithout doing anything.
Octocontrabass wrote: It's a bit of a long shot, but if both Qemu and VirtualBox are using hardware virtualization, you might have stumbled across a CPU erratum. This could explain why the fault resolves itself without you doing anything.
I need to study this more because i basically i have no idea what you just said.

That being said, stuff works when debugging and (sort of) doesnt otherwise... still cant explain this.

Re: Getting errors only when NOT debugging

Posted: Tue Jan 28, 2020 4:02 am
by Octocontrabass
Lagor wrote:It does but in the exact same way that other handlers do. I checked the structure of the PT and it's the same in any case.
I dont think this is the issue, if i fucked up the PT, it wouldn't work when i'm debugging and not work when i'm not debugging (which i think is the key to my problem somehow...)
Got a disk image I can try? At this point I'm not sure I can suggest anything else without taking a closer look.
Lagor wrote:
Octocontrabass wrote: It's a bit of a long shot, but if both Qemu and VirtualBox are using hardware virtualization, you might have stumbled across a CPU erratum. This could explain why the fault resolves itself without you doing anything.
I need to study this more because i basically i have no idea what you just said.
What CPU are you using? I'd like to look up the errata for it. (More detailed information is better; something like /proc/cpuinfo is ideal.)

Re: Getting errors only when NOT debugging

Posted: Tue Jan 28, 2020 4:14 am
by Lagor
Hi, thanks so much for your help.
Here's a link to the iso (and the kernel.elf for the debug symbols):
https://filebin.net/g576f3opuehlpy1l

I run it with:

Code: Select all

	
qemu-system-i386 -m 64M -cdrom os.iso
and debug it with:

Code: Select all

qemu-system-i386 -m 64M -s -cdrom os.iso &
gdb -ex "target remote localhost:1234" -ex "symbol-file kernel.elf"

Re: Getting errors only when NOT debugging

Posted: Tue Jan 28, 2020 6:23 am
by Octocontrabass
It looks like something is corrupting your stack. I was able to somehow overwrite EIP with 0xB8000 while in kernel mode, which I'm pretty sure is not supposed to happen.

You do have separate stacks for user and kernel mode, right?

Re: Getting errors only when NOT debugging

Posted: Wed Jan 29, 2020 2:38 am
by Lagor
Can you explain this?
I'm not using the stack in my usermode process so i thought it wouldn't matter if i set esp to a meaningful value or not but even when i did i had the same problem.
How did you manage to corrupt the kernel stack?

Also, once again, if i sistematically corrupt the stack, how come that debugging it everything seems fine?

Thank you.

Re: Getting errors only when NOT debugging

Posted: Wed Jan 29, 2020 4:57 am
by Octocontrabass
Lagor wrote:I'm not using the stack in my usermode process so i thought it wouldn't matter if i set esp to a meaningful value or not but even when i did i had the same problem.
I was hoping it would be an easy solution. Unfortunately, this means something in your kernel is causing the stack corruption.
Lagor wrote:How did you manage to corrupt the kernel stack?
I ran it in Qemu and pressed keys until it crashed. Unfortunately I didn't have the debugger running at the time so I'm not sure exactly what went wrong with the stack.
Lagor wrote:Also, once again, if i sistematically corrupt the stack, how come that debugging it everything seems fine?
It seems to be affected by timing. Running the debugger changes the timing of interrupts being received by your kernel, and that hides the problem.

Your stack is very close to the rest of your kernel. Perhaps try moving it elsewhere, and inserting a guard page so if the stack overflows your kernel will halt or crash instead of overwriting itself.

I notice your startup code uses the stack before it's done setting up the stack. You shouldn't do that: GRUB does not guarantee a usable stack.

Re: Getting errors only when NOT debugging

Posted: Wed Jan 29, 2020 7:35 am
by Lagor
Octocontrabass wrote: I was hoping it would be an easy solution. Unfortunately, this means something in your kernel is causing the stack corruption.

I ran it in Qemu and pressed keys until it crashed. Unfortunately I didn't have the debugger running at the time so I'm not sure exactly what went wrong with the stack.

It seems to be affected by timing. Running the debugger changes the timing of interrupts being received by your kernel, and that hides the problem.
Yes, i was wondering how interrupts (like a PIT or RTC) worked when Debugging.
Can you point me towards some resources explaining this?
Octocontrabass wrote: Your stack is very close to the rest of your kernel. Perhaps try moving it elsewhere, and inserting a guard page so if the stack overflows your kernel will halt or crash instead of overwriting itself.

I notice your startup code uses the stack before it's done setting up the stack. You shouldn't do that: GRUB does not guarantee a usable stack.
Thank you for the suggestion.
Honestly i was just getting my hands dirty trying to get things to work in a fast and simple way so i would get a feeling for them which means that i didn't pay attention to a lot of details (such as stack placements and so forth).
My priority was to write some code that would allow me to figure out whether or not i understood the theoretical concepts. I will be more rigorous from now on, i was just very excited to get down to buisness... :D