Double fault TSS problems

nexos · Post by **nexos** » Sun Jan 31, 2021 9:23 am

Hello,
This is my first question for my new OS! First, I am taking more time to structure it, and I think it is structured well. Anyway, I have just finished making my GDT and IDT code. This was the first time I based it solely on the Intel SDM. So, I decided that I would have a separate TSS for double faults. Executing int $0x08 works, the handler gets called. To try to replicate the most common reason for double faults (invalid kernel stack), I tried NULLing ESP and then executing an int. Logically, the double fault should get called, and then a task switch would occur to my TSS. It triple faulted on doing this! Everything looks okay in Bochs, but I have no clue whats going on. It acting like it is not a task gate at all, when it it is task gate in the IDT. The relevant code it at https://github.com/Nexware-Project/micr ... 6/cpu/i386
Thanks,
nexos

austanss · Post by **austanss** » Sun Jan 31, 2021 12:13 pm

Sounds like an invalid TSS.

[I have contributed all my knowledge to this answer.]

nexos · Post by **nexos** » Sun Jan 31, 2021 1:14 pm

No, it isn't an invalid TSS as it works, only not when needed. Oh well, for now, I will just make it reboot on double fault and come back to that later.

Gigasoft · Post by **Gigasoft** » Sun Jan 31, 2021 1:18 pm

There are multiple problems. The first one being that the DF TSS is a local variable. You also need one more TSS, which will be the normally active one.

nexos · Post by **nexos** » Sun Jan 31, 2021 1:22 pm

Gigasoft wrote:There are multiple problems. The first one being that the DF TSS is a local variable. You also need one more TSS, which will be the normally active one.

Really? Didn't know that. I will go try this and see what happens. The local variable would also be a problem as well. Didn't notice that.
Edit - after fixing that and fixing another bug where I didn't set CR3 in the TSS, it now works. Thanks for your help!

xeyes · Post by **xeyes** » Sun Jan 31, 2021 3:42 pm

I've been curious about whether the handler do anything to 'recover' from this?

Like try to fix the back linked TSS and IRET to it or try to kill the faulted user program and move onto something else.

Does double fault always mean that the kernel itself is buggy and that panic is the only way out?

sj95126 · Post by **sj95126** » Sun Jan 31, 2021 5:16 pm

xeyes wrote:Does double fault always mean that the kernel itself is buggy and that panic is the only way out?

Not necessarily - a double fault may be the expected result depending on how you've implemented your handlers.

For example, if you've swapped out your division-by-zero handler, then a division exception would trigger a page fault. You'd be expecting this and can recover as you designed it to. Note that I'm not necessarily advocating for swapping out your exception handlers, just that you could.

I don't recommend something like this, for the fairly simple reason that *usually* a double fault in the kernel is very bad, and often unrecoverable, and you're better off not making the double fault handler too complicated so it can do the important job of shutting down as responsibly as possible - crash dump, etc. etc. Once you're in a double fault, it's best not to push your luck too far.

nexos · Post by **nexos** » Sun Jan 31, 2021 5:26 pm

The reason why is because double faults normally occur because the kernel stack is invalid, then a page fault triggers next time a push or pop occurs. In kernel mode, the CPU tries to push the state on the stack, which is invalid, hence triggering a double fault. The double fault handler needs a valid stack, hence it does a triple fault. By using a task gate for double faults which points to a TSS, the double fault handler can cleanly trap these issues, hence making debugging a little simpler.

xeyes · Post by **xeyes** » Mon Feb 01, 2021 12:28 am

sj95126 wrote: For example, if you've swapped out your division-by-zero handler, then a division exception would trigger a page fault. You'd be expecting this and can recover as you designed it to.

Isn't this a page fault not double fault?

xeyes · Post by **xeyes** » Mon Feb 01, 2021 12:35 am

nexos wrote:The reason why is because double faults normally occur because the kernel stack is invalid, then a page fault triggers next time a push or pop occurs. In kernel mode, the CPU tries to push the state on the stack, which is invalid, hence triggering a double fault. The double fault handler needs a valid stack, hence it does a triple fault. By using a task gate for double faults which points to a TSS, the double fault handler can cleanly trap these issues, hence making debugging a little simpler.

Most DF occurrences I've seen is the page fault handler itself page faults and recursively uses the kernel stack until reaching an un-mapped page.

It doesn't make much sense to fix a stack trashed by a runaway PF (or other) handler as it's too hard to figure out what went wrong and what can be done when running as the DF handler.

That's why I'm curious is DF simply unrecoverable in all cases.

sj95126 · Post by **sj95126** » Mon Feb 01, 2021 10:02 am

xeyes wrote:
sj95126 wrote: For example, if you've swapped out your division-by-zero handler, then a division exception would trigger a page fault. You'd be expecting this and can recover as you designed it to.
Isn't this a page fault not double fault?

No, it would be a double fault. If your handler is swapped out, either you've cleared the present flag in the IDT (P=0), or the address of the handler isn't valid (P=0), or both, but either way you're going to get a second exception because the handler cannot be executed.

nexos · Post by **nexos** » Mon Feb 01, 2021 10:21 am

IMO doing this way allows for Double Faults to be handled cleanly. Linux and ReactOS do this, so it must be considered good practice

nullplan · Post by **nullplan** » Mon Feb 01, 2021 2:14 pm

sj95126 wrote:For example, if you've swapped out your division-by-zero handler, then a division exception would trigger a page fault. You'd be expecting this and can recover as you designed it to. Note that I'm not necessarily advocating for swapping out your exception handlers, just that you could.

Note that that is a terrible idea: The double fault is an abort type exception, and therefore the return address given in the interrupt frame is invalid (unpredictable). Therefore, once the double fault handler is invoked, it must not return before setting that address to a know good value. The address of the faulting instruction in this case would be lost.

xeyes wrote:That's why I'm curious is DF simply unrecoverable in all cases.

No, only most of them. See below for details.

nexos wrote:IMO doing this way allows for Double Faults to be handled cleanly. Linux and ReactOS do this, so it must be considered good practice

Linux only handles very specific double faults, and panics for all others. In particular, it handles double faults occurring while on an ESPFIX stack. Since interrupts are disabled when the read-only ESPFIX stack is loaded, pretty much the only way to fail is if the IRET itself fails. And that can only happen due to invalid addresses in the interrupt frame, which can only happen because the program did something stupid, and therefore those double faults just emulate a general protection fault due to bad IRET.

This is not a concern for those of us that don't allow 16-bit code to run at all in their OSes, and so this use case disappears. My kernel only panics on double fault.

sj95126 · Post by **sj95126** » Mon Feb 01, 2021 2:42 pm

nullplan wrote:
sj95126 wrote:For example, if you've swapped out your division-by-zero handler, then a division exception would trigger a page fault. You'd be expecting this and can recover as you designed it to. Note that I'm not necessarily advocating for swapping out your exception handlers, just that you could.
Note that that is a terrible idea: The double fault is an abort type exception, and therefore the return address given in the interrupt frame is invalid (unpredictable). Therefore, once the double fault handler is invoked, it must not return before setting that address to a know good value. The address of the faulting instruction in this case would be lost.

Right, which is why I said I don't recommend it.

Theoretically, if you had a multithreaded kernel, and it was a worker thread that double faulted, and you've properly stored the intermediate results of that worker thread, and/or could restart its task, then you could simply abandon that thread and you wouldn't care about not having the return address.

But as I said, IMO, the safest thing to do with a double fault is shut down (panic) as gracefully as possible as quickly as possible. It might have been something very important that failed to happen as a result of the double fault, and you likely would only make things much much worse trying to investigate and/or repair it.

nexos · Post by **nexos** » Mon Feb 01, 2021 3:30 pm

Note that when I say handle, that could mean a panic. In all cases, my double fault handler will panic. It is just to prevent triple faults.

OSDev.org

Double fault TSS problems

Double fault TSS problems

Re: Double fault TSS problems

Re: Double fault TSS problems

Re: Double fault TSS problems

Re: Double fault TSS problems

Re: Double fault TSS problems

Re: Double fault TSS problems

Re: Double fault TSS problems

Re: Double fault TSS problems

Re: Double fault TSS problems

Re: Double fault TSS problems

Re: Double fault TSS problems

Re: Double fault TSS problems

Re: Double fault TSS problems

Re: Double fault TSS problems