Page 1 of 1

Self-modifying exception handlers.

Posted: Fri Oct 24, 2008 7:50 pm
by 01000101
Hey,
I've been working on some exception handlers for my OS (as an experiment) and plan to use these to catch exceptions before the computer reboots in order to hopefully correct the issue. I was wondering if anyone has worked on exception handlers that actually 'correct' the exception.

eg:
say exception 0 fires (divide by zero). Is the 'norm' to just notify the user/OS that this has happened and then halt or reboot? or do some of you actually go back and correct the error?

I've been working on (and for the most part finished) an exception 0 handler (div by zero) that goes back through to the interrupted RIP and either A: fixes the 'zero' variable/memory reference, or B: jumps over that particular code and continues as if that divide never happened. If the jump occurs, it leaves both variables/memory locations alone and will result in some wrong math (maybe), but no rebooting or halting.

Is this a good idea, or are there some extreme pitfalls that I am yet to run into.

Re: Self-modifying exception handlers.

Posted: Fri Oct 24, 2008 10:30 pm
by Brendan
Hi,
01000101 wrote:I've been working on (and for the most part finished) an exception 0 handler (div by zero) that goes back through to the interrupted RIP and either A: fixes the 'zero' variable/memory reference, or B: jumps over that particular code and continues as if that divide never happened. If the jump occurs, it leaves both variables/memory locations alone and will result in some wrong math (maybe), but no rebooting or halting.

Is this a good idea, or are there some extreme pitfalls that I am yet to run into.
It can work as intended, but silently "fixing" bugs just makes it harder to find the bugs (and making it harder to find the bugs is hopefully the last thing you want to do). If I stuff something up, I want to know where my mistake is and I want to know immediately. I don't want to spend hours tracing some sort of miscalculation back to a dodgy division that the kernel knew about and hid from me.

Usually you'd want to notify the user (e.g. "blue screen of death") and kill the task that caused the exception (or lock up the computer if the kernel caused the exception). Other possibilities include doing a core dump, starting a bug report form that the user can fill in and email to the people who wrote the code, suspending the task and launching a debugger, invoking a task's signal handler, etc.

For some exceptions it is normal for the exception handler to try to fix the problem and continue. For example, the invalid opcode exception handler (and the device not available exception) might emulate instructions that the CPU doesn't support. Certain exceptions used for debugging (debug exception, breakpoint exception, alignment check exception) might add details to a log and continue (or maybe halt the task and let a debugger know, so the debugger can "unhalt" the task later). The page fault handler might check if a page needs to be loaded from swap space, load the page from swap space and then continue.

For these exception handlers you might still end up with a "blue screen of death" (or something). For example, if a task tries to execute an instruction that has never existed then the invalid opcode handler won't be able to emulate the instruction.

Also, some conditions don't involve exceptions but you still might want a "blue screen of death" (or something). For example, if the OS completely runs out of (virtual) memory, then the kernel might do a "blue screen of death" (or something) and kill a task to free some memory. I normally use this technique for spinlock handling code in the kernel, so that if I mess up the kernel's locks (e.g. freeing a lock that's already free) I know immediately (which is a lot better than getting a deadlock or synchronization issues).

I normally split things into 2 stages - there's exception handlers and there's critical error handlers, where the exception handlers either fix the problem or pass control to the critical error handler (and where the kernel can also pass control to the critical error handler directly). This allows the critical error handler to be configurable. For example, a "blue screen of death" is entirely useless on a headless computer (with no video card), so you might want to report the errors by appending them to a log file, or sending them over the network or over serial, where the administrator can configure which action/s should be taken.


Cheers,

Brendan

Re: Self-modifying exception handlers.

Posted: Sat Oct 25, 2008 1:45 am
by xyzzy
Brendan wrote:I normally split things into 2 stages - there's exception handlers and there's critical error handlers, where the exception handlers either fix the problem or pass control to the critical error handler (and where the kernel can also pass control to the critical error handler directly). This allows the critical error handler to be configurable. For example, a "blue screen of death" is entirely useless on a headless computer (with no video card), so you might want to report the errors by appending them to a log file, or sending them over the network or over serial, where the administrator can configure which action/s should be taken.
One thing I've always wondered about writing information about complete system failures (i.e. BSOD, kernel panic) to a log file: How do you know that it's safe to do it? The error could have occurred, for example, because some data structures in the VFS are corrupted. It could be potentially dangerous to attempt to write to the log file in this situation. If the error was caused by a bad pointer in the VFS or something, attempting to write could cause another exception.

How would you get around this? AFAIK Windows manages to do it, because you get events in the Event Log for BSODs.

Re: Self-modifying exception handlers.

Posted: Sat Oct 25, 2008 3:24 am
by Brendan
Hi,
AlexExtreme wrote:One thing I've always wondered about writing information about complete system failures (i.e. BSOD, kernel panic) to a log file: How do you know that it's safe to do it? The error could have occurred, for example, because some data structures in the VFS are corrupted. It could be potentially dangerous to attempt to write to the log file in this situation. If the error was caused by a bad pointer in the VFS or something, attempting to write could cause another exception.

How would you get around this? AFAIK Windows manages to do it, because you get events in the Event Log for BSODs.
It depends what died.

If the exception occurred inside the kernel then anything in the kernel or anything in the current address space may have been trashed; and because all other code (in different address spaces) relies on the kernel you're mostly screwed (nothing you attempt to do can be guaranteed to work reliably). The only thing you can do here is reduce the chance of this happening, by reducing the amount of code inside the kernel, doing lots of testing, etc.

If the exception occurred inside a normal process, then only that process would have been effected. As long as the process isn't relied on to write to your log file (e.g. the virtual file system, the file system code or the disk driver) then there's no problem.

If the exception occurred inside a process that is relied on, then you could maybe find an alternative - write to something on network, or write to a file system on a different disk drive.


Cheers,

Brendan