How do you use exceptions interrupts in your OS?

eryjus · Post by **eryjus** » Sat Dec 06, 2014 5:42 pm

Hi everyone,

I'm laying the foundation for my interrupt routines for the CPU exceptions. I have the Intel Guide Volume 3A, Chapter 6 at my side as I'm writing skeleton functions. As I'm reading each exception, I started thinking about what I could and what I should do in these exceptions. Some of them seem quite obvious (Page Fault, Breakpoint Trap), and others are somewhat allusive to me (NMI, Overflow Trap).

Anything that is an exception class "Fault" seems like some sort of recovery should be possible, especially since the CS:RIP is pointing to the instruction causing the fault and therefore an iretq will effectively be a "try again" to the CPU. #PF (Page Fault) is a perfect example of this where the page can be loaded from disk and the faulting instruction can be attempted again. However, #DE (Divide Error) Fault seems less likely since there is no way to store a valid result in the results registers -- so, I believe most OSs will simply terminate the offending process. Even more allusive to me is what the heck to do with an NMI interrupt.

Traps should also allow a program to be restarted and appear to me to serve more like exit points when certain conditions are met. Breakpoint seems reasonable to insert a breakpoint in the code and then single step from that point. But what might you do with an Overflow trap?

So, my question is: What, if any, special handling are are you doing in your exception interrupts?

SpyderTL · Post by **SpyderTL** » Sat Dec 06, 2014 7:44 pm

You appear to understand the different exceptions pretty well, so you just need to decide what you want your OS to do when it receives each of these events.

If you do nothing, and just IRET, your machine will simply lock up, or reboot on most exceptions.

If you are using stack frames, you can "find" an exception handler provided by the application that you can call.

If not, you will need to get creative, or you can simply kill the running application.

These are the only options that I'm aware of, but there are probably more.

mathematician · Post by **mathematician** » Sun Dec 07, 2014 5:45 am

A non maskable interrupt used to mean, and probably still does, that your memory chips were frying. The only thing you can do under those circumstances is to shut down the system, with a recommendation to the user that he buy some new RAM, or at least run a thorough memory check.

Combuster · Post by **Combuster** » Mon Dec 08, 2014 1:26 am

#OF is conditionally called by the INTO instruction. You can use it to signal a numeric error to the executing code, you can kill the process, or you can ignore it.

Brendan · Post by **Brendan** » Mon Dec 08, 2014 7:16 pm

Hi,

eryjus wrote:So, my question is: What, if any, special handling are are you doing in your exception interrupts?

The first thing any exception handler will do is determine if the problem is unrecoverable, or if there's something it can/should do.

How you handle unrecoverable things (e.g. blue screen of death, core dump, kernel panic, whatever) isn't "special handling" (it should be a generic "unrecoverable handler") and isn't necessarily only used by exceptions (e.g. kernel might use it when it detects security problems; like a process asking to free kernel space) and it can get complicated (e.g. if there's no video driver or the video driver crashed, then "blue screen of death" isn't an option - you probably want several alternatives for informing the user/administrator of the problem), so I won't go into it here.

There's also the whole issue of "signals". If your OS supports them (I think they're a major mistake, but they're required by a lot of legacy languages like C/C++); then sending a signal back to the process that crashed is another way of handling unrecoverable problems.

For recoverable problems only:

0x00 - Divide Error: Always unrecoverable?
0x01 - Debug Exception: Multiple causes (single-step, data breakpoint, etc). Options include halting the thread and informing a different process (debugger), passing control to a debugger within the same process, adding details to some sort of "debug log", etc. This is never unrecoverable. Note: this can be used by the OS for special purposes - e.g. see "0x17 - Alignment Check".
0x02 - NMI: If the kernel isn't using/abusing it for special purpose/s (e.g. "sampling profiler" that isn't effected by disabling IRQs); assume it indicates critical hardware failure and treat it the same as an unrecoverable problem in kernel space.
0x03 - Breakpoint Exception: Possibly similar to "0x01 - Debug Exception". If you want, you could use it for anything you like (e.g. I tend to use it in boot code for a "boot API calling mechanism" because "int3" is a 1-byte instruction).
0x04 - Overflow Exception: Always unrecoverable? Probably not used by most languages.
0x05 - Bound Range Exceeded Exception: Always unrecoverable? Probably not used by most languages.
0x06 - Invalid Opcode Exception: This is where you emulate instructions that the CPU doesn't support (so that applications designed for newer CPUs will still work correctly on older CPUs). If the OS can't/doesn't emulate the instruction that caused the exception, then it's unrecoverable.
0x07 - Device Not Available Exception: This is used as part of the "Lazy FPU/MMX/SSE state saving on task switches" thing; where (in the hope of improving task switch performance where tasks don't use FPU/MMX/SSE) you can postpone loading the FPU/MMX/SSE state until it's actually necessary. In addition; this may also be used for emulation (e.g. if the CPU doesn't have an FPU at all, then you can use this exception to emulate the FPU in software). Note: Even if an FPU is present, you might still want to emulate it instead for some reason (e.g. floating point division bugs in Pentium). For another example, maybe your kernel could provide "insane precision big rational number support" and let processes switch between FPU and rational numbers whenever they like, where most code (e.g. in libraries, etc) simply doesn't need to know or care if it's using "limited precision FPU" or "insane precision rational numbers".
0x08 - Double Fault: Always unrecoverable?
0x0A - Invalid TSS Exception: Always unrecoverable?
0x0B - Segment Not Present: Probably always unrecoverable. If an OS uses segmentation and not paging, then it may be used for virtual memory management (e.g. when you get a "Segment Not Present" exception, you load the missing segment from disk and retry).
0x0C - Stack Fault: Always unrecoverable?
0x0D - General Protection Fault: May be used for special tricks (e.g. I've used it to resset GS to its correct value, so kernel can always assume GS is correct even if user-space changes it). May also be used for emulation, including emulating instructions that access IO ports, emulating the TSC (when "RDTSC" at CPL=3 has been disabled for security reasons) and for virtual8086 mode.
0x0E - Page Fault: Typically a major part of virtual memory management (lots of tricks - allocate on write, copy on write, swap space, memory mapped files, etc).
0x10 - Floating Point Error: Multiple causes. In theory the OS can always recover from them regardless of actual cause (simply by emulating what the FPU would've done if the error was masked); but in practice that's a waste of time (if someone wanted that they would've just masked them in first place) so it's best to treat them as unrecoverable instead. The "stack overflow/underflow" may be used to pretend there's more than 8 FPU registers (e.g. "infinite" FPU registers); which can allow software to avoid unnecessary FPU save/loads and improve performance. Note: For the "cdecl" calling convention, ST0 to ST7 must be empty, which means that the caller has to store and then reload any FPU registers that are in use just in case the callee might use them and run out of FPU registers.
0x17 - Alignment Check: In theory, this is entirely recoverable. The problem is that if you're using it for profiling (e.g. creating a log of all misaligned accesses) you have to disable alignment check to execute the instruction. To work around that you can use the breakpoint exception - essentially, log the instruction that caused the misaligned access, disable alignment checking, enable the "instruction breakpoint" for the instruction and return (so that the CPU executes the instruction then triggers the "instruction breakpoint" trap); then (in the breakpoint exception handler) re-enable alignment check and disable the "instruction breakpoint" so that you can continue logging misaligned accesses.
0x18 - Machine Check: This requires special (CPU specific) handling to diagnose the cause of the hardware problem and report it to the user/administrator (before halting the computer). If your OS doesn't support the specific CPU, then it should leave "machine check" disabled so that it can't happen (and you get a reset instead).
0x19 - SIMD Floating Point Exception: Similar to "0x10 - Floating Point Error" (but for SSE instead of FPU). Multiple causes, but all should be treated as unrecoverable (and masked if by software that wants to ignore them).

Cheers,

Brendan

eryjus · Post by **eryjus** » Mon Dec 08, 2014 9:40 pm

Brendan wrote: <snip...>
Cheers

Thank you, once again, for your detailed reply. Lots to consider....

mallard · Post by **mallard** » Tue Dec 09, 2014 3:37 am

Brendan wrote: There's also the whole issue of "signals". If your OS supports them (I think they're a major mistake, but they're required by a lot of legacy languages like C/C++); then sending a signal back to the process that crashed is another way of handling unrecoverable problems.

The "signals" support in the C standard requires no support from the kernel. No IPC is required by the C standard (since IPC implies multiple processes, multitasking, etc. none of which is required by the C standard). C does not even define a way to send signals to a different process ("raise()" is part of C, but not "kill()").

POSIX obviously goes further and defines "kill()", requiring IPC.

It's absolutely a good idea to provide some way of allowing an application to handle its own errors; the application is in a far better position to attempt to recover from those errors than the kernel is. Even if it can't "recover", it can at least try to save data. Whether you do this via "signals", "events", "messages", something like Windows' SEH, or some completely different method is up to you. (Although it's a good idea to at least consider how you might port POSIX applications. I'm going for a method based on events/messages.)

Brendan · Post by **Brendan** » Tue Dec 09, 2014 6:25 am

Hi,

mallard wrote:
Brendan wrote:There's also the whole issue of "signals". If your OS supports them (I think they're a major mistake, but they're required by a lot of legacy languages like C/C++); then sending a signal back to the process that crashed is another way of handling unrecoverable problems.
The "signals" support in the C standard requires no support from the kernel. No IPC is required by the C standard (since IPC implies multiple processes, multitasking, etc. none of which is required by the C standard). C does not even define a way to send signals to a different process ("raise()" is part of C, but not "kill()").

How do you think SIGFPE, SIGILL or SIGSEGV are supposed to work without the kernel's exception handlers being involved?

The only 2 ways I can think of are running the entire program inside a virtual machine, or building a massive pile of run-time checks into the resulting binary; so that things like arithmetic errors, illegal instructions and memory access problems can be caught without using the CPU's exceptions. Obviously, this would be insanely slow (and not something the majority of C compilers will ever support).

Basically, the only sane way (that doesn't involve destroying performance) is for the kernel's exception handlers to diddle with the user-space stack, so that when the kernel returns from CPL=0 to CPL=3 it returns to the standard library's signal handler (and doesn't return to the instruction that caused the exception).

Note that the reason I think this is a pile of crud is that the code that caused the exception/signal could've been in the middle of anything at all at the time (e.g. including holding 123 different mutexes, freeing a dodgy area of memory, etc); which makes it extremely difficult for the signal handler to do anything.

mallard wrote:It's absolutely a good idea to provide some way of allowing an application to handle its own errors; the application is in a far better position to attempt to recover from those errors than the kernel is. Even if it can't "recover", it can at least try to save data. Whether you do this via "signals", "events", "messages", something like Windows' SEH, or some completely different method is up to you. (Although it's a good idea to at least consider how you might port POSIX applications. I'm going for a method based on events/messages.)

If your process can't be trusted to run without crashing, then it can't be trusted to "recover" after crashing either. It deserves to be wiped from the hard disk. No excuses.

Of course I also think that porting POSIX applications is a serious mistake too (unless you think you can run those POSIX applications better than Linux/FreeBSD/Solaris does).

Cheers,

Brendan

mallard · Post by **mallard** » Tue Dec 09, 2014 6:54 am

Brendan wrote: How do you think SIGFPE, SIGILL or SIGSEGV are supposed to work without the kernel's exception handlers being involved?

Easy: when you call "raise()", which is the only standard C way of "sending" a signal, the C runtime just looks up the handler in a table. If there is an entry for that signal, the handler is run, if there isn't, nothing happens (return nonzero from "raise()"). The C standard does not guarantee that those signals are actually generated when an error of the appropriate kind is occurs, so it's just fine (according to the C standard) to not support them and simply terminate the process.

The C standard does not require any kernel support for signals or any other error-handling mechanism.

If your kernel does support error handling via a non-POSIX-signal method, the handler can simply call "raise()" to provide compatibility.

Brendan wrote: If your process can't be trusted to run without crashing, then it can't be trusted to "recover" after crashing either. It deserves to be wiped from the hard disk. No excuses.

"Trust" has nothing to do with it. In the real world, programs occasionally have bugs (even yours, I'm certain). It's simple pragmatism to have some method to enable those bugs to be handled as gracefully as possible. The kernel can't save the user's data, but maybe the program can, so it's worth giving it the possibility.

Brendan wrote: Of course I also think that porting POSIX applications is a serious mistake too (unless you think you can run those POSIX applications better than Linux/FreeBSD/Solaris does).

And you're perfectly free to think that. I respect you opinion, even though I don't share it.

Personally, I think that making porting existing applications is a good way to add functionality to my OS without having to build absolutely everything from scratch. I lack the time, expertise and inclination to build (for example) a replacement C compiler that'll be anything like as good as GCC. Unless you've got the resources of Microsoft, you won't be able to build everything from scratch in one lifetime (and even Microsoft occasionally licence/buy products/technologies from elsewhere).

eryjus · Post by **eryjus** » Tue Dec 09, 2014 9:09 am

mallard wrote:... you won't be able to build everything from scratch in one lifetime ...

Funny! That's a large part of how I settled on the name CenturyOS.

Brendan · Post by **Brendan** » Tue Dec 09, 2014 10:00 am

Hi,

mallard wrote:
Brendan wrote:How do you think SIGFPE, SIGILL or SIGSEGV are supposed to work without the kernel's exception handlers being involved?
Easy: when you call "raise()", which is the only standard C way of "sending" a signal, the C runtime just looks up the handler in a table.

Did you even try to think about this before saying it?

Imagine you've used a dodgy function pointer and now you're executing random trash. Who calls "raise()" to send the SIGILL signal (immediately before the CPU would have generated its illegal opcode exception)?

mallard wrote:The C standard does not guarantee that those signals are actually generated when an error of the appropriate kind is occurs, so it's just fine (according to the C standard) to not support them and simply terminate the process.

So you're suggesting the kernel should ignore these signals completely (just like I did earlier)?

mallard wrote:
Brendan wrote:If your process can't be trusted to run without crashing, then it can't be trusted to "recover" after crashing either. It deserves to be wiped from the hard disk. No excuses.
"Trust" has nothing to do with it. In the real world, programs occasionally have bugs (even yours, I'm certain). It's simple pragmatism to have some method to enable those bugs to be handled as gracefully as possible. The kernel can't save the user's data, but maybe the program can, so it's worth giving it the possibility.

That's pure idiocy. If the data is important, then good software minimises the risk of data loss caused by things like hardware faults, power failures, etc (e.g. by writing a journal, or doing periodic "auto-saves", or whatever). If the software is crap and doesn't do this and the software is also crap that crashes, then data is lost because there was no journaling/auto-saves/whatever. Signals are not the answer.

mallard wrote:
Brendan wrote:Of course I also think that porting POSIX applications is a serious mistake too (unless you think you can run those POSIX applications better than Linux/FreeBSD/Solaris does).
And you're perfectly free to think that. I respect you opinion, even though I don't share it.

Personally, I think that making porting existing applications is a good way to add functionality to my OS without having to build absolutely everything from scratch. I lack the time, expertise and inclination to build (for example) a replacement C compiler that'll be anything like as good as GCC. Unless you've got the resources of Microsoft, you won't be able to build everything from scratch in one lifetime (and even Microsoft occasionally licence/buy products/technologies from elsewhere).

The problem is that it becomes too easy to simply port everything; and you quickly end up with so much ported software that the only way the end user can tell the difference between your OS and (e.g.) a Linux distribution is that your OS is slower, crashes more often and supports a lot less devices. There's no reason for applications developers to write applications specifically for the OS (they can just write portable software for "your OS and every other boring *nix work-alike" instead). None of your OS's unique features get used. There's no reason for the end user to use the OS for anything (instead of Linux/FreeBSD/Solaris). This is what I call "successful failure" - you've successfully created an OS, but the OS is a failure.

Cheers,

Brendan

mallard · Post by **mallard** » Tue Dec 09, 2014 10:30 am

Brendan wrote: Imagine you've used a dodgy function pointer and now you're executing random trash. Who calls "raise()" to send the SIGILL signal (immediately before the CPU would have generated its illegal opcode exception)?

In my OS, if the process has subscribed to the appropriate event/message, then the thread that's caused the fault will be halted and (if there are other running threads in the process) the kernel will queue a message to the application with details of the fault. If/when another thread receives that message, it can then call "raise()" to invoke the C signal hander, if it wants to.

If the process has not subscribed to the event/message for illegal operations or there are no other threads running, the process will be killed.

This is, of course, completely additional to the C standard, which mandates none of this. As I pointed out (and you correctly repeated) it's entirely within that standard to simply kill the processes if it faults, without implementing any signalling or error handling.

Brendan wrote: That's pure idiocy. If the data is important, then good software minimises the risk of data loss caused by things like hardware faults, power failures, etc (e.g. by writing a journal, or doing periodic "auto-saves", or whatever). If the software is crap and doesn't do this and the software is also crap that crashes, then data is lost because there was no journaling/auto-saves/whatever. Signals are not the answer.

According to you, just about every major OS ever is "pure idiocy". Even MS-DOS allowed applications to handle their own errors (by virtue of allowing them to do just about anything). Sure, perfect software would never crash and would always ensure user data is 100% safe. However, back on planet earth, that's not the case. There's no harm in having a belt-and-braces approach; having the possibility of userspace error handling doesn't prevent an application from having robust handling of user data.

Also, how do you expect to implement debugging on your OS? How would you run an interpreter/JIT without crippling performance by pre-checking all math operations? "Signals" (or another form of error handling, I'm no fan of how POSIX does it) are often used to allow the CPU to do those sorts of checks "for free" and still have good handling of them in your runtime.

Also, I intend to use the exact same mechanism to allow applications to handle page faults. That allows things such as user-mode memory-mapped I/O and other interesting possibilities.

Brendan wrote: The problem is that it becomes too easy to simply port everything; and you quickly end up with so much ported software that the only way the end user can tell the difference between your OS and (e.g.) a Linux distribution is that your OS is slower, crashes more often and supports a lot less devices. There's no reason for applications developers to write applications specifically for the OS. None of your OS's unique features get used. There's no reason for the end user to use the OS for anything. This is what I call "successful failure" - you've successfully created an OS, but the OS is a failure.

Again, I respect that opinion, even though I don't share it. It all depends on your objectives for your OS. I'm perfectly happy to build something that I like, as a hobby, with no particular desire to see it go "mainstream" or gain widespread adoption. To that end, if porting POSIX applications to my (decidedly non-POSIX) OS ever becomes "easy" then I'd consider it pretty darned successful.

If you want to build something revolutionary, entirely from scratch, that doesn't run existing applications, then feel free, power to you, just because that's not my objective doesn't mean it can't be yours.

"Success" and "failure" are entirely subjective opinions. Still, I'd rather build a "successful failure" than fail to build anything at all.

Brendan · Post by **Brendan** » Tue Dec 09, 2014 7:17 pm

Hi,

mallard wrote:
Brendan wrote:Imagine you've used a dodgy function pointer and now you're executing random trash. Who calls "raise()" to send the SIGILL signal (immediately before the CPU would have generated its illegal opcode exception)?
In my OS, if the process has subscribed to the appropriate event/message, then the thread that's caused the fault will be halted and (if there are other running threads in the process) the kernel will queue a message to the application with details of the fault. If/when another thread receives that message, it can then call "raise()" to invoke the C signal hander, if it wants to.

If the process has not subscribed to the event/message for illegal operations or there are no other threads running, the process will be killed.

Let me see if I understand this right..

For a single-threaded process, where that thread is about to execute an illegal opcode; the kernel's illegal opcode exception handler (which you said isn't involved but obviously must be) halts the thread then sends a message to the thread that is now halted; and when the halted thread feels like handling that message (how??) it calls "raise()" to convert the message into a signal (where "raise()" diddles with the thread's instruction pointer and unhalts the thread).

Of course maybe you're forced to have minimum of 2 threads; where if thread1 crashes then thread2 handles the message and calls "raise()", and if thread2 crashes then thread1 handles the message and calls "raise()". I don't know (it sounds like an ugly mess, but it's the only thing that makes sense).

In any case; it's just a different way of doing it. It still involves the kernel's exception handlers.

mallard wrote:
Brendan wrote:That's pure idiocy. If the data is important, then good software minimises the risk of data loss caused by things like hardware faults, power failures, etc (e.g. by writing a journal, or doing periodic "auto-saves", or whatever). If the software is crap and doesn't do this and the software is also crap that crashes, then data is lost because there was no journaling/auto-saves/whatever. Signals are not the answer.
According to you, just about every major OS ever is "pure idiocy".

Of course (but not necessarily just for this reason alone).

mallard wrote:Even MS-DOS allowed applications to handle their own errors (by virtue of allowing them to do just about anything). Sure, perfect software would never crash and would always ensure user data is 100% safe. However, back on planet earth, that's not the case. There's no harm in having a belt-and-braces approach; having the possibility of userspace error handling doesn't prevent an application from having robust handling of user data.

On a scale from "0% safe" to "100% safe"; where "100% safe" is perfect software and "0% safe" is the minimum requirement (e.g. able to reduce risk of data loss caused by power failure where necessary); what reason do I have to care about something so badly designed that it doesn't even come close to "0% safe"? For everything that actually matters ("0% safe or better") signals are useless.

There *is* harm in the belt-and-braces approach - it's like a safety net with massive gaping holes in it. People see the safety net and assume they can do stupid things ("I'll just save the data in the signal handler when there's a power failure!"), then fall to their death. If there's no dodgy safety net it's far more likely that people are going to use a decent safety harness (or in other words; if there are no signals it's far more likely people are going reduce the risk of data loss by using methods that actually do work properly).

mallard wrote:Also, how do you expect to implement debugging on your OS?

Mostly via. remote control; where one process (the debugger) attaches itself to another process (the process being debugged, which may be already running on a completely different computer); where the kernel grants the first process special abilities. These special abilities include receiving notifications for various events (e.g. when the process spawns/terminates threads, allocates/frees virtual pages, etc); being sent copies of messages sent/received by the threads being debugged; asking for copies of arbitrary areas of RAM from the process being debugged; being able to stop/restart any or all of the process' threads, etc. If a process being debugged crashes while a debugger is attached, then the kernel's "generic unrecoverable error handler" puts the process into an "all threads stopped" state (which is the first part of terminating it anyway) and then send a message notifying the debugger that the process is now "post mortem" and why.

None of this has nothing at all to do with signals. The process being debugged doesn't even know it is being debugged (however there are security checks - the process being debugged must have an "allow debugger to attach" flag in the executable).

mallard wrote:How would you run an interpreter/JIT without crippling performance by pre-checking all math operations?

I wouldn't.

More specifically, I won't have any interpreter/JIT; but if I did I'd want it to check for things type-checking (e.g. accessing a float as an int), integer overflow and accessing the wrong memory that happens to exist; where the CPU isn't capable of detecting any these things; and would therefore have to do pre-checking regardless.

mallard wrote:
Brendan wrote:The problem is that it becomes too easy to simply port everything; and you quickly end up with so much ported software that the only way the end user can tell the difference between your OS and (e.g.) a Linux distribution is that your OS is slower, crashes more often and supports a lot less devices. There's no reason for applications developers to write applications specifically for the OS. None of your OS's unique features get used. There's no reason for the end user to use the OS for anything. This is what I call "successful failure" - you've successfully created an OS, but the OS is a failure.
Again, I respect that opinion, even though I don't share it. It all depends on your objectives for your OS. I'm perfectly happy to build something that I like, as a hobby, with no particular desire to see it go "mainstream" or gain widespread adoption. To that end, if porting POSIX applications to my (decidedly non-POSIX) OS ever becomes "easy" then I'd consider it pretty darned successful.

If you want to build something revolutionary, entirely from scratch, that doesn't run existing applications, then feel free, power to you, just because that's not my objective doesn't mean it can't be yours.

"Success" and "failure" are entirely subjective opinions. Still, I'd rather build a "successful failure" than fail to build anything at all.

There's also something I call "failing successfully"; where the only thing preventing (potential) success is time. To be more specific, I see it as a state machine:

Also note that "learning" doesn't really change anything (e.g. "learning to successfully fail" vs. "learning to succeed"). For a hobby, I guess there's nothing really wrong with claiming failure is your hobby.

Cheers,

Brendan

mallard · Post by **mallard** » Wed Dec 10, 2014 3:57 am

Brendan wrote: Let me see if I understand this right..

Well, you don't. I quite clearly said "If [...] there are no other threads running, the process will be killed."

Brendan wrote: Of course maybe you're forced to have minimum of 2 threads; where if thread1 crashes then thread2 handles the message and calls "raise()", and if thread2 crashes then thread1 handles the message and calls "raise()". I don't know (it sounds like an ugly mess, but it's the only thing that makes sense).

Yes, I said that there has to be another thread to receive the message. That is quite rightly "the only thing that makes sense" in my system. A typical process will have a thread dedicated to receiving and handling messages, so that thread would call "raise()" if POSIX-like behaviour is desired by the programmer. There is the possible case of what happens when that thread crashes of course, which I do plan to account for. (In short, there will be 3 possible "message states" for a thread; waiting for message, processing message or doing something else. There will be atomic transitions between the first two states. If at any time there is a "critical" message in the queue and no non-halted thread in one of the first two states, the process is killed.)

Brendan wrote: In any case; it's just a different way of doing it. It still involves the kernel's exception handlers.

Sure, the kernel is going to need those exception handlers in order to do anything at all when an exception occurs (even if it's kill the process and erase it from disk as you seem to be suggesting). However, as I said, this is far beyond the needs of the C standard. The only things that C requires is that you can register a "signal handler" with "signal()" and call it with "raise()". That requires no kernel support.

Brendan wrote:
mallard wrote: According to you, just about every major OS ever is "pure idiocy".
Of course (but not necessarily just for this reason alone).

Well, that's your opinion I suppose. Personally, I'd rather trust the opinions of the developers of actual, successful (by my definition of the word, see below) OSs.

Brendan wrote: On a scale from "0% safe" to "100% safe"; where "100% safe" is perfect software and "0% safe" is the minimum requirement (e.g. able to reduce risk of data loss caused by power failure where necessary); what reason do I have to care about something so badly designed that it doesn't even come close to "0% safe"? For everything that actually matters ("0% safe or better") signals are useless.

And how will you enforce this "minimum requirement"? Considering that very little software in existence (and certainly nothing that a "normal user" might use) comes anywhere close to your "0%" standard, it seems that what you're talking about is not a general-purpose OS. Programmers are human. Humans make mistakes. Punishing the user with loss of data for a programmer's mistake is not good design.

Brendan wrote: There *is* harm in the belt-and-braces approach - it's like a safety net with massive gaping holes in it.

A safety net with "gaping holes" is still better than no net at all. You're not going to convince me that adding extra features to enhance reliability is a bad thing. Would you be happy to travel in a car with no safety features (crumple zones, airbags, seatbelts) because the driver told you that he's "a perfect driver"? Of course not. Same goes for software. No matter how good the programmer thinks his code is, the unexpected can and, over the course of time, will eventually happen. Even mathematically proven code can fail due to hardware glitches.

Brendan wrote:
mallard wrote:Also, how do you expect to implement debugging on your OS?
Mostly via. remote control; where one process (the debugger) attaches itself to another process (the process being debugged, which may be already running on a completely different computer); where the kernel grants the first process special abilities. These special abilities include receiving notifications for various events (e.g. when the process spawns/terminates threads, allocates/frees virtual pages, etc); being sent copies of messages sent/received by the threads being debugged; asking for copies of arbitrary areas of RAM from the process being debugged; being able to stop/restart any or all of the process' threads, etc. If a process being debugged crashes while a debugger is attached, then the kernel's "generic unrecoverable error handler" puts the process into an "all threads stopped" state (which is the first part of terminating it anyway) and then send a message notifying the debugger that the process is now "post mortem" and why.

So, your OS will have all the same infrastructure for notifying processes of errors, just with a more limited "destination" (since only a second process could receive the notifications).

Brendan wrote: None of this has nothing at all to do with signals. The process being debugged doesn't even know it is being debugged (however there are security checks - the process being debugged must have an "allow debugger to attach" flag in the executable).

You've still got "signals", you're just sending them somewhere else. Also, that "allow debugger to attach" flag seems unnecessary. Programs shouldn't have the ability to restrict the rights of the user; the ability to attach a debugger should be a per-user permission, not something that a program decides.

Brendan wrote:
mallard wrote:How would you run an interpreter/JIT without crippling performance by pre-checking all math operations?
I wouldn't.

So, you're not designing a general-purpose OS. Fine. I am, so I have different design goals. Not too surprising that my design doesn't match yours, when we have different goals in mind.

Brendan wrote: Also note that "learning" doesn't really change anything (e.g. "learning to successfully fail" vs. "learning to succeed"). For a hobby, I guess there's nothing really wrong with claiming failure is your hobby.

As I said, "Success" and "failure" are entirely subjective opinions. What you consider "success" and "failure" is obviously different from what I consider "success" and "failure". I'm not building an OS to impress you, so your definitions of "success" and "failure" aren't really relevant to me.

Brendan · Post by **Brendan** » Wed Dec 10, 2014 5:32 am

Hi,

mallard wrote:
Brendan wrote:In any case; it's just a different way of doing it. It still involves the kernel's exception handlers.
Sure, the kernel is going to need those exception handlers in order to do anything at all when an exception occurs (even if it's kill the process and erase it from disk as you seem to be suggesting). However, as I said, this is far beyond the needs of the C standard. The only things that C requires is that you can register a "signal handler" with "signal()" and call it with "raise()".

"Far beyond the needs of the C" is stretching the truth. The ISO C standard requires SIGABRT, SIGFPE, SIGILL, SIGINT, SIGSEGV, and SIGTERM to be defined; and (regardless of whether the C standard explicitly states it or not - I honestly don't know) programmers expect them to work (rather than merely being defined). If "works" means the standard library has a hidden thread handling messages then that's fine (as far as C programmer expectations go); but solves none of the problems involved with actually doing anything useful in the program's signal handler (if the program was stupid enough to bother replacing the standard library's default handlers).

mallard wrote:That requires no kernel support.

It requires kernel support to notify the process' standard library that the event occurred in the first place; regardless of how this notification is achieved.

Please not that I don't care how you've implemented it. The question is about whether or not it makes sense for the kernel's exceptions handlers to rely on code that has already crashed to finish handling an exception. To me it's as stupid as watching a drunk driver smash into another car and then letting them drive the tow truck.

mallard wrote:
Brendan wrote:On a scale from "0% safe" to "100% safe"; where "100% safe" is perfect software and "0% safe" is the minimum requirement (e.g. able to reduce risk of data loss caused by power failure where necessary); what reason do I have to care about something so badly designed that it doesn't even come close to "0% safe"? For everything that actually matters ("0% safe or better") signals are useless.
And how will you enforce this "minimum requirement"? Considering that very little software in existence (and certainly nothing that a "normal user" might use) comes anywhere close to your "0%" standard, it seems that what you're talking about is not a general-purpose OS. Programmers are human. Humans make mistakes. Punishing the user with loss of data for a programmer's mistake is not good design.

Everything has its own standards. C's "abstract machine" is just one. POSIX is another. Java's virtual machine is another. Most software does comply with the corresponding standards for the environment they're designed for. There's no reason to suspect that if those standards said "you're expected to implement an effective means of minimising data loss that may be caused by power failure, hardware faults or software crashes" that the majority of programmers designing software for that environment wouldn't do so.

mallard wrote:
Brendan wrote:There *is* harm in the belt-and-braces approach - it's like a safety net with massive gaping holes in it. People see the safety net and assume they can do stupid things ("I'll just save the data in the signal handler when there's a power failure!"), then fall to their death. If there's no dodgy safety net it's far more likely that people are going to use a decent safety harness (or in other words; if there are no signals it's far more likely people are going reduce the risk of data loss by using methods that actually do work properly).
A safety net with "gaping holes" is still better than no net at all.

Wrong. A misleading/false sense of safety is worse than nothing.

mallard wrote:
Brendan wrote:Mostly via. remote control; where one process (the debugger) attaches itself to another process (the process being debugged, which may be already running on a completely different computer); where the kernel grants the first process special abilities. These special abilities include receiving notifications for various events (e.g. when the process spawns/terminates threads, allocates/frees virtual pages, etc); being sent copies of messages sent/received by the threads being debugged; asking for copies of arbitrary areas of RAM from the process being debugged; being able to stop/restart any or all of the process' threads, etc. If a process being debugged crashes while a debugger is attached, then the kernel's "generic unrecoverable error handler" puts the process into an "all threads stopped" state (which is the first part of terminating it anyway) and then send a message notifying the debugger that the process is now "post mortem" and why.
So, your OS will have all the same infrastructure for notifying processes of errors, just with a more limited "destination" (since only a second process could receive the notifications).

The key difference is that I won't be expecting code that's already crashed to handle crash recovery - the debugger is a completely separate/isolated process that has not crashed.

mallard wrote:
Brendan wrote:None of this has nothing at all to do with signals. The process being debugged doesn't even know it is being debugged (however there are security checks - the process being debugged must have an "allow debugger to attach" flag in the executable).
You've still got "signals", you're just sending them somewhere else.

My OS's messaging? That's just signals. D-Bus? Yeah, that's just signals too. TCP/IP? That must be signals too! Obviously, all forms of communication (even spoken English) is 100% identical to Unix signals.

mallard wrote:Also, that "allow debugger to attach" flag seems unnecessary. Programs shouldn't have the ability to restrict the rights of the user; the ability to attach a debugger should be a per-user permission, not something that a program decides.

Users have no rights (until/unless the owner of the software grants them rights, including the right to use the software in the first place).

mallard wrote:
Brendan wrote:
mallard wrote:How would you run an interpreter/JIT without crippling performance by pre-checking all math operations?
I wouldn't.
So, you're not designing a general-purpose OS. Fine. I am, so I have different design goals. Not too surprising that my design doesn't match yours, when we have different goals in mind.

The only thing required for "general purpose OS" is that the user is able to install software that wasn't a built-in part of the OS. Nobody is stupid enough to assume that an interpreter/JIT is a requirement.

Cheers,

Brendan

OSDev.org

How do you use exceptions interrupts in your OS?

How do you use exceptions interrupts in your OS?

Re: How do you use exceptions interrupts in your OS?

Re: How do you use exceptions interrupts in your OS?

Re: How do you use exceptions interrupts in your OS?

Re: How do you use exceptions interrupts in your OS?

Re: How do you use exceptions interrupts in your OS?

Re: How do you use exceptions interrupts in your OS?

Re: How do you use exceptions interrupts in your OS?

Re: How do you use exceptions interrupts in your OS?

Re: How do you use exceptions interrupts in your OS?

Re: How do you use exceptions interrupts in your OS?

Re: How do you use exceptions interrupts in your OS?

Re: How do you use exceptions interrupts in your OS?

Re: How do you use exceptions interrupts in your OS?

Re: How do you use exceptions interrupts in your OS?