Hi,
XenOS wrote:Regarding the issue of NMI handling - how should an NMI be handled anyway? I always thought that NMIs indicate some critical hardware failure that cannot be dealt with in any sane way without shutting down the system (or letting it crash). (Except for IPIs with NMI delivery mode, of course - but these are always software generated.)
For critical hardware errors (both NMI and machine check), in my opinion the minimum requirements are telling the user it occurred. If a user submits a bug report saying "Your OS resets the computer when I try to burn a CD" then you're going to assume it's a triple fault and spend a week searching for bugs that don't exist (and when you give up you'll never know if there was bugs in your code or not). If a user submits a bug report saying "Your OS reports an NMI when I try to burn a CD" then it's an extremely different scenario.
Of course minimum requirements are only the minimum. The more information your software can provide the better (e.g. perhaps telling the user "NMI generated by SATA controller on AHCI bus" rather than "NMI occurred"). For NMI this does require something like a chipset driver, but for machine check it doesn't.
At the other end of the scale ("maximum requirements") is fault tolerance - e.g. recovering from critical hardware errors and keeping the OS running with the least loss of functionality where possible. This is likely to be far beyond the scope of most OS projects; however providing the framework needed for this may not be. For example, even if no actual motherboard drivers exist, your OS could support loading a "motherboard driver" and provide a way for software to disable specific PCI devices, take a CPU offline, mark an area of RAM as "faulty", etc (so that if anyone does write a motherboard driver it can do something useful).
Finally, NMI isn't necessarily limited to hardware errors. Your kernel may generate them deliberately for some reason (one example is the
"NMI watchdog" in Linux).
Cognition wrote:I also get the impression that NMIs shouldn't really nest in the first place.
At the hardware level, NMI shouldn't nest (but "shouldn't" is not a guarantee that they won't nest).
At the software level, as soon as your NMI handler attempts to do anything useful the "NMI doesn't nest" theory becomes unworkable, especially for micro-kernels (as things like video drivers are in user space), and especially for "multi-CPU" (as an NMI on one CPU will not prevent a different CPU from receiving NMI). For example, imagine an OS that (when a hardware error occurs) tries to terminate/suspend all non-essential processes, tries to sends a "hardware error occurred" message to all video drivers and tries to sync disks to avoid data loss. Now try to imagine an OS that does all that without executing a single IRET. It's a lot easier to just assume that NMI may nest and deal with it.
I'd be tempted to deliberately do a dummy "IRET to the following instruction" near the start of the NMI handler; just to make sure everyone understands that NMI can nest (regardless of what the hardware says or doesn't say). Heck, I'd probably set an "NMI occured" flag somewhere, do the dummy IRET, send IPIs to other CPUs (to tell them not to do anything non-essential until further notice), then do STI (so I'm not failing to respond to IPIs from other CPUs); and then start worrying about what to do about handling the NMI after all that is done.
Cheers,
Brendan