Brendan wrote:
Even for kpanic; the goal is to continue for long enough to provide the user with relevant information (e.g. the fact that the panic was caused by an NMI and some details of what the CPU was doing when it happened) and stop other CPUs from running (and making a mess of things); and crashing (e.g. due to NMI-SMI-NMI) before you've done these things would be undesirable.
Yes, but if you are not going to return from the NMI then the solutions in the "NMI-SMI-NMI" thread should work without issues. Alternatively one could attempt to make the NMI handler "idempotent" so it doesn't matter how many times it's started it still works, assuming at least one of those NMI handlers gets to run to completion..
Brendan wrote:
Some servers come with an "NMI button"; so that if an administrator notices an OS has locked up (e.g. maybe something like a spinlock that's spinning forever with IRQs disabled) they can press the button to get some idea of where the problem is (or get some information that makes a bug report better than useless). The other common cause is watchdog timers; which is basically an automated "tell me what the OS was doing if it locks up" alternative to the NMI button (e.g. normally the OS updates a timer regularly to stop it from expiring; and when the OS locks up it doesn't update the timer, so the timer expires and sends NMI).
Note: For hardware errors (e.g. memory corruption), I think most of it has been shifted to the machine check system in modern 80x86 computers. Unfortunately, if you go digging in modern chipset datasheets you'll still find (e.g.) registers that control things like "if FOO happens; send NMI or #SERR or SCI" and won't have any idea how the firmware actually configured the chipset (and can't easily create a list of things that cause NMI for a specific chipset).
For testing; one thing OS developers could really benefit from is an emulator designed for fault emulation and fault injection. Even basic things (e.g. checking if your "software RAID" layer actually does recover from hard drive failures properly) are excessively hard to test.
At least for the time being I'm not intending to have special support for an NMI button, but the generic NMI handler will eventually record a relevant error message. WD timer I don't yet have, but have thought about it, but in the case that it actually fires an interrupt I likely want to kpanic because it should never happen, so something is terribly wrong on a logical level if the WD timer fires.
I actually meant hardware generated NMI's that haven't been requested by the OS (eg. WD timer), and was thinking that memory corruption is likely handled by MCE these days, so wasn't sure if there's anything besides NMI button and WD timer that would actually trigger an NMI.
As for testing, I think one of the best ways would be to test with unit/integration tests and mocks to generate the NMI's or in general weird behavior. It probably gives you the best control and you can run all your tests as part of your compile/build process, I'm not a huge fan of manual testing, it's too unreliable in practice.
Though there are things that I don't know how to reasonably test, I don't really want to recreate a "virtual CPU" code for a mock object so I can test against it, but for a lot of stuff I think testing is quite reasonable. It adds dev time to create all the tests and mocks but it also decreases time spent debugging and hopefully produces better code.