Hi,
Korona wrote:NMI cannot be nested unless you iret. This problem can be solved by just not ireting until the NMI handler finishes running (e.g. by implementing an exceptional "return via retf" path in the MCE and page fault handler if that is necessary in your OS design; I suspect for most microkernels that should not even be necessary as the NMI should not ever touch non-present pages there).
No. The relevant part of Intel's manual is in "34.8 NMI HANDLING WHILE IN SMM":
"
Although NMI requests are blocked when the processor enters SMM, they may be enabled through software by executing an IRET instruction. If the SMI handler requires the use of NMI interrupts, it should invoke a dummy interrupt service routine for the purpose of executing an IRET instruction. Once an IRET instruction is executed, NMI interrupt requests are serviced in the same “real mode” manner in which they are handled outside of SMM.
A special case can occur if an SMI handler nests inside an NMI handler and then another NMI occurs. During NMI interrupt handling, NMI interrupts are disabled, so normally NMI interrupts are serviced and completed with an IRET instruction one at a time. When the processor enters SMM while executing an NMI handler, the processor saves the SMRAM state save map but does not save the attribute to keep NMI interrupts disabled. Potentially, an NMI could be latched (while in SMM or upon exit) and serviced upon exit of SMM even though the previous NMI handler has still not completed. One or more NMIs could thus be nested inside the first NMI handler. The NMI interrupt handler should take this possibility into consideration."
Essentially; if your NMI handler is interrupted by an SMI you end up with the possibility of NMI nesting; even if the firmware's SMI handler does not deliberately re-enable NMI (because it relies on triggering and receiving NMI internally itself), and even if the firmware's SMI handler does not use IRET anywhere.
Note that SMI is unpredictable, undetectable and unpreventable. An OS has no choice but to assume that (under certain conditions) NMI can nest at any point during the NMI handler (including before it can execute a single instruction).
It is impossible to use IST under these conditions (without accepting the risk of the OS crashing without warning before it is able to execute a single instruction).
LtG wrote:By NMI nesting you mean NMI-SMI-NMI or something else?
Yes, NMI-SMI-NMI (or something else).
LtG wrote:Assuming you're not utilizing NMI's (watchdogs, profiling, etc) is there anything you could do anyway or just graceful shut down? If just graceful shut down then why the need for nesting?
If you look at the datasheets for different chipsets you'll find that there's multiple potential causes of NMI (and often that some events can be configured in the chipset to generate an NMI or SMI or something else, where the chipset's datasheet can't tell you what any specific motherboard's firmware happened to configure it as).
For various reasons (including a strong dislike of ACPI's AML) I have a "motherboard driver" (in user-space). The motherboard driver (during its initialisation) might tell the kernel "always shutdown gracefully if NMI happens", or it might not. If it doesn't, then when an NMI occurs my kernel expects the motherboard driver to handle it (and the motherboard driver might respond with ""shutdown gracefully for this NMI", or it might not). However, even for "always shutdown gracefully" I refuse to knowingly allow my kernel to crash for a foreseeable and preventable reason instead of successfully shutting down gracefully.
Of course different OS projects have different goals, and if someone is aware of the problem and chooses to ignore it (and accept the very small risk of crashing) then that's fine (as long as it's a conscious decision and not just because they weren't aware of the problem).
LtG wrote:As for the issues with SYSCALL wrt NMI nesting, I assume your point is that with SYSENTER when you are in ring0 the stack is valid kernel stack and while in ring3 it doesn't matter because NMI will do a ring change and thus valid kernel stack will be loaded. With SYSCALL there's a short period when kernel stack is not loaded yet but we are already in ring0 and if that's the moment when NMI occurs then there's no valid stack?
Yes.
LtG wrote:Is there an alternative to using IST for the above scenario? If there isn't, then IST must be used..
If you do use IST for NMI then it's impossible to prevent the risk of kernel crashing unexpectedly.
If you don't use IST for NMI and don't use SYSCALL ether, then there's no problem.
If you don't use IST for NMI but do use SYSCALL, then it's unpossible to guarantee the kernel is reliable or secure (e.g. user-space can set RSP to the address of a critical kernel data structure before doing SYSCALL, and the CPU will trash that kernel data structure if an NMI occurs before kernel switches to a "known good" kernel stack).
Note: I've lost track of what Linux does to
"fix" the issue; but their continued bumbling is entertaining.
LtG wrote:Is the first instruction in the NMI handler guaranteed to be executed before a SMI/MCE could occur? And SMI/MCE could only occur at earliest between NMI instructions 1 and 2?
No. Because starting an interrupt handler typically takes longer than a normal instruction, it's actually more likely that an SMI/MCE (that occured while CPU was starting an interrupt handler) will happen before the NMI handler executes its first instruction than it is for it to happen between instructions.
LtG wrote:If so, wouldn't a relatively simple solution be to just change the IST on the first instruction so the original return address is safe?
That would reduce the risk (but wouldn't eliminate all of the risk).
LtG wrote:As for the performance, I don't have any numbers so I don't know what it will be like, but isn't the SYSCALL supposed to be faster than a regular CALL? What's the bloat needed? Switching stacks shouldn't be that slow and it happens with SYSENTER too (implicitly)..
Everything new is supposed to "faster" (even when it's not) - that's how marketing works.
In practice, for most kernel APIs you end up with a single entry point and have to use something like "call [kernel_function_table + eax*8]" to figure out which function the caller actually wanted. This is a potential cache miss (and a potential TLB miss), and an extremely likely branch misprediction. This one instruction all by itself could cost anything from 4 cycles to 300 cycles or more; and you can expect that (given that CPU has been running user-space code and that you can expect all of the kernel's data to be "least recently used") it'll probably cost about 100 cycles on average. That is where call gate wins - you can have a call gate for each function (or for each function that is used often), and inline a copy of the function directly into its call gate handler, and now you don't need a function number and don't need that expensive "call [kernel_function_table + eax*8]".
If a call gate and its "retf" costs 80 cycles but avoids about 100 cycles of "call [kernel_function_table + eax*8]"; then SYSCALL needs to be faster than "negative 20 cycles" to beat it. SYSCALL is nice, but doesn't make the CPU travel through time into the past, and therefore it can't win (even if you ignore the stack switching, and the "cmp eax,MAX_FUNCTION_NUMBER" and whatever else).
LtG wrote:Btw, the wiki has a bit different issue explained related to the return from SYSCALL, maybe this one should be added to the wiki as well?
The wiki page does mention the "syscall doesn't switch stacks" issue, but doesn't mention the "NMI can nest" issue and provides bad advice ("
For 64bit mode, the kernel must use Interrupt Stack Tables to safely move NMIs/MCEs onto a properly designated kernel stack").
It also doesn't seem to mention that SYSCALL doesn't exist on old 32-bit CPUs from AMD, or that SYSENTER doesn't exist on old 32-bit CPUs from Intel (or that there's an Intel CPU that doesn't support SYSENTER but CPUID is buggy and says that SYSENTER is supported).
Cheers,
Brendan