Brendan wrote:
The first trick to solve this problem would be to have a different NMI handler entry point for different CPUs. The first CPU's NMI handler does "lock add qword [address_of_IST_entry_for_first_CPU],1024", then second CPU's NMI handler does "lock add qword [address_of_IST_entry_for_second_CPU],1024", etc; and then they'd all do a "jmp" to a common NMI handler. This would also mean having a different IDT for each CPU.
I plan on treating each CPU/core independently anyway, so for me that's not an issue. Of course ISR code is shared thru paging. I may, not sure yet, have to add a few "hacks" for performance, but I'll try to avoid that if at all possible or reasonable.
Brendan wrote:
Note: If the "CPL=3 code descriptor" is "read, execute" (and not "execute only"); then CPL=3 code can load that code descriptor into DS (or ES or ...). In this case I don't know if the CPU checks if DS can be used for writes while in 64-bit code. Intel's manual says the CPU does check (in general) and doesn't mention "but not in 64-bit code" anywhere (like it does for other checks), but that may just be an error/omission in the manual. In theory (if CPL=3 code loaded DS with a "read, execute" descriptor and is then interrupted by NMI) the NMI handler's write to the IST might or might not cause a GPF. If this actually is a problem; it would be fixable by using "execute only" for CPL=3 code descriptor.
Good point, can't be bothered to check the manuals to see if the ISR does have to worry about bogus DS by malicious userland. I see no reason why CS should have anything but execute, especially given today's compilers (gcc) that don't support segmentation anyway. Which as you said makes it a moot point.
I liked the idea of segmentation (more granularity) but never really benchmarked it, and since AMD64 dropped it I guess I don't have much reason to care about it at this point, though may check its performance if I ever get around to creating x86_32 kernel.
Brendan wrote:
You can take "another NMI interrupts NMI handler somewhere" into consideration and minimise the chance that it can happen (e.g. become immune to the problem if the second NMI interrupts after you've managed to execute 10 or more instructions). The question is how much "ugly work-around" you're willing to have, and whether or not it's possible to be 100% immune (if it's possible for a second NMI to occur before you've executed the first instruction).
I don't think adjusting NMI handler IST entry is that ugly, and at least it's quite simple and straightforward, and either minimizes the risk to one instruction "timing" or completely gets rid of the issue.
Brendan wrote:
It's a messy and infrequent corner-case with no easy solution; and a technical writer working for one of these companies wrote "something" (that another technical writer at the other company probably quietly copied). There's no real guarantee that it's correct, and it's certainly not as simple as "just use IST and don't worry!".
I can't remember AMD even mentioning the nested NMI/NMI-SMI-NMI issue at all, and if Intel decided to add a section specifically mentioning this and also says that OS should prepare for it, that implies it can be dealt with. I can't really think of anything except IST to avoid it (or not using SYSCALL).
Brendan wrote:
Intel manual 34.3.1 Entering SMM wrote:
An SMI has a greater priority than debug exceptions and external interrupts. Thus, if an NMI, maskable hardware interrupt, or a debug exception occurs at an instruction boundary along with an SMI, only the SMI is handled. Subsequent SMI requests are not acknowledged while the processor is in SMM. The first SMI interrupt request that occurs while the processor is in SMM (that is, after SMM has been acknowledged to external hardware) is latched and serviced when the processor exits SMM with the RSM instruction. The processor will latch only one SMI while in SMM.
To clarify the situation; I'd expect that this is entirely possible (and can't find anything in Intel's manual to convincingly prove or disprove it):
- An NMI occurs
- CPU begins starting the NMI handler (IDT lookup, checks, etc); and while this is happening an SMI is received causing "pending SMI"
- CPU finishes starting the NMI handler (RIP pointing to first instruction of NMI handler) and commits changes to visible state
- Immediately after "commit changes to visible state" CPU checks for pending interrupts, sees the pending SMI, and starts handling it. The "old RIP value" that the CPU stores in the SMM state save map is the address of the first instruction in the NMI handler.
- The CPU executes the firmware's SMM code; and while that is happening a second NMI is received causing "pending NMI" (because NMIs are blocked in SMM)
- The firmware's SMM code does "RSM"; the CPU executes this instruction (including loading "old RIP value" from SMM state save map that still points to the first instruction of the NMI handler) and commits changes to visible state
- Immediately after "commit changes to visible state" CPU checks for pending interrupts, sees the pending NMI, and starts handling it.
- CPU trashes the first NMI handler's stack when starting the second NMI handler.
Earlier I mentioned the two possibilities I could think of for instruction boundary to allow for the issue, one where the instruction boundary is the entire duration of "invoking" NMI and the other where each uop (or something smaller than an actual instruction) is considered to be an instruction thus creating extra instruction boundaries between two "real" instructions.
The Intel quote above says that if NMI and SMI occur at same instruction boundary then SMI wins and NMI gets forgotten (though if still present after SMI then it would get taken care of). I think that means that the "prolonged instruction boundary" case can be dismissed.
I can't point to anything in the manual that explicitly says that the "uop boundaries = instruction boundaries", which would imply that the sequence you made up is maybe possible. I would consider it a bit pathological of Intel/AMD however. The "invoke NMI" has already been committed to, changing to SMI midway (before first NMI handler instruction) doesn't seem reasonable to me.
Also worth noting, AFAIK all of this applies only if the SMI handler _intentionally_ enables NMI's, which was missing in your "sequence".. But do SMI handlers do that? And if they do, aren't they prone to race conditions? How can the SMI change IDT back to OS IDT (I assume they change IDT, or does the CPU restore it from the state it's already saved) and RSM without allowing NMI in between (assuming they enabled NMI)?
Brendan wrote:
Due to risk of malware (rootkits, etc); it's "intentionally almost impossible" to modify firmware's SMM code on almost all motherboards. I'm not sure, but it might be possible to use hardware virtualization to bypass this (at least I vaguely remember something about SMM virtualization in AMD's manual), but if it is possible I'm also not if it'd influence results.
I'm not sure how often it applies, but there was a "hack" that allowed access to SMM, and I think it was pretty simple and mostly universal. Might be fixed in newer systems though.
IIRC the idea was:
- LAPIC "hijacks" memory references that are directed at the LAPIC
- Relocating LAPIC memory to overlay SMM memory
- SMI occurs (generate it or wait for it)
- SMI handler code references SMM (data) memory, but that is now hijacked by LAPIC and writes to it are discarded and reads from it (for the most part) return zero
- SMI handler jumps to wrong location where you've planted your code
- You now have ring -2 access
As I remember the paper, they suggested that all SMI handlers begin the same, with a couple of variations, so relocating the LAPIC memory works quite well..
I can try to find the paper if you like...
Brendan wrote:
For "is it worth it", if you care about the maximum reliability possible (e.g. those "mission critical high availability" servers), there's only 2 possible outcomes:
- You test some computers and (hopefully quickly) find out that one does have an "NMI-SMI-NMI" problem; and end up having to implement something to reduce or eliminate the "NMI-SMI-NMI" problem.
- You spend ages testing hundreds of computers without finding the problem, but still can't be sure that another different computer (or a future computer that doesn't exist yet) won't have an "NMI-SMI-NMI" problem; and end up having to implement something to reduce or eliminate the "NMI-SMI-NMI" problem.
I was more thinking along the lines that the way I interpret the manuals, the trick we discussed (NMI adjusts IST on first ISR instruction) should work. Attempting to prove that it does seems near impossible. The practicality of getting every CPU model and revision to test, and being confident enough that the testing method is even reliable (I don't like to trust brute force methods where I'm only relying on sheer number of tests and not "determinicity") it seems that there's nothing I can do about it.
Maybe I should get an Intel dev account and ask at their forums, which also lead me to this, though I'm not 100% sure if I trust the answer:
https://software.intel.com/en-us/forums ... pic/305672
At some point I'll have to test the performance of call gates vs SYSCALL vs SYSENTER vs soft int, etc and actually see if I should just avoid SYSCALL. I actually prefer SYSCALL to SYSENTER for its minimalism, but this NMI-SMI-NMI thing is really annoying and I wish Intel and AMD would have explicitly specified what is the safe way of handling it for all CPUs..