I've never really liked segmentation; but I do still like the idea of setting "CPL=3 CS limit" during task switches in protected mode (especially for old CPUs that don't support "no execute" page protection), which is something that would work well for me because I don't have shared libraries.LtG wrote:I liked the idea of segmentation (more granularity) but never really benchmarked it, and since AMD64 dropped it I guess I don't have much reason to care about it at this point, though may check its performance if I ever get around to creating x86_32 kernel.
For my purposes; I'd rather just not use IST (and not support SYSCALL either). For a micro-kernel (where there shouldn't be a large number of kernel API functions) it's probably just as fast to use "call gate per kernel API function" (and avoid the likely branch misprediction for "call [table+rax*8]" dispatch); and most CPUs (all Intel) support SYSENTER anyway.LtG wrote:I don't think adjusting NMI handler IST entry is that ugly, and at least it's quite simple and straightforward, and either minimizes the risk to one instruction "timing" or completely gets rid of the issue.Brendan wrote:You can take "another NMI interrupts NMI handler somewhere" into consideration and minimise the chance that it can happen (e.g. become immune to the problem if the second NMI interrupts after you've managed to execute 10 or more instructions). The question is how much "ugly work-around" you're willing to have, and whether or not it's possible to be 100% immune (if it's possible for a second NMI to occur before you've executed the first instruction).
Note that SYSCALL also causes awkwardness for machine check exception handler, and if you support "machine check error recovery" (and don't just halt everything when a machine check occurs) you can't (easily) use IST for machine check exceptions either. The problem here is that if the MCIP flag is set a second machine check exception causes a triple fault (which destroys "machine check error recovery"), and if you clear the MCIP flag as soon as possible (so that a second machine check exception won't cause triple fault) then the machine check exception handler needs to be re-entrant.
I'd just assume most OSs reduce the risk but aren't immune (and that Intel's advice only covers "risk reduction" and not immunity).LtG wrote:I can't remember AMD even mentioning the nested NMI/NMI-SMI-NMI issue at all, and if Intel decided to add a section specifically mentioning this and also says that OS should prepare for it, that implies it can be dealt with. I can't really think of anything except IST to avoid it (or not using SYSCALL).Brendan wrote: It's a messy and infrequent corner-case with no easy solution; and a technical writer working for one of these companies wrote "something" (that another technical writer at the other company probably quietly copied). There's no real guarantee that it's correct, and it's certainly not as simple as "just use IST and don't worry!".
I very much doubt Intel means "only the SMI is handled (and the NMI is forgotten forever)" and would assume "only the SMI is handled (at this instruction boundary)". Essentially; NMI isn't discarded but remains "pending".LtG wrote:Earlier I mentioned the two possibilities I could think of for instruction boundary to allow for the issue, one where the instruction boundary is the entire duration of "invoking" NMI and the other where each uop (or something smaller than an actual instruction) is considered to be an instruction thus creating extra instruction boundaries between two "real" instructions.Brendan wrote:To clarify the situation; I'd expect that this is entirely possible (and can't find anything in Intel's manual to convincingly prove or disprove it):Intel manual 34.3.1 Entering SMM wrote: An SMI has a greater priority than debug exceptions and external interrupts. Thus, if an NMI, maskable hardware interrupt, or a debug exception occurs at an instruction boundary along with an SMI, only the SMI is handled. Subsequent SMI requests are not acknowledged while the processor is in SMM. The first SMI interrupt request that occurs while the processor is in SMM (that is, after SMM has been acknowledged to external hardware) is latched and serviced when the processor exits SMM with the RSM instruction. The processor will latch only one SMI while in SMM.
- An NMI occurs
- CPU begins starting the NMI handler (IDT lookup, checks, etc); and while this is happening an SMI is received causing "pending SMI"
- CPU finishes starting the NMI handler (RIP pointing to first instruction of NMI handler) and commits changes to visible state
- Immediately after "commit changes to visible state" CPU checks for pending interrupts, sees the pending SMI, and starts handling it. The "old RIP value" that the CPU stores in the SMM state save map is the address of the first instruction in the NMI handler.
- The CPU executes the firmware's SMM code; and while that is happening a second NMI is received causing "pending NMI" (because NMIs are blocked in SMM)
- The firmware's SMM code does "RSM"; the CPU executes this instruction (including loading "old RIP value" from SMM state save map that still points to the first instruction of the NMI handler) and commits changes to visible state
- Immediately after "commit changes to visible state" CPU checks for pending interrupts, sees the pending NMI, and starts handling it.
- CPU trashes the first NMI handler's stack when starting the second NMI handler.
The Intel quote above says that if NMI and SMI occur at same instruction boundary then SMI wins and NMI gets forgotten (though if still present after SMI then it would get taken care of). I think that means that the "prolonged instruction boundary" case can be dismissed.
You definitely won't find "uop boundaries = instruction boundaries". If an instruction is split into 5 uops there will only be one instruction boundary when the last uop reaches retirement and the entire instruction (all changes from all uops) are committed to visible state together.LtG wrote:I can't point to anything in the manual that explicitly says that the "uop boundaries = instruction boundaries", which would imply that the sequence you made up is maybe possible. I would consider it a bit pathological of Intel/AMD however. The "invoke NMI" has already been committed to, changing to SMI midway (before first NMI handler instruction) doesn't seem reasonable to me.
No, my sequence is for "SMI does not intentionally enable/unmask NMI". Please note that Intel like to use the word "disabled" when they actually mean "held pending" (e.g. the normal IRQ enable/disable flag, which does not enable/disable IRQs and only causes them to be "held pending until IF cleared").LtG wrote:Also worth noting, AFAIK all of this applies only if the SMI handler _intentionally_ enables NMI's, which was missing in your "sequence"..
I'd hope most SMI handlers don't intentionally enable NMI (in the same way that I hope software is never released with critical vulnerabilities, CPUs don't have errata, and unicorns actually exist ).LtG wrote:But do SMI handlers do that? And if they do, aren't they prone to race conditions?
I think (not entirely sure); that in 16-bit code with "32-bit operand size overrides" the SIDT and LIDT instructions only effect the lowest 32 bits of the GDT base address; and that (if the OS is 64-bit) SMM code can safely use SIDT to store the lowest 32 bits of the OS's IDT base, then do whatever it likes, then use LIDT to restore the lowest 32 bits of the OS's IDT base.LtG wrote:How can the SMI change IDT back to OS IDT (I assume they change IDT, or does the CPU restore it from the state it's already saved) and RSM without allowing NMI in between (assuming they enabled NMI)?
I think I found it here. As far as vulnerabilities go, I'd rate this one as "extremely plausible in practice".LtG wrote:I'm not sure how often it applies, but there was a "hack" that allowed access to SMM, and I think it was pretty simple and mostly universal. Might be fixed in newer systems though.Brendan wrote:Due to risk of malware (rootkits, etc); it's "intentionally almost impossible" to modify firmware's SMM code on almost all motherboards. I'm not sure, but it might be possible to use hardware virtualization to bypass this (at least I vaguely remember something about SMM virtualization in AMD's manual), but if it is possible I'm also not if it'd influence results.
IIRC the idea was:
- LAPIC "hijacks" memory references that are directed at the LAPIC
- Relocating LAPIC memory to overlay SMM memory
- SMI occurs (generate it or wait for it)
- SMI handler code references SMM (data) memory, but that is now hijacked by LAPIC and writes to it are discarded and reads from it (for the most part) return zero
- SMI handler jumps to wrong location where you've planted your code
- You now have ring -2 access
As I remember the paper, they suggested that all SMI handlers begin the same, with a couple of variations, so relocating the LAPIC memory works quite well..
I can try to find the paper if you like...
I don't trust that answer at all - it completely ignores the part of the manual that the original poster quoted in their question.LtG wrote:Maybe I should get an Intel dev account and ask at their forums, which also lead me to this, though I'm not 100% sure if I trust the answer:
https://software.intel.com/en-us/forums ... pic/305672
Cheers,
Brendan