Clarify how x86 interrupts work

Brendan · Post by **Brendan** » Fri Jun 23, 2017 8:57 am

Hi,

Korona wrote:
Brendan wrote:Using IST is broken (or at least impractical to work-around properly) for the "NMI-SMI-NMI" case. Not using IST is not broken for the "NMI-SMI-NMI" case, and as an extra added bonus you can also get a free "triple fault when no progress is possible due to NMI-SMI-NMI-SMI-NMI-SMI-NMI.... storm" advantage.
As I said, I'm not sure if this can actually happen (my reading of the SDM is that is cannot for sane firmware that doesn't do SMI-IRET to enable NMI-NMI-IRET from NMI-RSM) and even if it can it is astronomically rare and the IST solution can just panic in this case.

If you don't think "NMI-SMI-NMI" can happen; how can you think that NMI-SMI-NMI-SMI-NMI-SMI-NMI.... can happen?

Note: Sane firmware is like unicorns - it's impossible to prove that they don't exist, and sometimes you'll see something that looks like it could be a unicorn out the corner of your eye (until you realise it's just a horse).

Korona wrote:
Brendan wrote:If the machine check exception handler is capable of recovery in some cases; IST is completely broken because it's impossible to avoid "second MCE trashes first MCE's stack after first MCE cleared MCIP but before first MCE did IRET". Also, for this case you do want to clear MCIP as soon as possible (after pulling information out of MSRs and storing it somewhere safe, and before you bother processing any of it) to reduce the risk of triple fault ruining your ability to recover.
Just self-IPI and clear MCIP inside the IPI.

So, like:

Code: Select all

MCE_handler:
    ...
   send_IPI_to_self();   // Work-around for the fact that I shouldn't have used IST but did
    ...
   iretd

actual_MCE_handler:
    // We're good now because this interrupt (started by "self IPI") doesn't use IST
    ...
    clearMCIP();
    ...
   iretd

Korona wrote:Besides, if there is an MCE before the IRET your kernel memory is doomed anyways.

Are you saying that if one measly little bit of memory that's only used by one CPU's stack and nothing else has failed, then instead of just using different memory and terminating the process, and instead of just taking that one CPU offline, I should start roaming the streets screaming "The sky is falling!" at people?

Korona wrote:
Brendan wrote:How about; send an IPI to other CPUs (to tell them you're doing an "emergency soft-offline"), then send "INIT IPI" to yourself (to reset CPU and put it into a "wait-for-SIPI" state); then keep the OS running (including cleaning up any mess left behind from "emergency soft-offline") using all the remaining CPUs?
That won't work. If your MCE happens in kernel space (and all nested MCE that we're talking about here happen in kernel space ) those other CPUs will just MCE too. The main purpose of MCE is reporting broken RAM. The fix is to disable that RAM. You cannot reliably disable kernel RAM. Yes, MCE can also report SERR and similar errors but those are even more critical and you cannot recover from them.

Nonsense.

If the broken RAM is in a "per NUMA domain" area of kernel space (which contains kernel code, etc) then I can take that NUMA domain offline (or maybe just replace the effected physical page/s from identical copies used by other NUMA domains in some cases). If the broken RAM is in a "per CPU" area (including a CPU's kernel stack) I can just take that CPU offline. If the broken RAM is in the message buffer area I can terminate the process that owns the message buffer. If the broken RAM was used for thread data structures or process data structures I can terminate a process. If the broken RAM was used for kernel's event log, then that can be discarded and replaced with new physical pages. If the broken RAM is in a "free physical page" stack or bitmap that'd be the easiest possible case! If ...

It's a micro-kernel, there isn't much else in kernel space.

Maybe only 1% of all memory is used for kernel space and maybe there's 1% of memory used in kernel space where broken RAM can't be recovered from; so maybe there's a 0.01% chance that an MCE caused by broken RAM is going to cause downtime because kernel couldn't recover.

Korona wrote:
Brendan wrote:No, I don't need to accept that. What I do need to accept is that for a "peer to peer distributed" OS like mine, the failure of any one computer can effect many computers; and minimising the risk of failures as much as possible, and recovering from failures as much as possible; is a necessity.
If you're writing your OS with a distributed environment in mind you should be even better off: Failure of a single machine should easily be recovered from. This recovery cannot (in case of a kernel space MCE) save a single machine but should instead rely on the other machine's replicating the failing machines tasks.

It's not the tasks that matter; it's the lost data. 20 people using 20 applications on 10 computers, where each application is spread across 5 out of 10 computers - one computer dies and up to half of the users could lose data.

Yes; there's many ways to reduce/limit/mitigate that (redundant services, saving recovery information to disk in the background, etc); but preventing a computer from losing data just because someone burped is an important part of that.

Cheers,

Brendan

Rusky · Post by **Rusky** » Fri Jun 23, 2017 6:31 pm

Brendan wrote:To clarify the situation; I'd expect that this is entirely possible (and can't find anything in Intel's manual to convincingly prove or disprove it):
An NMI occurs

CPU begins starting the NMI handler (IDT lookup, checks, etc); and while this is happening an SMI is received causing "pending SMI"

CPU finishes starting the NMI handler (RIP pointing to first instruction of NMI handler) and commits changes to visible state

Immediately after "commit changes to visible state" CPU checks for pending interrupts, sees the pending SMI, and starts handling it. The "old RIP value" that the CPU stores in the SMM state save map is the address of the first instruction in the NMI handler.

The CPU executes the firmware's SMM code; and while that is happening a second NMI is received causing "pending NMI" (because NMIs are blocked in SMM)

The firmware's SMM code does "RSM"; the CPU executes this instruction (including loading "old RIP value" from SMM state save map that still points to the first instruction of the NMI handler) and commits changes to visible state

Immediately after "commit changes to visible state" CPU checks for pending interrupts, sees the pending NMI, and starts handling it.

CPU trashes the first NMI handler's stack when starting the second NMI handler.

From what I understand, this can only happen if the SMM code somehow reenables NMIs (by executing an iret)? Otherwise the second NMI would stay blocked until the first NMI returned. If this is the case, it's not much help on consumer hardware, but for someone who controlled SMM it means things are fixable.

Brendan · Post by **Brendan** » Sat Jun 24, 2017 1:22 am

Hi,

Rusky wrote:
Brendan wrote:To clarify the situation; I'd expect that this is entirely possible (and can't find anything in Intel's manual to convincingly prove or disprove it):
An NMI occurs

CPU begins starting the NMI handler (IDT lookup, checks, etc); and while this is happening an SMI is received causing "pending SMI"

CPU finishes starting the NMI handler (RIP pointing to first instruction of NMI handler) and commits changes to visible state

Immediately after "commit changes to visible state" CPU checks for pending interrupts, sees the pending SMI, and starts handling it. The "old RIP value" that the CPU stores in the SMM state save map is the address of the first instruction in the NMI handler.

The CPU executes the firmware's SMM code; and while that is happening a second NMI is received causing "pending NMI" (because NMIs are blocked in SMM)

The firmware's SMM code does "RSM"; the CPU executes this instruction (including loading "old RIP value" from SMM state save map that still points to the first instruction of the NMI handler) and commits changes to visible state

Immediately after "commit changes to visible state" CPU checks for pending interrupts, sees the pending NMI, and starts handling it.

CPU trashes the first NMI handler's stack when starting the second NMI handler.
From what I understand, this can only happen if the SMM code somehow reenables NMIs (by executing an iret)? Otherwise the second NMI would stay blocked until the first NMI returned. If this is the case, it's not much help on consumer hardware, but for someone who controlled SMM it means things are fixable.

Yes (maybe).

It can happen if:

The SMM code does IRET for any reason (whether it's to intentionally enable NMI in SMM or not). Documented by both Intel and AMD
The CPU is a Pentium processor and the SMM code invokes a trap or fault handler (without ever doing any IRET). Documented by Intel
The SMM code clears the "NMI Mask" bit in the SMM state save map on (some?) AMD CPUs for whatever reason (firmware bugs)

This doesn't include 80x86 CPUs from other vendors (VIA, SiS, etc) where there's no documentation and no way to guess what they thought "correct behaviour" should be; and any (documented or undocumented) CPU errata that I'm not aware of.

Of course it changes nothing - there is no way for an OS to guess what the SMM code does for any computer, so the OS has to rely on "safe assumptions" (which means, OS must to assume SMM code does unblock NMI).

Cheers,

Brendan

Korona · Post by **Korona** » Sat Jun 24, 2017 7:56 am

Brendan wrote:If you don't think "NMI-SMI-NMI" can happen; how can you think that NMI-SMI-NMI-SMI-NMI-SMI-NMI.... can happen?

Note: Sane firmware is like unicorns - it's impossible to prove that they don't exist, and sometimes you'll see something that looks like it could be a unicorn out the corner of your eye (until you realise it's just a horse).

There is a difference between broken firmware and batshit insane firmware. I'm not assuming the firmware is not broken, I'm just assuming that it is not batshit crazy i.e. that it does not re-enable NMI and then leave NMI enabled after RSM.

Of course NMI-SMI-NMI can happen iff NMI-SMI-NMI-SMI-... can happen. Note that NMI-SMI-NMI-SMI-... is not the only scenario that leads to extensive kernel stack nesting without IST. In complex kernels something like SYSCALL-#BP-IRQ-#PF-#BP-MCE-#BP-NMI-#BP might happen too. Sure, you can mitigate that by enforcing rules like "no #DB or #BP in the kernel", "no #DB or #BP in interrupt paths" or "no #PF in the kernel". But IMHO the sane option is not to ban these (otherwise completely valid) features but to assume that the firmware is not batshit crazy.

MCE and NMI on non-IST stacks is broken if you're using SYSCALL (because those handlers might actually execute on the user-mode stack). Besides that point, using the IST also gives you a safe way to access per-CPU data (by storing a pointer to the per-CPU context relative to the stack base) that works even in the presence of SWAPGS. Keep in mind that you do not even know if FS is sane in MCE and NMI handlers.

You're basically saying: Ban SYSCALL, ban #DB and #BP in kernel code, accept large kernel stacks, accept ugly and slow workarounds for FS in hot IRQ paths because there might be some batshit insane firmware (that no one of the Linux guys actually encountered yet) that leaves NMI active after SMI even though Intel actively warns against that. I'm not willing to make that trade off in my OS.

Brendan wrote:So, like:

Code: Select all

MCE_handler:
    ...
   send_IPI_to_self();   // Work-around for the fact that I shouldn't have used IST but did
    ...
   iretd

actual_MCE_handler:
    // We're good now because this interrupt (started by "self IPI") doesn't use IST
    ...
    clearMCIP();
    ...
   iretd

Yes. However the point is not that the actual_MCE_handler does not use IST but that it is controlled by a sane priority/nesting mechanism (the local APIC). The handler might as well run on an IST too. I acknowledge that it is ugly, but we as OS developers have no control over stupid CPU design. We have to accept that the MCIP (and NMI) mechanism is broken and work around that. Considering that this is a rare code path I'm willing to accept that ugly (but simple) code.

Brendan wrote:Are you saying that if one measly little bit of memory that's only used by one CPU's stack and nothing else has failed, then instead of just using different memory and terminating the process, and instead of just taking that one CPU offline, I should start roaming the streets screaming "The sky is falling!" at people?

Basically yes, if that little bit of memory is on the MCE IST stack. Note that your no-IST MCE stack approach just tripple-faults if there is an MCE on the top of any kernel stack.

Brendan · Post by **Brendan** » Sat Jun 24, 2017 11:14 am

Hi,

Korona wrote:
Brendan wrote:Note: Sane firmware is like unicorns - it's impossible to prove that they don't exist, and sometimes you'll see something that looks like it could be a unicorn out the corner of your eye (until you realise it's just a horse).
There is a difference between broken firmware and batshit insane firmware. I'm not assuming the firmware is not broken, I'm just assuming that it is not batshit crazy i.e. that it does not re-enable NMI and then leave NMI enabled after RSM.

Intel's manual (and AMD's) warn OS developers that firmware/SMM may do this, but don't discourage firmware developers from doing it (and instead provide information to help them do it). That is like an official blessing from Intel (and AMD). Maybe there are reasons for this to be necessary (I don't know enough about the SMM code in typical firmware to guess if there are valid reasons why firmware *must* enable NMI in SMM).

Korona wrote:
Brendan wrote:If you don't think "NMI-SMI-NMI" can happen; how can you think that NMI-SMI-NMI-SMI-NMI-SMI-NMI.... can happen?
Of course NMI-SMI-NMI can happen iff NMI-SMI-NMI-SMI-... can happen. Note that NMI-SMI-NMI-SMI-... is not the only scenario that leads to extensive kernel stack nesting without IST. In complex kernels something like SYSCALL-#BP-IRQ-#PF-#BP-MCE-#BP-NMI-#BP might happen too. Sure, you can mitigate that by enforcing rules like "no #DB or #BP in the kernel", "no #DB or #BP in interrupt paths" or "no #PF in the kernel". But IMHO the sane option is not to ban these (otherwise completely valid) features but to assume that the firmware is not batshit crazy.

Sure - most OSs end up having to deal with "many interrupt sources (that have nothing to do with NMI) can nest". My latest approach is "(relatively large) per CPU kernel stack". Linux uses "small per task kernel stack; plus large per CPU stack".

Given that an OS typically has to deal with "many interrupt sources (that have nothing to do with NMI) can nest", dealing with "many interrupt sources (that can include NMI) can nest" is virtually free.

Korona wrote:MCE and NMI on non-IST stacks is broken if you're using SYSCALL (because those handlers might actually execute on the user-mode stack). Besides that point, using the IST also gives you a safe way to access per-CPU data (by storing a pointer to the per-CPU context relative to the stack base) that works even in the presence of SWAPGS. Keep in mind that you do not even know if FS is sane in MCE and NMI handlers.

From my perspective it's the opposite: SYSCALL is broken (if you're using MCE and NMI on non-IST stacks).

For history:

In 1997, both Intel and AMD introduced their own "annoyingly different" fast system call instructions. Intel's SYSENTER was sane. AMD's SYSENTER was broken by design (because unlike every other thing that could cause a switch to a higher privilege level it didn't load a safe ESP) and caused extra work for the kernel (having to find and load ESP itself); either deliberately (to avoid being compatible with Intel) or for marketing reasons (to pretend their SYSCALL is faster by ignoring the added cost of kernel fixing up ESP itself).
Between 1997 and ~2000; both Intel and AMD where "annoyingly uncooperative" and refused to support each other's alternatives for the sake of compatibility, possibly because Intel didn't want to support something that's broken by design, and possibly because AMD didn't want to admit that SYSCALL was broken by design. Fortunately almost nobody ever used SYSCALL in protected mode.
Intel tried to force people that wanted 64-bit to use Itanium (possibly to prevent future competition with AMD), and in ~2000 AMD introduced long mode (which was smart and successful). AMD also (ab)used the introduction of long mode as a way to force software developers (Microsoft) into adopting their broken by design SYSCALL instruction.
Since then; Intel continue to refuse to support SYSCALL in 32-bit, and AMD continues to refuse to support SYSENTER in 64-bit.

This is why we are stuck with a broken by design SYSCALL instruction - due to marketing and petty rivalry between AMD and Intel, and not because of technical merit.

Korona wrote:You're basically saying: Ban SYSCALL, ban #DB and #BP in kernel code, accept large kernel stacks, accept ugly and slow workarounds for FS in hot IRQ paths because there might be some batshit insane firmware (that no one of the Linux guys actually encountered yet) that leaves NMI active after SMI even though Intel actively warns against that. I'm not willing to make that trade off in my OS.

You're putting words in my mouth, then exaggerating fiction.

Is there any proof that none of the Linux guys have encountered an "NMI-SMI-NMI" problem? Is there any proof that the Linux guys would know if an "NMI-SMI-NMI" problem was encountered?

Korona wrote:
Brendan wrote:So, like:
Code: Select all
    ...   iretd
Yes. However the point is not that the actual_MCE_handler does not use IST but that it is controlled by a sane priority/nesting mechanism (the local APIC). The handler might as well run on an IST too. I acknowledge that it is ugly, but we as OS developers have no control over stupid CPU design. We have to accept that the MCIP (and NMI) mechanism is broken and work around that. Considering that this is a rare code path I'm willing to accept that ugly (but simple) code.

I'd rather do "MCE_handler: { save stuff; clear MCIP; process critical stuff; enable IRQs; process non-critical stuff; IRET; }" without extra mess and without problems.

Korona wrote:
Brendan wrote:Are you saying that if one measly little bit of memory that's only used by one CPU's stack and nothing else has failed, then instead of just using different memory and terminating the process, and instead of just taking that one CPU offline, I should start roaming the streets screaming "The sky is falling!" at people?
Basically yes, if that little bit of memory is on the MCE IST stack. Note that your no-IST MCE stack approach just tripple-faults if there is an MCE on the top of any kernel stack.

No, due to "write-back" it would take a faulty CPU cache for this to happen; and I doubt it's reasonable to expect any OS to recover from faulty CPU cache.

Cheers,

Brendan

Rusky · Post by **Rusky** » Sat Jun 24, 2017 12:00 pm

Brendan wrote:Yes (maybe).

It can happen if:
The SMM code does IRET for any reason (whether it's to intentionally enable NMI in SMM or not). Documented by both Intel and AMD

The CPU is a Pentium processor and the SMM code invokes a trap or fault handler (without ever doing any IRET). Documented by Intel

The SMM code clears the "NMI Mask" bit in the SMM state save map on (some?) AMD CPUs for whatever reason (firmware bugs)
This doesn't include 80x86 CPUs from other vendors (VIA, SiS, etc) where there's no documentation and no way to guess what they thought "correct behaviour" should be; and any (documented or undocumented) CPU errata that I'm not aware of.

Of course it changes nothing - there is no way for an OS to guess what the SMM code does for any computer, so the OS has to rely on "safe assumptions" (which means, OS must to assume SMM code does unblock NMI).

Ah, glad I read that correctly. The fact that it's possible for SMM code to be written to avoid this issue is still useful, in my opinion.

Someone in a position like Apple could easily take control over the firmware and write it themselves. Someone like Microsoft could include an SMM audit in their hardware verification program. Someone who's actually using x86 in a human-life-critical system (who is first of all probably slightly nuts) could of course do the same things.

It's unfortunate that the firmware has the power to cause this problem, but for a hobbyist OS it's not really an issue, and for real-world use it's at least manageable. (Whether it actually is managed is a different question.)

Korona · Post by **Korona** » Sat Jun 24, 2017 12:08 pm

Brendan wrote:Intel's manual (and AMD's) warn OS developers that firmware/SMM may do this, but don't discourage firmware developers from doing it (and instead provide information to help them do it). That is like an official blessing from Intel (and AMD). Maybe there are reasons for this to be necessary (I don't know enough about the SMM code in typical firmware to guess if there are valid reasons why firmware *must* enable NMI in SMM).

I looked it up in the manuals again. Intel clearly states

This assumes that NMIs were not blocked before the SMI occurred. If NMIs were blocked before the SMI occurred, they are blocked after execution of RSM.

AMD states

Once NMI is recognized within SMM, NMI recognition remains enabled until SMM is exited, at which point NMI masking is restored to the state it was in before entering SMM.

That is, both manuals state that SMI-NMI-SMI cannot happen, not even for broken firmware. Intel warns that a second NMI can be nested inside the first NMI but in my understanding that concerns the "effects" of the NMI and not their dispatching (as it would otherwise contradict the previous paragraph). The following can happen: NMI -> SMI -> NMI is latched -> RSM -> NMI handler reads NMI_SC (or an equivalent register on non-Intel chipsets), deasserts the NMI sources -> IRET -> NMI is dispatched, with no NMI sources asserted in NMI_SC. The following cannot happen: NMI -> SMI -> NMI is latched -> RSM -> NMI is dispatched.

Brendan wrote:Given that an OS typically has to deal with "many interrupt sources (that have nothing to do with NMI) can nest", dealing with "many interrupt sources (that can include NMI) can nest" is virtually free.

Basically this is about how much stack space you're willing to reserve.

Brendan wrote:From my perspective it's the opposite: SYSCALL is broken (if you're using MCE and NMI on non-IST stacks).

Yes. SYSCALL is broken. NMI is also broken: Because it is unblocked by IRET and not by some NMI_IRET instruction, it causes nesting problems with exceptions like PF or BP. MCE is broken for the same reason. In a perfect world there would be a special SIRET imm8 instruction with the imm8 encoding what exceptional interrupt sources are unblocked by the IRET. However we do not live in a perfect world.

Brendan · Post by **Brendan** » Sat Jun 24, 2017 12:42 pm

Hi,

Korona wrote:
Brendan wrote:Intel's manual (and AMD's) warn OS developers that firmware/SMM may do this, but don't discourage firmware developers from doing it (and instead provide information to help them do it). That is like an official blessing from Intel (and AMD). Maybe there are reasons for this to be necessary (I don't know enough about the SMM code in typical firmware to guess if there are valid reasons why firmware *must* enable NMI in SMM).
I looked it up in the manuals again. Intel clearly states
This assumes that NMIs were not blocked before the SMI occurred. If NMIs were blocked before the SMI occurred, they are blocked after execution of RSM.

Yes, and Intel also clearly states:

Intel wrote:A special case can occur if an SMI handler nests inside an NMI handler and then another NMI occurs. During NMI interrupt handling, NMI interrupts are disabled, so normally NMI interrupts are serviced and completed with an IRET instruction one at a time. When the processor enters SMM while executing an NMI handler, the processor saves the SMRAM state save map but does not save the attribute to keep NMI interrupts disabled. Potentially, an NMI could be latched (while in SMM or upon exit) and serviced upon exit of SMM even though the previous NMI handler has still not completed. One or more NMIs could thus be nested inside the first NMI handler. The NMI interrupt handler should take this possibility into consideration.

If I say "I can fly (but in special cases including reality I can't fly)" and you ignore the special case, it sounds really nice.

Korona wrote: AMD states
Once NMI is recognized within SMM, NMI recognition remains enabled until SMM is exited, at which point NMI masking is restored to the state it was in before entering SMM.

In the virtualisation stuff ("15.20 Event Injection") AMD also says:

AMD wrote:An injected NMI does not block delivery of further NMIs.

..which means that if (none? some? most? all? - AMD's manual is missing a lot of detail that Intel's manual provides) AMD CPUs don't have NMI-SMI-NMI there's still potential for "nested NMI if you're running in a VM".

Korona wrote:
Brendan wrote:From my perspective it's the opposite: SYSCALL is broken (if you're using MCE and NMI on non-IST stacks).
Yes. SYSCALL is broken. NMI is also broken: Because it is unblocked by IRET and not by some NMI_IRET instruction, it causes nesting problems with exceptions like PF or BP. MCE is broken for the same reason. In a perfect world there would be a special SIRET imm8 instruction with the imm8 encoding what exceptional interrupt sources are unblocked by the IRET. However we do not live in a perfect world.

Yes (and I'd add SMM to that list of broken things).

Cheers,

Brendan

OSDev.org

Clarify how x86 interrupts work

Re: Clarify how x86 interrupts work

Re: Clarify how x86 interrupts work

Re: Clarify how x86 interrupts work

Re: Clarify how x86 interrupts work

Re: Clarify how x86 interrupts work

Re: Clarify how x86 interrupts work

Re: Clarify how x86 interrupts work

Re: Clarify how x86 interrupts work