Clarify how x86 interrupts work

bluemoon · Post by **bluemoon** » Sat Apr 08, 2017 1:25 am

LtG wrote:Finally, if I see either of the following:
Code: Select all
mov eax, [10]
mov eax, [0xFF000000]
Why would I think that must be the cause of the issue I'm debugging? I don't think any issue you are debugging is easily confused to be caused by the above examples. The above examples might look strange, at which point you investigate it and find out that they are #PF syscalls, and then hopefully remember it.

You need to make sure the #PF is caused by exactly mov eax, [10] and not something like mov eax, [ebx+10h], otherwise programmers will curse you when the below code trigger a syscall:

Code: Select all

uint32_t* p = nullptr;
...
return p[4];

And we assume the compiler emit mov eax, [ebx+10h] here, while it's probably legit for it to emit mov eax, [10] as well. And if your syscall interface rely on certain compiler behavior not black&white on specification, you are asking for trouble.

Antti · Post by **Antti** » Sat Apr 08, 2017 1:52 am

Brendan, I will check the assembly code you posted ("pending comments"). There is one really interesting feature when the NMI is in the "scratch stack frame" state. If there are NMIs triggered very aggressively, the stack does not overflow. This was not the most interesting feature but an important detail anyway. The new idea is here: your NMI code can go forward even if there is an NMI chaos.

Code: Select all

align 16
NmiHandler:
.L1:	nop				; one instruction
	add dword [IdtNmiHandlerAddress], 16
align 16
.L2:	nop				; one instruction
	add dword [IdtNmiHandlerAddress], 16
align 16
.L3:	nop				; one instruction
	add dword [IdtNmiHandlerAddress], 16

Even if you have to give up (no return is possible), you can "try" to execute your code (critical) and it goes forward. Assuming that there are really short breaks. If you combine all these ideas and something not mentioned/invented yet, you can write the most theoretically fail safe operating system.

onlyonemac · Post by **onlyonemac** » Sat Apr 08, 2017 8:31 am

LtG wrote:Finally, if I see either of the following:
Code: Select all
mov eax, [10]
mov eax, [0xFF000000]
Why would I think that must be the cause of the issue I'm debugging? I don't think any issue you are debugging is easily confused to be caused by the above examples. The above examples might look strange, at which point you investigate it and find out that they are #PF syscalls, and then hopefully remember it. The only confusion would happen if you had a legit reason to believe the issue you are debugging is actually caused by the above two examples _and_ you don't know about #PF syscalls.

Can you come up with some example you might be debugging where confusion would be likely?

Suppose you're debugging a segfault (which is what I usually end up debugging). You're stepping through your code, looking for a mov involving a null or otherwise invalid pointer, and you see code like the two lines above. You assume that that's the cause of the segfault, and next thing you decide to try to figure out why your compiler's outputting those instructions. You fail to figure out why the instructions are being produced, so you file a bug report on the compiler.

Nobody would make the same mistake with call or int, because anybody who knows enough assembler to debug at the assembly level would know that those instructions are typically used for system calls.

Antti · Post by **Antti** » Sun Apr 09, 2017 5:37 am

Code: Select all

NmiHandler:
        mov rsp, NmiHandler.Iret                ; we can exceptionally use rsp as a normal register
        cmp rsp, [Tss_RIP]                      ; test a special corner case
        je short .Stck

        mov rsp, [Tss_RIP]
        mov [Iret_RIP], rsp                     ; build iretq frame (RIP)
        mov rsp, [Tss_CS]
        mov [Iret_CS], rsp                      ; build iretq frame (CS)
        mov rsp, [Tss_RFLAGS]
        mov [Iret_RFLAGS], rsp                  ; build iretq frame (RFLAGS)
        mov rsp, [Tss_RSP]
        mov [Iret_RSP], rsp                     ; build iretq frame (RSP)
        mov rsp, [Tss_SS]
        mov [Iret_SS], rsp                      ; build iretq frame (SS)

.Stck:  mov rsp, nmiStack                       ; set valid stack
        call GeneralNmiHandler                  ; call general NmiHandler

.End:   mov rsp, Iret_RIP                       ; rsp = original iretq frame
        mov dword [Tss_LowerIST7], Tss_SS + 8   ; activate the trick again
        ; -----------------------------------   ; this is the critical instruction window
.Iret:  iretq


GeneralNmiHandler:
        nop                                     ; extremely carefully written code
        nop                                     ; may jump to start at any point
        nop
        ret                                     ; normal return

"100% predictability" if this works like I thought it should. May contain bugs but the idea is here and should be extensible to many processors, i.e. detect the CPU and use the "local" buffers.

Antti · Post by **Antti** » Sun Apr 09, 2017 6:11 am

Hmm, one problem. Maybe there should be more tests. This could be possible:

Code: Select all

        ; User space application
        mov rax, NmiHandler.Iret                ; the special value
        jmp rax

        ; An NMI exactly so that RIP points to that special value
        ; although it will trigger a page fault later.

LtG · Post by **LtG** » Mon Jun 19, 2017 12:29 pm

Sorry for the late reply, but this has still been bugging me. Is this whole NMI-SMI issue hypothetical or are there actual cases where this happened?

Assuming we're starting from ring3, either the NMI and SMI occur very close to each other (during the same cycle) they won't be dealt with until the next instruction boundary. At that point SMI is higher priority and wins, therefore the SMI is not nested within the NMI and there's no issue.

Same assumptions as above, but the SMI occurs the next cycle from NMI, in this case the original ring3 instruction during which NMI occurred is "completed", we reach instruction boundary and NMI is dealt with, after this it shouldn't matter when SMI occurs, it will have to wait until next instruction boundary, which would have to be _after_ the first NMI handler instruction, since there's no other instructions in between.. Or am I missing something here?

If the above is true, then you would be guaranteed to get to execute the first NMI handler instruction and swap the IST_nmi into the "re-entrant" stack, so this should be relatively simple to handle properly.

So what would be the instruction and interrupts (NMI and SMI) sequence that would cause issues? AMD manuals also mention that SMI waits for currently-executing instructions to finish, write their results, all buffered memory writes to update caches/memory.

Also, at least AMD mentions using IST for NMI's specifically, so I would hope they've thought this thru, though that's not really a guarantee...

Brendan · Post by **Brendan** » Mon Jun 19, 2017 9:42 pm

Hi,

LtG wrote:Sorry for the late reply, but this has still been bugging me. Is this whole NMI-SMI issue hypothetical or are there actual cases where this happened?

An OS has been running for 123 days and crashes causing the computer to reset. Why? Was it a "theoretical" NMI-SMI issue, or a software bug in the kernel or a driver or something, or a temporary hardware glitch ("cosmic ray"), or a hardware flaw (CPU errata, etc)? Nobody knows, they just reboot the computer. We all just accept the fact that sometimes systems crash and put things like "no warranty" and "not fit for any specific purpose" in licences/EULAs to cover our butts.

LtG wrote:Assuming we're starting from ring3, either the NMI and SMI occur very close to each other (during the same cycle) they won't be dealt with until the next instruction boundary. At that point SMI is higher priority and wins, therefore the SMI is not nested within the NMI and there's no issue.

Same assumptions as above, but the SMI occurs the next cycle from NMI, in this case the original ring3 instruction during which NMI occurred is "completed", we reach instruction boundary and NMI is dealt with, after this it shouldn't matter when SMI occurs, it will have to wait until next instruction boundary, which would have to be _after_ the first NMI handler instruction, since there's no other instructions in between.. Or am I missing something here?

Assume that an NMI occurs, the CPU reaches the next instruction boundary, then spends 250 cycles starting the NMI handler (doing IDT and GDT lookups, sanity/protection checks, storing return info on the stack, and handling any TLB and cache misses involved) before the first instruction of the NMI handler is started. If an SMI occurs anywhere within those 250 cycles then the SMI waits until the end of the 250 cycles, then SMI interrupts the NMI handler before the NMI handler can execute its first instruction.

LtG wrote:So what would be the instruction and interrupts (NMI and SMI) sequence that would cause issues?

For a "pathological worst case"; assume CPL=3 code is running with RSP set to the address of a critical kernel data structure and then does SYSCALL (causing "RSP = address of a critical kernel data structure" while at CPL=0); then before the kernel can execute a single instruction at its SYSCALL entry point you get an NMI, and before the kernel can execute a single instruction of the NMI handler you get an SMI (which happens to do an IRET), and then a second NMI occurs after the SMI handler has done its IRET.

In this case; if you use the IST mechanism then the second NMI handler overwrites the "return SS:RSP and CS:RIP" information on the first NMI handler's stack; and when the first NMI handler tries to return it will probably "return" to the point where the second NMI handler interrupted, causing an "infinite loop of return to self" (computer locked up in NMI handler). If you don't use IST then (because SYSCALL switches to CPL=0 without changing RSP) when the first NMI occurs (and CPU pushes "return info" at RSP) a critical kernel data structure gets trashed.

Essentially, as an OS developer your choices are:

Use IST for NMI and be susceptible to "NMI-SMI-NMI".
Refuse to use IST for NMI and/or machine check and support SYSCALL; and be susceptible to "SYSCALL with dodgy RSP followed by NMI or machine check".
Refuse to use IST for NMI and machine check, and refuse to support SYSCALL; and be immune to "NMI-SMI-NMI" and be immune to "SYSCALL with dodgy RSP followed by NMI or machine check", but have "potentially slightly slower" system calls (especially on systems that don't support SYSENTER in 64-bit code).

For something like a games console; the chance of "NMI-SMI-NMI" happening is tiny and any data lost when a game crashes isn't likely to be "very important"; so using IST would be the best choice. For a mission critical high availability server I'd want to do everything possible to minimise risk.

LtG wrote:Also, at least AMD mentions using IST for NMI's specifically, so I would hope they've thought this thru, though that's not really a guarantee...

When companies think things through, often that thinking is like "if N computers per year crash because if this issue, how much will that effect our profits?" and then they decide to slap a "resolved: won't fix" on it because they couldn't be bothered (and because their legal team already put things like "not for use in anything involving a risk to human life" in the fine print).

Cheers,

Brendan

LtG · Post by **LtG** » Tue Jun 20, 2017 11:41 am

Brendan wrote: An OS has been running for 123 days and crashes causing the computer to reset. Why? Was it a "theoretical" NMI-SMI issue, or a software bug in the kernel or a driver or something, or a temporary hardware glitch ("cosmic ray"), or a hardware flaw (CPU errata, etc)? Nobody knows, they just reboot the computer. We all just accept the fact that sometimes systems crash and put things like "no warranty" and "not fit for any specific purpose" in licences/EULAs to cover our butts.

You can check the return address from NMI before IRET, and then know that you're screwed (NMI-SMI-NMI), so it could at least be reported. But I guess nobody does that, so we don't really know.

I was really asking to know whether it is actually possible or not, see below.

Brendan wrote: Assume that an NMI occurs, the CPU reaches the next instruction boundary, then spends 250 cycles starting the NMI handler (doing IDT and GDT lookups, sanity/protection checks, storing return info on the stack, and handling any TLB and cache misses involved) before the first instruction of the NMI handler is started. If an SMI occurs anywhere within those 250 cycles then the SMI waits until the end of the 250 cycles, then SMI interrupts the NMI handler before the NMI handler can execute its first instruction.

To clarify, by NMI I mean the actual NMI, and NMI_handler is then the instructions you want to execute in response.

So you're saying that either of the following is true;
After NMI the CPU reaches next instruction boundary, no SMI has arrived yet, and the CPU starts to do it's internal stuff for NMI. At some point during this time, before the first NMI_handler instruction, an SMI arrives and:
a) the NMI itself is considered an instruction so that between NMI and NMI_handler there is a "second" instruction boundary?
b) The entire NMI duration (between last instruction before NMI and the fist NMI_handler instruction) is considered to be an instruction boundary?

Is there something in the manuals supporting that? I'm not saying it's not possible, I just haven't put together all the pieces that would indicate that what you are suggesting is possible. I would consider the next instruction boundary to occur _after_ the _next_ _instruction_, and that would have to be _after_ the first NMI_handler instruction. Of course this might also be fully implementation dependent, like I said, I don't know.

I understand the point you are making, but my main issue is with the fact that it seems to me you are suggesting that there's an extra instruction boundary caused by the NMI itself, which isn't an instruction (even if it causes u-ops), or that the entire duration is considered to be an instruction boundary. Further the CPU starting NMI handling, completing all of the prep for it, but not actually even _starting_ the actual handler seems like dropping the ball half way thru which would lead to inconsistent state (wrt IST's) and that actually causes the issue you are describing.

I can't think of an easy way to test this, and of course testing all different CPUs is practically impossible for a hobbyist.

Brendan wrote: Essentially, as an OS developer your choices are:
Use IST for NMI and be susceptible to "NMI-SMI-NMI".

Refuse to use IST for NMI and/or machine check and support SYSCALL; and be susceptible to "SYSCALL with dodgy RSP followed by NMI or machine check".

Refuse to use IST for NMI and machine check, and refuse to support SYSCALL; and be immune to "NMI-SMI-NMI" and be immune to "SYSCALL with dodgy RSP followed by NMI or machine check", but have "potentially slightly slower" system calls (especially on systems that don't support SYSENTER in 64-bit code).

Regardless of SYSCALL, if you don't use IST for NMI then you can't "guarantee" that the NMI will get a valid stack. Of course you can make sure that you always have a valid stack but bugs are always around. Using IST you can ensure there's a valid stack and in the case that you are right, that NMI-SMI-NMI issue is real and actually occurs, you can still detect it and panic (or abort previously running process/thread).

Out of curiosity, would it actually be possible to test this? IIRC there's two "types" of NMI, the one raised thru the actual NMI pin, which gets the real NMI treatment and then the others (software) which don't get the NMI special treatment. Is there a way to intentionally trigger the NMI and the SMI? I remember there was a way (at least on some CPUs) to "hack" into the SMM code, thus creating SMM that intentionally enables NMI's..

Intel manual wrote: 34.8 NMI HANDLING WHILE IN SMM
NMI interrupts are blocked upon entry to the SMI handler. If an NMI request occurs during the SMI handler, it is latched and serviced after the processor exits SMM. Only one NMI request will be latched during the SMI handler. If an NMI request is pending when the processor executes the RSM instruction, the NMI is serviced before the next instruction of the interrupted code sequence. This assumes that NMIs were not blocked before the SMI occurred. If NMIs were blocked before the SMI occurred, they are blocked after execution of RSM.

Although NMI requests are blocked when the processor enters SMM, they may be enabled through software by executing an IRET instruction. If the SMI handler requires the use of NMI interrupts, it should invoke a dummy interrupt service routine for the purpose of executing an IRET instruction. Once an IRET instruction is executed, NMI interrupt requests are serviced in the same “real mode” manner in which they are handled outside of SMM.

A special case can occur if an SMI handler nests inside an NMI handler and then another NMI occurs. During NMI interrupt handling, NMI interrupts are disabled, so normally NMI interrupts are serviced and completed with an IRET instruction one at a time. When the processor enters SMM while executing an NMI handler, the processor saves the SMRAM state save map but does not save the attribute to keep NMI interrupts disabled. Potentially, an NMI could be latched (while in SMM or upon exit) and serviced upon exit of SMM even though the previous NMI handler has still not completed. One or more NMIs could thus be nested inside the first NMI handler. The NMI interrupt handler should take this possibility into consideration.

Also, for the Pentium processor, exceptions that invoke a trap or fault handler will enable NMI interrupts from inside of SMM. This behavior is implementation specific for the Pentium processor and is not part of the IA-32 architecture.

Aren't the two bold parts contradictory? I'm assuming because one says disabled and the other blocked that there's a difference between the two?

Brendan · Post by **Brendan** » Wed Jun 21, 2017 3:36 pm

Hi,

LtG wrote:
Brendan wrote: An OS has been running for 123 days and crashes causing the computer to reset. Why? Was it a "theoretical" NMI-SMI issue, or a software bug in the kernel or a driver or something, or a temporary hardware glitch ("cosmic ray"), or a hardware flaw (CPU errata, etc)? Nobody knows, they just reboot the computer. We all just accept the fact that sometimes systems crash and put things like "no warranty" and "not fit for any specific purpose" in licences/EULAs to cover our butts.
You can check the return address from NMI before IRET, and then know that you're screwed (NMI-SMI-NMI), so it could at least be reported. But I guess nobody does that, so we don't really know.

I was really asking to know whether it is actually possible or not, see below.

Brendan wrote: Assume that an NMI occurs, the CPU reaches the next instruction boundary, then spends 250 cycles starting the NMI handler (doing IDT and GDT lookups, sanity/protection checks, storing return info on the stack, and handling any TLB and cache misses involved) before the first instruction of the NMI handler is started. If an SMI occurs anywhere within those 250 cycles then the SMI waits until the end of the 250 cycles, then SMI interrupts the NMI handler before the NMI handler can execute its first instruction.
To clarify, by NMI I mean the actual NMI, and NMI_handler is then the instructions you want to execute in response.

So you're saying that either of the following is true;
After NMI the CPU reaches next instruction boundary, no SMI has arrived yet, and the CPU starts to do it's internal stuff for NMI. At some point during this time, before the first NMI_handler instruction, an SMI arrives and:
a) the NMI itself is considered an instruction so that between NMI and NMI_handler there is a "second" instruction boundary?
b) The entire NMI duration (between last instruction before NMI and the fist NMI_handler instruction) is considered to be an instruction boundary?

Yes.

LtG wrote:Is there something in the manuals supporting that? I'm not saying it's not possible, I just haven't put together all the pieces that would indicate that what you are suggesting is possible. I would consider the next instruction boundary to occur _after_ the _next_ _instruction_, and that would have to be _after_ the first NMI_handler instruction. Of course this might also be fully implementation dependent, like I said, I don't know.

I don't remember ever seeing anything in the manuals that would confirm or refute this (which doesn't necessarily mean there isn't something somewhere in the manuals). However, modern CPUs don't literally execute instructions; they break them into pieces (micro-ops), execute micro-ops, then sort out the mess at the end using a re-ordering buffer (to put micro-ops that were executed "out of order" back into order) and then (for "non-bogus" micro-ops) committing changes to "visible state". When an NMI occurs the CPU would dump a bunch of micro-ops from the "micro-code ROM" onto the pipeline and sort that out at the end of the pipeline too, which also has to cause changed to be committed changes to "visible state".

The key words here are "commit changes to visible state". This is what I believe Intel actually means when they say "at an instruction boundary", because Intel have to describe behaviour in abstract terms (without excessively low level internal details that ties the documented behaviour to any specific CPU's implementation), because it'd be better for performance (avoiding fetching and decoding ~4 instructions when you know only one will be "non-bogus" when it reaches retirement), and because it's less complex.

Also note that if the CPU did guarantee that SMI is postponed until after the NMI handler's first instruction is retired; it doesn't help (that one instruction would need to copy everything off of the original stack and store it at a "different for each nested NMI" address, then load RSP with the new stack address, all without corrupting any other registers).

LtG wrote:
Brendan wrote: Essentially, as an OS developer your choices are:
Use IST for NMI and be susceptible to "NMI-SMI-NMI".

Refuse to use IST for NMI and/or machine check and support SYSCALL; and be susceptible to "SYSCALL with dodgy RSP followed by NMI or machine check".

Refuse to use IST for NMI and machine check, and refuse to support SYSCALL; and be immune to "NMI-SMI-NMI" and be immune to "SYSCALL with dodgy RSP followed by NMI or machine check", but have "potentially slightly slower" system calls (especially on systems that don't support SYSENTER in 64-bit code).

Regardless of SYSCALL, if you don't use IST for NMI then you can't "guarantee" that the NMI will get a valid stack. Of course you can make sure that you always have a valid stack but bugs are always around.

If you can guarantee that the "SS0:RSP0" fields in the TSS are valid when the CPU is running CPL=3 code, and if you can guarantee that SS:RSP is valid when running CPL=0 code; then you can guarantee that (without IST) the NMI handler will get a valid stack. If you can't guarantee these things then NMI is the least of your worries (any IRQ at any time can send you to "double fault" or worse).

That is why SYSCALL matters - it breaks "CPU changes to a known good stack during a switch to more privileged code" which means that you can't guarantee that SS:RSP is valid when running CPL=0 code, which makes it impossible (without IST) to guarantee that the NMI handler (or machine check handler) gets a valid stack when CPL=0 code is interrupted.

LtG wrote:Using IST you can ensure there's a valid stack and in the case that you are right, that NMI-SMI-NMI issue is real and actually occurs, you can still detect it and panic (or abort previously running process/thread).

For "is it real", Intel thinks it's real enough to describe in their manual, and Linux developers think it's real enough to add an ugly work-around to reduce the chance of problems (but not avoid all chance of problems).

LtG wrote:Out of curiosity, would it actually be possible to test this? IIRC there's two "types" of NMI, the one raised thru the actual NMI pin, which gets the real NMI treatment and then the others (software) which don't get the NMI special treatment. Is there a way to intentionally trigger the NMI and the SMI? I remember there was a way (at least on some CPUs) to "hack" into the SMM code, thus creating SMM that intentionally enables NMI's..

You could probably use IPIs, performance monitoring counters and/or local APIC timer (in "TSC deadline" mode) to try to send NMIs and SMIs at the right times; but it'd be very easy for the timing to be "slightly not exactly right", and I have no idea how various motherboards would respond to "phantom SMI" arriving (I'd assume most would find no cause and ignore it safely, but...). Another option might be to use "PS/2 emulation for USB devices" (with a key on a keyboard jammed down for "key repeat forever") to cause frequent SMIs; then add an "IPI that triggers NMI" at the end of the NMI handler (to get NMIs as often as possible).

LtG wrote:
Intel manual wrote: 34.8 NMI HANDLING WHILE IN SMM
NMI interrupts are blocked upon entry to the SMI handler. If an NMI request occurs during the SMI handler, it is latched and serviced after the processor exits SMM. Only one NMI request will be latched during the SMI handler. If an NMI request is pending when the processor executes the RSM instruction, the NMI is serviced before the next instruction of the interrupted code sequence. This assumes that NMIs were not blocked before the SMI occurred. If NMIs were blocked before the SMI occurred, they are blocked after execution of RSM.

Although NMI requests are blocked when the processor enters SMM, they may be enabled through software by executing an IRET instruction. If the SMI handler requires the use of NMI interrupts, it should invoke a dummy interrupt service routine for the purpose of executing an IRET instruction. Once an IRET instruction is executed, NMI interrupt requests are serviced in the same “real mode” manner in which they are handled outside of SMM.

A special case can occur if an SMI handler nests inside an NMI handler and then another NMI occurs. During NMI interrupt handling, NMI interrupts are disabled, so normally NMI interrupts are serviced and completed with an IRET instruction one at a time. When the processor enters SMM while executing an NMI handler, the processor saves the SMRAM state save map but does not save the attribute to keep NMI interrupts disabled. Potentially, an NMI could be latched (while in SMM or upon exit) and serviced upon exit of SMM even though the previous NMI handler has still not completed. One or more NMIs could thus be nested inside the first NMI handler. The NMI interrupt handler should take this possibility into consideration.

Also, for the Pentium processor, exceptions that invoke a trap or fault handler will enable NMI interrupts from inside of SMM. This behavior is implementation specific for the Pentium processor and is not part of the IA-32 architecture.
Aren't the two bold parts contradictory? I'm assuming because one says disabled and the other blocked that there's a difference between the two?

I'd read those bold parts (combined) as "If NMIs were blocked before the SMI occurred, then NMI is not blocked while in SMM, but will be blocked again after execution of RSM".

Cheers,

Brendan

LtG · Post by **LtG** » Thu Jun 22, 2017 1:10 pm

Brendan wrote: Also note that if the CPU did guarantee that SMI is postponed until after the NMI handler's first instruction is retired; it doesn't help (that one instruction would need to copy everything off of the original stack and store it at a "different for each nested NMI" address, then load RSP with the new stack address, all without corrupting any other registers).

Using ISTx, the first NMI handler instruction overwrites ISTx with hardcoded value. The first NMI stack is safe, it's value is known, all future NMI's will trash each other (doesn't matter) on the secondary stack. Both primary and secondary stacks are known.

Wouldn't the above work? You run the last NMI to completion, restore the primary stack copying the return stuff someplace safe and then return to original "app".

Brendan wrote: For "is it real", Intel thinks it's real enough to describe in their manual, and Linux developers think it's real enough to add an ugly work-around to reduce the chance of problems (but not avoid all chance of problems).

Intel doesn't think it's real, they think it's something you have to take care of in the NMI handler, so presumably it can be taken care of? If it can't be taken care of, then why would Intel say "The NMI interrupt handler should take this possibility into consideration"? Yes, it's possible they are mean "you're screwed", but they don't actually say that. Plus both Intel and AMD specifically mention IST for NMI usage. Also found this in the manual:

Intel manual 34.3.1 Entering SMM wrote: An SMI has a greater priority than debug exceptions and external interrupts. Thus, if an NMI, maskable hardware interrupt, or a debug exception occurs at an instruction boundary along with an SMI, only the SMI is handled. Subsequent SMI requests are not acknowledged while the processor is in SMM. The first SMI interrupt request that occurs while the processor is in SMM (that is, after SMM has been acknowledged to external hardware) is latched and serviced when the processor exits SMM with the RSM instruction. The processor will latch only one SMI while in SMM.

So even if the CPU starts NMI but switches to SMI midway NMI (before first NMI handler instruction), then the SMI "eats" the NMI, and in this case doesn't the CPU have to preserve the visible state in such a fashion as if the NMI never happened at all? And therefore a new NMI would be issued after the SMI and there's no issue, or the SMI only gets acknowledge after the first NMI instruction.. I can't imagine a sequence where the first NMI handler instruction is not executed, but the NMI is already "committed" to and the CPU switches to SMI handler... Assuming the above quote from Intel is accurate can you come up with a sequence that's consistent with these manual quotes and would cause the first NMI handler instruction not to be executed?

Brendan wrote: You could probably use IPIs, performance monitoring counters and/or local APIC timer (in "TSC deadline" mode) to try to send NMIs and SMIs at the right times; but it'd be very easy for the timing to be "slightly not exactly right", and I have no idea how various motherboards would respond to "phantom SMI" arriving (I'd assume most would find no cause and ignore it safely, but...). Another option might be to use "PS/2 emulation for USB devices" (with a key on a keyboard jammed down for "key repeat forever") to cause frequent SMIs; then add an "IPI that triggers NMI" at the end of the NMI handler (to get NMIs as often as possible).

The timing would be difficult, however:
a) Brute force it, note sure how many cycles the whole thing takes, assuming low thousands, you'd be able to get 1M tests per second, running it for a couple of days should be pretty likely to have caused issues if there are any
b) Cause the timing, for instance the SMI handler checks how far the NMI handler got and make slight adjustments to get all "possible" cycle "differences", might need second CPU to do properly, but that shouldn't be an issue

Just not sure if any of that is really worth the effort though. Note also that the B option above would require custom SMI handler, but as said, it should be possible to break into the SMM mode which is required anyway since I don't know if most/any SMI's ever intentionally re-enable the NMI's (by issuing IRET), custom SMI would need to do that.

Brendan wrote:
LtG wrote:
Intel manual wrote: 34.8 NMI HANDLING WHILE IN SMM
...
If NMIs were blocked before the SMI occurred, they are blocked after execution of RSM.
...
When the processor enters SMM while executing an NMI handler, the processor saves the SMRAM state save map but does not save the attribute to keep NMI interrupts disabled. ...
Aren't the two bold parts contradictory? I'm assuming because one says disabled and the other blocked that there's a difference between the two?
I'd read those bold parts (combined) as "If NMIs were blocked before the SMI occurred, then NMI is not blocked while in SMM, but will be blocked again after execution of RSM".

Can you explain that further? Aren't NMI _always_ blocked in SMM (unless specifically re-enabled by the SMI handler)?

Also, ignoring the SMM part (since we don't care about it), your sentence reads:
"If NMIs were blocked before the SMI occurred, but will be blocked again after execution of RSM."

Or did your "then NMI is not blocked while in SMM" mean "then NMI is unblocked..."?

LtG · Post by **LtG** » Thu Jun 22, 2017 1:21 pm

Intel manual wrote: 34.8 NMI HANDLING WHILE IN SMM
...
If NMIs were blocked before the SMI occurred, they are blocked after execution of RSM.
...
When the processor enters SMM while executing an NMI handler, the processor saves the SMRAM state save map but does not save the attribute to keep NMI interrupts disabled. ...

After reading that a couple of more times I guess it's just a round about way of saying that the NMI's being blocked (due to current NMI) is preserved accross SMI handler, however it is not saved as part of the pre-SMI state, and thus if the SMI intentionally re-enables NMI's, then after RSM they are still re-enabled because the "block NMI's" state isn't saved and restored, but is just a "flag" in the CPU which the SMI "clobbered".

To be honest, I don't understand why this issue wasn't "fixed" here where it is caused, either by having the SMI issue some "block NMI's until next IRET" instruction or by having it be preserved. Why allow the "SMI never happened from ring0+ perspective" to be broken..? Just an oversight?

Brendan · Post by **Brendan** » Thu Jun 22, 2017 3:28 pm

Hi,

LtG wrote:
Brendan wrote:Also note that if the CPU did guarantee that SMI is postponed until after the NMI handler's first instruction is retired; it doesn't help (that one instruction would need to copy everything off of the original stack and store it at a "different for each nested NMI" address, then load RSP with the new stack address, all without corrupting any other registers).
Using ISTx, the first NMI handler instruction overwrites ISTx with hardcoded value. The first NMI stack is safe, it's value is known, all future NMI's will trash each other (doesn't matter) on the secondary stack. Both primary and secondary stacks are known.

Wouldn't the above work? You run the last NMI to completion, restore the primary stack copying the return stuff someplace safe and then return to original "app".

Short answer: Yes, I think that can work.

For multi-CPU (in case 2 or more CPUs get an NMI at the same time) each CPU would need a different value for its ISTx entry. This means that (without other tricks) you can't (e.g.) have a single NMI handler that does "lock add qword [address_of_IST_entry],1024" because "address_of_IST_entry" has to be different for different CPUs.

The first trick to solve this problem would be to have a different NMI handler entry point for different CPUs. The first CPU's NMI handler does "lock add qword [address_of_IST_entry_for_first_CPU],1024", then second CPU's NMI handler does "lock add qword [address_of_IST_entry_for_second_CPU],1024", etc; and then they'd all do a "jmp" to a common NMI handler. This would also mean having a different IDT for each CPU.

The second trick would be to use paging to create a "per CPU" area of kernel space, so that all CPUs use the same virtual addresses in their IST but those virtual addresses correspond to different physical addresses for each CPU.

With one of these tricks; if the CPU does guarantee that the first instruction is executed before a pending SMI is allowed (and if the kernel is 64-bit where "NULL segment checking" isn't done); then I can't think of a reason why it won't work.

Note: If the "CPL=3 code descriptor" is "read, execute" (and not "execute only"); then CPL=3 code can load that code descriptor into DS (or ES or ...). In this case I don't know if the CPU checks if DS can be used for writes while in 64-bit code. Intel's manual says the CPU does check (in general) and doesn't mention "but not in 64-bit code" anywhere (like it does for other checks), but that may just be an error/omission in the manual. In theory (if CPL=3 code loaded DS with a "read, execute" descriptor and is then interrupted by NMI) the NMI handler's write to the IST might or might not cause a GPF. If this actually is a problem; it would be fixable by using "execute only" for CPL=3 code descriptor.

LtG wrote:
Brendan wrote:For "is it real", Intel thinks it's real enough to describe in their manual, and Linux developers think it's real enough to add an ugly work-around to reduce the chance of problems (but not avoid all chance of problems).
Intel doesn't think it's real, they think it's something you have to take care of in the NMI handler, so presumably it can be taken care of? If it can't be taken care of, then why would Intel say "The NMI interrupt handler should take this possibility into consideration"? Yes, it's possible they are mean "you're screwed", but they don't actually say that.

You can take "another NMI interrupts NMI handler somewhere" into consideration and minimise the chance that it can happen (e.g. become immune to the problem if the second NMI interrupts after you've managed to execute 10 or more instructions). The question is how much "ugly work-around" you're willing to have, and whether or not it's possible to be 100% immune (if it's possible for a second NMI to occur before you've executed the first instruction).

LtG wrote:Plus both Intel and AMD specifically mention IST for NMI usage.

It's a messy and infrequent corner-case with no easy solution; and a technical writer working for one of these companies wrote "something" (that another technical writer at the other company probably quietly copied). There's no real guarantee that it's correct, and it's certainly not as simple as "just use IST and don't worry!".

LtG wrote:Also found this in the manual:
Intel manual 34.3.1 Entering SMM wrote: An SMI has a greater priority than debug exceptions and external interrupts. Thus, if an NMI, maskable hardware interrupt, or a debug exception occurs at an instruction boundary along with an SMI, only the SMI is handled. Subsequent SMI requests are not acknowledged while the processor is in SMM. The first SMI interrupt request that occurs while the processor is in SMM (that is, after SMM has been acknowledged to external hardware) is latched and serviced when the processor exits SMM with the RSM instruction. The processor will latch only one SMI while in SMM.
So even if the CPU starts NMI but switches to SMI midway NMI (before first NMI handler instruction), then the SMI "eats" the NMI, and in this case doesn't the CPU have to preserve the visible state in such a fashion as if the NMI never happened at all?

To clarify the situation; I'd expect that this is entirely possible (and can't find anything in Intel's manual to convincingly prove or disprove it):

An NMI occurs
CPU begins starting the NMI handler (IDT lookup, checks, etc); and while this is happening an SMI is received causing "pending SMI"
CPU finishes starting the NMI handler (RIP pointing to first instruction of NMI handler) and commits changes to visible state
Immediately after "commit changes to visible state" CPU checks for pending interrupts, sees the pending SMI, and starts handling it. The "old RIP value" that the CPU stores in the SMM state save map is the address of the first instruction in the NMI handler.
The CPU executes the firmware's SMM code; and while that is happening a second NMI is received causing "pending NMI" (because NMIs are blocked in SMM)
The firmware's SMM code does "RSM"; the CPU executes this instruction (including loading "old RIP value" from SMM state save map that still points to the first instruction of the NMI handler) and commits changes to visible state
Immediately after "commit changes to visible state" CPU checks for pending interrupts, sees the pending NMI, and starts handling it.
CPU trashes the first NMI handler's stack when starting the second NMI handler.

LtG wrote:
Brendan wrote:You could probably use IPIs, performance monitoring counters and/or local APIC timer (in "TSC deadline" mode) to try to send NMIs and SMIs at the right times; but it'd be very easy for the timing to be "slightly not exactly right", and I have no idea how various motherboards would respond to "phantom SMI" arriving (I'd assume most would find no cause and ignore it safely, but...). Another option might be to use "PS/2 emulation for USB devices" (with a key on a keyboard jammed down for "key repeat forever") to cause frequent SMIs; then add an "IPI that triggers NMI" at the end of the NMI handler (to get NMIs as often as possible).
The timing would be difficult, however:
a) Brute force it, note sure how many cycles the whole thing takes, assuming low thousands, you'd be able to get 1M tests per second, running it for a couple of days should be pretty likely to have caused issues if there are any
b) Cause the timing, for instance the SMI handler checks how far the NMI handler got and make slight adjustments to get all "possible" cycle "differences", might need second CPU to do properly, but that shouldn't be an issue

Just not sure if any of that is really worth the effort though. Note also that the B option above would require custom SMI handler, but as said, it should be possible to break into the SMM mode which is required anyway since I don't know if most/any SMI's ever intentionally re-enable the NMI's (by issuing IRET), custom SMI would need to do that.

Due to risk of malware (rootkits, etc); it's "intentionally almost impossible" to modify firmware's SMM code on almost all motherboards. I'm not sure, but it might be possible to use hardware virtualization to bypass this (at least I vaguely remember something about SMM virtualization in AMD's manual), but if it is possible I'm also not if it'd influence results.

For "is it worth it", if you care about the maximum reliability possible (e.g. those "mission critical high availability" servers), there's only 2 possible outcomes:

You test some computers and (hopefully quickly) find out that one does have an "NMI-SMI-NMI" problem; and end up having to implement something to reduce or eliminate the "NMI-SMI-NMI" problem.
You spend ages testing hundreds of computers without finding the problem, but still can't be sure that another different computer (or a future computer that doesn't exist yet) won't have an "NMI-SMI-NMI" problem; and end up having to implement something to reduce or eliminate the "NMI-SMI-NMI" problem.

Cheers,

Brendan

Brendan · Post by **Brendan** » Thu Jun 22, 2017 3:36 pm

Hi,

LtG wrote:
Intel manual wrote: 34.8 NMI HANDLING WHILE IN SMM
...
If NMIs were blocked before the SMI occurred, they are blocked after execution of RSM.
...
When the processor enters SMM while executing an NMI handler, the processor saves the SMRAM state save map but does not save the attribute to keep NMI interrupts disabled. ...
After reading that a couple of more times I guess it's just a round about way of saying that the NMI's being blocked (due to current NMI) is preserved accross SMI handler, however it is not saved as part of the pre-SMI state, and thus if the SMI intentionally re-enables NMI's, then after RSM they are still re-enabled because the "block NMI's" state isn't saved and restored, but is just a "flag" in the CPU which the SMI "clobbered".

That looks right to me!

LtG wrote:To be honest, I don't understand why this issue wasn't "fixed" here where it is caused, either by having the SMI issue some "block NMI's until next IRET" instruction or by having it be preserved. Why allow the "SMI never happened from ring0+ perspective" to be broken..? Just an oversight?

NMI was a horrible thing from the start (the "iret unblocks NMI" is a disaster - why not have a flag in EFLAGS?), then became a badly documented horrible thing (I'm still not entirely sure which parts of which chipsets might trigger NMI), then became a buggy badly documented horrible thing (when SMM was introduced), then became a potentially extra buggy badly documented horrible thing (when Pentium was released with what I'd consider errata), then it hasn't improved since.

Cheers,

Brendan

LtG · Post by **LtG** » Thu Jun 22, 2017 4:23 pm

Brendan wrote: The first trick to solve this problem would be to have a different NMI handler entry point for different CPUs. The first CPU's NMI handler does "lock add qword [address_of_IST_entry_for_first_CPU],1024", then second CPU's NMI handler does "lock add qword [address_of_IST_entry_for_second_CPU],1024", etc; and then they'd all do a "jmp" to a common NMI handler. This would also mean having a different IDT for each CPU.

I plan on treating each CPU/core independently anyway, so for me that's not an issue. Of course ISR code is shared thru paging. I may, not sure yet, have to add a few "hacks" for performance, but I'll try to avoid that if at all possible or reasonable.

Brendan wrote: Note: If the "CPL=3 code descriptor" is "read, execute" (and not "execute only"); then CPL=3 code can load that code descriptor into DS (or ES or ...). In this case I don't know if the CPU checks if DS can be used for writes while in 64-bit code. Intel's manual says the CPU does check (in general) and doesn't mention "but not in 64-bit code" anywhere (like it does for other checks), but that may just be an error/omission in the manual. In theory (if CPL=3 code loaded DS with a "read, execute" descriptor and is then interrupted by NMI) the NMI handler's write to the IST might or might not cause a GPF. If this actually is a problem; it would be fixable by using "execute only" for CPL=3 code descriptor.

Good point, can't be bothered to check the manuals to see if the ISR does have to worry about bogus DS by malicious userland. I see no reason why CS should have anything but execute, especially given today's compilers (gcc) that don't support segmentation anyway. Which as you said makes it a moot point.

I liked the idea of segmentation (more granularity) but never really benchmarked it, and since AMD64 dropped it I guess I don't have much reason to care about it at this point, though may check its performance if I ever get around to creating x86_32 kernel.

Brendan wrote: You can take "another NMI interrupts NMI handler somewhere" into consideration and minimise the chance that it can happen (e.g. become immune to the problem if the second NMI interrupts after you've managed to execute 10 or more instructions). The question is how much "ugly work-around" you're willing to have, and whether or not it's possible to be 100% immune (if it's possible for a second NMI to occur before you've executed the first instruction).

I don't think adjusting NMI handler IST entry is that ugly, and at least it's quite simple and straightforward, and either minimizes the risk to one instruction "timing" or completely gets rid of the issue.

Brendan wrote: It's a messy and infrequent corner-case with no easy solution; and a technical writer working for one of these companies wrote "something" (that another technical writer at the other company probably quietly copied). There's no real guarantee that it's correct, and it's certainly not as simple as "just use IST and don't worry!".

I can't remember AMD even mentioning the nested NMI/NMI-SMI-NMI issue at all, and if Intel decided to add a section specifically mentioning this and also says that OS should prepare for it, that implies it can be dealt with. I can't really think of anything except IST to avoid it (or not using SYSCALL).

Brendan wrote:
Intel manual 34.3.1 Entering SMM wrote: An SMI has a greater priority than debug exceptions and external interrupts. Thus, if an NMI, maskable hardware interrupt, or a debug exception occurs at an instruction boundary along with an SMI, only the SMI is handled. Subsequent SMI requests are not acknowledged while the processor is in SMM. The first SMI interrupt request that occurs while the processor is in SMM (that is, after SMM has been acknowledged to external hardware) is latched and serviced when the processor exits SMM with the RSM instruction. The processor will latch only one SMI while in SMM.
To clarify the situation; I'd expect that this is entirely possible (and can't find anything in Intel's manual to convincingly prove or disprove it):
An NMI occurs

CPU begins starting the NMI handler (IDT lookup, checks, etc); and while this is happening an SMI is received causing "pending SMI"

CPU finishes starting the NMI handler (RIP pointing to first instruction of NMI handler) and commits changes to visible state

Immediately after "commit changes to visible state" CPU checks for pending interrupts, sees the pending SMI, and starts handling it. The "old RIP value" that the CPU stores in the SMM state save map is the address of the first instruction in the NMI handler.

The CPU executes the firmware's SMM code; and while that is happening a second NMI is received causing "pending NMI" (because NMIs are blocked in SMM)

The firmware's SMM code does "RSM"; the CPU executes this instruction (including loading "old RIP value" from SMM state save map that still points to the first instruction of the NMI handler) and commits changes to visible state

Immediately after "commit changes to visible state" CPU checks for pending interrupts, sees the pending NMI, and starts handling it.

CPU trashes the first NMI handler's stack when starting the second NMI handler.

Earlier I mentioned the two possibilities I could think of for instruction boundary to allow for the issue, one where the instruction boundary is the entire duration of "invoking" NMI and the other where each uop (or something smaller than an actual instruction) is considered to be an instruction thus creating extra instruction boundaries between two "real" instructions.

The Intel quote above says that if NMI and SMI occur at same instruction boundary then SMI wins and NMI gets forgotten (though if still present after SMI then it would get taken care of). I think that means that the "prolonged instruction boundary" case can be dismissed.

I can't point to anything in the manual that explicitly says that the "uop boundaries = instruction boundaries", which would imply that the sequence you made up is maybe possible. I would consider it a bit pathological of Intel/AMD however. The "invoke NMI" has already been committed to, changing to SMI midway (before first NMI handler instruction) doesn't seem reasonable to me.

Also worth noting, AFAIK all of this applies only if the SMI handler _intentionally_ enables NMI's, which was missing in your "sequence".. But do SMI handlers do that? And if they do, aren't they prone to race conditions? How can the SMI change IDT back to OS IDT (I assume they change IDT, or does the CPU restore it from the state it's already saved) and RSM without allowing NMI in between (assuming they enabled NMI)?

Brendan wrote: Due to risk of malware (rootkits, etc); it's "intentionally almost impossible" to modify firmware's SMM code on almost all motherboards. I'm not sure, but it might be possible to use hardware virtualization to bypass this (at least I vaguely remember something about SMM virtualization in AMD's manual), but if it is possible I'm also not if it'd influence results.

I'm not sure how often it applies, but there was a "hack" that allowed access to SMM, and I think it was pretty simple and mostly universal. Might be fixed in newer systems though.

IIRC the idea was:
- LAPIC "hijacks" memory references that are directed at the LAPIC
- Relocating LAPIC memory to overlay SMM memory
- SMI occurs (generate it or wait for it)
- SMI handler code references SMM (data) memory, but that is now hijacked by LAPIC and writes to it are discarded and reads from it (for the most part) return zero
- SMI handler jumps to wrong location where you've planted your code
- You now have ring -2 access

As I remember the paper, they suggested that all SMI handlers begin the same, with a couple of variations, so relocating the LAPIC memory works quite well..

I can try to find the paper if you like...

Brendan wrote: For "is it worth it", if you care about the maximum reliability possible (e.g. those "mission critical high availability" servers), there's only 2 possible outcomes:
You test some computers and (hopefully quickly) find out that one does have an "NMI-SMI-NMI" problem; and end up having to implement something to reduce or eliminate the "NMI-SMI-NMI" problem.

You spend ages testing hundreds of computers without finding the problem, but still can't be sure that another different computer (or a future computer that doesn't exist yet) won't have an "NMI-SMI-NMI" problem; and end up having to implement something to reduce or eliminate the "NMI-SMI-NMI" problem.

I was more thinking along the lines that the way I interpret the manuals, the trick we discussed (NMI adjusts IST on first ISR instruction) should work. Attempting to prove that it does seems near impossible. The practicality of getting every CPU model and revision to test, and being confident enough that the testing method is even reliable (I don't like to trust brute force methods where I'm only relying on sheer number of tests and not "determinicity") it seems that there's nothing I can do about it.

Maybe I should get an Intel dev account and ask at their forums, which also lead me to this, though I'm not 100% sure if I trust the answer:
https://software.intel.com/en-us/forums ... pic/305672

At some point I'll have to test the performance of call gates vs SYSCALL vs SYSENTER vs soft int, etc and actually see if I should just avoid SYSCALL. I actually prefer SYSCALL to SYSENTER for its minimalism, but this NMI-SMI-NMI thing is really annoying and I wish Intel and AMD would have explicitly specified what is the safe way of handling it for all CPUs..

LtG · Post by **LtG** » Thu Jun 22, 2017 4:31 pm

Brendan wrote: NMI was a horrible thing from the start (the "iret unblocks NMI" is a disaster - why not have a flag in EFLAGS?), then became a badly documented horrible thing (I'm still not entirely sure which parts of which chipsets might trigger NMI), then became a buggy badly documented horrible thing (when SMM was introduced), then became a potentially extra buggy badly documented horrible thing (when Pentium was released with what I'd consider errata), then it hasn't improved since.

What's the Pentium thing you are referring to? In your list SMM is before it and isn't at least part of the issue we are discussing SYSCALL related, so what's the one you are referring to?

The way I see NMI it's just another priority of interrupts given that NMI's can be masked by external hardware. So the "correct" way of adding NMI's would have been to make the interrupt controller "better" from the beginning and allow proper interrupt prioritization so that instead of enabling/disabling all interrupts you could use more granularity and for example never disable the highest priority "NMI". I guess interfacing with the PIC was too slow, so they had to resort to two interrupt priority levels of which the "normal" level is further prioritized by the PICs.

One other thing to the actual NMI-SMI-NMI topic, for some instructions the CPU blocks all interrupts (IIRC SMI's included), like switching stacks. I would expect NMI to do the same, and hope that SYSCALL also blocks all interrupts for a single cycle, but can't remember ever reading about it so I guess it's at least not architectural.

OSDev.org

Clarify how x86 interrupts work

Re: Clarify how x86 interrupts work

Re: Clarify how x86 interrupts work

Re: Clarify how x86 interrupts work

Re: Clarify how x86 interrupts work

Re: Clarify how x86 interrupts work

Re: Clarify how x86 interrupts work

Re: Clarify how x86 interrupts work

Re: Clarify how x86 interrupts work

Re: Clarify how x86 interrupts work

Re: Clarify how x86 interrupts work

Re: Clarify how x86 interrupts work

Re: Clarify how x86 interrupts work

Re: Clarify how x86 interrupts work

Re: Clarify how x86 interrupts work

Re: Clarify how x86 interrupts work