Clarify how x86 interrupts work

Question about which tools to use, bugs, the best way to implement a function, etc should go here. Don't forget to see if your question is answered in the wiki first! When in doubt post here.
LtG
Member
Member
Posts: 384
Joined: Thu Aug 13, 2015 4:57 pm

Re: Clarify how x86 interrupts work

Post by LtG »

onlyonemac wrote:
LtG wrote:While I don't have a strong opinion on the use of paging as a form of syscall, for the "magic":
- Anyone using assembly deserves it
- Anyone not commenting their code deserves it

Just in case you get the wrong impression, I'm not against assembly, but unless there's a good reason to use it, don't.

For the "normal" case, the above "page fault magic syscall" would be hidden in a C level syscall function so it shouldn't bother anyone. Just not sure if I see a benefit compared to using a SYSCALL. Antti, was there some benefit other than doing it in an unconventional way?
What about if you're looking at a disassembly in a debugger, and you see an invalid "mov" and think "that must be the problem"? Unless you know that it's a syscall, and keep this in mind whenever you're reading disassembles, you're asking for confusion.
I agree that it will reduce readability, however I don't think it's that much of an issue.

Assuming you don't know the platform it will bite you, but then you learn and hopefully won't get bit a second time.

I do think the tools we have aren't very good and a good disassembler would annotate the code, similarly for the disassembly in the debugger, though generally speaking I don't think you should need to look at disassembly when debugging some higher level language.

Finally, if I see either of the following:

Code: Select all

mov eax, [10]
mov eax, [0xFF000000]
Why would I think that must be the cause of the issue I'm debugging? I don't think any issue you are debugging is easily confused to be caused by the above examples. The above examples might look strange, at which point you investigate it and find out that they are #PF syscalls, and then hopefully remember it. The only confusion would happen if you had a legit reason to believe the issue you are debugging is actually caused by the above two examples _and_ you don't know about #PF syscalls.

Can you come up with some example you might be debugging where confusion would be likely?

So, I agree they do reduce readability due to bad tools but I don't think it would be that much of an issue in practice, but since there's no benefit I'm not planning on using #PF syscalls =)
User avatar
Brendan
Member
Member
Posts: 8561
Joined: Sat Jan 15, 2005 12:00 am
Location: At his keyboard!
Contact:

Re: Clarify how x86 interrupts work

Post by Brendan »

Hi,
LtG wrote:
Antti wrote:For example, what could be done with an Alignment Check (#AC) exception? It is always ring 3 only so kernel space could not trigger it. :D
I doubt you'll find anything with better performance than SYSCALL, but out of curiosity, is there some _real_ use for the #AC? For what purpose is it intended?
Some CPUs don't support misaligned accesses (e.g. MIPS CPUs); so if you invent a language that is supposed to be portable it would be tempting to say that misaligned accesses are illegal. For a language like that, #AC would indicate bugs (non-compliance with the language's specification or buggy compiler) rather than only indicating a potential performance issue.
LtG wrote:Alternatively you could approach it from the other end, what needs are there, beyond SYSCALL..? For me everything is relatively simple since I'm planning a micro-kernel and purely messages (for now at least) so SYSCALL is the only one I need and AFAIK it has the best performance.
For 32-bit code (including 32-bit code running in "compatibility mode" under a 64-bit kernel) SYSCALL might not exist (but can be emulated in the "undefined opcode" exception handler, if you want to use exceptions for your system calls).

For 64-bit code, SYSCALL has "excessively difficult to solve" security/reliability concerns caused by switching to CPL=0 without changing to a "known good" kernel stack, combined with the default behaviour of interrupts not changing stack when there's no privilege level change, combined with the IST mechanism being unusable for anything where "nesting" is possible. For a worst case scenario, assume that a thread sets RSP to zero before using SYSCALL (and CPU is running kernel code at CPL=0 with an invalid stack) and before the kernel can execute a single instruction an NMI occurs. The NMI handler can be nested and therefore can't use IST and must use a normal interrupt gate where there's no privilege change so stack isn't changed, so the stack is still invalid, which means that the CPU sees a page fault when trying to start the NMI handler. Now you've got "trouble, several layers deep" in your page fault handler (or your general protection fault handler if RSP was set to "just below the non-canonical hole" instead of zero). The simplest solution is to refuse to support SYSCALL and only support SYSENTER (but sadly, SYSENTER might not exist for 64-bit code, so...).

Also note that I'm relatively skeptical about the performance of SYSCALL - if you include bloat for switching to a "known good kernel stack" (before/without touching the potentially invalid user-space stack) and bloat for restoring the caller's RSP afterwards, it's potentially slower than an old fashioned call gate (if related GDT entries and TSS.RSP0 are in cache), especially if it's a micro-kernel with very few kernel API functions where you can give each function its own call gate and avoid the "likely branch misprediction" caused by something like "call [functionTable+eax*4]".


Cheers,

Brendan
For all things; perfection is, and will always remain, impossible to achieve in practice. However; by striving for perfection we create things that are as perfect as practically possible. Let the pursuit of perfection be our guide.
LtG
Member
Member
Posts: 384
Joined: Thu Aug 13, 2015 4:57 pm

Re: Clarify how x86 interrupts work

Post by LtG »

Brendan wrote:Hi,
LtG wrote:
Antti wrote:For example, what could be done with an Alignment Check (#AC) exception? It is always ring 3 only so kernel space could not trigger it. :D
I doubt you'll find anything with better performance than SYSCALL, but out of curiosity, is there some _real_ use for the #AC? For what purpose is it intended?
Some CPUs don't support misaligned accesses (e.g. MIPS CPUs); so if you invent a language that is supposed to be portable it would be tempting to say that misaligned accesses are illegal. For a language like that, #AC would indicate bugs (non-compliance with the language's specification or buggy compiler) rather than only indicating a potential performance issue.
LtG wrote:Alternatively you could approach it from the other end, what needs are there, beyond SYSCALL..? For me everything is relatively simple since I'm planning a micro-kernel and purely messages (for now at least) so SYSCALL is the only one I need and AFAIK it has the best performance.
For 32-bit code (including 32-bit code running in "compatibility mode" under a 64-bit kernel) SYSCALL might not exist (but can be emulated in the "undefined opcode" exception handler, if you want to use exceptions for your system calls).

For 64-bit code, SYSCALL has "excessively difficult to solve" security/reliability concerns caused by switching to CPL=0 without changing to a "known good" kernel stack, combined with the default behaviour of interrupts not changing stack when there's no privilege level change, combined with the IST mechanism being unusable for anything where "nesting" is possible. For a worst case scenario, assume that a thread sets RSP to zero before using SYSCALL (and CPU is running kernel code at CPL=0 with an invalid stack) and before the kernel can execute a single instruction an NMI occurs. The NMI handler can be nested and therefore can't use IST and must use a normal interrupt gate where there's no privilege change so stack isn't changed, so the stack is still invalid, which means that the CPU sees a page fault when trying to start the NMI handler. Now you've got "trouble, several layers deep" in your page fault handler (or your general protection fault handler if RSP was set to "just below the non-canonical hole" instead of zero). The simplest solution is to refuse to support SYSCALL and only support SYSENTER (but sadly, SYSENTER might not exist for 64-bit code, so...).

Also note that I'm relatively skeptical about the performance of SYSCALL - if you include bloat for switching to a "known good kernel stack" (before/without touching the potentially invalid user-space stack) and bloat for restoring the caller's RSP afterwards, it's potentially slower than an old fashioned call gate (if related GDT entries and TSS.RSP0 are in cache), especially if it's a micro-kernel with very few kernel API functions where you can give each function its own call gate and avoid the "likely branch misprediction" caused by something like "call [functionTable+eax*4]".


Cheers,

Brendan
By NMI nesting you mean NMI-SMI-NMI or something else?

Assuming you're not utilizing NMI's (watchdogs, profiling, etc) is there anything you could do anyway or just graceful shut down? If just graceful shut down then why the need for nesting?

As for the issues with SYSCALL wrt NMI nesting, I assume your point is that with SYSENTER when you are in ring0 the stack is valid kernel stack and while in ring3 it doesn't matter because NMI will do a ring change and thus valid kernel stack will be loaded. With SYSCALL there's a short period when kernel stack is not loaded yet but we are already in ring0 and if that's the moment when NMI occurs then there's no valid stack?

Is there an alternative to using IST for the above scenario? If there isn't, then IST must be used..

Is the first instruction in the NMI handler guaranteed to be executed before a SMI/MCE could occur? And SMI/MCE could only occur at earliest between NMI instructions 1 and 2? If so, wouldn't a relatively simple solution be to just change the IST on the first instruction so the original return address is safe?

As for the performance, I don't have any numbers so I don't know what it will be like, but isn't the SYSCALL supposed to be faster than a regular CALL? What's the bloat needed? Switching stacks shouldn't be that slow and it happens with SYSENTER too (implicitly)..

Btw, the wiki has a bit different issue explained related to the return from SYSCALL, maybe this one should be added to the wiki as well?
Korona
Member
Member
Posts: 1000
Joined: Thu May 17, 2007 1:27 pm
Contact:

Re: Clarify how x86 interrupts work

Post by Korona »

NMI cannot be nested unless you iret. This problem can be solved by just not ireting until the NMI handler finishes running (e.g. by implementing an exceptional "return via retf" path in the MCE and page fault handler if that is necessary in your OS design; I suspect for most microkernels that should not even be necessary as the NMI should not ever touch non-present pages there).

MCEs never nest; the CPU resets instead (unless you clear an MCE-in-progress bit inside an MSR).

Both mechansims can therefore utilize the IST. I guess most confusions about this "it's difficult to handle a bad stack after syscall" talk arose from a stupid and fragile implementation in previous versions of the Linux kernel.
managarm: Microkernel-based OS capable of running a Wayland desktop (Discord: https://discord.gg/7WB6Ur3). My OS-dev projects: [mlibc: Portable C library for managarm, qword, Linux, Sigma, ...] [LAI: AML interpreter] [xbstrap: Build system for OS distributions].
User avatar
Brendan
Member
Member
Posts: 8561
Joined: Sat Jan 15, 2005 12:00 am
Location: At his keyboard!
Contact:

Re: Clarify how x86 interrupts work

Post by Brendan »

Hi,
Korona wrote:NMI cannot be nested unless you iret. This problem can be solved by just not ireting until the NMI handler finishes running (e.g. by implementing an exceptional "return via retf" path in the MCE and page fault handler if that is necessary in your OS design; I suspect for most microkernels that should not even be necessary as the NMI should not ever touch non-present pages there).
No. The relevant part of Intel's manual is in "34.8 NMI HANDLING WHILE IN SMM":

"Although NMI requests are blocked when the processor enters SMM, they may be enabled through software by executing an IRET instruction. If the SMI handler requires the use of NMI interrupts, it should invoke a dummy interrupt service routine for the purpose of executing an IRET instruction. Once an IRET instruction is executed, NMI interrupt requests are serviced in the same “real mode” manner in which they are handled outside of SMM.

A special case can occur if an SMI handler nests inside an NMI handler and then another NMI occurs. During NMI interrupt handling, NMI interrupts are disabled, so normally NMI interrupts are serviced and completed with an IRET instruction one at a time. When the processor enters SMM while executing an NMI handler, the processor saves the SMRAM state save map but does not save the attribute to keep NMI interrupts disabled. Potentially, an NMI could be latched (while in SMM or upon exit) and serviced upon exit of SMM even though the previous NMI handler has still not completed. One or more NMIs could thus be nested inside the first NMI handler. The NMI interrupt handler should take this possibility into consideration.
"

Essentially; if your NMI handler is interrupted by an SMI you end up with the possibility of NMI nesting; even if the firmware's SMI handler does not deliberately re-enable NMI (because it relies on triggering and receiving NMI internally itself), and even if the firmware's SMI handler does not use IRET anywhere.

Note that SMI is unpredictable, undetectable and unpreventable. An OS has no choice but to assume that (under certain conditions) NMI can nest at any point during the NMI handler (including before it can execute a single instruction).

It is impossible to use IST under these conditions (without accepting the risk of the OS crashing without warning before it is able to execute a single instruction).
LtG wrote:By NMI nesting you mean NMI-SMI-NMI or something else?
Yes, NMI-SMI-NMI (or something else).
LtG wrote:Assuming you're not utilizing NMI's (watchdogs, profiling, etc) is there anything you could do anyway or just graceful shut down? If just graceful shut down then why the need for nesting?
If you look at the datasheets for different chipsets you'll find that there's multiple potential causes of NMI (and often that some events can be configured in the chipset to generate an NMI or SMI or something else, where the chipset's datasheet can't tell you what any specific motherboard's firmware happened to configure it as).

For various reasons (including a strong dislike of ACPI's AML) I have a "motherboard driver" (in user-space). The motherboard driver (during its initialisation) might tell the kernel "always shutdown gracefully if NMI happens", or it might not. If it doesn't, then when an NMI occurs my kernel expects the motherboard driver to handle it (and the motherboard driver might respond with ""shutdown gracefully for this NMI", or it might not). However, even for "always shutdown gracefully" I refuse to knowingly allow my kernel to crash for a foreseeable and preventable reason instead of successfully shutting down gracefully.

Of course different OS projects have different goals, and if someone is aware of the problem and chooses to ignore it (and accept the very small risk of crashing) then that's fine (as long as it's a conscious decision and not just because they weren't aware of the problem).
LtG wrote:As for the issues with SYSCALL wrt NMI nesting, I assume your point is that with SYSENTER when you are in ring0 the stack is valid kernel stack and while in ring3 it doesn't matter because NMI will do a ring change and thus valid kernel stack will be loaded. With SYSCALL there's a short period when kernel stack is not loaded yet but we are already in ring0 and if that's the moment when NMI occurs then there's no valid stack?
Yes.
LtG wrote:Is there an alternative to using IST for the above scenario? If there isn't, then IST must be used..
If you do use IST for NMI then it's impossible to prevent the risk of kernel crashing unexpectedly.

If you don't use IST for NMI and don't use SYSCALL ether, then there's no problem.

If you don't use IST for NMI but do use SYSCALL, then it's unpossible to guarantee the kernel is reliable or secure (e.g. user-space can set RSP to the address of a critical kernel data structure before doing SYSCALL, and the CPU will trash that kernel data structure if an NMI occurs before kernel switches to a "known good" kernel stack).

Note: I've lost track of what Linux does to "fix" the issue; but their continued bumbling is entertaining. ;)
LtG wrote:Is the first instruction in the NMI handler guaranteed to be executed before a SMI/MCE could occur? And SMI/MCE could only occur at earliest between NMI instructions 1 and 2?
No. Because starting an interrupt handler typically takes longer than a normal instruction, it's actually more likely that an SMI/MCE (that occured while CPU was starting an interrupt handler) will happen before the NMI handler executes its first instruction than it is for it to happen between instructions.
LtG wrote:If so, wouldn't a relatively simple solution be to just change the IST on the first instruction so the original return address is safe?
That would reduce the risk (but wouldn't eliminate all of the risk).
LtG wrote:As for the performance, I don't have any numbers so I don't know what it will be like, but isn't the SYSCALL supposed to be faster than a regular CALL? What's the bloat needed? Switching stacks shouldn't be that slow and it happens with SYSENTER too (implicitly)..
Everything new is supposed to "faster" (even when it's not) - that's how marketing works.

In practice, for most kernel APIs you end up with a single entry point and have to use something like "call [kernel_function_table + eax*8]" to figure out which function the caller actually wanted. This is a potential cache miss (and a potential TLB miss), and an extremely likely branch misprediction. This one instruction all by itself could cost anything from 4 cycles to 300 cycles or more; and you can expect that (given that CPU has been running user-space code and that you can expect all of the kernel's data to be "least recently used") it'll probably cost about 100 cycles on average. That is where call gate wins - you can have a call gate for each function (or for each function that is used often), and inline a copy of the function directly into its call gate handler, and now you don't need a function number and don't need that expensive "call [kernel_function_table + eax*8]".

If a call gate and its "retf" costs 80 cycles but avoids about 100 cycles of "call [kernel_function_table + eax*8]"; then SYSCALL needs to be faster than "negative 20 cycles" to beat it. SYSCALL is nice, but doesn't make the CPU travel through time into the past, and therefore it can't win (even if you ignore the stack switching, and the "cmp eax,MAX_FUNCTION_NUMBER" and whatever else).
LtG wrote:Btw, the wiki has a bit different issue explained related to the return from SYSCALL, maybe this one should be added to the wiki as well?
The wiki page does mention the "syscall doesn't switch stacks" issue, but doesn't mention the "NMI can nest" issue and provides bad advice ("For 64bit mode, the kernel must use Interrupt Stack Tables to safely move NMIs/MCEs onto a properly designated kernel stack").

It also doesn't seem to mention that SYSCALL doesn't exist on old 32-bit CPUs from AMD, or that SYSENTER doesn't exist on old 32-bit CPUs from Intel (or that there's an Intel CPU that doesn't support SYSENTER but CPUID is buggy and says that SYSENTER is supported).


Cheers,

Brendan
For all things; perfection is, and will always remain, impossible to achieve in practice. However; by striving for perfection we create things that are as perfect as practically possible. Let the pursuit of perfection be our guide.
Korona
Member
Member
Posts: 1000
Joined: Thu May 17, 2007 1:27 pm
Contact:

Re: Clarify how x86 interrupts work

Post by Korona »

Brendan wrote:No. The relevant part of Intel's manual is in "34.8 NMI HANDLING WHILE IN SMM":
[...]
Holy ****. I did not know that, thanks for pointing it out. But doesn't that mean that NMIs are broken beyond repair on x86? After all if you don't use an IST and there are arbitrarily nested NMIs then you cannot prevent a stack overflow due to nested NMI frames. Do chipsets actually generate new NMIs before you ack them (by writing some chipset-specific register)? If they didn't there could still be nested NMIs but not during the NMI prologue so you would be safe from this insanity by using a prologue that is carefully written to not corrupt its stack on reentry. Otherwise: Is there any architecturally safe way to handle that?

I do think that Linux handles NMIs via an IST entry. That makes me think that there are no chipsets that involve this level of stupidity because otherwise users would be complaining about NMI related lock ups. That is however a heuristic argument so I wouldn't count on it.
managarm: Microkernel-based OS capable of running a Wayland desktop (Discord: https://discord.gg/7WB6Ur3). My OS-dev projects: [mlibc: Portable C library for managarm, qword, Linux, Sigma, ...] [LAI: AML interpreter] [xbstrap: Build system for OS distributions].
User avatar
Schol-R-LEA
Member
Member
Posts: 1925
Joined: Fri Oct 27, 2006 9:42 am
Location: Athens, GA, USA

Re: Clarify how x86 interrupts work

Post by Schol-R-LEA »

Looking at the Linux email link Brendan posted, the first sentence really sums it up:
Andy Lutomirski wrote:x86 has a woefully poorly designed NMI mechanism.
I don't know about you, but I would be hard pressed to find anything in the x86 architecture, or on the standard PC platform in general, that isn't "woefully poorly designed". However, that's sour grapes on my part (or anyone else's, at this stage).
Rev. First Speaker Schol-R-LEA;2 LCF ELF JAM POEE KoR KCO PPWMTF
Ordo OS Project
Lisp programmers tend to seem very odd to outsiders, just like anyone else who has had a religious experience they can't quite explain to others.
Antti
Member
Member
Posts: 923
Joined: Thu Jul 05, 2012 5:12 am
Location: Finland

Re: Clarify how x86 interrupts work

Post by Antti »

I have one idea. What if an IST entry in the TSS points to itself + 8 bytes? The stack frame would ovewrite the IST entry with the value of SS and the next NMI would not overwrite the original one (that is the most important to be saved, not the nested ones).
tom9876543
Member
Member
Posts: 170
Joined: Wed Jul 18, 2007 5:51 am

Re: Clarify how x86 interrupts work

Post by tom9876543 »

Schol-R-LEA wrote:Looking at the Linux email link Brendan posted, the first sentence really sums it up:
Andy Lutomirski wrote:x86 has a woefully poorly designed NMI mechanism.
I don't know about you, but I would be hard pressed to find anything in the x86 architecture, or on the standard PC platform in general, that isn't "woefully poorly designed". However, that's sour grapes on my part (or anyone else's, at this stage).
I agree Schol-R-LEA. Intel quickly designed the 8086 and didn't realise how popular it would be (that is forgivable). Then IBM made very quick and nasty decisions when cobbling together (I won't use the word designing) the original PC-XT.

Would anyone know how the ARM CPUs manage hardware errors? I will guess it is a much more logical, better designed implementation.
User avatar
Brendan
Member
Member
Posts: 8561
Joined: Sat Jan 15, 2005 12:00 am
Location: At his keyboard!
Contact:

Re: Clarify how x86 interrupts work

Post by Brendan »

Hi,
Schol-R-LEA wrote:Looking at the Linux email link Brendan posted, the first sentence really sums it up:
Andy Lutomirski wrote:x86 has a woefully poorly designed NMI mechanism.
When they decided to use IRET to "re-enable" NMI, it made perfect sense - there was no SMM or SYSCALL to worry about and operating systems were significantly different (and significantly simpler) than they are now. When they decided to implement SMM it made perfect sense - much cheaper (at the time) to use an existing CPU for things like power management than it was to add a whole new CPU to the CPU.

The problem is the interaction between 2 things that both made sense at the time they were designed; combined with some "hindsight is easier than foresight".

When AMD designed SYSCALL and decided not to switch stacks, that never made sense - it was a clear violation of the separation between kernel-space and user-space. When both Intel and AMD decided not to adhere to a common standard and to refused to support each other's alternatives, that never made sense. When AMD used their design of long mode as a way to force their (inferior and broken) alternative onto everyone, that never made sense. I don't think any of these decisions where based on engineering; and I think they're caused by politics/competition.

On top of all of this, there's backward compatibility - the idea that it's better to keep things compatible instead of breaking compatibility to improve the design. The success of 80x86 (and "PC") despite its flaws (and the death of anything that tried to compete against 80x86 in the desktop and server space) is proof that "better but incompatible" is a disaster. The consequence is that OS developers need to deal with a few warts (in return for knowing their OS will still work next year). We all complain about the warts (but nobody complains that their software still works).


Cheers,

Brendan
For all things; perfection is, and will always remain, impossible to achieve in practice. However; by striving for perfection we create things that are as perfect as practically possible. Let the pursuit of perfection be our guide.
Antti
Member
Member
Posts: 923
Joined: Thu Jul 05, 2012 5:12 am
Location: Finland

Re: Clarify how x86 interrupts work

Post by Antti »

Code: Select all

    0xFFFFFFFF_00000066 I/O Map Base Address
    0xFFFFFFFF_00000064 Reserved
    ..
    0xFFFFFFFF_00000058 IST7 (upper 32 bits)    = 0xFFFFFFFF
    0xFFFFFFFF_00000054 IST7 (lower 32 bits)    = 0x00000058
    ..
    0xFFFFFFFF_00000008 RSP0 (upper 32 bits)
    0xFFFFFFFF_00000004 RSP0 (lower 32 bits)
    0xFFFFFFFF_00000000 Reserved
First NMI (using IST7):

Code: Select all

NmiHandler:
    ; [0xFFFFFFFF_00000050] = original SS (upper dword is zero)
    ; [0xFFFFFFFF_00000048] = original RSP
    ; [0xFFFFFFFF_00000040] = original RFLAGS
    ; [0xFFFFFFFF_00000038] = original CS
    ; [0xFFFFFFFF_00000030] = original RIP

    ; RSP = 0xFFFFFFFF_00000030
    nop             ; never executed (second NMI is triggered immediately)
Second NMI (upper dword of SS cleared the IST7's lower 32 bits):

Code: Select all

NmiHandler:
    ; [0xFFFFFFFE_FFFFFFF8] = SS
    ; [0xFFFFFFFE_FFFFFFF0] = RSP
    ; [0xFFFFFFFE_FFFFFFE8] = RFLAGS
    ; [0xFFFFFFFE_FFFFFFE0] = CS
    ; [0xFFFFFFFE_FFFFFFD8] = RIP

    ; RSP = 0xFFFFFFFE_FFFFFFD8

    nop             ; nop
Have I understood something incorrectly? Does this solve the nesting NMI problem if using ISTs?
User avatar
Brendan
Member
Member
Posts: 8561
Joined: Sat Jan 15, 2005 12:00 am
Location: At his keyboard!
Contact:

Re: Clarify how x86 interrupts work

Post by Brendan »

Hi,
Antti wrote:Have I understood something incorrectly? Does this solve the nesting NMI problem if using ISTs?
In long mode, when any interrupt is started the CPU writes 32 bytes (or 40 bytes if there's an error code) and each thing (including SS and CS) consume 8 bytes. To get "SS in interrupt handler's stack frame" to overwrite the IST7 it would look like:

Code: Select all

     IST7 = overwritten by SS (zero extended to 64-bits)
     IST6 = overwritten by RSP
     IST5 = overwritten by RFLAGS
     IST4 = overwritten by CS (zero extended to 64-bits)
     IST3 = overwritten by RIP
Because SS must be lower than 8192 (to be valid/within GDT), this means that for the second NMI's interrupt handler "RSP < 8192".

To guard against "two CPUs both get nested NMI at same time and both CPUs end up using RSP < 1892" you'd need to make sure that each CPU uses different values for SS, and that these values are 32 apart (e.g. one CPU uses "SS = 8", another "SS = 8 + 32", etc). That limits you to a max. of 255 CPUs. Alternatively, the area from 0x00000000 to 0x00002000 can be "per CPU", which means that every thread needs a different PML4 and the OS sets "page tables for 0x00000000 to 0x00002000" before switching virtual address space during task switches.

If you work around all of that; then there's still a (small) possibility of an "NMI - SMI - NMI - SMI - NMI" sequence where the third NMI trashes the second NMI. ;)


It's probably best to consider it a "performance vs. reliability vs. complexity" compromise. The "overwrite IST" technique is relatively complex and not 100% reliable.

Not using IST and not using SYSCALL solves all the problems, is 100% reliable and adds no complexity at all. For Intel CPUs (where you can just use SYSENTER) the performance is likely to be equal to SYSCALL and better than "SYSCALL plus manual stack switching". This means that the only disadvantage is performance on AMD CPUs (which don't support SYSENTER in long mode); which (assuming "one call gate for all kernel API functions") is likely to be small relative to the cost of all the other stuff a system call has to do.

Note that I still suspect that "different call gate for each kernel API function (or just each frequently used kernel API function)" could be better than both SYSENTER and SYSCALL (on all CPUs); and if that is true it would mean that "not using IST and not using SYSCALL" could be pure advantages with no disadvantages.


Cheers,

Brendan
For all things; perfection is, and will always remain, impossible to achieve in practice. However; by striving for perfection we create things that are as perfect as practically possible. Let the pursuit of perfection be our guide.
Antti
Member
Member
Posts: 923
Joined: Thu Jul 05, 2012 5:12 am
Location: Finland

Re: Clarify how x86 interrupts work

Post by Antti »

The IST7 is not 8-byte aligned. The trick replaces only the lower 32-bits with zeros.
Antti
Member
Member
Posts: 923
Joined: Thu Jul 05, 2012 5:12 am
Location: Finland

Re: Clarify how x86 interrupts work

Post by Antti »

In short, it is possible to execute instructions and the original stack frame is safe.

Code: Select all

    [0x????????_??????50] = original SS (upper dword is zero)
    [0x????????_??????48] = original RSP
    [0x????????_??????40] = original RFLAGS
    [0x????????_??????38] = original CS
    [0x????????_??????30] = original RIP
The "scratch frame" is

Code: Select all

    [0x????????_FFFFFFF8] = SS          (beginning of NmiHandler)
    [0x????????_FFFFFFF0] = RSP         (beginning of NmiHandler)
    [0x????????_FFFFFFE8] = RFLAGS      (beginning of NmiHandler)
    [0x????????_FFFFFFE0] = CS          (beginning of NmiHandler)
    [0x????????_FFFFFFD8] = RIP         (beginning of NmiHandler)
Your NMI handler may disable the whole IST mechanism and it may loop ("infinite" number of NMIs) before it succeeds in doing that. You could lose "iret" frames but these point to the beginning of the NMI handler (before the IST mechanism is disabled) and are not important.
User avatar
Brendan
Member
Member
Posts: 8561
Joined: Sat Jan 15, 2005 12:00 am
Location: At his keyboard!
Contact:

Re: Clarify how x86 interrupts work

Post by Brendan »

Hi,
Antti wrote:The IST7 is not 8-byte aligned. The trick replaces only the lower 32-bits with zeros.
Ah - that would solve one of the problems (and trash IST2, leaving IST1 as the only thing left that can be used by both double fault and machine check).

There's another problem I overlooked earlier - you have to restore the IST somewhere, otherwise one NMI modifies the IST, and then 10 days later "NMI - SMI - NMI" breaks. You can't restore the IST atomically. For example, for "mov [ITS7low],value; iret" a second NMI can occur between the "mov" and the "iret" and trash the stack that's being used by the first NMI handler.

To fix that you'd need to make the NMI handler switch to another stack and copy its return info onto that stack, then restore the IST. Possibly something like:

Code: Select all

NMI_handler:

;Subsequent NMI's will use the "alternative NMI stack" (even if this NMI is currently using it)

    cmp rsp,defaultISTstack          ;Is this nested?
    je .notNested                    ; no

    ;NMI was nested

    add word [NMIstackListTop],1024  ;WARNING: "16-bit wrapping" used to implement a ring buffer
                                     ;         (of 64 stacks) and avoid the need to restore after
    mov rsp,[NMIstackListTop]
    push qword [altNMIstack+24]
    push qword [altNMIstack+16]
    push qword [altNMIstack+8]
    push qword [altNMIstack]
    jmp .gotStack

    ;NMI was not nested

.notNested:
    add word [NMIstackListTop],1024  ;WARNING: "16-bit wrapping" used to implement a ring buffer
                                     ;         (of 64 stacks) and avoid the need to restore after
    mov rsp,[NMIstackListTop]
    push qword [defaultNMIstack+24]
    push qword [defaultNMIstack+16]
    push qword [defaultNMIstack+8]
    push qword [defaultNMIstack]

.gotStack:
    mov [ITS7low],defaultISTstack    ;Subsequent NMI's will use the "default NMI stack" after this

    ...

    iretq


Cheers,

Brendan
For all things; perfection is, and will always remain, impossible to achieve in practice. However; by striving for perfection we create things that are as perfect as practically possible. Let the pursuit of perfection be our guide.
Post Reply