More NMI discussion

Cognition · Post by **Cognition** » Sun May 06, 2012 8:21 pm

Branching off from the discussion that was going in the syscall thread.

There's a fairly good writeup of some of the issues NMIs can cause here. There's also a writeup of how linux decided to nest its NMIs. It seems like some other potential issues have popped up about how to handle NMIs safely on the x86-64 architecture.

Using SYSCALL/SYSRET could cause the kernel to have an unverified stack, cause problems if an NMI (possibly MCE too?) occurs before the syscall handler can swap in a valid stack, requiring the use of an IST for NMIs and MCEs.
SMIs can trigger during NMIs and possibly issue and IRET disabling the processors internal NMI masking logic
MCEs might also be able to trigger in a fashion similar to the SMI and disable NMI masking logic
With the processor's internal NMI masking disable you could get a sequence of NMI-SMI-NMI early enough in the handler that the return values for the initial NMI will be overwritten before they can be reliably saved or mirrored elsewhere.

As Brendan pointed out in the other thread this would seem to indicate that the only bulletproof solution would be use a single kernel stack and avoid using the SYSCALL instruction. Naturally this might have performance implications as SYSCALL tends to be the only "fast system call" instruction that works reliably in long mode for both Intel and AMD processors that support the feature. On top of that it seems that they don't even concern themselves with some of these scenarios occurring in linux and from what I've seen FreeBSD as well. NMI nesting is usually enabled in these kernels at their own convenience after they've copied any context data over to another good stack or some other save area. If past precedent is used as a standard it would seem that some of these conditions are for whatever reason considered not worth trying to deal with.

Overall I get the feeling that this kind of begs the question does this come down to a design decision? And likewise is there some other chipset logic that perhaps guarantees a certain grace period to handle NMI nesting in a sane fashion?

FreeBSD handler is here.

I know this issue has come up a few times as of late, but it seems like a bit of a gray area where absolute fault tolerance might indicate one pattern of NMI handling while performance considerations would dictate another. Seems to be a big gray area if there ever was one.

Brendan · Post by **Brendan** » Mon May 07, 2012 12:39 am

Hi,

Cognition wrote:Overall I get the feeling that this kind of begs the question does this come down to a design decision?

In my opinion there's only 2 factors:

How much risk is acceptable?
How much risk is there?

How much risk is acceptable depends on what the OS is intended for. For an OS intended for something like a web kiosk or a thin client, if the OS crashes then the computer will reset (or a user will reset it), and because there wasn't any data or anything anyway it could just be a minor inconvenience. For a mission critical system being used for a stock exchange or air-traffic control at a major airport, you're less likely to want failures.

For how much risk there is, it's hard to estimate. For the way Linux and FreeBSD are handling nested NMI, there's a period of time starting from when the first NMI occurs until the NMI handler is able to handle a second NMI, where an SMI could unmask a second NMI and a second NMI (which may have occured before SMI unmasked NMIs) could be received. The risk would depend on how often NMIs occur and how often SMIs occur - e.g. using NMI for a watchdog timer or profiling would increase the risk. The risk also increases with the number of CPUs involved - if there's 100 CPUs then the chance of an "NMI, SMI, NMI" sequence causing a problem is 100 times higher.

The other thing is testing. I have no idea how you'd test the OS to determine the actual chance of failures (on various computers, under various conditions).

If you want very high reliability (e.g. the sort of OS where you can have a cluster of 1000 computers running for a several years without any crashes at all) then eliminating the chance of NMI related crashes is going to be easier than worrying about if the risk is acceptable or not (and never really knowing).

Cheers,

Brendan

bluemoon · Post by **bluemoon** » Mon May 07, 2012 4:11 am

Cognition wrote:Naturally this might have performance implications as SYSCALL tends to be the only "fast system call".

Now I begin to doubt this, and it turn out sysenter or INT may be just as fast for the whole work required for kernel interface. ie the full package of syscall+switch stack+messing with thread storage / TSS entry / nesting stuff may be just as slow as the traditional way.

Cognition · Post by **Cognition** » Mon May 07, 2012 5:03 am

@Brendan:

Yeah it just seems the risk is extremely hard to quantitizes for this particular issue. Linux and BSDs are also generally known for pretty good reliability and uptime, so if they aren't worrying about it I have to wonder if it's applicable to the vast majority of applications out there. In general, if my software isn't keeping a plane in the air or someon's heart beating, perhaps software nesting would be 'good enough' under almost all other conditions.

@bluemoon:

What's necessary for a SYSCALL interface on top of basic requirements of kernel entry in general should be pretty minimal. Really it's just the stack pointer swapping, which can be done in 3 move instructions. CPU specific thread storage speeds things up a hell of a lot in kernel vs actually having to look up and pass cpu specific structures as a parameter at every turn and is probably something you'd end up doing even through an software interrupt interface. Stuff shouldn't be poking around in the TSS either generally, the exception perhaps being some of the software nesting but that's specifically done for NMIs.

Brendan · Post by **Brendan** » Mon May 07, 2012 4:24 pm

Hi,

Cognition wrote:Yeah it just seems the risk is extremely hard to quantitizes for this particular issue. Linux and BSDs are also generally known for pretty good reliability and uptime, so if they aren't worrying about it I have to wonder if it's applicable to the vast majority of applications out there. In general, if my software isn't keeping a plane in the air or someon's heart beating, perhaps software nesting would be 'good enough' under almost all other conditions.

Linux and FreeBSD are known for pretty good reliability and uptime. If you want "better than pretty good", there's a lot less options and you may need to move to something like NonStop OS.

I tried to estimate the chance of "NMI then SMI then NMI" causing problems yesterday (for the "nested NMI with IST" method that Linux and FreeBSD are using). I was tired and I suspect my calculation was very dodgy so I didn't post it. However, for a pool of about 200 computers each with 16 CPUs, if each CPU is having an average of 1 NMI per second for watchdog, 100 NMIs per second for profiling and 10 SMIs per second for who knows what; then you might expect a CPU on one of the computers to crash each year due to the second NMI occurring before the NMI handler is able to handle a second NMI. Of course this dodgy calculation assumed that SMI does do an IRET, and in practice you wouldn't be using NMI for profiling on a large pool of computers.

bluemoon wrote:
Cognition wrote:Naturally this might have performance implications as SYSCALL tends to be the only "fast system call".
Now I begin to doubt this, and it turn out sysenter or INT may be just as fast for the whole work required for kernel interface. ie the full package of syscall+switch stack+messing with thread storage / TSS entry / nesting stuff may be just as slow as the traditional way.

In my opinion, the best possible kernel interface is SYSENTER - it has all the same advantages as SYSCALL (no GDT lookups or protection checks) but unlike SYSCALL it does guarantee that RSP is set to a safe value and avoids the "running at CPL=0 with a dodgy stack" problem. Unfortunately modern AMD CPUs don't support SYSENTER in long mode (even though they do support SYSENTER in protected mode).

Also note that the "CPL=3 to CPL=0 to CPL=3" switching overhead (regardless of how it's done) could be amortized. A thread could create a list of functions it wants the kernel to do then ask the kernel to do all the functions in the list. In that way you only pay the "CPL=3 to CPL=0 to CPL=3" switching overhead once for each group of kernel functions (rather than once for each kernel function).

Cheers,

Brendan

Cognition · Post by **Cognition** » Tue May 08, 2012 11:54 am

Yeah after reading Brendan's responses here and some discussion on IRC I'm leaning towards defaulting to SYSENTER on systems that support it and having SYSCALL as a default on systems that don't with the option to run through interrupts only through the kernel's argument string. That should cover all bases pretty well and require minimum in kernel logic to implement.

bluemoon · Post by **bluemoon** » Tue May 08, 2012 12:03 pm

Brendan wrote: Also note that the "CPL=3 to CPL=0 to CPL=3" switching overhead (regardless of how it's done) could be amortized. A thread could create a list of functions it wants the kernel to do then ask the kernel to do all the functions in the list. In that way you only pay the "CPL=3 to CPL=0 to CPL=3" switching overhead once for each group of kernel functions (rather than once for each kernel function).

This is an interesting idea, traditionally kernel call tends to be bloated for doing many work in one call; with Brendan's way the kernel API can be broken down into smaller functionality and let application(or library framework) to compile and streamline many syscalls at once, which IMO somehow works like sending commands to openGL and commit.
This sure open up new possibilities.

OSDev.org

More NMI discussion

More NMI discussion

Re: More NMI discussion

Re: More NMI discussion

Re: More NMI discussion

Re: More NMI discussion

Re: More NMI discussion

Re: More NMI discussion