If you don't think "NMI-SMI-NMI" can happen; how can you think that NMI-SMI-NMI-SMI-NMI-SMI-NMI.... can happen?Korona wrote:As I said, I'm not sure if this can actually happen (my reading of the SDM is that is cannot for sane firmware that doesn't do SMI-IRET to enable NMI-NMI-IRET from NMI-RSM) and even if it can it is astronomically rare and the IST solution can just panic in this case.Brendan wrote:Using IST is broken (or at least impractical to work-around properly) for the "NMI-SMI-NMI" case. Not using IST is not broken for the "NMI-SMI-NMI" case, and as an extra added bonus you can also get a free "triple fault when no progress is possible due to NMI-SMI-NMI-SMI-NMI-SMI-NMI.... storm" advantage.
Note: Sane firmware is like unicorns - it's impossible to prove that they don't exist, and sometimes you'll see something that looks like it could be a unicorn out the corner of your eye (until you realise it's just a horse).
So, like:Korona wrote:Just self-IPI and clear MCIP inside the IPI.Brendan wrote:If the machine check exception handler is capable of recovery in some cases; IST is completely broken because it's impossible to avoid "second MCE trashes first MCE's stack after first MCE cleared MCIP but before first MCE did IRET". Also, for this case you do want to clear MCIP as soon as possible (after pulling information out of MSRs and storing it somewhere safe, and before you bother processing any of it) to reduce the risk of triple fault ruining your ability to recover.
Code: Select all
MCE_handler:
...
send_IPI_to_self(); // Work-around for the fact that I shouldn't have used IST but did
...
iretd
actual_MCE_handler:
// We're good now because this interrupt (started by "self IPI") doesn't use IST
...
clearMCIP();
...
iretd
Are you saying that if one measly little bit of memory that's only used by one CPU's stack and nothing else has failed, then instead of just using different memory and terminating the process, and instead of just taking that one CPU offline, I should start roaming the streets screaming "The sky is falling!" at people?Korona wrote:Besides, if there is an MCE before the IRET your kernel memory is doomed anyways.
Nonsense.Korona wrote:That won't work. If your MCE happens in kernel space (and all nested MCE that we're talking about here happen in kernel space ) those other CPUs will just MCE too. The main purpose of MCE is reporting broken RAM. The fix is to disable that RAM. You cannot reliably disable kernel RAM. Yes, MCE can also report SERR and similar errors but those are even more critical and you cannot recover from them.Brendan wrote:How about; send an IPI to other CPUs (to tell them you're doing an "emergency soft-offline"), then send "INIT IPI" to yourself (to reset CPU and put it into a "wait-for-SIPI" state); then keep the OS running (including cleaning up any mess left behind from "emergency soft-offline") using all the remaining CPUs?
If the broken RAM is in a "per NUMA domain" area of kernel space (which contains kernel code, etc) then I can take that NUMA domain offline (or maybe just replace the effected physical page/s from identical copies used by other NUMA domains in some cases). If the broken RAM is in a "per CPU" area (including a CPU's kernel stack) I can just take that CPU offline. If the broken RAM is in the message buffer area I can terminate the process that owns the message buffer. If the broken RAM was used for thread data structures or process data structures I can terminate a process. If the broken RAM was used for kernel's event log, then that can be discarded and replaced with new physical pages. If the broken RAM is in a "free physical page" stack or bitmap that'd be the easiest possible case! If ...
It's a micro-kernel, there isn't much else in kernel space.
Maybe only 1% of all memory is used for kernel space and maybe there's 1% of memory used in kernel space where broken RAM can't be recovered from; so maybe there's a 0.01% chance that an MCE caused by broken RAM is going to cause downtime because kernel couldn't recover.
It's not the tasks that matter; it's the lost data. 20 people using 20 applications on 10 computers, where each application is spread across 5 out of 10 computers - one computer dies and up to half of the users could lose data.Korona wrote:If you're writing your OS with a distributed environment in mind you should be even better off: Failure of a single machine should easily be recovered from. This recovery cannot (in case of a kernel space MCE) save a single machine but should instead rely on the other machine's replicating the failing machines tasks.Brendan wrote:No, I don't need to accept that. What I do need to accept is that for a "peer to peer distributed" OS like mine, the failure of any one computer can effect many computers; and minimising the risk of failures as much as possible, and recovering from failures as much as possible; is a necessity.
Yes; there's many ways to reduce/limit/mitigate that (redundant services, saving recovery information to disk in the background, etc); but preventing a computer from losing data just because someone burped is an important part of that.
Cheers,
Brendan