Issues moving to the "one kernel stack per CPU" approach

Brendan · Post by **Brendan** » Fri Apr 06, 2012 9:31 pm

Hi,

cyr1x wrote:
gerryg400 wrote: Is it possible to return from an NMI without enabling further NMIs ?
Yes, just don't use IRET, use RETF, etc...

As far as I can tell; after the CPU has started the NMI handler but before the CPU has had a chance to execute the NMI handler's very first instruction, a machine check exception or an SMI could occur and do IRET; and immediately after the IRET (or after a return from SMM) a second NMI can occur before you've managed to execute one instruction for the first NMI.

The only "bullet-proof" way to handle NMI is to cope with nested NMI. This implies that using a task gate (in protected mode) or using the "IST" (in long mode) for the NMI handler is a bad idea (as they can't handle nesting).

turdus wrote:In long mode you can skip the CPL check, as CPU can switch stacks with consistent data regardless to CPL change.

For "one kernel stack per CPU" you can't store a thread's state the thread's kernel stack (it'd be a massive nightmare). The easiest/best way to get around that is to store the thread's state in a "thread control block" (or TCB) when you enter the kernel, and then restore a thread's state (not necessarily the same thread) when you leave the kernel. Interrupt handlers have to check the interrupted code's CPL to determine if they have to store the interrupted thread's state.

turdus wrote:IST also allows separate stacks for NMI, exception handlers and irq handlers for example (up to 7 stacks).

In general, task gates (in protected mode) and "IST" (in long mode) just complicate things more without solving anything...

Cheers,

Brendan

gerryg400 · Post by **gerryg400** » Fri Apr 06, 2012 10:28 pm

As far as I can tell; after the CPU has started the NMI handler but before the CPU has had a chance to execute the NMI handler's very first instruction, a machine check exception or an SMI could occur and do IRET; and immediately after the IRET (or after a return from SMM) a second NMI can occur before you've managed to execute one instruction for the first NMI.

The only "bullet-proof" way to handle NMI is to cope with nested NMI. This implies that using a task gate (in protected mode) or using the "IST" (in long mode) for the NMI handler is a bad idea (as they can't handle nesting).

I'm pretty much out of my depth here. Don't know much about SMIs or MC exceptions. I thought (perhaps hoped ?) that SMI code shouldn't do an IRET for this very reason. And I thought that MC exceptions weren't recoverable anyway so there was no need to return from them. Too simplistic ?

Bottom line is I've been assuming that I could prevent NMIs from nesting.

Brendan · Post by **Brendan** » Fri Apr 06, 2012 11:31 pm

Hi,

gerryg400 wrote:
As far as I can tell; after the CPU has started the NMI handler but before the CPU has had a chance to execute the NMI handler's very first instruction, a machine check exception or an SMI could occur and do IRET; and immediately after the IRET (or after a return from SMM) a second NMI can occur before you've managed to execute one instruction for the first NMI.

The only "bullet-proof" way to handle NMI is to cope with nested NMI. This implies that using a task gate (in protected mode) or using the "IST" (in long mode) for the NMI handler is a bad idea (as they can't handle nesting).
I'm pretty much out of my depth here. Don't know much about SMIs or MC exceptions. I thought (perhaps hoped ?) that SMI code shouldn't do an IRET for this very reason.

It's hard to tell what firmware's SMM might do in practice (and while I'd hope that most firmware doesn't touch any interrupts for any reason I wouldn't want to make assumptions involving the sanity of firmware vendors!). I'm using Intel's manual as a guide to what is theoretically possible (duplicated here under the assumption of "fair use, for educational purposes"):

Intel wrote:26.8 NMI HANDLING WHILE IN SMM

NMI interrupts are blocked upon entry to the SMI handler. If an NMI request occurs during the SMI handler, it is latched and serviced after the processor exits SMM. Only one NMI request will be latched during the SMI handler. If an NMI request is pending when the processor executes the RSM instruction, the NMI is serviced before the next instruction of the interrupted code sequence. This assumes that NMIs were not blocked before the SMI occurred. If NMIs were blocked before the SMI occurred, they are blocked after execution of RSM.

Although NMI requests are blocked when the processor enters SMM, they may be enabled through software by executing an IRET instruction. If the SMM handler requires the use of NMI interrupts, it should invoke a dummy interrupt service routine for the purpose of executing an IRET instruction. Once an IRET instruction is executed, NMI interrupt requests are serviced in the same “real mode” manner in which they are handled outside of SMM.

A special case can occur if an SMI handler nests inside an NMI handler and then another NMI occurs. During NMI interrupt handling, NMI interrupts are disabled, so normally NMI interrupts are serviced and completed with an IRET instruction one at a time. When the processor enters SMM while executing an NMI handler, the processor saves the SMRAM state save map but does not save the attribute to keep NMI interrupts disabled. Potentially, an NMI could be latched (while in SMM or upon exit) and serviced upon exit of SMM even though the previous NMI handler has still not completed. One or more NMIs could thus be nested inside the first NMI handler. The NMI interrupt handler should take this possibility into consideration.

Also, for the Pentium processor, exceptions that invoke a trap or fault handler will enable NMI interrupts from inside of SMM. This behaviour is implementation specific for the Pentium processor and is not part of the IA-32 architecture.

If you're concerned about "bullet-proof" NMI handling, there's plenty to worry about in that.

gerryg400 wrote:And I thought that MC exceptions weren't recoverable anyway so there was no need to return from them. Too simplistic ?

That depends on what caused the machine check exception and what the OS is capable of handling. Intel have designed it such that software can attempt to recover after a machine check exception (marking some RAM as "faulty" and killing any effected processes, or shutting down a faulty CPU and letting the other CPUs continue, etc). Even if recovery isn't possible, reporting the error to the user may be possible (e.g. a nice descriptive "blue screen of death" saying which piece of hardware is faulty rather than a random reset of unknown cause) and it might also be possible to syncing disks to minimise data loss.

gerryg400 wrote:Bottom line is I've been assuming that I could prevent NMIs from nesting.

I don't think it's possible to prevent NMIs from nesting in all possible cases (although I should point out that machine check is optional - you could just let the system reset itself like it does for triple fault if you want).

I should probably also point out that "good enough" can sometimes be good enough. For example, if your OS is intended for games consoles and crashes once per year because of nested NMI (but crashes 20 times per week due to other problems), then I'd say "let it crash!". On the other hand, if you want people to use your OS for mission critical servers (99.9999% availability!) then "good enough" will probably never be good enough.

Cheers,

Brendan

xenos · Post by **xenos** » Sat Apr 07, 2012 2:54 am

@Brendan:

The second version of your interrupt handler code looks rather close to my original idea (before the "ring 0 stack points to TCB" approach was mentioned). Just a few remarks and questions I encountered when I looked through your interrupt handler code:

Brendan wrote:

Code: Select all

interruptHandler:
    push dword 0                       ;Dummy error code (remove for some exception handlers)

    test byte [esp+4],3                ;Was interrupted code running at CPL=0?
    je .isCPL0                         ; yes

At the very beginning of the first version: I guess this must be [esp+8] since you pushed the dummy error code. In the second version you wrote [esp+8].

Code: Select all

    popad
    add esp,4                          ;Remove error code
    iretd

At the return to user mode: Shouldn't DS and ES be restored to USER_DATA here as well (and also in the second version of your interrupt handler)?

Oh, and I guess EFLAGS needs to be saved / restored as well. But anyway, I guess the concept is very clear.

@gerryg400:

Would you mind posting your ret_to_user code? I guess I can imagine what it should look like, I just want to make sure that I don't miss anything important.

@all:

Regarding the issue of NMI handling - how should an NMI be handled anyway? I always thought that NMIs indicate some critical hardware failure that cannot be dealt with in any sane way without shutting down the system (or letting it crash). (Except for IPIs with NMI delivery mode, of course - but these are always software generated.)

One more minor issue with the "stack points to TCB" approach I (possibly) found is that in long mode one needs to make sure that the register save area pointed to by RSP0 is always aligned at a 16-byte boundary, because otherwise the pushed registers may end up at the wrong place:

Intel Vol. 3 Chap. 6.14.2 wrote:In legacy mode, the stack pointer may be at any alignment when an interrupt or exception causes a stack frame to be pushed. This causes the stack frame and succeeding pushes done by an interrupt handler to be at arbitrary alignments. In IA-32e mode, the RSP is aligned to a 16-byte boundary before pushing the stack frame. The stack frame itself is aligned on a 16-byte boundary when the interrupt handler is called. The processor can arbitrarily realign the new RSP on interrupts because the previous (possibly unaligned) RSP is unconditionally saved on the newly aligned stack. The previous RSP will be automatically restored by a subsequent IRET.

turdus · Post by **turdus** » Sat Apr 07, 2012 7:39 am

Brendan wrote:Interrupt handlers have to check the interrupted code's CPL

No, that's the point of IST. Read the manuals ("The IST mechanism provides a method for specific interrupts, such as NMI, double-fault, and machine-check, to always execute on a known- good stack."). I don't have such a check and I can assure you it works fine and fast without. And there's plenty of space on TCB page for nested stacks (a thread struct usually not bigger than 512 bytes).
Let's say you cannot image a handler without it, but it's not a "have to".

Brendan · Post by **Brendan** » Sat Apr 07, 2012 8:04 am

Hi,

turdus wrote:
Brendan wrote:For "one kernel stack per CPU".... Interrupt handlers have to check the interrupted code's CPL
No, that's the point of IST. Read the manuals ("The IST mechanism provides a method for specific interrupts, such as NMI, double-fault, and machine-check, to always execute on a known- good stack."). I don't have such a check and I can assure you it works fine and fast without. And there's plenty of space on TCB page for nested stacks (a thread struct usually not bigger than 512 bytes).
Let's say you cannot image a handler without it, but it's not a "have to".

You're using "one (small) kernel stack per thread plus a bunch more kernel stacks for various interrupt handlers" and not using "one kernel stack per CPU"; and because you're having trouble seeing the difference you're not comprehending what everyone else here is talking about.

Note: Ignoring space used for things like thread name, amount of CPU time used by the thread, etc; we'd be looking at about 512 bytes for FPU/MMX/SSE state alone; plus another 32 bytes (for protected mode) or 128 bytes (for long mode) to store the thread's general registers. There are no stacks in the TCB at all.

Cheers,

Brendan

Cognition · Post by **Cognition** » Sat Apr 07, 2012 5:38 pm

XenOS wrote: @all:

Regarding the issue of NMI handling - how should an NMI be handled anyway? I always thought that NMIs indicate some critical hardware failure that cannot be dealt with in any sane way without shutting down the system (or letting it crash). (Except for IPIs with NMI delivery mode, of course - but these are always software generated.)

This was a question I had as well, after nosing around my chipset's data sheet some it seems like something a chipset driver would have to handle. On the Intel ICH10R it seems like NMI can be asserted primarily by the detection of bus parity errors, PCI devices asserting the #SERR signal, parity errors detected on the PCI Bus, and errors on the ISA bus through the LPC bridge. Ultimately part of the LPC package includes a system management function that has the actual registers in it to actually let the hardware know the NMI has been dealt with, it also might support some other remote platform management technologies from Intel. On top of that it seems like a lot of things that generate an NMI could be redirected through the hardware to generate an SMI instead. Long story short it does seem that on modern systems NMIs are still used to signal fairly serious chipset errors that are probably implementation specific and might not be recoverable. According to the PCI Bus specifications SERR is generally a pretty severe stop the world kind of error that doesn't really have a set handling routine and by design always generates either an NMI or Machine Check Exception and is 'therefore fatal'.

At best it seems you might be able to detect what generated the error and report it to an admin or the user and attempting an orderly shutdown. I also get the impression that NMIs shouldn't really nest in the first place. Intel's SDM Volume 3 Section 6.7 states that upon recieving an NMI the processor hardware makes sure no other interrupts can be recieved including NMIs. It does seem possible that perhaps a Machine Check Exception could occur, but even then an MCE is an abort exception that doesn't garuntee any clean state to return to (from what I gather if you do get a recoverable MCE the processor has to be restarted). So I'm not sure some of these scenarios laid out here are even possible.

Brendan · Post by **Brendan** » Sat Apr 07, 2012 8:16 pm

Hi,

XenOS wrote:Regarding the issue of NMI handling - how should an NMI be handled anyway? I always thought that NMIs indicate some critical hardware failure that cannot be dealt with in any sane way without shutting down the system (or letting it crash). (Except for IPIs with NMI delivery mode, of course - but these are always software generated.)

For critical hardware errors (both NMI and machine check), in my opinion the minimum requirements are telling the user it occurred. If a user submits a bug report saying "Your OS resets the computer when I try to burn a CD" then you're going to assume it's a triple fault and spend a week searching for bugs that don't exist (and when you give up you'll never know if there was bugs in your code or not). If a user submits a bug report saying "Your OS reports an NMI when I try to burn a CD" then it's an extremely different scenario.

Of course minimum requirements are only the minimum. The more information your software can provide the better (e.g. perhaps telling the user "NMI generated by SATA controller on AHCI bus" rather than "NMI occurred"). For NMI this does require something like a chipset driver, but for machine check it doesn't.

At the other end of the scale ("maximum requirements") is fault tolerance - e.g. recovering from critical hardware errors and keeping the OS running with the least loss of functionality where possible. This is likely to be far beyond the scope of most OS projects; however providing the framework needed for this may not be. For example, even if no actual motherboard drivers exist, your OS could support loading a "motherboard driver" and provide a way for software to disable specific PCI devices, take a CPU offline, mark an area of RAM as "faulty", etc (so that if anyone does write a motherboard driver it can do something useful).

Finally, NMI isn't necessarily limited to hardware errors. Your kernel may generate them deliberately for some reason (one example is the "NMI watchdog" in Linux).

Cognition wrote:I also get the impression that NMIs shouldn't really nest in the first place.

At the hardware level, NMI shouldn't nest (but "shouldn't" is not a guarantee that they won't nest).

At the software level, as soon as your NMI handler attempts to do anything useful the "NMI doesn't nest" theory becomes unworkable, especially for micro-kernels (as things like video drivers are in user space), and especially for "multi-CPU" (as an NMI on one CPU will not prevent a different CPU from receiving NMI). For example, imagine an OS that (when a hardware error occurs) tries to terminate/suspend all non-essential processes, tries to sends a "hardware error occurred" message to all video drivers and tries to sync disks to avoid data loss. Now try to imagine an OS that does all that without executing a single IRET. It's a lot easier to just assume that NMI may nest and deal with it.

I'd be tempted to deliberately do a dummy "IRET to the following instruction" near the start of the NMI handler; just to make sure everyone understands that NMI can nest (regardless of what the hardware says or doesn't say). Heck, I'd probably set an "NMI occured" flag somewhere, do the dummy IRET, send IPIs to other CPUs (to tell them not to do anything non-essential until further notice), then do STI (so I'm not failing to respond to IPIs from other CPUs); and then start worrying about what to do about handling the NMI after all that is done.

Cheers,

Brendan

gerryg400 · Post by **gerryg400** » Sun Apr 08, 2012 4:59 am

XenOS wrote:@gerryg400:
Would you mind posting your ret_to_user code? I guess I can imagine what it should look like, I just want to make sure that I don't miss anything important.

This code is by no means final and the ksysexit() function changes frequently. In particular, I should be able to handle reschedules caused by an interrupt right up to the __cli() but I currently don't. This __cli() is in effect the point of no return.

Code: Select all

ret_to_user:

        /* put core_id in edi */
        GET_COREID %edi
        call    ksysexit
        /* ksysexit returns the new stack pointer */
        movq    %rax, %rsp

        /* Back to the new thread */
        POP_GP_REGS
        iretq

then

Code: Select all

void *ksysexit(int core_id) {

    kobj_thread_t *newt;

    kassert(core_nest[core_id] == 1);

    ksignal_check();

    newt = curr_thread[core_id];

    /* Now run the continuation function */
    if (newt->continuation_func) {
        kdebug("POST PROCESSING\n");
        newt->reg.rax = newt->continuation_func(newt);
        newt->continuation_func = 0;
    }

    __cli();

    /* Set the current core state */
    core_state[core_id] = 0;

    /* Set the core nesting level to 0 */
    core_nest[core_id] = 0;

    /* Fix up the TLS */
    wr_msr(0xc0000100, (uintptr_t)curr_thread[core_id]->pptls);

    /* Fixup the TSS */
    tss[core_id]->rsp0 = (uint64_t)&curr_thread[core_id]->stk0top;

    /* And get the stack for the next thread */
    return &curr_thread[core_id]->reg.r15;
}

gerryg400 · Post by **gerryg400** » Sun Apr 08, 2012 5:03 am

XenOS wrote:One more minor issue with the "stack points to TCB" approach I (possibly) found is that in long mode one needs to make sure that the register save area pointed to by RSP0 is always aligned at a 16-byte boundary, because otherwise the pushed registers may end up at the wrong place:

This http://forum.osdev.org/viewtopic.php?f=1&t=22014 cost me a week of my life.

xenos · Post by **xenos** » Sun Apr 08, 2012 10:24 am

Indeed your code looks more or less as I expected

The only thing I would have missed is the TLS stuff / the fs base fixup since I (currently) don't have TLS support in my kernel. Well, and the continuation stuff is a bit new to me. I know about the theoretical concept, but I've never used continuations in practice.

rdos · Post by **rdos** » Sun Apr 08, 2012 10:46 am

Brendan wrote:Hi,

turdus wrote:
Brendan wrote:For "one kernel stack per CPU".... Interrupt handlers have to check the interrupted code's CPL
No, that's the point of IST. Read the manuals ("The IST mechanism provides a method for specific interrupts, such as NMI, double-fault, and machine-check, to always execute on a known- good stack."). I don't have such a check and I can assure you it works fine and fast without. And there's plenty of space on TCB page for nested stacks (a thread struct usually not bigger than 512 bytes).
Let's say you cannot image a handler without it, but it's not a "have to".
You're using "one (small) kernel stack per thread plus a bunch more kernel stacks for various interrupt handlers" and not using "one kernel stack per CPU"; and because you're having trouble seeing the difference you're not comprehending what everyone else here is talking about.

I don't think that matters a lot. In a design that is tolerant of stack overflows in kernel, either the stack or the double fault handler must be a task / IST. Otherwise, any such things will generate a tripple-fault, and you would have no idea what happened. This is not related to if there is one kernel stack per thread or CPU.

gerryg400 · Post by **gerryg400** » Sun Apr 08, 2012 3:50 pm

rdos wrote:I don't think that matters a lot. In a design that is tolerant of stack overflows in kernel, either the stack or the double fault handler must be a task / IST. Otherwise, any such things will generate a tripple-fault, and you would have no idea what happened. This is not related to if there is one kernel stack per thread or CPU.

Perhaps a task gate or an IST could be used to find kernel bugs, but in this type of kernel design it is intended that 'normal' operation never use these CPU features. Remember that most work in a microkernel is done in ring 3. And in this particular type of microkernel the goal is for a very small, very flat (not very nested) kernel. My kernel is not tolerant in any way of kernel stack overflow. It doesn't need to be because (aside from bugs), the kernel stack is used in a very predictable way and will never overflow.

rdos · Post by **rdos** » Mon Apr 09, 2012 3:57 am

gerryg400 wrote:
rdos wrote:I don't think that matters a lot. In a design that is tolerant of stack overflows in kernel, either the stack or the double fault handler must be a task / IST. Otherwise, any such things will generate a tripple-fault, and you would have no idea what happened. This is not related to if there is one kernel stack per thread or CPU.
Perhaps a task gate or an IST could be used to find kernel bugs, but in this type of kernel design it is intended that 'normal' operation never use these CPU features. Remember that most work in a microkernel is done in ring 3. And in this particular type of microkernel the goal is for a very small, very flat (not very nested) kernel. My kernel is not tolerant in any way of kernel stack overflow. It doesn't need to be because (aside from bugs), the kernel stack is used in a very predictable way and will never overflow.

OTOH, having a task / IST serve double fault won't intervene with any normal operation either. I can imagine that nested IRQs could cause stack overflows even in a microkernel, but that would depend on how IRQs are handled, and the size of the kernel stack. Another obvious problem that could cause stack faults is a corrupt stack pointer. But you are right that in a mature microkernel these issues should not occur.

turdus · Post by **turdus** » Tue Apr 10, 2012 3:18 am

Brendan wrote:You're using "one (small) kernel stack per thread plus a bunch more kernel stacks for various interrupt handlers" and not using "one kernel stack per CPU"; and because you're having trouble seeing the difference you're not comprehending what everyone else here is talking about.

Sorry to disappoint you, but you are wrong. I do not have per thread IST values. IST pointers are in TSS, and you have only one TSS per cpu (unless you use hardware task switching, which I don't). Whenever a handler called, this pointer is substracted by KERNELSTACKSIZE, and before leave added. This way the "beginning of stack" is always arranged for the next handler to not interfere. If you choose KERNELSTACKSIZE smaller than the actual stack consumption, than yes, you'll have trouble. But as long as provide sufficient amount of memory, you can have endless nested levels of handlers without problem.

Note: Ignoring space used for things like thread name, amount of CPU time used by the thread, etc; we'd be looking at about 512 bytes for FPU/MMX/SSE state alone; plus another 32 bytes (for protected mode) or 128 bytes (for long mode) to store the thread's general registers. There are no stacks in the TCB at all.

TCB contains stack to store general purpuse registers and return data for iret. Yes, you can use different memory for that, but most of kernels place it in TCB (512 bytes for those 32 or 128 bytes you wrote). And I don't ignore thread names and such, what makes you think that? FPU/MMX/SSE registers usually not stored in stack, instead in a dedicated area of TCB, since you save and restore it with a different method (only when needed, not on every task switch). So they not belong to the subject.

OSDev.org

Issues moving to the "one kernel stack per CPU" approach

Re: Issues moving to the "one kernel stack per CPU" approach

Re: Issues moving to the "one kernel stack per CPU" approach

Re: Issues moving to the "one kernel stack per CPU" approach

Re: Issues moving to the "one kernel stack per CPU" approach

Re: Issues moving to the "one kernel stack per CPU" approach

Re: Issues moving to the "one kernel stack per CPU" approach

Re: Issues moving to the "one kernel stack per CPU" approach

Re: Issues moving to the "one kernel stack per CPU" approach

Re: Issues moving to the "one kernel stack per CPU" approach

Re: Issues moving to the "one kernel stack per CPU" approach

Re: Issues moving to the "one kernel stack per CPU" approach

Re: Issues moving to the "one kernel stack per CPU" approach

Re: Issues moving to the "one kernel stack per CPU" approach

Re: Issues moving to the "one kernel stack per CPU" approach

Re: Issues moving to the "one kernel stack per CPU" approach