Issues moving to the "one kernel stack per CPU" approach

Jezze · Post by **Jezze** » Wed Apr 04, 2012 4:55 pm

That's actually pretty clever.

xenos · Post by **xenos** » Thu Apr 05, 2012 1:11 am

I agree, I like this idea. This may also provide the solution to another issue I was thinking about (not a serious one, but at least something to think about).

As I wrote before, there may be situations when a CPU has nothing to do and is put into a "sti; hlt;" state waiting for an interrupt and some work. During this time the CPU stays in ring 0 because of the "hlt", and it is not assigned any user thread. In this case the interrupt stub should not save any registers into the current thread's TCB, since there is no current thread at all. Instead, it should simply switch to a clean kernel stack and enter the interrupt handler. The interrupt handler does not make any assumptions about the register contents, so the values from the previous run can be completely discarded.

I was wondering how to detect this situation most efficiently, and I figured out that I could either check the "current TCB" pointer for being NULL or the return code segment (pushed when the interrupt occurs) for being ring 0 or ring 3. But in your approach it looks a bit different. Could you explain the role of "core_nest" and "core_state" in your code? I guess you allow interrupts in ring 0 as well - how do you handle them?

gerryg400 · Post by **gerryg400** » Thu Apr 05, 2012 4:12 am

Yes I have interrupts enabled nearly all the time. I only really disable them while saving and switching stacks. As you guessed I use core_nest[] to keep track of interrupt nesting. My interrupt entry looks like this. You see it jumps over the stack switch if it's already on the kernel stack.

Code: Select all

.code64
.text
        .align  16

        .globl  intr_0x21
intr_0x21:
        pushq   %rdi
        pushq   %rsi
        mov     $0x21, %esi
        jmp     intr_handler

intr_handler:

        /* Push the remaining regs. rdi, rsi were already pushed */
        PUSH_GP_REGS_EXCEPT_RDI_RSI

        /* put core_id in edi */
        GET_COREID %edi

        /* If we came from the kernel our core_nest will be > 0 */
        xorl    %edx, %edx
        cmpl    $0, core_nest(, %rdi, 4)
        jne     intr_dont_switch
        movl    $1, %edx

        /* We came from user mode, switch to the kernel stack */
        movq    kstack(, %rdi, 8), %rsp

intr_dont_switch:
        incl    core_nest(, %rdi, 4)
        pushq   %rdx
        call    kintr_handler
        popq    %rdx

        /* If we came from user mode we ret_to_user() */
        cmpl    $1, %edx
        je      ret_to_user

        /* Else do a simple return */
        cli
        GET_COREID %edi
        decl    core_nest(, %rdi, 4)
        POP_GP_REGS
        iretq

the prototype for kintr_handler is

Code: Select all

void kintr_handler(int core, int inum, int nested);

Core_state is to keep track of a core as it performs a system call. This is mainly for fault handling. Core_states are

Code: Select all

enum {
    CORE_ST_USER,     /* Core is in user mode. A fault becomes a signal on the running thread. */
    CORE_ST_PARMCHK,  /* Core is in kernel mode but playing with user data e.g. copy_to_user(). A fault causes the system call to return an error message. */
    CORE_ST_SYSCALL,  /* In a syscall. A fault here means a kernel bug */
    CORE_ST_IPC,      /* A fault here means return an error to the calling thread(s) */
    CORE_ST_CONTIN,   /* Performing a continuation function. We discussed these once before on OSDEV. If you don't know what these are you soon will */
};

xenos · Post by **xenos** » Thu Apr 05, 2012 5:47 am

Very nice! This approach looks very clean to me. Probably I'll use something similar in my kernel.

Just one more thing: I guess your register save area (which rsp0 points to) also contains some space for the error code, in case of a PF or GPF?

gerryg400 · Post by **gerryg400** » Thu Apr 05, 2012 6:27 am

XenOS wrote:Very nice! This approach looks very clean to me. Probably I'll use something similar in my kernel.

The basic idea comes from Minix.

XenOS wrote:Just one more thing: I guess your register save area (which rsp0 points to) also contains some space for the error code, in case of a PF or GPF?

No. It doesn't need to. A little fiddling with the stack and it can be passed as a parameter to the C code that handles traps.

Code: Select all

.code64
.text
        .globl  intr_trap13
        .align  16
intr_trap13:
        /* Error code is already on stack */
        pushq   $13
        jmp     trap

trap:
        PUSH_GP_REGS_EXCEPT_RDI_RSI

        /* Save error code and trapno temporarily */
        movq    14*8(%rsp), %rdx
        movq    13*8(%rsp), %rcx

        /* Save rdi and rsi into stack */
        movq    %rdi, 14*8(%rsp)
        movq    %rsi, 13*8(%rsp)

        /* Put trapno in rsi (errcode is already in rdx) */
        movq    %rcx, %rsi

        /* Put a pointer the the reg struct in rcx */
        leaq    (%rsp), %rcx

        /* put core_id in edi */
        GET_COREID %edi

        /* If we came from the kernel our core_nest will be > 0 */
        cmpl    $0, core_nest(, %rdi, 4)
        jne     trap_dont_switch

        /* We came from user mode, switch to the kernel stack */
        incl    core_nest(, %rdi, 4)
        movq    kstack(, %rdi, 8), %rsp
        call    ktrap_user
        jmp     ret_to_user

trap_dont_switch:
        incl    core_nest(, %rdi, 4)
        call    ktrap_kernel

        /* Do a simple return */
        cli
        GET_COREID %edi
        decl    core_nest(, %rdi, 4)
        POP_GP_REGS
        iretq

The prototype for ktrap_kernel looks like this.

Code: Select all

void ktrap_kernel(int corenum, int trapno, int errcode, regpack64_t *regs);

xenos · Post by **xenos** » Thu Apr 05, 2012 6:59 am

gerryg400 wrote:The basic idea comes from Minix.

So maybe I should have a closer look at Minix from time to time

I have one of Tanenbaum's older books somewhere around here with an early version of the Minix source code, but I always had the impression that is was a bit hard to read. I guess I better have a look at a more recent version...

No. It doesn't need to. A little fiddling with the stack and it can be passed as a parameter to the C code that handles traps.

I see. Probably something similar will work on 32bit systems as well, simply by putting the error code onto the kernel stack instead of rdx.

Brendan · Post by **Brendan** » Thu Apr 05, 2012 5:07 pm

Hi,

gerryg400 wrote:My interrupt entry looks like this. You see it jumps over the stack switch if it's already on the kernel stack.

gerryg400 wrote:

Code: Select all

        /* put core_id in edi */
        GET_COREID %edi

        /* If we came from the kernel our core_nest will be > 0 */
        xorl    %edx, %edx
        cmpl    $0, core_nest(, %rdi, 4)
        jne     intr_dont_switch

The fastest/easiest way for an interrupt handler to test if the interrupted code was running at CPL=0 or not would be:

Code: Select all

        test byte [rsp+something],3          ;Does "return CS" on the stack have CPL bits set to zero?
        je intr_dont_switch                  ; yes, skip the thread state save

Cheers,

Brendan

Brendan · Post by **Brendan** » Thu Apr 05, 2012 5:39 pm

Hi,

gerryg400 wrote:I point the ring 0 stack in the TSS at the thread control block before I run a thread.

I like this idea too, and have been thinking of "borrowing" it for my OS. There's are 2 things people should be careful about though.

First, if you support virtual8086 you'd want to be careful because the CPU does 4 extra pushes (for segment registers, before the normal SS/ESP/EFLAGS/CS/EIP pushes) when virtual8086 code is interrupted. This isn't a problem if you're in long mode and can't use virtual8086, or if you're in protected mode and don't use virtual8086 for other reasons.

The other thing that worries me is race conditions caused by an NMI that occurs at exactly the wrong time (after CPL=3 code was interrupted by any other interrupt handler but before the first interrupt handler has finished saving the interrupted thread's state and switching to the kernel's stack). Without adequate work-arounds, the consequences of the race conditions may range from "severe" to "worse than severe", depending on the specific code being used for normal interrupt handlers and the NMI handler.

Cheers,

Brendan

xenos · Post by **xenos** » Fri Apr 06, 2012 1:54 am

Brendan wrote:The fastest/easiest way for an interrupt handler to test if the interrupted code was running at CPL=0 or not would be:
Code: Select all
        test byte [rsp+something],3          ;Does "return CS" on the stack have CPL bits set to zero?
        je intr_dont_switch                  ; yes, skip the thread state save

That's exactly what I implemented yesterday

Brendan wrote:The other thing that worries me is race conditions caused by an NMI that occurs at exactly the wrong time (after CPL=3 code was interrupted by any other interrupt handler but before the first interrupt handler has finished saving the interrupted thread's state and switching to the kernel's stack). Without adequate work-arounds, the consequences of the race conditions may range from "severe" to "worse than severe", depending on the specific code being used for normal interrupt handlers and the NMI handler.

How could one solve this kind of problem? If I see this correctly, the problem is that the NMI handler would save some registers on the stack, which is in fact a small portion of the TCB, and it would thus trash some parts of the TCB which follow the register save area. So the problem is mainly caused by putting the stack into the TCB, where it becomes fatal if it grows more than expected. I guess the NMI handler needs to switch to its own stack immediately or at least detect that the current stack pointer is not safe for extensive pushes...

Owen · Post by **Owen** » Fri Apr 06, 2012 9:20 am

Brendan wrote:The other thing that worries me is race conditions caused by an NMI that occurs at exactly the wrong time (after CPL=3 code was interrupted by any other interrupt handler but before the first interrupt handler has finished saving the interrupted thread's state and switching to the kernel's stack). Without adequate work-arounds, the consequences of the race conditions may range from "severe" to "worse than severe", depending on the specific code being used for normal interrupt handlers and the NMI handler.

I think the conclusion to this is that, in protected mode, NMIs should always be routed through a task gate (Is it a task gate? My PM memory is weak), or in long mode, through the Interrupt Stack Table.

gerryg400 · Post by **gerryg400** » Fri Apr 06, 2012 3:52 pm

I don't have any great solutions to NMI. Aside from Owen's solution, which seems reasonable there is the manual way which would involve unwinding and rebuilding the stack if the NMI happened during kernel entry or exit. I've played around with this a little and it's certainly possible. The core_nest variable tells me whether the kernel has a good stack or not.

The biggest issue for me is that NMI allows you to re-enter the kernel even when interrupts are disabled and a spinlock is held or being waited on. This can cause deadlock on the local core or unnecessary waiting by other cores. The only way around this that I have is to set a bit in core_state to indicate that an NMI has occurred and return back to what was previously being done. Because it's very difficult to detect exactly where the kernel is when an NMI happens it would probably be best to assume that if interrupts are disabled (check IF on the RFLAGS on the stack) then it's not safe to do anything. The NMI bit in core_state can be processed later at a time when the kernel is in a clean state.

A further complication is that an IRET will potentially unmask further NMIs. This problem is not limited to our particular type of microkernel so it's probable that there are solutions to this floating about.

Is it possible to return from an NMI without enabling further NMIs ?

cyr1x · Post by **cyr1x** » Fri Apr 06, 2012 4:45 pm

gerryg400 wrote: Is it possible to return from an NMI without enabling further NMIs ?

Yes, just don't use IRET, use RETF, etc...
On the other hand one might consider to do an iret to the 'real' NMI handler and put some NMI-reentrancy-protection code around it. This allows for further exceptions/interrupts, as you may want to put breakpoints for debugging the NMI code or handle other interrupts while in an NMI.

gerryg400 · Post by **gerryg400** » Fri Apr 06, 2012 5:09 pm

cyr1x wrote:
gerryg400 wrote: Is it possible to return from an NMI without enabling further NMIs ?
Yes, just don't use IRET, use RETF, etc...
On the other hand one might consider to do an iret to the 'real' NMI handler and put some NMI-reentrancy-protection code around it. This allows for further exceptions/interrupts, as you may want to put breakpoints for debugging the NMI code or handle other interrupts while in an NMI.

Hmm, of course. Fix up the stack frame and just do a RET. Then later on, after the NMI is processed, rebuild the stack frame and do an IRET.

turdus · Post by **turdus** » Fri Apr 06, 2012 5:53 pm

In long mode you can skip the CPL check, as CPU can switch stacks with consistent data regardless to CPL change. See AMD64_vol2 page 246 or Intel_vol3a section 5.14.5.
IST also allows separate stacks for NMI, exception handlers and irq handlers for example (up to 7 stacks).

Use it this way: for clarity let's assume using IST0 only. When you enter a handler, add KERNELSTACKSIZE to it's value. Subtract it before leave, both unconditionally. You'll end up like this:

Code: Select all

stack+0	came from userspace
stack+KSIZE	kernel nested level 0
stack+2*KSIZE	kernel nested level 1
stack+3*KSIZE	kernel nested level 2
stack+4*KSIZE	kernel nested level 3
...nested level n

(Since I do not allow re-entrant IRQ handlers, most of the time I have "came from userspace". If an IRQ happens during syscall, I have nested level 0. If a page fault occurs in that IRQ handler, an exception will be raised, so I would have nested level 1. If the page fault handler crashes (it won't), another exception is raised, which means nested level 2. The double fault exception handler never fails, so I have never seen nested level 3. If you are so unlucky that an NMI happens then, you'll have the theoretic nested level 4.)
For KSIZE you can use some small value, like 256 bytes (of which 40+104 bytes used (plus error code) when C handler takes over, that takes no more than 32-64 bytes plus).

Avoiding re-entrant IRQ handler has many benefits.
1. Because IRQ handlers are small and fast routines, it's unlikely to have an another IRQ in a handler anyway.
2. If this very rare event happens, it will suffer only a small delay.
3. Eliminating the condition gains performance as it does not flush the CPU pipeline (being sequential).
4. In return TCB uses a little more memory (by storing full state between kernel levels), but TCB page is cached most of the time (timeslice counter on that page).
5. Further speedup by not using bitmask magic on PIC.
6. Handlers (even send+recv syscalls) are guaranteed atomic, which simplify even more at higher levels.

Brendan · Post by **Brendan** » Fri Apr 06, 2012 8:15 pm

Hi,

XenOS wrote:
Brendan wrote:The other thing that worries me is race conditions caused by an NMI that occurs at exactly the wrong time (after CPL=3 code was interrupted by any other interrupt handler but before the first interrupt handler has finished saving the interrupted thread's state and switching to the kernel's stack). Without adequate work-arounds, the consequences of the race conditions may range from "severe" to "worse than severe", depending on the specific code being used for normal interrupt handlers and the NMI handler.
How could one solve this kind of problem? If I see this correctly, the problem is that the NMI handler would save some registers on the stack, which is in fact a small portion of the TCB, and it would thus trash some parts of the TCB which follow the register save area. So the problem is mainly caused by putting the stack into the TCB, where it becomes fatal if it grows more than expected. I guess the NMI handler needs to switch to its own stack immediately or at least detect that the current stack pointer is not safe for extensive pushes...

I've been thinking about it. I've decided/realised that the problems don't just effect NMI - they effect any exception that could occur at exactly the wrong time, which could include the machine check exception and possibly others.

For one example, a kernel might assume GS always points to the CPU's "per CPU data area" and then arrange things such that if (malicious) CPL=3 code changed GS to something else a general protection fault or page fault will occur when the kernel attempts to use GS, and the exception handler corrects GS and returns (such that the GS is auto-corrected if/when necessary). Note: While this sounds complex it is a viable optimisation in some cases (it avoids the need for the kernel to ever load GS when there's no malicious code running, and simplifies things like "mov esi,[gs:CPUdata.currentTCB]" in interrupt handlers).

In addition; I've realised that there may also be problems returning from an interrupt; where an NMI (or other exception) could occur while you're in the middle of returning to CPL=3.

The exact number of problems and their effects depends on the specific code being used for any interrupt handlers involved; and therefore it's a little hard to find solutions without having specific code to analyse. With this in mind, here's some (untested) example code I made up. To make it more fun I'm relying on the GS trick mentioned above and showing code for protected mode (where you can't just use "swapgs"). Also note that I don't bother saving/restoring CPL=3 DS and ES because (for my OS) user space code shouldn't modify them (and for all I care user space code can/should crash if it does modify them).

Code: Select all

interruptHandler:
    push dword 0                       ;Dummy error code (remove for some exception handlers)

    test byte [esp+4],3                ;Was interrupted code running at CPL=0?
    je .isCPL0                         ; yes

    ;Save remaining CPL=3 state

    pushad

    ;Make sure kernel data segments are being used

    mov eax,KERNEL_DATA
    mov ds,eax
    mov es,eax
    
    ;Switch to the kernel's stack

    mov esp,[gs:CPUinfo.kernelStack]

    ;Enable interrupts (assume interrupt handler uses an "interrupt gate")

    sti

    ;Call the real interrupt handler, C calling conventions for "void realInterruptHandler(uint32_t interruptNumber)"
    
    push 123                           ;Assume this interrupt handler is for interrupt number 123 (!)
    call realInterruptHandler
    add esp,4

    ;Disable interrupts

    cli
    
    ;Restore CPL3 state (may be returning to a different thread)
    
    mov edx,cr3
    mov eax,[gs:CPUinfo.currentTCB]    ;eax = address of TCB for this thread
    cmp edx,[eax+TCB.userCR3]          ;Is virtual address space switch needed?
    je .l1                             ; no, skip it
    mov edx,[eax+TCB.userCR3]
    mov cr3,edx
.l1:
    mov edi,[gs:CPUinfo.TSSaddress]    ;edi = address of TSS for this CPU
    lea ebx,[eax+offsetForESP0]        ;ebx = value to put in TSS for this thread
    mov [edi+TSS.ESP0],eax             ;Set ESP0 in TSS
    lea esp,[eax+offsetForReturnData]  ;Switch to TCB stack to POP return values
   
    popad
    add esp,4                          ;Remove error code
    iretd

.isCPL0:
    push dword [esp+16]                ;Push "return EFLAGS" on the stack again
    popfd                              ;Restore the interrupted code's "interrupt enable" flag
    pushad
    push 123                           ;Assume this interrupt handler is for interrupt number 123 (!)
    call realInterruptHandler
    add esp,4
    popad
    cli
    iretd

There are 4 problems here:

If an NMI (or some other exception) occurs at the wrong time at the start of an interrupt handler, that exception (and *all* exceptions that interrupt that exception) can trash data in the TCB and maybe beyond the end of the TCB too.
If an NMI (or some other exception) occurs at the wrong time at the start of an interrupt handler, then the exception handler would think it interrupted CPL=0 code but can't assume that DS or ES are sane
If an NMI (or some other exception) occurs at the wrong time at the start of an interrupt handler, then the exception handler would think it interrupted CPL=0 code but can't assume its stack is sane
If an NMI (or some other exception) occurs at the wrong time at the end of an interrupt handler (e.g. when a normal interrupt handler is in the middle of returning to CPL=3), then the exception handler would think it interrupted CPL=0 code but can't assume it's stack is sane

The solution to problem 2) is that the exception handler for any exception that could occur at the wrong time needs to load DS and ES regardless of whether the exception occured when kernel code is running or not.

The solution to problems 1), 3) and 4) is simple: don't use the TCB for a stack to begin with.

The code above would be rewritten as:

Code: Select all

interruptHandler:
    push dword 0                       ;Dummy error code (remove for some exception handlers)

    test byte [esp+8],3                ;Was interrupted code running at CPL=0?
    je .isCPL0                         ; yes

    mov [gs:CPUinfo.tempUserEAX],eax   ;Store EAX somewhere safe

    mov eax,KERNEL_DATA                ;Make sure DS and ES are correct
    mov ds,eax
    mov es,eax

    mov [gs:CPUinfo.tempUserEBX],ebx   ;Store EBX somewhere safe
    
    mov eax,[gs:CPUinfo.currentTCB]    ;eax = address of interrupted thread's TCB
    mov ebx,[esp+4]                    ;ebx = "return EIP"
    mov [eax+TCB.userEIP],ebx
    mov ebx,[esp+12]                   ;ebx = "return ESP"
    mov [eax+TCB.userESP],ebx

    mov ebx,[gs:CPUinfo.tempUserEAX]   ;ebx = user EAX
    mov [eax+TCB.userEAX],ebx
    mov ebx,[gs:CPUinfo.tempUserEBX]   ;ebx = user EBX
    
    mov [eax+TCB.userEBX],ebx
    mov [eax+TCB.userECX],ecx
    mov [eax+TCB.userEDX],edx
    mov [eax+TCB.userEBP],ebp
    mov [eax+TCB.userESI],esi
    mov [eax+TCB.userEDI],edi

    ;Enable interrupts (assume interrupt handler uses an "interrupt gate")

    sti

    ;Call the real interrupt handler, C calling conventions for "void realInterruptHandler(uint32_t interruptNumber)"
    
    push 123                           ;Assume this interrupt handler is for interrupt number 123 (!)
    call realInterruptHandler
    add esp,4

    ;Disable interrupts

    cli
    
    ;Restore CPL3 state (may be returning to a different thread)
    
    mov edx,cr3
    mov eax,[gs:CPUinfo.currentTCB]    ;eax = address of TCB for this thread
    cmp edx,[eax+TCB.userCR3]          ;Is virtual address space switch needed?
    je .l1                             ; no, skip it
    mov edx,[eax+TCB.userCR3]
    mov cr3,edx
.l1:
    mov ebx,[eax+TBC.userEIP]
    mov ecx,[eax+TBC.userESP]
    mov [esp+4],ebx                    ;Set "return EIP"
    mov [esp+12],ebx                   ;Set "return ESP"
    mov ebx,[eax+TCB.userEBX]
    mov ecx,[eax+TCB.userECX]
    mov edx,[eax+TCB.userEDX]
    mov ebp,[eax+TCB.userEBP]
    mov esi,[eax+TCB.userESI]
    mov edi,[eax+TCB.userEDI]
    mov eax,[eax+TCB.userEAX]
    add esp,4                          ;Remove error code
    iretd


.isCPL0:
    push dword [esp+16]                ;Push "return EFLAGS" on the stack again
    popfd                              ;Restore the interrupted code's "interrupt enable" flag
    pushad
    mov eax,KERNEL_DATA
    mov ds,eax
    mov es,eax

    push 123                           ;Assume this interrupt handler is for interrupt number 123 (!)
    call realInterruptHandler
    add esp,4

    popad
    iretd

Cheers,

Brendan

OSDev.org

Issues moving to the "one kernel stack per CPU" approach

Re: Issues moving to the "one kernel stack per CPU" approach

Re: Issues moving to the "one kernel stack per CPU" approach

Re: Issues moving to the "one kernel stack per CPU" approach

Re: Issues moving to the "one kernel stack per CPU" approach

Re: Issues moving to the "one kernel stack per CPU" approach

Re: Issues moving to the "one kernel stack per CPU" approach

Re: Issues moving to the "one kernel stack per CPU" approach

Re: Issues moving to the "one kernel stack per CPU" approach

Re: Issues moving to the "one kernel stack per CPU" approach

Re: Issues moving to the "one kernel stack per CPU" approach

Re: Issues moving to the "one kernel stack per CPU" approach

Re: Issues moving to the "one kernel stack per CPU" approach

Re: Issues moving to the "one kernel stack per CPU" approach

Re: Issues moving to the "one kernel stack per CPU" approach

Re: Issues moving to the "one kernel stack per CPU" approach