Re: Issues moving to the "one kernel stack per CPU" approach
Posted: Wed Apr 04, 2012 4:55 pm
That's actually pretty clever.
The Place to Start for Operating System Developers
http://f.osdev.org/
Code: Select all
.code64
.text
.align 16
.globl intr_0x21
intr_0x21:
pushq %rdi
pushq %rsi
mov $0x21, %esi
jmp intr_handler
intr_handler:
/* Push the remaining regs. rdi, rsi were already pushed */
PUSH_GP_REGS_EXCEPT_RDI_RSI
/* put core_id in edi */
GET_COREID %edi
/* If we came from the kernel our core_nest will be > 0 */
xorl %edx, %edx
cmpl $0, core_nest(, %rdi, 4)
jne intr_dont_switch
movl $1, %edx
/* We came from user mode, switch to the kernel stack */
movq kstack(, %rdi, 8), %rsp
intr_dont_switch:
incl core_nest(, %rdi, 4)
pushq %rdx
call kintr_handler
popq %rdx
/* If we came from user mode we ret_to_user() */
cmpl $1, %edx
je ret_to_user
/* Else do a simple return */
cli
GET_COREID %edi
decl core_nest(, %rdi, 4)
POP_GP_REGS
iretq
Code: Select all
void kintr_handler(int core, int inum, int nested);
Code: Select all
enum {
CORE_ST_USER, /* Core is in user mode. A fault becomes a signal on the running thread. */
CORE_ST_PARMCHK, /* Core is in kernel mode but playing with user data e.g. copy_to_user(). A fault causes the system call to return an error message. */
CORE_ST_SYSCALL, /* In a syscall. A fault here means a kernel bug */
CORE_ST_IPC, /* A fault here means return an error to the calling thread(s) */
CORE_ST_CONTIN, /* Performing a continuation function. We discussed these once before on OSDEV. If you don't know what these are you soon will */
};
The basic idea comes from Minix.XenOS wrote:Very nice! This approach looks very clean to me. Probably I'll use something similar in my kernel.
No. It doesn't need to. A little fiddling with the stack and it can be passed as a parameter to the C code that handles traps.XenOS wrote:Just one more thing: I guess your register save area (which rsp0 points to) also contains some space for the error code, in case of a PF or GPF?
Code: Select all
.code64
.text
.globl intr_trap13
.align 16
intr_trap13:
/* Error code is already on stack */
pushq $13
jmp trap
trap:
PUSH_GP_REGS_EXCEPT_RDI_RSI
/* Save error code and trapno temporarily */
movq 14*8(%rsp), %rdx
movq 13*8(%rsp), %rcx
/* Save rdi and rsi into stack */
movq %rdi, 14*8(%rsp)
movq %rsi, 13*8(%rsp)
/* Put trapno in rsi (errcode is already in rdx) */
movq %rcx, %rsi
/* Put a pointer the the reg struct in rcx */
leaq (%rsp), %rcx
/* put core_id in edi */
GET_COREID %edi
/* If we came from the kernel our core_nest will be > 0 */
cmpl $0, core_nest(, %rdi, 4)
jne trap_dont_switch
/* We came from user mode, switch to the kernel stack */
incl core_nest(, %rdi, 4)
movq kstack(, %rdi, 8), %rsp
call ktrap_user
jmp ret_to_user
trap_dont_switch:
incl core_nest(, %rdi, 4)
call ktrap_kernel
/* Do a simple return */
cli
GET_COREID %edi
decl core_nest(, %rdi, 4)
POP_GP_REGS
iretq
Code: Select all
void ktrap_kernel(int corenum, int trapno, int errcode, regpack64_t *regs);
So maybe I should have a closer look at Minix from time to time I have one of Tanenbaum's older books somewhere around here with an early version of the Minix source code, but I always had the impression that is was a bit hard to read. I guess I better have a look at a more recent version...gerryg400 wrote:The basic idea comes from Minix.
I see. Probably something similar will work on 32bit systems as well, simply by putting the error code onto the kernel stack instead of rdx.No. It doesn't need to. A little fiddling with the stack and it can be passed as a parameter to the C code that handles traps.
gerryg400 wrote:My interrupt entry looks like this. You see it jumps over the stack switch if it's already on the kernel stack.
The fastest/easiest way for an interrupt handler to test if the interrupted code was running at CPL=0 or not would be:gerryg400 wrote:Code: Select all
/* put core_id in edi */ GET_COREID %edi /* If we came from the kernel our core_nest will be > 0 */ xorl %edx, %edx cmpl $0, core_nest(, %rdi, 4) jne intr_dont_switch
Code: Select all
test byte [rsp+something],3 ;Does "return CS" on the stack have CPL bits set to zero?
je intr_dont_switch ; yes, skip the thread state save
I like this idea too, and have been thinking of "borrowing" it for my OS. There's are 2 things people should be careful about though.gerryg400 wrote:I point the ring 0 stack in the TSS at the thread control block before I run a thread.
That's exactly what I implemented yesterdayBrendan wrote:The fastest/easiest way for an interrupt handler to test if the interrupted code was running at CPL=0 or not would be:Code: Select all
test byte [rsp+something],3 ;Does "return CS" on the stack have CPL bits set to zero? je intr_dont_switch ; yes, skip the thread state save
How could one solve this kind of problem? If I see this correctly, the problem is that the NMI handler would save some registers on the stack, which is in fact a small portion of the TCB, and it would thus trash some parts of the TCB which follow the register save area. So the problem is mainly caused by putting the stack into the TCB, where it becomes fatal if it grows more than expected. I guess the NMI handler needs to switch to its own stack immediately or at least detect that the current stack pointer is not safe for extensive pushes...Brendan wrote:The other thing that worries me is race conditions caused by an NMI that occurs at exactly the wrong time (after CPL=3 code was interrupted by any other interrupt handler but before the first interrupt handler has finished saving the interrupted thread's state and switching to the kernel's stack). Without adequate work-arounds, the consequences of the race conditions may range from "severe" to "worse than severe", depending on the specific code being used for normal interrupt handlers and the NMI handler.
I think the conclusion to this is that, in protected mode, NMIs should always be routed through a task gate (Is it a task gate? My PM memory is weak), or in long mode, through the Interrupt Stack Table.Brendan wrote:The other thing that worries me is race conditions caused by an NMI that occurs at exactly the wrong time (after CPL=3 code was interrupted by any other interrupt handler but before the first interrupt handler has finished saving the interrupted thread's state and switching to the kernel's stack). Without adequate work-arounds, the consequences of the race conditions may range from "severe" to "worse than severe", depending on the specific code being used for normal interrupt handlers and the NMI handler.
Yes, just don't use IRET, use RETF, etc...gerryg400 wrote: Is it possible to return from an NMI without enabling further NMIs ?
Hmm, of course. Fix up the stack frame and just do a RET. Then later on, after the NMI is processed, rebuild the stack frame and do an IRET.cyr1x wrote:Yes, just don't use IRET, use RETF, etc...gerryg400 wrote: Is it possible to return from an NMI without enabling further NMIs ?
On the other hand one might consider to do an iret to the 'real' NMI handler and put some NMI-reentrancy-protection code around it. This allows for further exceptions/interrupts, as you may want to put breakpoints for debugging the NMI code or handle other interrupts while in an NMI.
Code: Select all
stack+0 came from userspace
stack+KSIZE kernel nested level 0
stack+2*KSIZE kernel nested level 1
stack+3*KSIZE kernel nested level 2
stack+4*KSIZE kernel nested level 3
...nested level n
I've been thinking about it. I've decided/realised that the problems don't just effect NMI - they effect any exception that could occur at exactly the wrong time, which could include the machine check exception and possibly others.XenOS wrote:How could one solve this kind of problem? If I see this correctly, the problem is that the NMI handler would save some registers on the stack, which is in fact a small portion of the TCB, and it would thus trash some parts of the TCB which follow the register save area. So the problem is mainly caused by putting the stack into the TCB, where it becomes fatal if it grows more than expected. I guess the NMI handler needs to switch to its own stack immediately or at least detect that the current stack pointer is not safe for extensive pushes...Brendan wrote:The other thing that worries me is race conditions caused by an NMI that occurs at exactly the wrong time (after CPL=3 code was interrupted by any other interrupt handler but before the first interrupt handler has finished saving the interrupted thread's state and switching to the kernel's stack). Without adequate work-arounds, the consequences of the race conditions may range from "severe" to "worse than severe", depending on the specific code being used for normal interrupt handlers and the NMI handler.
Code: Select all
interruptHandler:
push dword 0 ;Dummy error code (remove for some exception handlers)
test byte [esp+4],3 ;Was interrupted code running at CPL=0?
je .isCPL0 ; yes
;Save remaining CPL=3 state
pushad
;Make sure kernel data segments are being used
mov eax,KERNEL_DATA
mov ds,eax
mov es,eax
;Switch to the kernel's stack
mov esp,[gs:CPUinfo.kernelStack]
;Enable interrupts (assume interrupt handler uses an "interrupt gate")
sti
;Call the real interrupt handler, C calling conventions for "void realInterruptHandler(uint32_t interruptNumber)"
push 123 ;Assume this interrupt handler is for interrupt number 123 (!)
call realInterruptHandler
add esp,4
;Disable interrupts
cli
;Restore CPL3 state (may be returning to a different thread)
mov edx,cr3
mov eax,[gs:CPUinfo.currentTCB] ;eax = address of TCB for this thread
cmp edx,[eax+TCB.userCR3] ;Is virtual address space switch needed?
je .l1 ; no, skip it
mov edx,[eax+TCB.userCR3]
mov cr3,edx
.l1:
mov edi,[gs:CPUinfo.TSSaddress] ;edi = address of TSS for this CPU
lea ebx,[eax+offsetForESP0] ;ebx = value to put in TSS for this thread
mov [edi+TSS.ESP0],eax ;Set ESP0 in TSS
lea esp,[eax+offsetForReturnData] ;Switch to TCB stack to POP return values
popad
add esp,4 ;Remove error code
iretd
.isCPL0:
push dword [esp+16] ;Push "return EFLAGS" on the stack again
popfd ;Restore the interrupted code's "interrupt enable" flag
pushad
push 123 ;Assume this interrupt handler is for interrupt number 123 (!)
call realInterruptHandler
add esp,4
popad
cli
iretd
Code: Select all
interruptHandler:
push dword 0 ;Dummy error code (remove for some exception handlers)
test byte [esp+8],3 ;Was interrupted code running at CPL=0?
je .isCPL0 ; yes
mov [gs:CPUinfo.tempUserEAX],eax ;Store EAX somewhere safe
mov eax,KERNEL_DATA ;Make sure DS and ES are correct
mov ds,eax
mov es,eax
mov [gs:CPUinfo.tempUserEBX],ebx ;Store EBX somewhere safe
mov eax,[gs:CPUinfo.currentTCB] ;eax = address of interrupted thread's TCB
mov ebx,[esp+4] ;ebx = "return EIP"
mov [eax+TCB.userEIP],ebx
mov ebx,[esp+12] ;ebx = "return ESP"
mov [eax+TCB.userESP],ebx
mov ebx,[gs:CPUinfo.tempUserEAX] ;ebx = user EAX
mov [eax+TCB.userEAX],ebx
mov ebx,[gs:CPUinfo.tempUserEBX] ;ebx = user EBX
mov [eax+TCB.userEBX],ebx
mov [eax+TCB.userECX],ecx
mov [eax+TCB.userEDX],edx
mov [eax+TCB.userEBP],ebp
mov [eax+TCB.userESI],esi
mov [eax+TCB.userEDI],edi
;Enable interrupts (assume interrupt handler uses an "interrupt gate")
sti
;Call the real interrupt handler, C calling conventions for "void realInterruptHandler(uint32_t interruptNumber)"
push 123 ;Assume this interrupt handler is for interrupt number 123 (!)
call realInterruptHandler
add esp,4
;Disable interrupts
cli
;Restore CPL3 state (may be returning to a different thread)
mov edx,cr3
mov eax,[gs:CPUinfo.currentTCB] ;eax = address of TCB for this thread
cmp edx,[eax+TCB.userCR3] ;Is virtual address space switch needed?
je .l1 ; no, skip it
mov edx,[eax+TCB.userCR3]
mov cr3,edx
.l1:
mov ebx,[eax+TBC.userEIP]
mov ecx,[eax+TBC.userESP]
mov [esp+4],ebx ;Set "return EIP"
mov [esp+12],ebx ;Set "return ESP"
mov ebx,[eax+TCB.userEBX]
mov ecx,[eax+TCB.userECX]
mov edx,[eax+TCB.userEDX]
mov ebp,[eax+TCB.userEBP]
mov esi,[eax+TCB.userESI]
mov edi,[eax+TCB.userEDI]
mov eax,[eax+TCB.userEAX]
add esp,4 ;Remove error code
iretd
.isCPL0:
push dword [esp+16] ;Push "return EFLAGS" on the stack again
popfd ;Restore the interrupted code's "interrupt enable" flag
pushad
mov eax,KERNEL_DATA
mov ds,eax
mov es,eax
push 123 ;Assume this interrupt handler is for interrupt number 123 (!)
call realInterruptHandler
add esp,4
popad
iretd