Page 1 of 4

Issues moving to the "one kernel stack per CPU" approach

Posted: Wed Apr 04, 2012 9:40 am
by xenos
After going through this discussion for a while I decided to implement the "one kernel stack per CPU" approach in my kernel. (Currently I use "one kernel stack per thread".) The main reason is to avoid the necessity of stack switches in the kernel, and to leave remnants of stack traces from nested function calls on a thread's kernel stack. It appears much clearer to me if a thread completely surrenders its identity whenever the kernel is entered. The thread's state is saved to some kind of thread control block (TCB) and "the kernel" is executed, nothing else. When the kernel has finished its job and returns, either the same or a diffent thread is chosen for execution, and its state is restored from the TCB. This makes things like IPC very simple: The sending thread invokes some IPC system call, and within this call control is transferred to the receiving thread (which may just have been unblocked).

Currently I'm having a few issues with the transition to the "one kernel stack per CPU" approach, so I'd like to hear some ideas and opinions:

CPU idling
Currently I have a few kernel threads for CPU idling, one for each CPU. Whenever the thread queue is empty and there is no work to do, a CPU executes its idle thread. The only thing the idle thread does is "sti; hlt;" in an endless loop. The latter is the reason why this thread is running in ring 0, since "hlt" is a privileged instruction. However, the "one kernel stack per CPU" approach does not allow for kernel threads, or at least it needs some hacking like per-thread stacks for kernel threads, and per-cpu stacks for user threads, but I'm trying to avoid this. So my current idea if there is no available thread is to simply not return to any thread, and directly run into a "sti; hlt;" sequence, which will be interrupted (or rather aborted, since I won't return to whatever comes after the "hlt") by the next timer tick or IRQ.

Thread state saving
Whenever the kernel is entered, at least some part of the threads state (i.e., registers) must be saved in the TCB. In my current per-thread stack approach I simply save the general purpose registers on the stack (in an ASM stub), and the FPU / MMX / SSE / AVX state is saved in the TCB only if necessary (using some inline ASM). In the new per-cpu stack approach I would prefer to save (at least some of) the general purpose registers to the TCB in my ASM stub. The TCB is defined as a rather complicated C++ class - the x86 TCB containing the complete x86 register set is derived from a generic TCB containing things like thread ID, time consumption and some other stuff, which is likely to change in the future. This means that also the offset of the x86 registers within my TCB is likely to change. The puzzling question is: How can I tell my ASM stub where to save the general purpose registers, when all I have is a pointer to the current TCB? I try to avoid hardcoding the offset into my ASM stub ("movl $curTCB, %eax; movl %edx, $0x48(%eax);" etc.) because if I change the layout of my TCB, I have to change all these hardcoded offsets as well.

Re: Issues moving to the "one kernel stack per CPU" approach

Posted: Wed Apr 04, 2012 10:36 am
by bluemoon
XenOS wrote:I try to avoid hardcoding the offset into my ASM stub ("movl $curTCB, %eax; movl %edx, $0x48(%eax);" etc.)
You can use struct in inline assembly or plain assembly.

Re: Issues moving to the "one kernel stack per CPU" approach

Posted: Wed Apr 04, 2012 11:18 am
by xenos
bluemoon wrote:You can use struct in inline assembly or plain assembly.
That's what I was thinking about, but somehow I couldn't figure out how to access the members of a C struct (or maybe a C++ class?) by their names from plain assembly... This is certainly one of the "stupid questions", though ;)

Re: Issues moving to the "one kernel stack per CPU" approach

Posted: Wed Apr 04, 2012 11:28 am
by bluemoon
You need to write in syntax that is recognized by your assembler. Do it by hand, or get some "header file convertor" and tweak the Makefile for automation.

For NASM, this is my FAT boot code:

Code: Select all

; ----------------------------------------------
struc BPB                 ; BIOS Paramater Block (At Boot Record)
    resb  3               ; JMP
    BPB_name    resb  8   ; Volume OEM Name eg. "mkdosfs"
    BPB_bps     resw  1   ; Bytes per Sector
    BPB_spc     resb  1   ; Sectors per Cluster
    BPB_res     resw  1   ; Reserved sector before 1st FAT
    BPB_nfat    resb  1   ; number of FATs
    BPB_maxroot resw  1   ; max number of root entry
    BPB_total1  resw  1   ; total sectors, if 0 use total2
    BPB_media   resb  1   ; media descriptor, F8 for HDD
    BPB_spf     resw  1   ; sectors per FAT
    BPB_spt     resw  1   ; sectors per track (63)
    BPB_head    resw  1   ; num of head
    BPB_lba     resd  1   ; starting sector of this vol
    BPB_total2  resd  1   ; total sectors if total1 == 0
endstruc

vbr:
    jmp	short start
    nop
    TIMES   0x3B  db  0     ; BIOS PARA BLOCK
start:
    ...
    mov     eax, [vbr + BPB_lba]    ; eax = LBA of volume

Re: Issues moving to the "one kernel stack per CPU" approach

Posted: Wed Apr 04, 2012 11:31 am
by Synon
XenOS wrote:
bluemoon wrote:You can use struct in inline assembly or plain assembly.
That's what I was thinking about, but somehow I couldn't figure out how to access the members of a C struct (or maybe a C++ class?) by their names from plain assembly... This is certainly one of the "stupid questions", though ;)
Push the struct onto the stack and then use hardcoded offsets to access the members (that's how gcc does it, at least).

For example, the code

Code: Select all

struct vector3 {
	int x, y, z;
};

int main()
{
	struct vector3 my_vector;
	my_vector.x = my_vector.y = my_vector.z = 0;
	return 0;
}
becomes

Code: Select all

	.file	"tmp.c"
	.def	___main;	.scl	2;	.type	32;	.endef
	.text
.globl _main
	.def	_main;	.scl	2;	.type	32;	.endef
_main:
	pushl	%ebp
	movl	%esp, %ebp
	andl	$-16, %esp
	subl	$16, %esp
	call	___main
	movl	$0, 12(%esp) # my_vector.z = 0
	movl	12(%esp), %eax
	movl	%eax, 8(%esp) # my_vector.y = my_vector.z
	movl	8(%esp), %eax
	movl	%eax, 4(%esp) # my_vector.x = my_vector.y
	movl	$0, %eax
	leave
	ret

Re: Issues moving to the "one kernel stack per CPU" approach

Posted: Wed Apr 04, 2012 11:42 am
by xenos
bluemoon wrote:You need to write in syntax that is recognized by your assembler. Do it by hand, or get some "header file convertor" and tweak the Makefile for automation.
Of course, that's a possibility. The main disadvantage I see here at the moment is the fact that I need to define this ASM struct for each supported target architecture, and if I change something in the generic TCB struct, I need to apply this change in all ASM struct definitions. And of course, I need to know the exact offsets to "clone" this struct, which may depend on the compiler's choice of alignment (unless I pack all structures, which may have disadvantages on architectures like ARM or MIPS).
Synon wrote:Push the struct onto the stack and then use hardcoded offsets to access the members (that's how gcc does it, at least).
Hardcoding the offsets is exactly what I'm trying to avoid, since these offsets are likely to change in the future - see my first post.

Re: Issues moving to the "one kernel stack per CPU" approach

Posted: Wed Apr 04, 2012 11:42 am
by bluemoon
I never needed to access structure in inline assembly yet for my OS project, but this compile nicely with gcc:

Code: Select all

struct FOO {
    int x, y;
};

void bar(struct FOO *f) {
    __asm ( "mov eax, %0\n"
            "inc eax\n"
            "mov %1, eax"
           : "=m"(f->y)
           : "m"(f->x)
           : "eax"
    );
}

Re: Issues moving to the "one kernel stack per CPU" approach

Posted: Wed Apr 04, 2012 11:47 am
by Synon
XenOS wrote:
Synon wrote:Push the struct onto the stack and then use hardcoded offsets to access the members (that's how gcc does it, at least).
Hardcoding the offsets is exactly what I'm trying to avoid, since these offsets are likely to change in the future - see my first post.
Well, bluemoon's solution works, but you said you wanted to have it in pure assembly rather than inline... the only other things I can think of are to use macros, pass the struct as its constituent parts (so in my struct vector3 example, you would pass x, y and z as separate parameters) or you could define the struct in C and then create the object in assembly*. Sadly all three require you to change your code in more than one place when you change the contents of the structure.

* like this:

Code: Select all

struct foo { int a, b, c; };

int foobar(struct foo* bar);
{
        bar->c = bar->a + bar->b;
        return 0;
}

Code: Select all

struct_foo_bar:
        .a:        dd        0
        .b:        dd        0
        .c:        dd        0

barfoo:
        mov        ebx,       struct_foo_bar
        push       ebx
        call       foobar
        ret

Re: Issues moving to the "one kernel stack per CPU" approach

Posted: Wed Apr 04, 2012 11:49 am
by bluemoon
Synon wrote:Sadly all three require you to change your code in more than one place when you change the contents of the structure.
That's why I mentioned Makefile and automation :o

Re: Issues moving to the "one kernel stack per CPU" approach

Posted: Wed Apr 04, 2012 12:53 pm
by xenos
Indeed, the inline assembly example works fine. The reason why I'd rather use pure assembly is because this should be an interrupt stub, so I cannot simply wrap my code in a C function.

Another thing that came to my mind is creating something like a "struct RegisterSet", which contains fixed offsets for all registers, and to save a pointer to this struct within the TCB instead of a pointer to the whole TCB. This would allow me to save the registers in some assembly stub, jump to the C / C++ code and compute the pointer to the start of the TCB using something like the offsetof operator.

Thanks for all suggestions!

Re: Issues moving to the "one kernel stack per CPU" approach

Posted: Wed Apr 04, 2012 1:11 pm
by bluemoon
If you keep your class as plain old data, you can just use header convertor to translate the class into assembly include file with the structure and use it directly, without extra pointer lookup.

Re: Issues moving to the "one kernel stack per CPU" approach

Posted: Wed Apr 04, 2012 1:25 pm
by xenos
Well, unfortunately my x86 TCB class is derived from a generic TCB base class, so it's not a POD class...

Re: Issues moving to the "one kernel stack per CPU" approach

Posted: Wed Apr 04, 2012 1:39 pm
by Brendan
Hi,

For CPU idling, do you need to use a thread at all? When the kernel returns to user space it'd determine which thread to return to and then load that thread's state. If it determines there is no thread to return to, can it just wait until there is?
bluemoon wrote:If you keep your class as plain old data, you can just use header convertor to translate the class into assembly include file with the structure and use it directly, without extra pointer lookup.
Sounds like a good idea for cache locality too. Maybe split the complicated C++ class into several separate Plain_old_data_structures according to how and when the data is used - e.g. so you don't get extra cache misses because someone decided to store the (hardly ever used) thread name string in the middle of the (frequently used) TCB.
XenOS wrote:Well, unfortunately my x86 TCB class is derived from a generic TCB base class, so it's not a POD class...
I'm fairly sure C++ can handle something like:

Code: Select all

struct THREAD_STATE {
#ifdef TARGET_80x86
    uint32_t regEAX;
    uint32_t regEBX;
    uint32_t regECX;
#elseifdef TARGET_ARM
    uint32_t reg0;
    uint32_t reg1;
#endif
}

class TCB {
    private:
        struct THREAD_STATE *threadState;
}
Do you need to derive from the generic TCB base class at all?


Cheers,

Brendan

Re: Issues moving to the "one kernel stack per CPU" approach

Posted: Wed Apr 04, 2012 2:22 pm
by xenos
Brendan wrote:For CPU idling, do you need to use a thread at all? When the kernel returns to user space it'd determine which thread to return to and then load that thread's state. If it determines there is no thread to return to, can it just wait until there is?
Indeed, that's what I ended up with. If there is no thread to run, the kernel will simply wait for some (timer or other hardware) interrupt, which may in turn unblock some waiting thread.
Sounds like a good idea for cache locality too. Maybe split the complicated C++ class into several separate Plain_old_data_structures according to how and when the data is used - e.g. so you don't get extra cache misses because someone decided to store the (hardly ever used) thread name string in the middle of the (frequently used) TCB.
Good point. I think I'll try something like that.
Do you need to derive from the generic TCB base class at all?
Probably not. I guess I will use some variation of your suggestion:

kernel/ThreadState.h

Code: Select all

struct THREAD_STATE;

class TCB {
    private:
        struct THREAD_STATE *threadState;
};
kernel/arch/x86/x32/ThreadState.h

Code: Select all

struct THREAD_STATE {
    uint32_t eax;
    uint32_t ebx;
};
kernel/arch/x86/x64/ThreadState.h

Code: Select all

struct THREAD_STATE {
    uint64_t rax;
    uint64_t rbx;
};
The latter two files will be included only by architecture specific source files that actually deal with the contents of a THREAD_STATE. All architecture independent source code doesn't need to know what a THREAD_STATE looks like and includes only the first header file.

Re: Issues moving to the "one kernel stack per CPU" approach

Posted: Wed Apr 04, 2012 3:59 pm
by gerryg400
Xenos, I have one kernel stack per core and a very simple solution to the problem (I think).

I point the ring 0 stack in the TSS at the thread control block before I run a thread.

Like this

Code: Select all

    /* Fixup the TSS */
    tss[core_id]->rsp0 = (uint64_t)&curr_thread[core_id]->stk0top;
That way, the first 5 registers are pushed directly into the TCB by the processor. Then my asm stub pushes the rest of the regs directly into the TCB. Finally when all that's done I manually switch to the kernel stack.

Here's a asm stub

Code: Select all

intr_0x20:
        /* Move all the regs into the thread structure */
        pushq   %rdi
        pushq   %rsi
        PUSH_GP_REGS_EXCEPT_RDI_RSI

        /* put core_id in edi */
        GET_COREID %edi

        /* Switch to kernel stack */
        movq    kstack(, %rdi, 8), %rsp
        incl    core_nest(, %rdi, 4)

        /* Set the current core state */
        movq    $1, core_state(, %rdi, 4)

        /* Go to the c code now */
        cld
        sti
        call    ksyscall

        jmp     ret_to_user