Page 1 of 4
Issues moving to the "one kernel stack per CPU" approach
Posted: Wed Apr 04, 2012 9:40 am
by xenos
After going through
this discussion for a while I decided to implement the "one kernel stack per CPU" approach in my kernel. (Currently I use "one kernel stack per thread".) The main reason is to avoid the necessity of stack switches in the kernel, and to leave remnants of stack traces from nested function calls on a thread's kernel stack. It appears much clearer to me if a thread completely surrenders its identity whenever the kernel is entered. The thread's state is saved to some kind of thread control block (TCB) and "the kernel" is executed, nothing else. When the kernel has finished its job and returns, either the same or a diffent thread is chosen for execution, and its state is restored from the TCB. This makes things like IPC very simple: The sending thread invokes some IPC system call, and within this call control is transferred to the receiving thread (which may just have been unblocked).
Currently I'm having a few issues with the transition to the "one kernel stack per CPU" approach, so I'd like to hear some ideas and opinions:
CPU idling
Currently I have a few kernel threads for CPU idling, one for each CPU. Whenever the thread queue is empty and there is no work to do, a CPU executes its idle thread. The only thing the idle thread does is "sti; hlt;" in an endless loop. The latter is the reason why this thread is running in ring 0, since "hlt" is a privileged instruction. However, the "one kernel stack per CPU" approach does not allow for kernel threads, or at least it needs some hacking like per-thread stacks for kernel threads, and per-cpu stacks for user threads, but I'm trying to avoid this. So my current idea if there is no available thread is to simply not return to any thread, and directly run into a "sti; hlt;" sequence, which will be interrupted (or rather aborted, since I won't return to whatever comes after the "hlt") by the next timer tick or IRQ.
Thread state saving
Whenever the kernel is entered, at least some part of the threads state (i.e., registers) must be saved in the TCB. In my current per-thread stack approach I simply save the general purpose registers on the stack (in an ASM stub), and the FPU / MMX / SSE / AVX state is saved in the TCB only if necessary (using some inline ASM). In the new per-cpu stack approach I would prefer to save (at least some of) the general purpose registers to the TCB in my ASM stub. The TCB is defined as a rather complicated C++ class - the x86 TCB containing the complete x86 register set is derived from a generic TCB containing things like thread ID, time consumption and some other stuff, which is likely to change in the future. This means that also the offset of the x86 registers within my TCB is likely to change. The puzzling question is: How can I tell my ASM stub where to save the general purpose registers, when all I have is a pointer to the current TCB? I try to avoid hardcoding the offset into my ASM stub ("movl $curTCB, %eax; movl %edx, $0x48(%eax);" etc.) because if I change the layout of my TCB, I have to change all these hardcoded offsets as well.
Re: Issues moving to the "one kernel stack per CPU" approach
Posted: Wed Apr 04, 2012 10:36 am
by bluemoon
XenOS wrote:I try to avoid hardcoding the offset into my ASM stub ("movl $curTCB, %eax; movl %edx, $0x48(%eax);" etc.)
You can use struct in inline assembly or plain assembly.
Re: Issues moving to the "one kernel stack per CPU" approach
Posted: Wed Apr 04, 2012 11:18 am
by xenos
bluemoon wrote:You can use struct in inline assembly or plain assembly.
That's what I was thinking about, but somehow I couldn't figure out how to access the members of a C struct (or maybe a C++ class?) by their names from plain assembly... This is certainly one of the "stupid questions", though
Re: Issues moving to the "one kernel stack per CPU" approach
Posted: Wed Apr 04, 2012 11:28 am
by bluemoon
You need to write in syntax that is recognized by your assembler. Do it by hand, or get some "header file convertor" and tweak the Makefile for automation.
For NASM, this is my FAT boot code:
Code: Select all
; ----------------------------------------------
struc BPB ; BIOS Paramater Block (At Boot Record)
resb 3 ; JMP
BPB_name resb 8 ; Volume OEM Name eg. "mkdosfs"
BPB_bps resw 1 ; Bytes per Sector
BPB_spc resb 1 ; Sectors per Cluster
BPB_res resw 1 ; Reserved sector before 1st FAT
BPB_nfat resb 1 ; number of FATs
BPB_maxroot resw 1 ; max number of root entry
BPB_total1 resw 1 ; total sectors, if 0 use total2
BPB_media resb 1 ; media descriptor, F8 for HDD
BPB_spf resw 1 ; sectors per FAT
BPB_spt resw 1 ; sectors per track (63)
BPB_head resw 1 ; num of head
BPB_lba resd 1 ; starting sector of this vol
BPB_total2 resd 1 ; total sectors if total1 == 0
endstruc
vbr:
jmp short start
nop
TIMES 0x3B db 0 ; BIOS PARA BLOCK
start:
...
mov eax, [vbr + BPB_lba] ; eax = LBA of volume
Re: Issues moving to the "one kernel stack per CPU" approach
Posted: Wed Apr 04, 2012 11:31 am
by Synon
XenOS wrote:bluemoon wrote:You can use struct in inline assembly or plain assembly.
That's what I was thinking about, but somehow I couldn't figure out how to access the members of a C struct (or maybe a C++ class?) by their names from plain assembly... This is certainly one of the "stupid questions", though
Push the struct onto the stack and then use hardcoded offsets to access the members (that's how gcc does it, at least).
For example, the code
Code: Select all
struct vector3 {
int x, y, z;
};
int main()
{
struct vector3 my_vector;
my_vector.x = my_vector.y = my_vector.z = 0;
return 0;
}
becomes
Code: Select all
.file "tmp.c"
.def ___main; .scl 2; .type 32; .endef
.text
.globl _main
.def _main; .scl 2; .type 32; .endef
_main:
pushl %ebp
movl %esp, %ebp
andl $-16, %esp
subl $16, %esp
call ___main
movl $0, 12(%esp) # my_vector.z = 0
movl 12(%esp), %eax
movl %eax, 8(%esp) # my_vector.y = my_vector.z
movl 8(%esp), %eax
movl %eax, 4(%esp) # my_vector.x = my_vector.y
movl $0, %eax
leave
ret
Re: Issues moving to the "one kernel stack per CPU" approach
Posted: Wed Apr 04, 2012 11:42 am
by xenos
bluemoon wrote:You need to write in syntax that is recognized by your assembler. Do it by hand, or get some "header file convertor" and tweak the Makefile for automation.
Of course, that's a possibility. The main disadvantage I see here at the moment is the fact that I need to define this ASM struct for each supported target architecture, and if I change something in the generic TCB struct, I need to apply this change in all ASM struct definitions. And of course, I need to know the exact offsets to "clone" this struct, which may depend on the compiler's choice of alignment (unless I pack all structures, which may have disadvantages on architectures like ARM or MIPS).
Synon wrote:Push the struct onto the stack and then use hardcoded offsets to access the members (that's how gcc does it, at least).
Hardcoding the offsets is exactly what I'm trying to avoid, since these offsets are likely to change in the future - see my first post.
Re: Issues moving to the "one kernel stack per CPU" approach
Posted: Wed Apr 04, 2012 11:42 am
by bluemoon
I never needed to access structure in inline assembly yet for my OS project, but this compile nicely with gcc:
Code: Select all
struct FOO {
int x, y;
};
void bar(struct FOO *f) {
__asm ( "mov eax, %0\n"
"inc eax\n"
"mov %1, eax"
: "=m"(f->y)
: "m"(f->x)
: "eax"
);
}
Re: Issues moving to the "one kernel stack per CPU" approach
Posted: Wed Apr 04, 2012 11:47 am
by Synon
XenOS wrote:Synon wrote:Push the struct onto the stack and then use hardcoded offsets to access the members (that's how gcc does it, at least).
Hardcoding the offsets is exactly what I'm trying to avoid, since these offsets are likely to change in the future - see my first post.
Well, bluemoon's solution works, but you said you wanted to have it in pure assembly rather than inline... the only other things I can think of are to use macros, pass the struct as its constituent parts (so in my struct vector3 example, you would pass x, y and z as separate parameters) or you could define the struct in C and then create the object in assembly*. Sadly all three require you to change your code in more than one place when you change the contents of the structure.
* like this:
Code: Select all
struct foo { int a, b, c; };
int foobar(struct foo* bar);
{
bar->c = bar->a + bar->b;
return 0;
}
Code: Select all
struct_foo_bar:
.a: dd 0
.b: dd 0
.c: dd 0
barfoo:
mov ebx, struct_foo_bar
push ebx
call foobar
ret
Re: Issues moving to the "one kernel stack per CPU" approach
Posted: Wed Apr 04, 2012 11:49 am
by bluemoon
Synon wrote:Sadly all three require you to change your code in more than one place when you change the contents of the structure.
That's why I mentioned Makefile and automation
Re: Issues moving to the "one kernel stack per CPU" approach
Posted: Wed Apr 04, 2012 12:53 pm
by xenos
Indeed, the inline assembly example works fine. The reason why I'd rather use pure assembly is because this should be an interrupt stub, so I cannot simply wrap my code in a C function.
Another thing that came to my mind is creating something like a "struct RegisterSet", which contains fixed offsets for all registers, and to save a pointer to this struct within the TCB instead of a pointer to the whole TCB. This would allow me to save the registers in some assembly stub, jump to the C / C++ code and compute the pointer to the start of the TCB using something like the offsetof operator.
Thanks for all suggestions!
Re: Issues moving to the "one kernel stack per CPU" approach
Posted: Wed Apr 04, 2012 1:11 pm
by bluemoon
If you keep your class as
plain old data, you can just use header convertor to translate the class into assembly include file with the structure and use it directly, without extra pointer lookup.
Re: Issues moving to the "one kernel stack per CPU" approach
Posted: Wed Apr 04, 2012 1:25 pm
by xenos
Well, unfortunately my x86 TCB class is derived from a generic TCB base class, so it's not a POD class...
Re: Issues moving to the "one kernel stack per CPU" approach
Posted: Wed Apr 04, 2012 1:39 pm
by Brendan
Hi,
For CPU idling, do you need to use a thread at all? When the kernel returns to user space it'd determine which thread to return to and then load that thread's state. If it determines there is no thread to return to, can it just wait until there is?
bluemoon wrote:If you keep your class as
plain old data, you can just use header convertor to translate the class into assembly include file with the structure and use it directly, without extra pointer lookup.
Sounds like a good idea for cache locality too. Maybe split the complicated C++ class into several separate Plain_old_data_structures according to how and when the data is used - e.g. so you don't get extra cache misses because someone decided to store the (hardly ever used) thread name string in the middle of the (frequently used) TCB.
XenOS wrote:Well, unfortunately my x86 TCB class is derived from a generic TCB base class, so it's not a POD class...
I'm fairly sure C++ can handle something like:
Code: Select all
struct THREAD_STATE {
#ifdef TARGET_80x86
uint32_t regEAX;
uint32_t regEBX;
uint32_t regECX;
#elseifdef TARGET_ARM
uint32_t reg0;
uint32_t reg1;
#endif
}
class TCB {
private:
struct THREAD_STATE *threadState;
}
Do you need to derive from the generic TCB base class at all?
Cheers,
Brendan
Re: Issues moving to the "one kernel stack per CPU" approach
Posted: Wed Apr 04, 2012 2:22 pm
by xenos
Brendan wrote:For CPU idling, do you need to use a thread at all? When the kernel returns to user space it'd determine which thread to return to and then load that thread's state. If it determines there is no thread to return to, can it just wait until there is?
Indeed, that's what I ended up with. If there is no thread to run, the kernel will simply wait for some (timer or other hardware) interrupt, which may in turn unblock some waiting thread.
Sounds like a good idea for cache locality too. Maybe split the complicated C++ class into several separate Plain_old_data_structures according to how and when the data is used - e.g. so you don't get extra cache misses because someone decided to store the (hardly ever used) thread name string in the middle of the (frequently used) TCB.
Good point. I think I'll try something like that.
Do you need to derive from the generic TCB base class at all?
Probably not. I guess I will use some variation of your suggestion:
kernel/ThreadState.h
Code: Select all
struct THREAD_STATE;
class TCB {
private:
struct THREAD_STATE *threadState;
};
kernel/arch/x86/x32/ThreadState.h
Code: Select all
struct THREAD_STATE {
uint32_t eax;
uint32_t ebx;
};
kernel/arch/x86/x64/ThreadState.h
Code: Select all
struct THREAD_STATE {
uint64_t rax;
uint64_t rbx;
};
The latter two files will be included only by architecture specific source files that actually deal with the contents of a THREAD_STATE. All architecture independent source code doesn't need to know what a THREAD_STATE looks like and includes only the first header file.
Re: Issues moving to the "one kernel stack per CPU" approach
Posted: Wed Apr 04, 2012 3:59 pm
by gerryg400
Xenos, I have one kernel stack per core and a very simple solution to the problem (I think).
I point the ring 0 stack in the TSS at the thread control block before I run a thread.
Like this
Code: Select all
/* Fixup the TSS */
tss[core_id]->rsp0 = (uint64_t)&curr_thread[core_id]->stk0top;
That way, the first 5 registers are pushed directly into the TCB by the processor. Then my asm stub pushes the rest of the regs directly into the TCB. Finally when all that's done I manually switch to the kernel stack.
Here's a asm stub
Code: Select all
intr_0x20:
/* Move all the regs into the thread structure */
pushq %rdi
pushq %rsi
PUSH_GP_REGS_EXCEPT_RDI_RSI
/* put core_id in edi */
GET_COREID %edi
/* Switch to kernel stack */
movq kstack(, %rdi, 8), %rsp
incl core_nest(, %rdi, 4)
/* Set the current core state */
movq $1, core_state(, %rdi, 4)
/* Go to the c code now */
cld
sti
call ksyscall
jmp ret_to_user