Software based task switching - Patching esp0

raghuk · Post by **raghuk** » Sun Jan 22, 2012 4:29 am

Hi,

I have a question on software based task switching. I intend to use a single TSS for all threads and understand that I need to patch esp0 in the TSS to a correct value for the current thread. But what value should I use?

Some of the tutorials suggest to use KERNEL_STACK_END. Will that always work? Let us say I get a syscall and switch to kernel mode. CPU will take esp0 (which is set to KERNEL_STACK_END). The timer IRQ fires in the meantime. Since the CPL is already 0 another stack switch will not occur so CPU will not need TSS.esp0 till I return from the syscall. Is there any other scenario I need to consider? Will I overwrite stack content if I set it to a fixed value?

Thanks,
Raghu

Jezze · Post by **Jezze** » Sun Jan 22, 2012 4:47 am

Most likely that scenario won't happen because during the syscall you have disabled interrupts.

Brendan · Post by **Brendan** » Sun Jan 22, 2012 5:02 am

Hi,

raghuk wrote:Some of the tutorials suggest to use KERNEL_STACK_END. Will that always work? Let us say I get a syscall and switch to kernel mode. CPU will take esp0 (which is set to KERNEL_STACK_END). The timer IRQ fires in the meantime. Since the CPL is already 0 another stack switch will not occur so CPU will not need TSS.esp0 till I return from the syscall. Is there any other scenario I need to consider? Will I overwrite stack content if I set it to a fixed value?

There's (at least) 3 different possibilities:

Each thread has its own kernel stack, mapped at different virtual addresses in kernel space. In this case using "KERNEL_STACK_END" will only work for one thread. When you create a new thread you'd allocate memory (in kernel space) for that new thread's kernel stack (e.g. using some sort of "kmalloc()" function). During a task switch you change the ESP0 field in the CPU's TSS.
Each thread has its own kernel stack, mapped at the *same* addresses in kernel space. Here you'd use paging to have a (slightly) different virtual address space for each task, so that the same virtual addresses used for each thread's kernel stack actually refer to completely different physical pages. In this case using "KERNEL_STACK_END" will work for as many threads as you like, and during task switches you change virtual address spaces but don't need to touch the ESP0 field in the TSS at all. The downside here is that different threads that belong to the same process have to use different virtual address spaces (e.g. different page directories, etc).
Instead of having one kernel stack per task/thread, you could have one kernel stack per CPU, or even one kernel stack for the entire system (for all CPUs). The "one kernel stack per CPU" idea is a little tricky to get right though, because it means only one task/thread can be running in the kernel at a time (but it can save a lot of memory if it's done right). The "one kernel stack for the entire system" idea is probably a very bad idea (the more CPUs there are, the more time you waste fighting to share that same kernel stack).

In my opinion, the third option ("one kernel stack per CPU") is technically superior, but it can be complicated and confusing and it's not something I'd recommend to a beginner. For a beginner, I'd recommend the first option ("each thread has its own kernel stack, mapped at different virtual addresses in kernel space") because it's easier to understand and work with, and it's also the way a lot of things (existing kernels, tutorials, etc) do it.

Cheers,

Brendan

rdos · Post by **rdos** » Sun Jan 22, 2012 5:24 am

There is a forth alternative. Use one TSS per thread, but save/load registers with software. Instead of keeping kernel stack at the same location (and needing to reload CR3, which flushes TLB), it is only necesary to load a new TR with LTR. The LTR automatically changes the ESP0 to the value relevant for the thread. Additionally, the IO bitmap permission map is also switched automatically and thus can be per thread.

raghuk · Post by **raghuk** » Sun Jan 22, 2012 7:11 am

rdos & Brendan,

Thanks a lot for your responses.

I was planning to use option 2; personally I find it clean to have the kernel stack always located at the same linear address. I will have to change and invalidate one PTE (4k stack) with each task switch.

Option 4 might be useful as well. My OS will have support for v8086 programs, so having a separate I/O permission bitmap and interrupt redirection bitmap could be handy. But I haven't reached that stage of design and will have to revisit this later.

Velko · Post by **Velko** » Sun Jan 22, 2012 12:17 pm

Well, I'm all for Brendan's option 1 (the traditional approach - one TSS per CPU, one stack per thread). I believe it offers the most flexibility. Every thread can modify other thread's state. And I believe ring0 threads are well behaved, because I did check input parameters before (yes, nobody's perfect, and I believe there’s a plenty of bugs in my code too).

So, there's something I'm baffled about:

Brendan wrote: In my opinion, the third option ("one kernel stack per CPU") is technically superior.

Could You describe it in more detail? Is it offering some performance advantages? Or maybe, some advantages in protection? Or something else?

(From my signature You should know, I like simple things

)

Even if your interrupt handler gets interrupted by another interrupt, a well behaved interrupt handler should restore to the state where interrupt did never happen'. In the result your syscall will continue without knowing if there was an interrupt or not. Probably you'll send some APIC EOI's in wrong order, but that can be prevented by disabling interrupts for some short moments.

bluemoon · Post by **bluemoon** » Sun Jan 22, 2012 1:02 pm

Velko wrote:
Brendan wrote: In my opinion, the third option ("one kernel stack per CPU") is technically superior.
Could You describe it in more detail? Is it offering some performance advantages?

Occam's Razor. Option 3 require less memory and no TLB flush while being sufficient for task switching. Consider you have 5000 threads and each has a 4K kernel stack, it use up 20MB, non-swappable memory for virtually nothing. Then you need a page table entry for that 4K stack, which may introduce another 4K overhead, for grand total of 40MB non-swappable memory.

Method 3, on the other hand, require a few kilobytes (you may need a bit larger stack) no matter how many threads.

However it is more complex to handle pre-emptive kernel code.

Brendan · Post by **Brendan** » Sun Jan 22, 2012 1:24 pm

Velko wrote:Well, I'm all for Brendan's option 1 (the traditional approach - one TSS per CPU, one stack per thread). I believe it offers the most flexibility. Every thread can modify other thread's state. And I believe ring0 threads are well behaved, because I did check input parameters before (yes, nobody's perfect, and I believe there’s a plenty of bugs in my code too).

So, there's something I'm baffled about:
Brendan wrote: In my opinion, the third option ("one kernel stack per CPU") is technically superior.
Could You describe it in more detail? Is it offering some performance advantages? Or maybe, some advantages in protection? Or something else?

(From my signature You should know, I like simple things )

For the traditional approach (my "option 1" above), it's like there's 1 context (running a thread) with 1 sub-context (running the kernel, in the context of a thread), which has 1 sub-sub-context (running an interrupt handler, while running the kernel, in the context of a thread).

For the alternative approach (my "option 3" above), it's like there's 2 contexts: running a thread, or running the kernel. The "running the kernel" context has a sub-context (running an interrupt handler, while running the kernel). The key point is that when you enter the kernel (for any reason) you save the thread's state, and then you're not running that thread anymore (you're not running any thread at all, you're running the kernel instead). When you switch from the kernel to a thread you load the thread's state, but it might not be the same thread that entered the kernel (e.g. thread#1 might call the kernel API, the kernel might do some stuff, then the kernel might return/switch to thread#5).

There's 2 advantages. The first is memory usage. For example, if you've got 128 processes with 128 threads each and if each thread costs you 4 KiB for its own kernel stack, then that adds up to 16 MiB just for kernel stacks, even though you can't really use all of them at the same time. With one kernel stack per CPU, it might only cost you 4 KiB per CPU (regardless of how many processes or threads are running). The other advantage is cache locality - you can't keep 16 MiB of "kernel stack" data in a CPU's cache. This means that for "option 1", each time you do a task switch you can expect cache misses as soon as you access the next task's kernel stack. For "option 3" you avoid those cache misses because you're constantly reusing the same stack (and the same cache lines).

Then there's disadvantages. The first (and probably the biggest) problem with "option 3" is that you can't have kernel threads. This mostly means that for a monolithic kernel (where you really need kernel threads for device drivers, etc) it's probably not viable at all, and it can make some things fairly tricky even for a micro-kernel.

The other thing that makes it a bit tricky is finding ways to avoid saving all of a thread's state when you enter the kernel, and avoid loading all of the thread's state when you return from the kernel. The simplest way around this is to postpone as much as possible until you return from the kernel in the hope that you'll be returning to the same task (and therefore won't need to save or load most things). Of course the simplest way doesn't work when the kernel gets more complex - for example, if a CPU was running thread#1 before the CPU entered the kernel, and the kernel decides to sleep for a while to cool the CPU down (or gets busy doing other things), then no other CPU can run thread#1 because its state hasn't been saved. To work around this you need, um, work-arounds. For example, if the kernel knows that it won't be returning to any thread for a while it could save the previous thread's state as soon as possible, and remember that it has done this already so that it doesn't try to do it again when it finally does return to a thread.

Cheers,

Brendan

egos · Post by **egos** » Mon Jan 23, 2012 4:15 am

raghuk wrote:I was planning to use option 2; personally I find it clean to have the kernel stack always located at the same linear address. I will have to change and invalidate one PTE (4k stack) with each task switch.

You should have all thread structures mapped into kernel space and active thread link anyway. With option 1 you can store thread structure in stack region and use TSS.SP0 as active thread link. Look at my example here for details.

gerryg400 · Post by **gerryg400** » Mon Jan 23, 2012 4:42 am

Brendan wrote:The first (and probably the biggest) problem with "option 3" is that you can't have kernel threads.

What do you mean by kernel threads ? If kernel threads are just threads running in kernel space, why can't you have them ? Or do you have something else in mind ?

Brendan · Post by **Brendan** » Mon Jan 23, 2012 5:39 am

Hi,

gerryg400 wrote:
Brendan wrote:The first (and probably the biggest) problem with "option 3" is that you can't have kernel threads.
What do you mean by kernel threads ? If kernel threads are just threads running in kernel space, why can't you have them ? Or do you have something else in mind ?

"Kernel threads" are threads that are owned by the kernel itself, and don't belong to any (user-space) process.

For "option 3" (where there's one kernel stack per CPU) you can't really have several kernel threads all sharing the same kernel stack. You could have a mixture though I guess (e.g. kernel threads have their own stacks and don't use the "per CPU" kernel stack), but I can imagine that getting messy (you'd want to have explicit stack switches when a kernel thread calls a kernel API function or is interrupted by an IRQ or IPI; just in case the kernel does a task switch and then uses the wrong stack to return to user space).

Cheers,

Brendan

gerryg400 · Post by **gerryg400** » Mon Jan 23, 2012 5:54 am

Hmm, I run my 'kernel' threads in ring1. They have a stack just like any other thread. The thing that makes them different from user threads is that they have access to the kernel data structures. I use them for object disposal and memory zeroing.

egos · Post by **egos** » Mon Jan 23, 2012 6:20 am

Usually we have only two rings. Single kernel stack architecture is more suitable for microkernels the variant of which you have, I think. You can run system threads in ring 3 as well. They are server threads, not kernel threads.

gerryg400 · Post by **gerryg400** » Mon Jan 23, 2012 6:44 am

All my drivers run in ring 3 and don't have access to the kernel (except by system call). These threads are definitely part of the kernel. They access kernel data structures. I do have a single kernel stack per core and 'my kernel threads' each have their own stack.

Anyways there's no need to get bogged down in terminology. Each of us might have slightly different definitions. The point is that I thought that there is no issue using Brendan's scheme #3 and having kernel threads. (At least with my definition of kernel threads).

raghuk · Post by **raghuk** » Mon Jan 23, 2012 6:59 am

egos wrote:
raghuk wrote:I was planning to use option 2; personally I find it clean to have the kernel stack always located at the same linear address. I will have to change and invalidate one PTE (4k stack) with each task switch.
You should have all thread structures mapped into kernel space and active thread link anyway. With option 1 you can store thread structure in stack region and use TSS.SP0 as active thread link. Look at my example here for details.

I didn't understand that. Is there an advantage in storing the thread structures in kernel stack rather than storing them in kernel heap? Right now I kmalloc() space required for thread structures and free them when thread terminates. So I just need one pointer to the current running thread and another pointer to a queue of ready threads.

I understand option 1 will perform better because each thread switch does not involve invalidating PTEs/PDEs. But if I switch to a different process, I need to load a new value into cr3 anyway.

OSDev.org

Software based task switching - Patching esp0

Software based task switching - Patching esp0

Re: Software based task switching - Patching esp0

Re: Software based task switching - Patching esp0

Re: Software based task switching - Patching esp0

Re: Software based task switching - Patching esp0

Re: Software based task switching - Patching esp0

Re: Software based task switching - Patching esp0

Re: Software based task switching - Patching esp0

Re: Software based task switching - Patching esp0

Re: Software based task switching - Patching esp0

Re: Software based task switching - Patching esp0

Re: Software based task switching - Patching esp0

Re: Software based task switching - Patching esp0

Re: Software based task switching - Patching esp0

Re: Software based task switching - Patching esp0