Dynamic kernel stack allocation

OSwhatever · Post by **OSwhatever** » Mon Sep 26, 2011 4:06 pm

The traditional way of doing things is that you allocate the kernel stack during thread creation and let that thread have that stack through its entire life. However, since the trip in the kernel is always supposed to be short vacation from the ordinary user life without any memory of the last trips, the kernel stack can be assigned when the thread enters the kernel and also be reused. The thread doesn't care which kernel stack it uses. This might reduce some memory usage but another things comes to mind is user mode scheduling. The problem with user mode threads that the kernel has no idea about the threads and there are no kernel stacks associated with them. With dynamic kernel stack allocation the kernel doesn't care how many threads visits the kernel and just gives them a kernel stack upon entry, this would also solve user mode thread kernel call parallelism. Why isn't this method more widely used? Are there any drawbacks I'm not aware of?

gerryg400 · Post by **gerryg400** » Mon Sep 26, 2011 8:09 pm

Firstly you need to decide how many threads can be in the kernel at once. In a simple OS without the possibility of pre-emption during a system call it may be just one. Or it may be one per core. Or it may be one at each priority (possibly per core). Or it may be that your OS allows every thread to be in a system call (possibly sleeping, spinning etc.) at the same time.

In the normal implementation you need a kernel stack for each of those threads. Whether you pre-allocate all those stacks or allocate them at kernel entry is a question of speed of execution v memory use.

Brendan · Post by **Brendan** » Mon Sep 26, 2011 8:41 pm

Hi,

OSwhatever wrote:The traditional way of doing things is that you allocate the kernel stack during thread creation and let that thread have that stack through its entire life. However, since the trip in the kernel is always supposed to be short vacation from the ordinary user life without any memory of the last trips, the kernel stack can be assigned when the thread enters the kernel and also be reused.

I've been thinking about this for a while. You'd need one kernel stack per CPU (and could pre-allocate them rather than dynamically allocate them). On kernel entry you'd save the user state and on kernel exit you'd restore the user state. When something happens that causes a task switch, the kernel only changes a "return task ID" value, so that it restores a different task's state when it returns to user space. For things like FPU/MMX/SSE state, on kernel exit you'd do something like "if(currentTask != returnTask) { saveRemainingState(); currentTask = returnTask; }". In the same way you can postpone scheduling - for e.g. "if(returnTask == UNKNOWN) returnTask = scheduler_chooseTask();".

There may be performance advantages, including less cache misses involved with accessing kernel stacks (because each CPU reuses its own stack the RAM for each CPU's stack will remain in cache) and avoiding some task switching (e.g. if several things occur that cause task switches at almost the same time, then the "return task ID" variable may be changed several times before the kernel returns to user space, but only one task switch would actually occur rather than several of them). Of course the reason I've been thinking about it is RAM usage - if you've got 8 CPUs running 12345 threads and kernel stacks are 8 KiB each; then "per CPU" kernel stacks would cost 64 KiB of RAM and "per thread" kernel stacks would cost 98760 KiB of RAM.

For disadvantages; it will make simple kernel API services (where very little state actually needed to be saved/restored) slower; partly because user state would have to be stored in some sort of "thread info structure" and not on the kernel stack, and you'd need to shift some things that the CPU puts on the stack (e.g. return SS:ESP and return SS:EIP would need to be shifted to/from the "thread info structure"). The other disadvantage is that you couldn't have "kernel threads" with this method; and I like having kernel threads (mostly for cleaning up various pieces of kernel data in idle time).

OSwhatever wrote:The thread doesn't care which kernel stack it uses. This might reduce some memory usage but another things comes to mind is user mode scheduling. The problem with user mode threads that the kernel has no idea about the threads and there are no kernel stacks associated with them. With dynamic kernel stack allocation the kernel doesn't care how many threads visits the kernel and just gives them a kernel stack upon entry, this would also solve user mode thread kernel call parallelism. Why isn't this method more widely used? Are there any drawbacks I'm not aware of?

In theory user space threads have less overhead because the cost of switching to/from CPL=0 is avoided sometimes. In practice this overhead is negligible (around 30 cycles, depending on a wide variety of things). The problem with user space threads is that it's impossible to have "global thread priorities". Each process doesn't know if there's higher priority threads in other processes that should be running instead of its own threads; and the kernel doesn't know the priority of any user space threads (and isn't doing most of the scheduling anyway); so you end up executing less important threads when you should be executing more important threads. This can make the OS feel slow/sluggish - e.g. one application wasting time calculating the first million prime numbers while the user is sitting there waiting for another application to respond to a keypress. If you have a look at some of the performance numbers for various OSs, you'll see that OSs with very slow "kernel task switching" (e.g. Windows) tend to benefit from things like fibres, while for better OSs (with fast "kernel task switching", like Linux) it's not worth bothering with. Basically, user space threads are only good when the kernel is crappy, and I'd suggest writing a kernel that isn't crappy instead of bothering with user space threads.

Cheers,

Brendan

gerryg400 · Post by **gerryg400** » Mon Sep 26, 2011 9:07 pm

I've been thinking about this for a while. You'd need one kernel stack per CPU (and could pre-allocate them rather than dynamically allocate them). On kernel entry you'd save the user state and on kernel exit you'd restore the user state. When something happens that causes a task switch, the kernel only changes a "return task ID" value, so that it restores a different task's state when it returns to user space.

This works well and is very easy to implement if your kernel design doesn't allow threads to sleep (i.e. no IO waiting in the kernel) and doesn't allow threads in kernel mode to be pre-empted. Many microkernels don't need IO wait because that sort of thing happens in userspace. But almost all kernels have at least some long system calls and will need some sort of kernel pre-emption.

BTW, I think I misunderstood OSwhatever's original point. I thought that by "Dynamic kernel stack allocation" he meant allocating kernel stacks as required using kmalloc.

Brendan · Post by **Brendan** » Mon Sep 26, 2011 10:42 pm

Hi,

gerryg400 wrote:
I've been thinking about this for a while. You'd need one kernel stack per CPU (and could pre-allocate them rather than dynamically allocate them). On kernel entry you'd save the user state and on kernel exit you'd restore the user state. When something happens that causes a task switch, the kernel only changes a "return task ID" value, so that it restores a different task's state when it returns to user space.
This works well and is very easy to implement if your kernel design doesn't allow threads to sleep (i.e. no IO waiting in the kernel) and doesn't allow threads in kernel mode to be pre-empted. Many microkernels don't need IO wait because that sort of thing happens in userspace.

When you switch from user space to kernel space you stop running any task; when you're inside the kernel no tasks are running at all; and when you return to user space you start running a task again. Because no task is running when the kernel is running, it should be easy for the kernel to change a task's state to "blocked" and put a task onto a queue or something.

For a very basic implementation, the scheduler would always decide which task to return to immediately before the kernel returns to user space. For my previous post, the "currentTask" and "returnTask" variables are there to reduce overhead, by avoiding the need for the scheduler to decide which task to return to if/when possible.

I'd also assume that when "returnTask == UNKNOWN" there may be no tasks that are ready to run; and instead of returning to user space the scheduler might put the CPU into some sort of sleep state (until a task does become ready to run).

The pseudo-code for kernel exit might look something like:

Code: Select all

    if(currentTask != returnTask) {
        while(returnTask == UNKNOWN) {
            returnTask = scheduler_findTask();
            if(returnTask = UNKNOWN) {
                put_CPU_into_sleep_state();
            }
        }
        if(currentTask != UNKNOWN) {
           /* For "on demand" FPU/MMX/SSE state loading using the "device not available" exception */
            save_FPU_MMX_SSE_state(currentTask);
            set_FPU_MMX_SSE_state_owner(UNKNOWN);
        }
        currentTask = returnTask;
    }
    loadState(currentTask);
    return to user space

gerryg400 wrote:But almost all kernels have at least some long system calls and will need some sort of kernel pre-emption.

Worst case for a micro-kernel is creating a new process - creating some sort of "process info" structure, creating a new virtual address space and creating an initial thread for the process. I think I can do it fast enough, given that I don't do "fork()" (and don't need to clone an existing address space or setup "copy on write") and that the "process loader" (e.g. code to parse command line arguments, build environment strings, parse executable headers, load the executable, etc) would be running in user space.

For a monolithic kernel, I wouldn't be considering something like this..

gerryg400 wrote:BTW, I think I misunderstood OSwhatever's original point. I thought that by "Dynamic kernel stack allocation" he meant allocating kernel stacks as required using kmalloc.

I think I may have overlooked OSwhatever's original point too, and gone off on my own slightly different tangent..

Cheers,

Brendan

gerryg400 · Post by **gerryg400** » Mon Sep 26, 2011 10:49 pm

Brendan wrote:Worst case for a micro-kernel is creating a new process - creating some sort of "process info" structure, creating a new virtual address space and creating an initial thread for the process. I think I can do it fast enough, given that I don't do "fork()" (and don't need to clone an existing address space or setup "copy on write") and that the "process loader" (e.g. code to parse command line arguments, build environment strings, parse executable headers, load the executable, etc) would be running in user space.

True, most of that stuff is quick or can be done by the new process itself. In fact most xxxx_create() syscalls are short and sweet.

But aren't the longest kernel calls process_destroy(), mailbox_destroy etc. where you need to tear down and dispose of all the accumulated stuff that is linked to the process or mailbox ?

I've been trying to find nice ways to pre-empt this sort of stuff. I'm currently playing around with an idea where, when a process is deleted it's just marked as deleted (which hides all it's resources) and tearing it down as time permits. I really only want one kstack per core. You're correct with 10's of thousands of threads it just burns too much memory. Oh and currently my kernel stacks are fully-commited (with phys-mem allocated) kernel memory and that's not an unlimited resource. If your kernel stacks are virtual memory or are swappable then perhaps that's not so bad.

Brendan · Post by **Brendan** » Mon Sep 26, 2011 11:41 pm

Hi,

gerryg400 wrote:
Brendan wrote:Worst case for a micro-kernel is creating a new process - creating some sort of "process info" structure, creating a new virtual address space and creating an initial thread for the process. I think I can do it fast enough, given that I don't do "fork()" (and don't need to clone an existing address space or setup "copy on write") and that the "process loader" (e.g. code to parse command line arguments, build environment strings, parse executable headers, load the executable, etc) would be running in user space.
True, most of that stuff is quick or can be done by the new process itself.

But aren't the longest kernel calls process_destroy(), mailbox_destroy etc. where you need to tear down and dispose of all the accumulated stuff that is linked to the process or mailbox ?

Hrm. Usually I have kernel thread/s that do the tear down work in idle time. Without the ability to use kernel thread/s I'd have to have some sort of helper code in user space. One approach may be to replace a page of the process' code with a "tear-down" stub (and adjust the priority of the process' last remaining thread to suit). That way when a process goes from "terminated" to "actually gone" there wouldn't be much left to do in the kernel itself.

There is an alternative I probably should mention. Some micro-kernels (can't remember which ones) have "pre-emption points" at strategic places (in long running kernel services); where part-way through doing something the kernel can check if there's something more important to do and postpone whatever it's doing until later. In this case the "return to user space" code would have to check if the task being returned to was at one of the kernel's pre-emption points, and if it was it would return to kernel's pre-emption point and not return to user space at all.

I prefer the "helper code in user space" idea, but I'm not sure why (seems cleaner and more "micro-kernel-like").

gerryg400 wrote:I've been trying to find nice ways to pre-empt this sort of stuff. I'm currently playing around with an idea where, when a process is deleted it's just marked as deleted (which hides all it's resources) and tearing it down as time permits. I really only want one kstack per core. You're correct with 10's of thousands of threads it just burns too much memory. Oh and currently my kernel stacks are fully-commited (with phys-mem allocated) kernel memory and that's not an unlimited resource. If your kernel stacks are virtual memory or are swappable then perhaps that's not so bad.

If kernel stacks are swappable, then it'd complicate things like page fault handling a lot (you'd have to use hardware task switching in protected mode or an extra stack and IST in long mode, just to handle "page fault while attempting to access kernel stack" properly)...

Cheers,

Brendan

gerryg400 · Post by **gerryg400** » Tue Sep 27, 2011 12:51 am

I've been thinking about this for a while. You'd need one kernel stack per CPU (and could pre-allocate them rather than dynamically allocate them).... etc.

Brendan, have you actually decided to go this way or are you still thinking about it ? I've decided. It's one kstack per core for me.

OSwhatever · Post by **OSwhatever** » Tue Sep 27, 2011 5:00 am

Brendan wrote:For disadvantages; it will make simple kernel API services (where very little state actually needed to be saved/restored) slower; partly because user state would have to be stored in some sort of "thread info structure" and not on the kernel stack, and you'd need to shift some things that the CPU puts on the stack (e.g. return SS:ESP and return SS:EIP would need to be shifted to/from the "thread info structure"). The other disadvantage is that you couldn't have "kernel threads" with this method; and I like having kernel threads (mostly for cleaning up various pieces of kernel data in idle time).

I was thinking that you simply use a lock less stack storing stack candidates, if the stack is empty you allocate a new stack from the kernel heap. Upon kernel entry you pluck a kernel stack from the lock stack and push it back it when leaving. This will be more code for a system call but still feasible performance wise I think. If the the thread is preempted in the kernel it simply keeps the kernel stack during the time it is blocked.

Kernel threads would work too because it will remain in the kernel and thus never return the stack. Kernel threads can get a kernel stack directly when created.

n theory user space threads have less overhead because the cost of switching to/from CPL=0 is avoided sometimes. In practice this overhead is negligible (around 30 cycles, depending on a wide variety of things). The problem with user space threads is that it's impossible to have "global thread priorities". Each process doesn't know if there's higher priority threads in other processes that should be running instead of its own threads; and the kernel doesn't know the priority of any user space threads (and isn't doing most of the scheduling anyway); so you end up executing less important threads when you should be executing more important threads. This can make the OS feel slow/sluggish - e.g. one application wasting time calculating the first million prime numbers while the user is sitting there waiting for another application to respond to a keypress. If you have a look at some of the performance numbers for various OSs, you'll see that OSs with very slow "kernel task switching" (e.g. Windows) tend to benefit from things like fibres, while for better OSs (with fast "kernel task switching", like Linux) it's not worth bothering with. Basically, user space threads are only good when the kernel is crappy, and I'd suggest writing a kernel that isn't crappy instead of bothering with user space threads.

You can simulate this by letting the user mode scheduler adjust its own priority. Basically setting the host kernel thread to the same priority as the desired priority for the user mode thread. This involves the kernel however, something that that we didn't want to begin with.

FlashBurn · Post by **FlashBurn** » Tue Sep 27, 2011 2:16 pm

I also like the idea of having just one kernelstack per core, but I don´t like the idea that my kernel is not preemptive.

Just imagine you have lots of cores and all threads on these cores go into the kernel the same time and all want the the lock. So you either use a mutex/semaphore and need to save the cpu state (because you can´t just push all regs to the stack) or you use a spinlock and this can then take some time for all cpus to get it.

OSwhatever · Post by **OSwhatever** » Tue Sep 27, 2011 3:05 pm

FlashBurn wrote:I also like the idea of having just one kernelstack per core, but I don´t like the idea that my kernel is not preemptive.

Just imagine you have lots of cores and all threads on these cores go into the kernel the same time and all want the the lock. So you either use a mutex/semaphore and need to save the cpu state (because you can´t just push all regs to the stack) or you use a spinlock and this can then take some time for all cpus to get it.

The "global kernel lock" was something that was ridiculed in the earlier days of Linux. I'm not sure if you can seriously sell a non re-entrant kernel today. With the current multicore trend it really motivates it even more.

gerryg400 · Post by **gerryg400** » Tue Sep 27, 2011 3:06 pm

FlashBurn wrote:I also like the idea of having just one kernelstack per core, but I don´t like the idea that my kernel is not preemptive.

Just imagine you have lots of cores and all threads on these cores go into the kernel the same time and all want the the lock. So you either use a mutex/semaphore and need to save the cpu state (because you can´t just push all regs to the stack) or you use a spinlock and this can then take some time for all cpus to get it.

There's a distinction between a kernel being preemptive and a kernel allowing concurrency. It's entirely possible to design a kernel that allows any or all cores to enter the kernel at the same time but does not allow those cores to be preempted once they enter.

In my opinion
- concurrency is a must in multicore kernels. To scale at all on multicore machines your kernel must be able to use all those cores concurrently with minimal locking.
- preemption is preferred (over no preemption) in monolithic kernels. Monolithic kernels tend to have long code paths through the kernel. Having said this there was quite some debate on the Linux mailing lists. Some listers believed that the kernel would have been more responsive if it had been preempted only voluntarily.
- voluntary preemption, as described by Brendan is better than forced preemption in microkernels. I believe it makes concurrency easier and would provide a more responsive system.

Just my opinion.

Brendan · Post by **Brendan** » Tue Sep 27, 2011 10:38 pm

Hi,

gerryg400 wrote:
I've been thinking about this for a while. You'd need one kernel stack per CPU (and could pre-allocate them rather than dynamically allocate them).... etc.
Brendan, have you actually decided to go this way or are you still thinking about it ? I've decided. It's one kstack per core for me.

I've decided to implement something like this for the next version of my kernel. Once I've implemented it I'll decide if I'll keep it or not.

OSwhatever wrote:I was thinking that you simply use a lock less stack storing stack candidates, if the stack is empty you allocate a new stack from the kernel heap. Upon kernel entry you pluck a kernel stack from the lock stack and push it back it when leaving. This will be more code for a system call but still feasible performance wise I think. If the the thread is preempted in the kernel it simply keeps the kernel stack during the time it is blocked.

That would work too. It'd have slightly higher overhead and use more RAM, and make it easier to support kernel threads and pre-emption.

FlashBurn wrote:I also like the idea of having just one kernelstack per core, but I don´t like the idea that my kernel is not preemptive.

OSwhatever wrote:The "global kernel lock" was something that was ridiculed in the earlier days of Linux. I'm not sure if you can seriously sell a non re-entrant kernel today. With the current multicore trend it really motivates it even more.

You're both getting confused.

For "kernel stack per CPU", all CPUs can be running kernel code at the same time. There is no "big kernel lock" and it's not a "non re-entrant kernel".

The "no pre-emption" disadvantage would only mean that task switches/pre-emption that occurs while a thread is running in kernel space would be postponed until the kernel has finished what it's doing (and is ready to return to user-space). For example, imagine a thread calls the kernel to allocate some memory. For a "fully pre-emptable" kernel, an IRQ can occur while the kernel is allocating the memory the thread wanted and the task switch can happen immediately; while for "kernel stack per CPU" the task switch would be postponed until after the memory has been allocated (and the kernel is ready to return to user space).

In my experience, it's not quite like that though. For a "fully pre-emptable" kernel that supports SMP, when it's allocating memory the kernel would have to acquire some sort of lock (more likely is acquiring multiple locks - e.g. one for the process' address space and one for a "free page stack"). The locks would be some type of "busy waiting" lock, because (if there's contention) waiting/spinning is going to be a lot faster and more efficient than expensive task switches. In this case, allowing task switches while the lock is held is a very bad idea - instead, if an IRQ occurs while the "fully pre-emptable" kernel is allocating memory, the task switch is postponed until after the kernel releases the lock.

For my previous kernels I made this an implicit part of acquiring any lock. Acquiring a lock would cause a "locks held" variable to be increased, and if a task switch occurs while "locks held != 0" the scheduler would set a "task switches postponed" flag and won't do the task switch. Whenever a lock is released it decrements the "locks held" variable, and if the variable became zero it checks the "task switches postponed" flag and does the postponed task switch then.

Of course "when the kernel releases the lock" is often when the kernel is finished doing what it's doing and is about to return to user space. In this case, "fully pre-emptable" is very similar to "no pre-emption" anyway.

Not all things are like that though. There's about 3 different groups of kernel API functions:

functions that are so simple that they can be completed without acquiring any lock (including things like "get_process_ID()" and things that can be done with lock-free or wait-free code). These are typically so fast that nobody cares anyway.
functions that acquire one or more locks at the same time (where in practice "fully pre-emptable" is very similar to "no pre-emption" anyway)
functions that are lengthy/expensive (e.g. creating and destroying processes). In previous kernels I've been especially careful and split these up into pieces to ensure that there's "gaps" where no locks are held (and pre-emption can occur). I've even deliberately released a lock and then acquired the exact same lock immediately after releasing it in a few places. For "no pre-emption" you'd do the same thing - split lengthy/expensive operations into pieces (you just split things in a different way - e.g. explicit "pre-emption points" or some sort of "kernel helper" in user space).

Basically, for a micro-kernel, in practice "fully pre-emptable" ends up being very similar to a "no pre-emption" anyway - the "no pre-emption" version would only be slightly less pre-emptable.

Again, for a monolithic kernel (where lots of kernel functions involve messing about with device drivers, etc and a lot of them end up being lengthy/expensive), I wouldn't be considering something like this. For a micro-kernel, the only lengthy/expensive functions are process creation and destruction, thread creation and destruction, and the part of virtual memory management that scans the virtual address space (and decides which pages are good candidates to send to swap space).

Cheers,

Brendan

FlashBurn · Post by **FlashBurn** » Wed Sep 28, 2011 2:33 am

I´m not talking about a big kernel lock, but about any lock where it is possible that all CPUs want it at the same time. If the kernel is not preemptive that means that the last CPU which will get the lock will wait some time and this time will get bigger with every CPU you have in the system.

I don´t really like the idea of preemption points. I mean I wouldn´t know where I need to put them and this also looks like cooperative multitasking.

The longest function I have in my kernel is the vmm, because it needs some locks and could take some time. The task creation is not that long, because I only create an empty task and start a new kernel thread for doing more init stuff (this is also need because of my vmm).
This means that I would have kernel threads which need to be preemptive. I also need them for things like thread/task killing.

The problem I see there is, if I have a non-preemptiv kernel I also don´t need to disable the ints for every lock, but this will be a problem when I have kernel threads which are preemptive, because there I would need to manually disable and enable the ints at some points (where I call functions which need to take a lock).

So would it be a good idea to mix both?

OSwhatever · Post by **OSwhatever** » Wed Sep 28, 2011 5:47 am

When we're talking about locks isn't the "spinlock and disable irq" method mostly suitable very short times like inserting in structures that takes very short times. If you're going to use only spinlock without disabling irq you have to promise that you never end up there or use the same objects in the interrupt.

Even for smaller jobs I tend to use mutexes. You often have the situation where you have nested locks and in this case mutexes do better. Often you have the situation where something might need memory from the page allocator, but you don't really know for sure.

OSDev.org

Dynamic kernel stack allocation

Dynamic kernel stack allocation

Re: Dynamic kernel stack allocation

Re: Dynamic kernel stack allocation

Re: Dynamic kernel stack allocation

Re: Dynamic kernel stack allocation

Re: Dynamic kernel stack allocation

Re: Dynamic kernel stack allocation

Re: Dynamic kernel stack allocation

Re: Dynamic kernel stack allocation

Re: Dynamic kernel stack allocation

Re: Dynamic kernel stack allocation

Re: Dynamic kernel stack allocation

Re: Dynamic kernel stack allocation

Re: Dynamic kernel stack allocation

Re: Dynamic kernel stack allocation