User vs Kernel threads

zerosum · Post by **zerosum** » Thu Jun 19, 2008 8:40 am

Hi all,

I've justed started a unit at university on Operating Systems. Before anyone jumps on me, I'm not after homework or assignment help, just other people's thoughts on threads.

The text and course material is at the moment discussing user space and kernel space threads, their down sides and benefits. Some of this material has led me to a question, which so far no one has provided clarification on.

From what I understand of the material provided to us, user space threads are only better than kernel space threads in their efficiency. On the other hand, they're reliant on each thread being considerate and voluntarily yielding, meaning other threads in the process could be starved. Further to this, since the kernel has no concept of their existence, one blocking I/O request would cause the process to be taken out of the "Runnable" queue until such a time as the I/O completes, which kind of defeats the purpose of multithreading in the first place.

On the other hand, if threads are implemented in the kernel, they could be scheduled independently. This, of course, means that if the scheduler does its job properly, no thread would starve due to one thread not voluntarily yielding. Further, when one thread in a process blocks for whatever reason, the others could continue being executed. Clearly this seems to be the better option.

The argument against kernel threads seems to be based on the additional overhead incurred by increased interrupts and context switches. This doesn't make much sense to me. The assumption seems to be made that if threads are handled in kernel space, each thread would (normally) be offered 1/n (where n = number of threads in process) of the process' allocated execution time and that a timer would be used to interrupt each thread to switch to the next. More overhead.

My question is this: Why could each thread not be given the full amount of the process' allocated times? Instead of trying to run every thread in process X every time the scheduler chooses to allow that process to run, which then requires many more interrupts and overhead, why could the scheduler just not select one thread from the process each time and give it the full amount? Then the next time that process is scheduled to run, a different thread from that process could be selected (or not, depending on priority and status). This would mean that kernel space threads would have no more overhead than a kernel which only offered multiprogramming through processes, since it would only be interrupting when it would have interrupted to switch to a new process anyway.

What that would mean, obviously, is that each thread is still offered the same amount of run time it would otherwise be offered, overall, but it would receive a greater amount of time less frequently. For example, in a process with four threads and a total execution time of 100ms, instead of being offered 25ms every time the process is scheduled to run (1/n), it would receive 100ms every Nth (=4) time the process is scheduled to run.

Would that dramatically effect apparent responsiveness? That's the only thing I could think of which would be a down side to doing this.

Cheers,
Lee

dr_evil · Post by **dr_evil** » Thu Jun 19, 2008 10:24 am

Google for "Scheduler Activations".

Brendan · Post by **Brendan** » Thu Jun 19, 2008 10:27 am

Hi,

zerosum wrote:From what I understand of the material provided to us, user space threads are only better than kernel space threads in their efficiency. On the other hand, they're reliant on each thread being considerate and voluntarily yielding, meaning other threads in the process could be starved. Further to this, since the kernel has no concept of their existence, one blocking I/O request would cause the process to be taken out of the "Runnable" queue until such a time as the I/O completes, which kind of defeats the purpose of multithreading in the first place.

For user-level threads, the threading library (e.g. pthreads) could use some sort of timer function (e.g. SIGALRM) to do it's own pre-emption, so that user-level threads don't need to voluntarily yield.

User-level threads on their own is mostly useless (e.g. can't make use of multiple CPUs), so most threading libraries map several user-level threads onto several kernel-level threads. This helps to reduce the negative effects of the "multiple CPUs" problem and the "blocking I/O" problem (and may even avoid these problems entirely with a suitably complex threading library) while still reducing the overhead of thread switches (as some/most thread switches would/could still be done in user space without the kernel's involvement).

The other problem with user-level threads is thread priorities, in that a thread's priority is often "per process" and can't easily be system-wide because the user-level thread scheduler doesn't know about threads that belong to other processes. To see why this is a problem, imagine 2 processes where both processes have a high priority thread for the user interface and a low priority thread for heavy processing. The kernel's scheduler only sees 2 similar processes and doesn't know about the user-level thread's priorities, and the user-level scheduler only sees it's threads and doesn't know about other processes, so you get a situation where a low priority thread that belongs to one process can get more CPU time than a high priority thread that belongs to a different process.

zerosum wrote:On the other hand, if threads are implemented in the kernel, they could be scheduled independently. This, of course, means that if the scheduler does its job properly, no thread would starve due to one thread not voluntarily yielding. Further, when one thread in a process blocks for whatever reason, the others could continue being executed. Clearly this seems to be the better option.

The argument against kernel threads seems to be based on the additional overhead incurred by increased interrupts and context switches. This doesn't make much sense to me. The assumption seems to be made that if threads are handled in kernel space, each thread would (normally) be offered 1/n (where n = number of threads in process) of the process' allocated execution time and that a timer would be used to interrupt each thread to switch to the next. More overhead.

The overhead of switching from one thread to another can be broken into several parts - saving the old thread's state, deciding which thread to run next, loading the new thread's state, and (for kernel-level threads only) the cost of switching from user-space to kernel-space and back to user-space. It's this switching from user-space to kernel-space and back that increases overhead (as in theory, everything else can be almost identical).

Cheers,

Brendan

zerosum · Post by **zerosum** » Thu Jun 19, 2008 8:23 pm

Hi Brendan,

Thanks for your post

For user-level threads, the threading library (e.g. pthreads) could use some sort of timer function (e.g. SIGALRM) to do it's own pre-emption, so that user-level threads don't need to voluntarily yield.

That was mentioned in the text too. I can't offhand recall the exact argument against such an implementation, but I believe it had something to do with signal conflicts between the thread library and the user application. That doesn't make a whole lot of sense, however, since the OS could, if nothing else, define a specific signal for use by thread libraries for the purposes of pre-emption.

The other problem with user-level threads is thread priorities, in that a thread's priority is often "per process" and can't easily be system-wide because the user-level thread scheduler doesn't know about threads that belong to other processes.

Very good point.

It's this switching from user-space to kernel-space and back that increases overhead (as in theory, everything else can be almost identical).

Fair enough. That was pretty much the point of my post; a (very basic and obvious) theory of how one could implement kernel threads that perform no more kernel -> user switches than if kernel threads were not implemented at all.

I would also be interested to know exactly how much overhead that kernel -> user switch (and vice versa) costs, but I guess that is a "How long is a piece of string" question.

As for scheduler activations, that was mentioned also, but I think I need to go over that part again. It seemed to gloss over it and not provide much in the way of practical information, other than saying that this method assigns "virtual CPUs," which in itself tells me very little. I think I'll re-read that part again and possibly Google it depending on what I can learn from re-reading the text.

Thanks again,
Lee

Brendan · Post by **Brendan** » Fri Jun 20, 2008 2:01 am

Hi,

zerosum wrote:I would also be interested to know exactly how much overhead that kernel -> user switch (and vice versa) costs, but I guess that is a "How long is a piece of string" question.

It depends a lot on specific circumstances (exact code used, contents of various caches, CPU type, etc). For an extremely rough idea you could estimate about 25 cycles for the actual user-level to kernel-level switch, then another 32 cycles to reload data segment registers (DS and ES only, 16 cycles each), then about the same for the other direction (i.e. about 114 cycles from user-level back to user-level). This ignores things like TLB misses, cache misses, pipeline stalls, etc - you could easily add another 300 cycles on average.

However, this (extremely rough) estimate only matters when the kernel's scheduler switches between threads that belong to the same process. When the kernel's scheduler switches between threads that belong to different processes then you've got a full process switch (which can't be done by user-level code). To get a full idea of how much overhead kernel-level threads cost you'd need to work out how many thread switches are actually "process switches" and ignore them, then multiply the remainder (ie. the thread switches that could have been done at user-level) by the "user -> kernel -> user" overhead to get the overall overhead.

zerosum wrote:As for scheduler activations, that was mentioned also, but I think I need to go over that part again. It seemed to gloss over it and not provide much in the way of practical information, other than saying that this method assigns "virtual CPUs," which in itself tells me very little. I think I'll re-read that part again and possibly Google it depending on what I can learn from re-reading the text.

The wikipedia has the simplest description of Scheduler Activations I've seen, although to be honest I really don't understand the difference between a "virtual processor" and a normal process (seems like a case of redundant terminology to me). Hopefully someone who knows more about it can explain it to both of us..

Cheers,

Brendan

JamesM · Post by **JamesM** » Fri Jun 20, 2008 2:51 am

Hi Brendan,

Brendan wrote:Hi,

zerosum wrote:I would also be interested to know exactly how much overhead that kernel -> user switch (and vice versa) costs, but I guess that is a "How long is a piece of string" question.
It depends a lot on specific circumstances (exact code used, contents of various caches, CPU type, etc). For an extremely rough idea you could estimate about 25 cycles for the actual user-level to kernel-level switch, then another 32 cycles to reload data segment registers (DS and ES only, 16 cycles each), then about the same for the other direction (i.e. about 114 cycles from user-level back to user-level). This ignores things like TLB misses, cache misses, pipeline stalls, etc - you could easily add another 300 cycles on average.

As far as I can tell you've underestimated how costly the user -> kernel -> user switch can be - On the IA32 architecture the saving of registers to the kernel stack isn't a major problem - the pusha instruction takes care of most of them and that in the best case can be handled internally with register renaming to be pretty darned quick. However, on other architectures (I'm thinking more RISC ones like MIPS or POWER, but the same could possibly apply to AMD64 - I don't know enough about the architecture to comment) the large register file makes setting up the kernel state costly. In my PowerPC trap handler I save 164 bytes worth of registers to the stack - 32 GPRs and a load of special purpose registers that can only be accessed via dedicated instructions. This is massively time consuming.

Then again on other architectures (here comes the PPC example again - get ready!) address space switching isn't as much of a killer as on x86. Systems like MIPS and PPC are designed with a kernel in mind - the design assumes that a kernel will stay mapped in the same location across all address spaces and because of that there are fewer pipeline stalls and TLB flushes when switching address spaces (that along with the general design of the MMU, which in both cases does much less work than on x86).

What I'm trying hamhandedly to say is that while user->kernel switches may be relatively uncostly on x86, and process swithes extremely costly, the inverse can be true on other archs, so the tradeoff between user level and kernel level threads can't be made universally across all platforms - it's architecture dependent.

James

zerosum · Post by **zerosum** » Sat Jun 21, 2008 12:07 am

Thanks Brendan and James, I appreciate your informative responses

OSDev.org

User vs Kernel threads

User vs Kernel threads

Re: User vs Kernel threads

Re: User vs Kernel threads

Re: User vs Kernel threads

Re: User vs Kernel threads

Re: User vs Kernel threads

Re: User vs Kernel threads