Can scheduling be done in userspace?

Synon · Post by **Synon** » Mon Oct 03, 2011 10:02 am

I'm not writing it yet, but I've been planning out a microkernel as of today. Just getting down how it might work and things like that, and eventually I will try and write it (after I've managed to write a basic monolithic kernel - once I can get more than one userspace process running at once it I'll consider it a success; so that's probably going to be a couple of years from now).

Essentially I want to take the microkernel concept of "everything in userspace" as far as possible. As such, I've been trying to figure out what the absolute bare minimum is that the kernel itself has to do. So far; memory allocation (not management), memory-mapped-I/O and port I/O (abstracted away) and task scheduling. But I'm wondering, does the scheduler really need to be in the kernel? My way around it would be to have the kernel handle the actual switch between tasks, but have a scheduler running in userspace decide when to switch and what to switch to. The kernel could start all the servers and just run the scheduling server. The scheduling server would set an interval and a process, then the kernel would set the PIT and start running that process. When the PIT generates an interrupt the kernel would switch back to the scheduling server, and the process starts again, ad exitum.

My reason for wanting to do this is because I want to be able to very easily change the scheduling algorithm used. If the scheduler resides in kernel space, you either have to recompile the kernel when you write a new scheduler, or implement the scheduler as a module. Both of them seem less easy than simply having a userspace process that can load userspace modules, and each one can be a different scheduling algorithm. Or it could even have its own scripting language to describe a scheduling algorithm (but I'd probably go with the module loading method; there are enough scripting languages already).

NickJohnson · Post by **NickJohnson** » Mon Oct 03, 2011 10:08 am

That is actually the more "proper" way to make a scheduler for a microkernel, and I'm pretty sure there are at least a few modern microkernels that do that. The issue is that it causes a lot of extra context switches for relatively little gain in functionality or flexibility (there isn't really a scheduler policy too complex to implement in the kernel, and with loadable modules, you could even make an in-kernel scheduler replaceable.) Basically, whatever small performance gains you get from being able to swap out/heavily adjust the scheduler at runtime will likely be eclipsed by the increase in multitasking overhead. I think most of the people here who claim to write microkernels still have the scheduler in-kernel (I do, and I think Brendan does, anyway.)

This is just why people tend to make the other decision. I'm not trying to stop you from removing the scheduler from your microkernel; it is in fact possible, and has been done successfully.

Synon · Post by **Synon** » Mon Oct 03, 2011 10:17 am

NickJohnson wrote:That is actually the more "proper" way to make a scheduler for a microkernel, and I'm pretty sure there are at least a few modern microkernels that do that. The issue is that it causes a lot of extra context switches for relatively little gain in functionality or flexibility (there isn't really a scheduler policy too complex to implement in the kernel, and with loadable modules, you could even make an in-kernel scheduler replaceable.) Basically, whatever small performance gains you get from being able to swap out/heavily adjust the scheduler at runtime will likely be eclipsed by the increase in multitasking overhead. I think most of the people here who claim to write microkernels still have the scheduler in-kernel (I do, and I think Brendan does, anyway.)

This is just why people tend to make the other decision. I'm not trying to stop you from removing the scheduler from your microkernel; it is in fact possible, and has been done successfully.

Are there any extra considerations that need to be... considered (other than what you've already mentioned)? I understand the performance hit, but as CPUs get faster and faster, IMO it matters less and less as time goes on (Moore's law). Performance has never been the biggest issue for me anyway - I've always cared more about code clarity and conciseness, which, IMO, is easier to achieve in a microkernel (since the kernel itself is/should be smaller).

NickJohnson · Post by **NickJohnson** » Mon Oct 03, 2011 10:26 am

Synon wrote:I understand the performance hit, but as CPUs get faster and faster, IMO it matters less and less as time goes on (Moore's law). Performance has never been the biggest issue for me anyway - I've always cared more about code clarity and conciseness, which, IMO, is easier to achieve in a microkernel (since the kernel itself is/should be smaller).

CPUs are not really getting any faster. Moore's law states simply that the density of transistors we can achieve will double every 1.5 years, which does directly impact memory sizes, but not processor speeds. Currently, the speed of any single CPU has begun to plateau, but since we can pack more transistors in the same space, we're instead moving to more cores. Speed isn't everything, but the overhead of multitasking is felt everywhere in the system, and it can be significant. If you don't care about speed, why not just use a round-robin scheduler in-kernel?

In a microkernel, the kernel itself is smaller. However, once you add the complexity of exporting the scheduler to userspace and the complexity of implementing it in userspace, you actually end up with significantly more code. This is why monolithic kernels are easier to write: they're shorter and simpler. Microkernels are at least more flexible, and for drivers at least, much more clear and reliable, but they are anything but simple and concise from a whole-system perspective.

Synon · Post by **Synon** » Mon Oct 03, 2011 10:45 am

NickJohnson wrote:CPUs are not really getting any faster. Moore's law states simply that the density of transistors we can achieve will double every 1.5 years, which does directly impact memory sizes, but not processor speeds. Currently, the speed of any single CPU has begun to plateau, but since we can pack more transistors in the same space, we're instead moving to more cores.

But with more transistors, can't you increase the size and amount of registers, and the size of cache? That must have an effect on speed (not on the clock speed, but on the amount of time it takes for the CPU to pull data out of RAM or cache).

NickJohnson wrote:Speed isn't everything, but the overhead of multitasking is felt everywhere in the system, and it can be significant.

I suppose so; although as quoted in my previous post, Tanenbaum says the hit is 5-10% (although I would have thought that it would increase depending on the amount of processes there are; although I guess that relies on the time complexity of the scheduling algorithm).

NickJohnson wrote:If you don't care about speed, why not just use a round-robin scheduler in-kernel?

Two reasons

I want to be able to change the scheduler when I want to and to me it seems easier to do that in userspace (though I accept that you can replace the scheduler in a monolithic kernel via kernel modules).
I want the scheduler running in userland (in user-terram?) pretty much for the sake of it

NickJohnson wrote:In a microkernel, the kernel itself is smaller. However, once you add the complexity of exporting the scheduler to userspace and the complexity of implementing it in userspace, you actually end up with significantly more code. This is why monolithic kernels are easier to write: they're shorter and simpler. Microkernels are at least more flexible, and for drivers at least, much more clear and reliable, but they are anything but simple and concise from a whole-system perspective.

No, however each individual component can be more concise than in a pure monolithic kernel (since there is really only one component, ignoring modularity). OOP, for example, uses this principle - if you separate code into logical blocks, it is easier to conceptualise how the program as a whole will work, and each block can be easier to understand as well. Obviously that isn't necessarily the case - its just as possible to write a beautiful monolithic kernel as it is to write a spaghetti microkernel - but IMO a microkernel is more conducive to concise, well-organised code (I'm pretty much reciting the propaganda that resides throughout Tanenbaum's book; but I do actually believe it

).

NickJohnson · Post by **NickJohnson** » Mon Oct 03, 2011 11:20 am

Synon wrote:
NickJohnson wrote:CPUs are not really getting any faster. Moore's law states simply that the density of transistors we can achieve will double every 1.5 years, which does directly impact memory sizes, but not processor speeds. Currently, the speed of any single CPU has begun to plateau, but since we can pack more transistors in the same space, we're instead moving to more cores.
But with more transistors, can't you increase the size and amount of registers, and the size of cache? That must have an effect on speed (not on the clock speed, but on the amount of time it takes for the CPU to pull data out of RAM or cache).

This is true, but the extra operations you're doing are context switches, which cause TLB flushes, etc., which are not going to benefit from this that much. Also, you're on a specific architecture, so the number and size of the registers aren't going to change on you if you get a faster processor.

Synon wrote:

I want to be able to change the scheduler when I want to and to me it seems easier to do that in userspace (though I accept that you can replace the scheduler in a monolithic kernel via kernel modules).

I want the scheduler running in userland (in user-terram?) pretty much for the sake of it

Okay, sure. I'm not going to claim that I haven't made plenty of design decisions simply for the hell of it; it makes life more interesting.

Also, maybe "terram userum"? I don't think the Romans used hyphens.

Synon wrote:
NickJohnson wrote:In a microkernel, the kernel itself is smaller. However, once you add the complexity of exporting the scheduler to userspace and the complexity of implementing it in userspace, you actually end up with significantly more code. This is why monolithic kernels are easier to write: they're shorter and simpler. Microkernels are at least more flexible, and for drivers at least, much more clear and reliable, but they are anything but simple and concise from a whole-system perspective.
No, however each individual component can be more concise than in a pure monolithic kernel (since there is really only one component, ignoring modularity). OOP, for example, uses this principle - if you separate code into logical blocks, it is easier to conceptualise how the program as a whole will work, and each block can be easier to understand as well. Obviously that isn't necessarily the case - its just as possible to write a beautiful monolithic kernel as it is to write a spaghetti microkernel - but IMO a microkernel is more conducive to concise, well-organised code (I'm pretty much reciting the propaganda that resides throughout Tanenbaum's book; but I do actually believe it ).

There's nothing stopping you from abstracting the scheduler as an object in the kernel. Moving the scheduler into userspace makes it easily replaceable (but not more modular per se), harder for it to break things (more reliable, but not more modular), and saves address space (still not more modular.) From a modularity standpoint, you just turn the scheduler into an object that is difficult and expensive to interact with.

Synon · Post by **Synon** » Mon Oct 03, 2011 11:34 am

NickJohnson wrote:There's nothing stopping you from abstracting the scheduler as an object in the kernel. Moving the scheduler into userspace makes it easily replaceable (but not more modular per se), harder for it to break things (more reliable, but not more modular), and saves address space (still not more modular.) From a modularity standpoint, you just turn the scheduler into an object that is difficult and expensive to interact with.

Those three benefits you listed are good enough. Thanks for this conversation, it was interesting and helpful.

OSwhatever · Post by **OSwhatever** » Mon Oct 03, 2011 11:53 am

How does user mode scheduling work with SMP? Let's say you have two threads and two CPUs. How is the second CPU going to be involved if the kernel has no idea that the second thread was created.

Synon · Post by **Synon** » Mon Oct 03, 2011 12:44 pm

OSwhatever wrote:How does user mode scheduling work with SMP? Let's say you have two threads and two CPUs. How is the second CPU going to be involved if the kernel has no idea that the second thread was created.

I haven't though about that yet, but maybe it could work like this:

The kernel has its own process list, each node of which looks something like this:

Code: Select all

struct process {
    int pid;
    struct logical_cpu cpu;
    struct cpu_state state;
    /* etc. */
};

while the scheduler would just have something along the lines of

Code: Select all

struct process {
    int pid;
    int priority;
    /* etc. */
};

The scheduler doesn't need to know about the CPU or anything like that, all it needs to do is decide when to switch process and which process to switch to.

Hopefully I'll work something out at some point...

gerryg400 · Post by **gerryg400** » Mon Oct 03, 2011 1:47 pm

The main problem I see with user-mode scheduling is that in a microkernel (well in my microkernel at least) the scheduler ends up being distributed all through the kernel.

For example I use message-passing and asynchronous events as my IPC. In both those modules and in other places there are bits of the scheduler that change the priority of the current or target process, switch to another process depending on the priority of the current message or event or schedule the target process to run on the same or another core.

In fact most of my scheduler is like this. A lot of scheduling is done without ever really using the scheduler proper. I guess I could never figure out how to combine that with user-mode scheduling.

juzam · Post by **juzam** » Thu Nov 10, 2011 9:21 am

The best solution might be a blend of kernel-level thread and user-level thread. Check the Solaris architecture about thread and process (LWP, doorbell facilities to notice the kernel when a user-level thread is blocking or waiting for an object, ...).

http://en.wikipedia.org/wiki/Light-weight_process
http://developers.sun.com/solaris/artic ... ocess.html

turdus · Post by **turdus** » Thu Nov 10, 2011 3:02 pm

NickJohnson wrote:If you don't care about speed, why not just use a round-robin scheduler in-kernel?

I'm not sure what you mean by that, round-rubin is the fastest scheduling algorithm possible, and also scales well. It's simple as

Code: Select all

current=current->next;

any other method would take more cpu cycles, and it's absolutely Θ(1) (the time to get the next thread does not depend on number of threads). K.I.S.S. you know.

As for the OP's question: I've a two level scheduling model, the low level one in kernel uses 32 priority queues with round-rubin in each level. When the last queue is wrapped around, a userspace high level scheduler is called, that can reorganize threads in queues (by changing next pointers). This has several benefits: high performance, since userspace scheduler called not so often, yet modular and changeable on-the-fly. Simplest userspace scheduler could be a one that put all threads in the same queue, thus making a transparent round-rubin scheduling. More complex implementation would examine the number of blockings (whether a thread is cpu oriented or io oriented) and separate threads into different queues according to that or whatever. Because userspace scheduler organize threads it can calculate the next time of it's own call (in worst case when no thread blocks). This can be used as a deadline which is essential to real-time scheduling. In extreme situation userspace scheduler would put only one thread in a queue, gaining what you originally wanted to.

I do not suggest to use userspace scheduler only, because your kernel has to
1. stop running a thread, go to kernel land via interrupt
2. switch context to scheduler's userspace
3. wait until scheduler calculates and decides which thread to wake (uninterruptible)
4. switch from userspace to kernel land via syscall
5. switch context again to selected thread's userspace
on each timer irq, which seem more than inefficient.

But, do as you like, writing an OS is all about testing your ideas.

NickJohnson · Post by **NickJohnson** » Thu Nov 10, 2011 3:08 pm

turdus wrote:
NickJohnson wrote:If you don't care about speed, why not just use a round-robin scheduler in-kernel?
I'm not sure what you mean by that, round-rubin is the fastest scheduling algorithm possible, and also scales well. It's simple as
Code: Select all
current=current->next;
any other method would take more cpu cycles, and it's absolutely Θ(1) (the time to get the next thread does not depend on number of threads). K.I.S.S. you know.

By speed, I don't mean speed of the scheduling algorithm itself. Scheduler overhead is negligible in comparison to context switching overhead, so having a smart algorithm for your scheduler can improve system performance by reducing context switches and more fairly allocating time to processes, even though it takes a few more instructions to actually run the scheduler. Because it ignores all of those things, round-robin is one of the worst scheduling algorithms, although in practice it usually does well enough.

Brendan · Post by **Brendan** » Thu Nov 10, 2011 6:55 pm

Hi,

NickJohnson wrote:
turdus wrote:
NickJohnson wrote:If you don't care about speed, why not just use a round-robin scheduler in-kernel?
I'm not sure what you mean by that, round-rubin is the fastest scheduling algorithm possible, and also scales well. It's simple as
Code: Select all
current=current->next;
any other method would take more cpu cycles, and it's absolutely Θ(1) (the time to get the next thread does not depend on number of threads). K.I.S.S. you know.
By speed, I don't mean speed of the scheduling algorithm itself. Scheduler overhead is negligible in comparison to context switching overhead, so having a smart algorithm for your scheduler can improve system performance by reducing context switches and more fairly allocating time to processes, even though it takes a few more instructions to actually run the scheduler. Because it ignores all of those things, round-robin is one of the worst scheduling algorithms, although in practice it usually does well enough.

There's about 5 different things worth considering:

Raw speed: How fast task switches are when running on one CPU. This depends on the overhead of the task switch itself (e.g. deciding what to switch to, saving previous task's state and loading new task's state), plus indirect overheads (like cache and TLB misses). This may not be a single number (e.g. speed may get worse the more threads there are). It also tends to get worse when you add more features. For example, adding support for FPU/MMX/SSE/AVX will effect raw speed, and so will tracking how much CPU time each task actually used, supporting "per task" debugging and "per task" performance monitoring, etc.
Scalability: How fast task switches are when running on multiple CPUs (including direct overheads and indirect overheads). This isn't a single number either (and not one "ratio"), but more like a graph. For example, a scheduler design might handle up to 8 CPUs really well, but have severe problems when there's more than 8 CPUs. For multi-CPU, there's also a lot of techniques/optimisations that can be done that increase the direct overheads but decrease the indirect overheads (e.g. spending more time deciding which task to switch to, in an attempt to reduce the chance of cache misses or the cost of cache misses, or to distribute load across CPUs better).
Latency: How quickly it responds to changes. For example, if a high priority thread is unblocked, how long until that high priority thread is given CPU time. For things like how fast the OS feels to the end-user, latency can be far more important than raw speed or scalability (and it's especially important for micro-kernels, where scheduling latency effects how quickly device device divers can respond to things like IRQs).
Frequency: How often task switches occur. For example, if task switching itself is very fast (and scales very well), then the OS can still suck because task switches occur very often.
Fairness: How well the scheduler meets someone's definition of "fair" (where "someone's definition of fair" mostly depends on the OS's design goals, and can be a political nightmare). For example, if one user has 5 tasks and another user has 50 tasks, then should the first user's tasks get 10 times as much CPU time as the second user's tasks, so that both users end up with an equal share of CPU time (or should the tasks be treated equally, so that the second user ends up hogging most of the CPU time)?

Note: You could probably combine "raw speed" and "scalability", and represent both with a single 3-dimensional graph, with "number of CPUs" along one axis, "number of threads" along another axis, and "overhead" along the third axis.

For round-robin implemented with one linked list, the overhead of deciding which task to switch to might be very minimal; but the indirect overhead would be the worst case (e.g. it maximises the chance that a task's data won't be in cache/s next time the task is given CPU time and screws up any techniques/optimisations to reduce indirect overheads on multi-CPU) and scalability will suck (e.g. lots of CPUs fighting to access the same linked list as tasks are inserted/removed from the linked list). Latency will also suck (no task priorities, an important task may need to wait while 10000 other tasks have their turn before it gets CPU time) and the only way to make latency suck slightly less would be to increase frequency (and make everything suck). For "fairness" it depends how you define "fair" - round robin would only work if "fair" means all tasks are treated the same (regardless of how important the task/s are or who they belong to).

Cheers,

Brendan

rdos · Post by **rdos** » Fri Nov 11, 2011 2:12 am

There is nothing wrong with round-robin, at least not when round-robin is per scheduler queue. In my solution, I have a number of round-robin queues. For ready-to-run threads, there is one queue per priority-level (I have 256 levels, but only use about 16). Each core has it's own set of priority-based queues, and then there are per-system priority queues for threads that have been "kicked" from their core.

OSDev.org

Can scheduling be done in userspace?

Can scheduling be done in userspace?

Re: Can scheduling be done in userspace?

Re: Can scheduling be done in userspace?

Re: Can scheduling be done in userspace?

Re: Can scheduling be done in userspace?

Re: Can scheduling be done in userspace?

Re: Can scheduling be done in userspace?

Re: Can scheduling be done in userspace?

Re: Can scheduling be done in userspace?

Re: Can scheduling be done in userspace?

Re: Can scheduling be done in userspace?

Re: Can scheduling be done in userspace?

Re: Can scheduling be done in userspace?

Re: Can scheduling be done in userspace?

Re: Can scheduling be done in userspace?