OS design

beyondsociety · Post by **beyondsociety** » Wed Oct 13, 2004 11:41 am

Im in the process of debating whether a monolithic or micro kernel is the best for my operating system needs. There are numerous opinions on which is best to use, They is not the case in this post.

With a monolithic kernel, all system drivers are located in the kernel space (ring 0). Performance of switching between context switches and messaging are better but if one driver crashes, the whole system goes down.

Would a monolithic kernel with loadable modules for the drivers and preemptive multitasking fix the problem of "if one driver crashes, the whole system goes down"?

My other idea is to setup a seperate Page Directory and Page Tables inside the kernel for the drivers. Im wondering if this will add protection to the kernel from the drivers or just make it worse. I looking at using one address space per process with paging. Though, it would be slow to switch address spaces twice on every message passing.

This might solve the problem, but not sure

"Improved Address-Space Switching on Pentium Processors by Transparently Multiplexing User Address Spaces" Covers using segments to allow a few processes of 4 to 64MB in size to reside in a global part of larger 3GB process's. Making task switching and IPC between these processes a bit quicker. These smaller address space processes could be used for device drivers for example.

With a micro kernel, all system drivers are located in user space (ring 3) and the kernel only contains: memory allocation, scheduling, and messaging. Since the drivers are located in user space, this causes more context switches and messaging to occur.

I was thinking of puting my system drivers in either ring 1 or 2, is this a good idea and would it solve the performance issue?

Thanks for the imput in advance.

mystran · Post by **mystran** » Wed Oct 13, 2004 2:24 pm

Actually, this has nothing to do with "rings". It's a virtual memory management issue instead.

Usually each process is given it's own virtual memory context, that is, each process has it's own page tables. Kernel is then mapped on the top portion of each of these, so you don't need to switch the virtual memory context when you go from the process to kernel and back, but if you need to switch the process, you need to switch the page table, which flushes so called TLBs, which causes following memory accesses to be very slow, since the processor has to first go and read the actual page table entries. This is normally done only once, then the entry is store in a TLB, and the check can be made from there. Sure if you touch too many memory, you cause the TLB to become full, in which case some of it is discarded in favor of newly relevant entries, but the real pain is that switching the page directory means (almost) ALL of the TLB entries need to be emptied.

In a monolithic kernel this is not so huge issue, since most often we only need to go from a process to kernel and back. If we need to switch the process, we take the hit, but since this is relatively rare activity, the hit can be taken.

The problem is, in a microkernel, each driver is in a different process. If you needed to read some part of a file in a monolithic kernel, you'd switch from the application to kernel, perform the read, and switch back. (actually you'd probably start read, do something else, then resume when the read is finished, but...)

In a microkernel, the application sends a message to a filesystem process, so we switch once. The filesystem process then needs some stuff from the harddrive device driver, so we switch process the second time. Then the reply comes, we switch back to filesystem, which might need some more stuff from the drive, so we need to switch a few more times, and finally we switch back to the application. A lot of switches, and if we need to flush all TLB's on every switch, we'll do a lot of extra work.

So the idea is to keep a few small address spaces, and map them in every context, just like the kernel. Now, we can put the most commonly needed services and device drivers into those small address spaces, so that if we only need to switch between a single application and those "small address spaces", we won't have to switch the page directory -> no TLB flushes -> works faster.

Colonel Kernel · Post by **Colonel Kernel** » Wed Oct 13, 2004 11:05 pm

In a microkernel, the application sends a message to a filesystem process, so we switch once. The filesystem process then needs some stuff from the harddrive device driver, so we switch process the second time. Then the reply comes, we switch back to filesystem, which might need some more stuff from the drive, so we need to switch a few more times, and finally we switch back to the application. A lot of switches, and if we need to flush all TLB's on every switch, we'll do a lot of extra work.

I've probably brought up this nitpick somewhere before, but here it goes again... All that this example shows is why poorly-designed microkernels are slow. It's pretty pathological... A well-designed microkernel would do something more intelligent like implementing some drivers as processes and others as shared libraries so they can be "pooled". Component pooling has been around in middleware for years -- it makes sense for OS dev too IMO.

Imagine this instead -- the disk driver is a process, and it has loaded a shared-library filesystem driver for each partition on the disk. There obviously needs to be a way for the application to find the right server process to send a message to based on the path to the file it's accessing, but there is no reason why this would need to be determined for every access to the file (so it's out of scope for this example).

The application sends a message to the server process. The server process filters the request through the filesystem driver lib, which calls the disk driver (in the same address space, and possibly even the same thread). The disk driver returns to the filesystem driver. The filesystem driver calls the disk driver... ad nauseum -- it doesn't matter, they're both in the same process. Finally, the filesystem driver replies to the application process. Only two address space switches. Not as good as zero, but better.

distantvoices · Post by **distantvoices** » Thu Oct 14, 2004 4:00 am

I'd have the *drivers* - the threads which do the actual talking-to-hardware and real system related nittygritty - reside in kernel space.

Have a top down approach: user processes do not have to talk to drivers directly (well, except of some crucial services, which might fetch crucial system structures from kernel land - like ifconfig-command)

I do it like this:

User process -> service process (userprocess suspended)
service process ->driver (service process (thread eventually) suspended)
driver talks to hardware, is blocked till interrupt comes along (or not - see NIC programming)
driver -> service process
service process -> user process.

There are only two adress space switches par optimum.

Now, consider Address space Management - or like Dreamsmith likes it - Process Management: Me, being Minix inspired in the very beginning, has created an *own process* for the process/memory management, where also the virtual address allocations take place. It is a bottle neck, if you do demand paging: all the time the pager asks the mm service: is this address valid - and eventually gets the command to map it in. This is a ... well, in sense of pure microkernel correct operation, but in sense of performace, it is deeeeaad slooow.

So, if going the Micro Kernel way, I advice you to put at least the virtual address space management into kernel space too. I think it is reasonable to have mmap mumap calls go directly to the kernel. Virtual address space queries can take place directly in Kernel land. If doing threads - have the virtual address tree root of them point to the same tree if they belong to one process. (@dreamsmith: yes, the message has sunk in. I'm planning on how to get rid of this complicated thing)

Brendan · Post by **Brendan** » Thu Oct 14, 2004 5:55 am

Hi,

beyondsociety wrote: Im in the process of debating whether a monolithic or micro kernel is the best for my operating system needs. There are numerous opinions on which is best to use, They is not the case in this post.

The reason you're getting numerous opinions is because whether a monolithic or micro kernel is the best depends heavily on what the OS will be used for, what the OS's goals are and what each person considers "best".

beyondsociety wrote: With a monolithic kernel, all system drivers are located in the kernel space (ring 0). Performance of switching between context switches and messaging are better but if one driver crashes, the whole system goes down.

The performance of context switches and messaging would be the same, but (depending on implementation) the number of context switches and the number of messages sent/received would/could increase for micro-kernels.

beyondsociety wrote: Would a monolithic kernel with loadable modules for the drivers and preemptive multitasking fix the problem of "if one driver crashes, the whole system goes down"?

This also depends on implementation. For example, if there's a micro-kernel with loadable CPL=0 drivers the only difference would be that drivers couldn't access memory owned by other drivers or processes (they could still trash the kernel's memory and there still wouldn't be any IO port protection). If the device drivers ran at CPL=3 then the drivers are less likely to be able to trash something else. Then there's the problem of what the OS does when <something needed> dies (where <something needed> could be any device driver, the virtual filesystem code, the GUI, etc). For example, if the video driver dies will the OS continue without video, restart the video driver (which may crash again), or gracefully shutdown? What about a disk driver that's being used for swap space?

beyondsociety wrote: My other idea is to setup a seperate Page Directory and Page Tables inside the kernel for the drivers. Im wondering if this will add protection to the kernel from the drivers or just make it worse. I looking at using one address space per process with paging. Though, it would be slow to switch address spaces twice on every message passing.

This wouldn't make any difference to performance or security/stability (it's mostly the same as monolithic except it'd be easier to implement loadable device drivers)...

"Improved Address-Space Switching on Pentium Processors by Transparently Multiplexing User Address Spaces" Covers using segments to allow a few processes of 4 to 64MB in size to reside in a global part of larger 3GB process's. Making task switching and IPC between these processes a bit quicker. These smaller address space processes could be used for device drivers for example.

Multiplexing user address spaces could (depending on the design of the messaging and scheduling code) reduce the overhead of messaging and context switches between processes in the same address space (especially for CPUs and/or OSs that don't support/use the "global" page flag). If there's 48 processes squeezed into each address space, but the OS is running a total of 1000 processes it's not going to help much. It would also mean limiting each process to a fraction of the address space, or having code to shift a process to it's own (non-shared) address space if it starts using more than a fraction of the address space.

beyondsociety wrote: I was thinking of puting my system drivers in either ring 1 or 2, is this a good idea and would it solve the performance issue?

Unfortunately unless you use segmentation for additional protection, anything that's not running at CPL=3 can trash the kernel's data (with paging there's only "user" and "system" protection levels).

Cheers,

Brendan

Brendan · Post by **Brendan** » Thu Oct 14, 2004 6:06 am

Hi again,

mystran wrote: In a monolithic kernel this is not so huge issue, since most often we only need to go from a process to kernel and back. If we need to switch the process, we take the hit, but since this is relatively rare activity, the hit can be taken.

The problem is, in a microkernel, each driver is in a different process. If you needed to read some part of a file in a monolithic kernel, you'd switch from the application to kernel, perform the read, and switch back. (actually you'd probably start read, do something else, then resume when the read is finished, but...)

In a microkernel, the application sends a message to a filesystem process, so we switch once. The filesystem process then needs some stuff from the harddrive device driver, so we switch process the second time. Then the reply comes, we switch back to filesystem, which might need some more stuff from the drive, so we need to switch a few more times, and finally we switch back to the application. A lot of switches, and if we need to flush all TLB's on every switch, we'll do a lot of extra work.

Interestingly, for computers with several CPUs this can be all wrong. With a monolithic kernel (where device drivers do not have their own contexts) you can have one process very busy trying to use several device drivers, etc all on one CPU, while all others CPUs are idle.

For a micro-kernel, an application/process running on CPU 1 sends a message to a file system process running on CPU 2, which sends and receives more messages between a process running on CPU 3. There doesn't have to be any context switches (until there's more processes running than CPUs) and all CPUs would be working to make the application process run better. This would be especially useful if messaging is asynchronious (ie. if the processes don't block waiting for the result of a request, but instead continue doing useful work while the request is being handled).

Cheers,

Brendan

Dreamsmith · Post by **Dreamsmith** » Sat Oct 16, 2004 1:41 am

beyondsociety wrote:Im in the process of debating whether a monolithic or micro kernel is the best for my operating system needs.

If the question is, "Which is better, a monolithic or microkernel design?", my answer would be, "Yes."

If my answer seems nonsensical, it accurately reflects my opinion of the sensibility of the question.

I'm reminded of the great debate, years ago, about whether RISC or CISC design was better for making CPUs and which of these two ways would future CPUs be made. While the absolutists argued, the pragmatists made the current generation of CPUs, discarding the silly notion that you had to go one way or the other and implementing the best features of both.

I'm also reminded of "pure" languages, programming languages that take some concept than explore it to the extreme, thus giving us pure functional languages, or pure OO languages (where literally everything, even a constant like 2, is an object) -- they're great ways to explore the benefits and limits of an idea, but they never result in practical, general purpose languages. But the research provides wonderful insights into how to improve the general purpose languages we actually use.

Concepts like microkernel or exokernel are like this -- there's been some great research done taking these ideas to the extreme, and the lessons learned inform us on how to improve our OS designs. But unless you're doing research, it's never a good idea to take any idea too far.

Most modern operating systems, at least the popular ones, show traits of both monolithic and microkernel design, just as most modern processors show traits of both RISC and CISC design. If you're starting an OS project with the question, "Should I be designing a monolithic kernel or a microkernel?", you're starting your project with a commitment you ought not to be making. Barring specialized requirements, if you've committed to doing things one way or the other, you've already made a mistake in my opinion, regardless of which way you decided to go.

beyondsociety · Post by **beyondsociety** » Sat Oct 16, 2004 4:25 pm

For the most part, Ive been adding things on the fly. I basicly have a basic working kernel setup and now I stuck on the design part. I've implemented paging, but no physical/virtual memory managers and I am trying to work out a design before I go any further.

I was thinking of combing both of the different kernel models but Im not sure on what to actually include. Does anybody have any suggestions?

Thanks in advance.

mystran · Post by **mystran** » Sun Oct 17, 2004 1:38 pm

Brendan wrote: Interestingly, for computers with several CPUs this can be all wrong. With a monolithic kernel (where device drivers do not have their own contexts) you can have one process very busy trying to use several device drivers, etc all on one CPU, while all others CPUs are idle.

Actually I don't see what you are trying to say. Even in a monolithic kernel you can use separate kernel threads for any and all of your device drivers, in which case they can run on different CPUs. That said, if you only have one process (with one thread) to run, the other CPUs are going to be mostly idle anyway, unless your device drivers are really CPU heavy.

Besides, it can be non-trivial to schedule for multiple processors in a way that keeps them all busy, but avoids transferring tasks from one processor to another too often. It only get's worse when you add priorities in the mix: lower priority thread get's pre-empted by a higher-priority thread, but there's a CPU idle. Should we transfer one of the tasks to that idle CPU, when in fact we might have enough stuff for both of the tasks on the first CPU's cache that it's actually faster to handle the higher-priority thread in place, and keep the other CPU idle.

Finally, while I have no real SMP experience, I've understood that sending messages from one CPU to another isn't exactly free either. Could someone give me so idea of how much penalty sending a message from one CPU to another as part of IPC costs. I mean, there's the IPI, then there's cache sync issues... is there something more?

Pype.Clicker · Post by **Pype.Clicker** » Mon Oct 18, 2004 2:45 am

well, the major improvement i can envision with SMP is that the scheduler could try to find another free thread that's running in the same space than the suspending thread (or that your API has been written for asynchronous calls and that you've something useful to do in the current thread while waiting)

Thus you can potentially save one context switch overhead (imho)

But you still have to serialize the message, or copy it into the target space (or map it to the target space), etc.
You still have one extra user-level-to-kernel context switch cycle, etc.

Brendan · Post by **Brendan** » Mon Oct 18, 2004 6:08 am

Hi,

mystran wrote:
Brendan wrote: Interestingly, for computers with several CPUs this can be all wrong. With a monolithic kernel (where device drivers do not have their own contexts) you can have one process very busy trying to use several device drivers, etc. all on one CPU, while all others CPUs are idle.
Actually I don't see what you are trying to say. Even in a monolithic kernel you can use separate kernel threads for any and all of your device drivers, in which case they can run on different CPUs. That said, if you only have one process (with one thread) to run, the other CPUs are going to be mostly idle anyway, unless your device drivers are really CPU heavy.

It's common for a monolithic kernel not to use separate threads for each device driver to improve performance - a thread switch is never as fast as a near call. It was this type of monolithic kernel I was referring to when I said "a monolithic kernel (where device drivers do not have their own contexts)". While I did write "context" I actually meant "job and/or process and/or task and/or thread and/or whatever_else_you_call_it". Perhaps "the things that the scheduler schedules" would be less ambiguous?

IMHO all software designed for SMP/MP/NUMA should use as many threads as is logical to reduce uneven CPU load (for e.g. http://www.users.bigpond.com/sacabling/ ... erview.htm).

mystran wrote: Finally, while I have no real SMP experience, I've understood that sending messages from one CPU to another isn't exactly free either. Could someone give me so idea of how much penalty sending a message from one CPU to another as part of IPC costs. I mean, there's the IPI, then there's cache sync issues... is there something more?

Sending a message from a thread running on one CPU to a thread running on another involves much more overhead than a near call, but only the additional cost of an IPI (if necessary) when compared to sending a message between threads on the same CPU (depending on specific implementation).

Cheers,

Brendan

Solar · Post by **Solar** » Mon Oct 18, 2004 6:19 am

Brendan wrote: It was this type of monolithic kernel I was referring to when I said "a monolithic kernel (where device drivers do not have their own contexts)". While I did write "context" I actually meant "job and/or process and/or task and/or thread and/or whatever_else_you_call_it". Perhaps "the things that the scheduler schedules" would be less ambiguous?

I previously proposed the term "task" for this, the (not further specified) control flow in general, as opposed to the more specific "process" (own address space) or "thread" (shared address space).

I felt that a proprietary definition would still help communication more than the unspecified mess of "I mean thread but I'll have to explain exactly what I mean anyway".

That proposal was turned down; perhaps it's time to put it on the plate again?

Brendan · Post by **Brendan** » Mon Oct 18, 2004 7:38 am

Hi,

Solar wrote: I previously proposed the term "task" for this, the (not further specified) control flow in general, as opposed to the more specific "process" (own address space) or "thread" (shared address space).

I felt that a proprietary definition would still help communication more than the unspecified mess of "I mean thread but I'll have to explain exactly what I mean anyway".

That proposal was turned down; perhaps it's time to put it on the plate again?

Something needs to be done

.

The term "task" means too many different things and (IMHO) should be defined before use. For e.g. Intel use it to describe the hardware multi-tasking, but also use it with a completely different meaning when describing the local APIC (task priority register) - all in the same document. Different OS's have used "task" to mean whatever they felt like...

The term "context" could relate to something that is scheduled, or an "IRQ handler's context", or a different CPU mode (e.g. "SMM context"), and it's meaning really does depend on the context it's used.

A "process" in *nix (AFAIK) is a thing that can be scheduled that has it's own address space.

A "thread" in *nix (AFAIK) is a user-level thing that is ignored by the kernel (but scheduled by user-level code in the process).

Now, my OS has "things" and "sub-things", where each "thing" consists of one or more "sub-things" and part of the address space (e.g. the lowest 1 Gb). Each "sub-thing" is independantly scheduled by the kernel, but also owns part of the address space (e.g. the middle 2 Gb).

With the *nix terminology, my "things" provide/own (part of) the address space that is shared by all "sub-things" and therefore it could be argued that my "things" are processes. However, because my "sub-things" are scheduled by the kernel and also have a unique (part of) the address space they can also be considered "processes". Obviously this is a mess.

To complicate things further, it would be conceivable to have a "job" that consists of one or more "processes", where each "process" consists of one or more "tasks", where each "task" consists of one or more "threads". Any of these levels could control any part/s of an address space, and any of these levels could be scheduled by code within the kernel or in the next highest level.

More generally, an OS may have N levels of things, with different characteristics for each level. Characteristics would include address space ownership, kernel scheduling or user level scheduling, file handle ownership, IRQ ownership, IO port ownership, etc.

IMHO this makes it almost impossible for any related terminology to have any specific meaning. Instead we could define a naming scheme that takes into account all possible levels and all possible characteristics (for e.g. "lev0AKFQI" could describe a *nix process, while "lev1-L---" could describe a *nix thread, and I could use "lev0A----" and "lev1AKFQI").

Alternatively we could all include a complete (descriptive) glossary with each post ::).

Cheers,

Brendan

Solar · Post by **Solar** » Mon Oct 18, 2004 8:33 am

Brendan wrote:
A "thread" in *nix (AFAIK) is a user-level thing that is ignored by the kernel (but scheduled by user-level code in the process).

Ahh, but then there are numerous papers on kernel-level threads (i.e. multiple control flows in a single address space scheduled by the kernel) vs. user-level threads (i.e. the kernel schedules the process, which in turn schedules multiple threads in its address space which the kernel doesn't see)...

So we have "job", "task", "thread", "process"... any more contenders? ::)

Brendan · Post by **Brendan** » Mon Oct 18, 2004 9:01 am

Hi,

Solar wrote: Ahh, but then there are numerous papers on kernel-level threads (i.e. multiple control flows in a single address space scheduled by the kernel) vs. user-level threads (i.e. the kernel schedules the process, which in turn schedules multiple threads in its address space which the kernel doesn't see)...

Yes, I think the original version of Unix didn't support threads in any way, so people started adding multi-threading in user level libraries. The user-level threading wasn't too good because it was outside of the kernel's control - there was no way to tell the kernel the priority of your threads, and in some cases (with really old round robin kernel schedulers) you couldn't even tell the kernel that all of the threads within the process where blocked/waiting.

Eventually different versions of *nix built multi-threading support into the kernel to get rid of the problems with user-level threads.

Solar wrote: So we have "job", "task", "thread", "process"... any more contenders? ::)

What was wrong with "thing" and "sub-thing"!!

Cheers,

Brendan

OSDev.org

OS design

OS design

Re:OS design

Re:OS design

Re:OS design

Re:OS design

Re:OS design

Re:OS design

Re:OS design

Re:OS design

Re:OS design

Re:OS design

Re:OS design

Re:OS design

Re:OS design

Re:OS design