Page 1 of 1

MMU support for microkernels

Posted: Thu Sep 26, 2013 5:36 pm
by linguofreak
I've been pondering the advantages and disadvantages of microkernels, and one of the things that strikes me is that many of the services traditionally provided by a kernel are best implemented as libraries, and a monolithic kernel does indeed function as a big library or set of libraries, whereas a microkernel, in offloading services into userspace, tends to make full processes out of them, which strikes me as inelegant and creates the context switch overhead that is generally cited (rightly or wrongly) as the primary disadvantage of microkernels.

Now, the reason microkernels tend to implement userspace services as full processes is that the capabilities of most (all?) current MMUs don't provide the capability to isolate services to the degree desirable for a microkernel architecture in any other way. So then the question is "what hardware features does a microkernel need"?

To begin with, I can see at least three types of addressing resources that an OS should differentiate and where hardware support for such differentiation would be incredibly helpful:

1) Thread state (stacks, etc)
2) Code
3) Data sets shared between threads (files, process state, IPC buffers, etc)

The x86 MMU is the only one I know of that allows for differentiation between all three types, but the implementation of the Intel segmentation model leaves much to be desired and the ability to do non-flat address spaces has been abandoned with x86-64. But a microkernel would need something like x86 segmentation or z/Architecture access registers.

One difference from the Intel implementation would be that instead of a segment being determined by an base and a limit in an underlying paged address space, each segment descriptor would point to a page directory for that segment. This would allow page table lookups to be done immediately, rather than requiring that they be deferred until the proper segment base had been added to the offset. Ideally, the descriptor for a segment wouldn't even have to be checked if translation information for an segment:offset pair was cached in the TLB.

Building on this, a really critical feature for microkernels would be a means of preventing unprivileged code from accessing segments it doesn't have access to without leaving those segments completely unmapped and forcing context switches when switching between code that does or doesn't have access to a segment. The way I see this being done is this: the kernel establishes a system-wide segment table, roughly equivalent to the Intel GDT, containing descriptors for every segment on the system. The kernel can load segment registers directly from this table, but user-space programs cannot. Instead, each code or thread state segment descriptor contains a pointer to a "virtual segment table". When a user program performs a segment load, the segment selector it uses is translated using the VST into a selector that points into the global segment table, from which the corresponding descriptor is then loaded. The high bit of the virtual selector used by the program would determine whether the VST used would be that of the current code segment or tha
t of the current thread state segment. Once a segment had been loaded, the translated selctor value would be stored in the segment register instead of
the untranslated value, allowing different programs to use different selector values to load a given segment without invalidating TLB entries corresponding to that segment, and allowing a caller to grant a callee one-time access to one of the caller's segments simply by leaving that segment loaded in
a segment register while making the call.

The goal is to make it so that device drivers and other traditional kernel services can be implemented in userspace and given access to only what they need to do their jobs, as in a microkernel, while being called directly by code that needs to use them without any kind of message passing or process or thread switch, as in a monolithic kernel, and without the involvement of the kernel to the extent possible.

Can anybody think of other hardware features that would be useful for the implementation of a microkernel?

Re: MMU support for microkernels

Posted: Thu Sep 26, 2013 9:57 pm
by AndrewAPrice
I've asked myself the same question. Context switching invokes a huge overhead, and that is a huge motivation for me to implement isolation in software, rather than in hardware.

Re: MMU support for microkernels

Posted: Fri Sep 27, 2013 12:56 am
by Brendan
Hi,
linguofreak wrote:Can anybody think of other hardware features that would be useful for the implementation of a microkernel?
For all kernels, the goal would be to ensure that no normal process can access anything belonging to any other normal process; including (for e.g.) ensuring that one application can't read sensitive data (e.g. passwords) from a different application. This level of isolation requires that different pieces run in their own separate virtual address spaces (or to put it another way, if a hardware feature provides the isolation required, then you could call the isolated contexts "virtual address spaces").

For micro-kernel's, the goal would be to ensure that no piece can access anything belonging to any other piece; including (for e.g.) ensuring that one device driver can't read sensitive data (e.g. passwords) from a different device driver. This level of isolation requires that different pieces run in their own separate virtual address spaces (or to put it another way, if a hardware feature provides the isolation required, then you could call the isolated contexts "virtual address spaces").

Any CPU that provides the isolation required for normal applications already has the hardware feature/s needed for a micro-kernel.

For performance, the main problem is the cost of communication between isolated pieces (regardless of whether they're normal processes/applications communicating via. something like pipes or messages on a monolithic system, or device drivers communicating via. something like pipes or messages on a micro-kernel). The communication costs can be broken down into 3 things:
  • The overhead of privilege level changes (e.g. CPL=3 -> CPL=0 -> CPL=3). For 80x86 these cost were slashed about 15 years ago with the introduction of SYSENTER/SYSEXIT and SYSCALL/SYSRET.
  • The overhead of virtual address space switching. For 80x86 the cost of these was dramatically reduced 5 years ago when Intel introduced support for tagged TLBs and "Address Space IDs" in Nehalem CPUs.
  • The impact on caches caused by changing from one piece of code's working set to another piece of code's working set (cache misses, for both normal caches and TLB). This has nothing to do with micro-kernel vs. monolithic (it effects monolithic kernels just as much). There's only 2 ways to solve this - increase cache size (so the working set for all pieces of code will fit in caches all the time), and change working sets less often.
The key point here is "change working sets less often". For communication between pieces of code, the only way you can change working sets less often is to make use of asynchronous communication. Basically; rather than immediately changing from one piece of code's working set to another piece of code's working set; add information that the other piece of code will need to some sort of buffer/queue and change working sets when convenient (not immediately).

For an example, imagine a process that wants to open 10 files. It could call the virtual file system's "open file" function immediately (first working set change) and let the VFS do whatever it has to to open the file and return (second working set change), and do this 10 times. Alternatively the process could add 10 "open file commands" to some sort of queue, then switch to the VFS's working set, then let the VFS open all 10 files, then switch back to the process' working set. Because the working set is changed twice instead of 20 times, the number of cache misses should be significantly reduced.

In addition to this; for multi-CPU systems, asynchronous communication means that the sender and receiver can be running on different CPUs at the same time. Apart from the obvious benefits (parallelism), this can severely reduce the cost of working set changes. For example (same example as before), a process that wants to open 10 files can be running on one CPU with one set of caches, while the VFS is running on a different CPU with different caches, where working set changes (and cache misses) are avoided entirely.

With all of the above in mind; for modern 80x86 (e.g. Nehalem or later, and possibly even Pentium III and later) a micro-kernel that uses asynchronous communication may be faster and more scalable than a monolithic kernel that uses direct function calls despite the extra overhead caused by additional isolation. In practice, the problem is that existing software/applications (and things like the standard C library) aren't designed for asynchronous communication, so to get the benefits you need to write an OS from scratch (rather than merely writing a kernel and recycling existing generic/POSIX applications).

Sadly; almost all existing micro-kernels (L4, Minix, QNX, etc) have used synchronous communication, and the OSs that most of these micro-kernels have been used in have recycled generic/POSIX applications. This is why people assume all micro-kernels have worse performance even though it's not necessarily true.


Cheers,

Brendan

Re: MMU support for microkernels

Posted: Fri Sep 27, 2013 11:07 am
by Brynet-Inc
Intel is working on something called SGX or "Software Guard Extensions", but it's pretty obvious it was motivated by the media industries DRM fetishes.

Re: MMU support for microkernels

Posted: Fri Sep 27, 2013 11:49 am
by linguofreak
Brendan wrote:Hi,
linguofreak wrote:Can anybody think of other hardware features that would be useful for the implementation of a microkernel?
For all kernels, the goal would be to ensure that no normal process can access anything belonging to any other normal process; including (for e.g.) ensuring that one application can't read sensitive data (e.g. passwords) from a different application. This level of isolation requires that different pieces run in their own separate virtual address spaces (or to put it another way, if a hardware feature provides the isolation required, then you could call the isolated contexts "virtual address spaces").

For micro-kernel's, the goal would be to ensure that no piece can access anything belonging to any other piece; including (for e.g.) ensuring that one device driver can't read sensitive data (e.g. passwords) from a different device driver. This level of isolation requires that different pieces run in their own separate virtual address spaces (or to put it another way, if a hardware feature provides the isolation required, then you could call the isolated contexts "virtual address spaces").
The goal for a microkernel is to make sure no piece can access anything belonging to another piece without permission from the owner. If a process asks the filesystem driver to write a memory mapped file to disk, for example, the filesystem driver needs to have access to the file buffer.
Any CPU that provides the isolation required for normal applications already has the hardware feature/s needed for a micro-kernel.
Any Turing-complete CPU provides all the features *needed* for a microkernel even without an MMU. You can just run all your user-space code in a VM that does all of its protection in software. It'll just be slow.

The question is what features are *useful*.
For performance, the main problem is the cost of communication between isolated pieces (regardless of whether they're normal processes/applications communicating via. something like pipes or messages on a monolithic system, or device drivers communicating via. something like pipes or messages on a micro-kernel). The communication costs can be broken down into 3 things:
  • The overhead of privilege level changes (e.g. CPL=3 -> CPL=0 -> CPL=3). For 80x86 these cost were slashed about 15 years ago with the introduction of SYSENTER/SYSEXIT and SYSCALL/SYSRET.
  • The overhead of virtual address space switching. For 80x86 the cost of these was dramatically reduced 5 years ago when Intel introduced support for tagged TLBs and "Address Space IDs" in Nehalem CPUs.
My scheme is basically an ASID scheme with the addition that a program can access multiple address spaces at the same time and that the ASID mechanism is userspace-visible, allowing programs to switch to the address space of a service they want to call without getting the kernel involved, pass address spaces as parameters, etc.
[*]The impact on caches caused by changing from one piece of code's working set to another piece of code's working set (cache misses, for both normal caches and TLB). This has nothing to do with micro-kernel vs. monolithic (it effects monolithic kernels just as much). There's only 2 ways to solve this - increase cache size (so the working set for all pieces of code will fit in caches all the time), and change working sets less often.[/list]

The key point here is "change working sets less often". For communication between pieces of code, the only way you can change working sets less often is to make use of asynchronous communication. Basically; rather than immediately changing from one piece of code's working set to another piece of code's working set; add information that the other piece of code will need to some sort of buffer/queue and change working sets when convenient (not immediately).

For an example, imagine a process that wants to open 10 files. It could call the virtual file system's "open file" function immediately (first working set change) and let the VFS do whatever it has to to open the file and return (second working set change), and do this 10 times. Alternatively the process could add 10 "open file commands" to some sort of queue, then switch to the VFS's working set, then let the VFS open all 10 files, then switch back to the process' working set. Because the working set is changed twice instead of 20 times, the number of cache misses should be significantly reduced.
But wouldn't it be neat to have hardware features that allow you to move the mechanism that adds stuff to the VFS queue out of the kernel and into userspace in the VFS?

Re: MMU support for microkernels

Posted: Fri Sep 27, 2013 2:46 pm
by Brendan
Hi,
linguofreak wrote:
Brendan wrote:For micro-kernel's, the goal would be to ensure that no piece can access anything belonging to any other piece; including (for e.g.) ensuring that one device driver can't read sensitive data (e.g. passwords) from a different device driver. This level of isolation requires that different pieces run in their own separate virtual address spaces (or to put it another way, if a hardware feature provides the isolation required, then you could call the isolated contexts "virtual address spaces").
The goal for a microkernel is to make sure no piece can access anything belonging to another piece without permission from the owner. If a process asks the filesystem driver to write a memory mapped file to disk, for example, the filesystem driver needs to have access to the file buffer.
Sure, but plain old shared memory is enough to allow one process to give another process restricted (and direct) access to specific pieces of its data. You don't need anything special (beyond normal paging) built into a CPU for this.
linguofreak wrote:
For performance, the main problem is the cost of communication between isolated pieces (regardless of whether they're normal processes/applications communicating via. something like pipes or messages on a monolithic system, or device drivers communicating via. something like pipes or messages on a micro-kernel). The communication costs can be broken down into 3 things:
  • The overhead of privilege level changes (e.g. CPL=3 -> CPL=0 -> CPL=3). For 80x86 these cost were slashed about 15 years ago with the introduction of SYSENTER/SYSEXIT and SYSCALL/SYSRET.
  • The overhead of virtual address space switching. For 80x86 the cost of these was dramatically reduced 5 years ago when Intel introduced support for tagged TLBs and "Address Space IDs" in Nehalem CPUs.
My scheme is basically an ASID scheme with the addition that a program can access multiple address spaces at the same time and that the ASID mechanism is userspace-visible, allowing programs to switch to the address space of a service they want to call without getting the kernel involved, pass address spaces as parameters, etc.
Sounds like a massive security problem (e.g. processes deciding to access other process' address spaces).
linguofreak wrote:But wouldn't it be neat to have hardware features that allow you to move the mechanism that adds stuff to the VFS queue out of the kernel and into userspace in the VFS?
Not really.

It's relatively easy (using lock-free FIFO queues and shared memory) to shift the queues themselves into user-space. However, for it to work in a sane way the kernel's scheduler has to know when to do task/thread switches; and if you have to tell the kernel's scheduler when you're waiting for a message (need to be blocked) or that you've sent a message (so the scheduler can determine if the receiver should be unblocked and if it should pre-empt or run on a different CPU); then you still have to do a kernel API call (and privilege level switching) and haven't actually gained anything by shifting the queues to user-space. What you will do is create the opportunity for user-space to screw things up for no benefit whatsoever. ;)


Cheers,

Brendan

Re: MMU support for microkernels

Posted: Fri Sep 27, 2013 3:28 pm
by Owen
Notes to add to Brendan's first post:
  • Context switch overhead is very much CPU dependent,
  • x86 ASIDs only work with Intel VT and AMD SVM
  • x86 is the anomaly with regards to context switch overhead - it is especially bad
The last cones as a result of (a) the proliferation of monolithic kernels (Especially on x86), and (b) the fact that the usage of x86 is completely different from its' initial design.

x86 operating systems generally assume high context switch overhead and hence optimize for minimal context switches.

Re: MMU support for microkernels

Posted: Fri Sep 27, 2013 10:53 pm
by linguofreak
Brendan wrote:Hi,
linguofreak wrote:My scheme is basically an ASID scheme with the addition that a program can access multiple address spaces at the same time and that the ASID mechanism is userspace-visible, allowing programs to switch to the address space of a service they want to call without getting the kernel involved, pass address spaces as parameters, etc.
Sounds like a massive security problem (e.g. processes deciding to access other process' address spaces).
Note the word "service" instead of "process". The idea here is that the service is something more akin to a shared library (or at least exports a shared library if it is a full process) and makes available an address space or segment (depending what terminology you want to use) containing the functions that make up its interface. A program wishing to make use of these functions then carries out the equivalent of an Intel far jump into that address space.

A segment / address space belonging to a process or service isn't made available to other processes unless that program explicitly tells the kernel that it wishes to make it available, or unless it leaves its selector in a segment/access register in order to pass it to another program (and in that case its only good until the program returns from the function call or loads the register with another selector).
linguofreak wrote:But wouldn't it be neat to have hardware features that allow you to move the mechanism that adds stuff to the VFS queue out of the kernel and into userspace in the VFS?
Not really.

It's relatively easy (using lock-free FIFO queues and shared memory) to shift the queues themselves into user-space. However, for it to work in a sane way the kernel's scheduler has to know when to do task/thread switches; and if you have to tell the kernel's scheduler when you're waiting for a message (need to be blocked) or that you've sent a message (so the scheduler can determine if the receiver should be unblocked and if it should pre-empt or run on a different CPU);
Here "sending a message" is making a function call to a library exported by the receiver. If it's a synchronous function call, it just runs as part of your thread and returns. There's no need for a thread corresponding to the receiver to be unblocked because your thread is doing the work, and there's no need for your thread to be blocked while you wait for a "done" message, because the "done" message is when your thread returns from the library.

If it's an asynchronous function call, then your thread, running the receiver's code, puts a request on a queue, either (A) calls the kernel to unblock one of the receiver's threads or (B) sets a user-space wakeup flag (or maybe increments a message count) that the kernel scheduler checks, and returns. Your thread then continues working until it runs out of work, then calls the kernel to block it while it waits for a "done" message.

If the receiver is a device driver, the asynchronous call might look more like this: Your thread, running the driver's code, sends a command to the appropriate device, does some driver paperwork, and returns. It then continues working until it runs out of work and calls the kernel to wait. When the device is done with the request, it raises an interrupt, which vectors to a handler in the drivers address space. The handler does any necessary work, sends a "done" message to your thread (or perhaps sends it to a thread belonging to the driver which then messages your thread when it next runs), and irets.
Not really.

It's relatively easy (using lock-free FIFO queues and shared memory) to shift the queues themselves into user-space. However, for it to work in a sane way the kernel's scheduler has to know when to do task/thread switches; and if you have to tell the kernel's scheduler when you're waiting for a message (need to be blocked) or that you've sent a message (so the scheduler can determine if the receiver should be unblocked and if it should pre-empt or run on a different CPU); then you still have to do a kernel API call (and privilege level switching) and haven't actually gained anything by shifting the queues to user-space. What you will do is create the opportunity for user-space to screw things up for no benefit whatsoever.


Cheers,
For the synchronous case you don't end up calling the kernel at all (unless the receiver needs to for its own reasons).

If the system is set up to use option (A) in the "asynchronous non-driver" paragraph, the "done" message (for either the asynchronous non-driver or the asynchronous driver case) is sent by calling the kernel, if it's set up to use option (B), the message is sent by calling a "requestDone(int requestID)" function in a library exported by your process. requestDone() records that the "done" message has arrived and does other necessary paperwork, sets the wakeup flag for your thread for the scheduler, and returns.

I prefer option B because it's more consistent with the "all messages are function calls to a library exported by the appropriate process" idea.

In the "asynchronous non-driver, option (B)" case, the only switches to kernel mode made are when your thread calls the kernel to wait (or, if it has enough work, when the timer interrupt starts the scheduler), and when the receivers thread waits or runs out of time itself (assuming no other threads on the system).

In the "asynchronous driver, option (B)" case, if your thread has enough work to do, it may not run out before the interrupt arrives from the device, in which case it may be that no switch to kernel mode is ever made (assuming the hardware is designed so that interrupts can be handled entirely in user space). Your thread finds the "done" message waiting in its queue when the interrupt returns. Actually, it's not going to be quite that rosy, as you wouldn't want to call requestDone() directly from an interrupt handler, seeing as the following code would cause quite a mess, but you get the point:

Code: Select all

void requestDone(int requestID)
{
    for(;;;);
}
Offloading the actual message sending to a driver thread would leave you with the same number of kernel mode transitions as the non-driver case (that is, waits and timer interrupts).