About exokernels...

Brendan · Post by **Brendan** » Mon Feb 17, 2014 3:57 am

Hi,

Rusky wrote:
Brendan wrote:The next difference would be global optimisations. Things like trying to intelligently prefetch data from disk (and swapping, etc), and scheduling and power management. When important pieces are duplicated in isolated processes, maintaining the global state needed to do these optimisations efficiently would become a nightmare, and I'd expect exo-kernels to suck badly because of this.
It turns out that optimizing individual applications improves global performance (see the Xok paper). The exokernel can still handle things like tracking LRU, page/block mappings, etc. It just does it at a lower level that gives applications more opportunities to utilize resources effectively.

I'm talking about global optimisations which effect global performance. You're talking about local optimisations (within a process) that effect global performance. These are 2 completely different (opposite?) things. For example, with extremely good local optimisations and extremely bad global optimisations; global performance might remain the same.

Of course if you give processes the ability to do local optimisations you remove the kernel's ability to do global optimisations. For an example; imagine one process is struggling to get data fast enough from one partition, and a different process decides it should prefetch data from another partition on the same disk. Do you allow that second process to do its local optimisation (and severely effect the first process); or do you make a global decision that's best for both processes (fetching data that's needed now is more important than pre-fetching data that might be needed later) that prevents the local optimisation?

Rusky wrote:
Brendan wrote:The next difference would be private optimisations (e.g. applications that use their own customised libOS to improve performance). This is mostly the land of pipe dreams. Application developers couldn't be bothered and (if they did bother) the number of potential optimisations they could do (that can't also be supported by a monolithic kernel) is almost zero.
Application developers don't have to do it themselves. Exokernels just enable a much wider range of potential libOSes to choose from, that can all run on one system at once. So big-scale applications like web servers and databases, office suites, web browsers, etc. can choose interfaces that are more optimal. Do you really think someone like Apache or MySQL or Mozilla would just "not bother"?

Correct. If they find a significant bottleneck; they're all much more likely to submit a patch for the Linux kernel or use existing methods to get lower level access (e.g. using "/dev/sda" directly to avoid going through VFS/file system layers).

Rusky wrote:
Brendan wrote:The final difference is security. If a malicious application can use it's own private libOS, then all hell breaks loose. For example, you can't expect the potentially malicious file system code (in the library) to keep the file system in tact, and can't even assume that the application's libOS respects file permissions.
You miss the whole point of exokernels. They manage permissions at the hardware level, so the malicious file system code in the library HAS to respect permissions. The file system is particularly tricky to get right like this, but they have done it (described in the Xok paper).

That's just plain silly - rather than having one piece of code managing file systems/file permissions you end up with code duplicated at multiple levels (in the kernel and in the library) doing the same thing.

It looks like Xok avoided this by having all file systems implemented in the kernel (using a "pseudo-RISC assembly language" as a safe way to add extensibility to the kernel's file system support).

Cheers,

Brendan

Rusky · Post by **Rusky** » Mon Feb 17, 2014 12:20 pm

Brendan wrote:I'm talking about global optimisations which effect global performance. You're talking about local optimisations (within a process) that effect global performance. These are 2 completely different (opposite?) things. For example, with extremely good local optimisations and extremely bad global optimisations; global performance might remain the same.

Of course if you give processes the ability to do local optimisations you remove the kernel's ability to do global optimisations. For an example; imagine one process is struggling to get data fast enough from one partition, and a different process decides it should prefetch data from another partition on the same disk. Do you allow that second process to do its local optimisation (and severely effect the first process); or do you make a global decision that's best for both processes (fetching data that's needed now is more important than pre-fetching data that might be needed later) that prevents the local optimisation?

What global optimizations could not be better performed without local information? Your earlier examples are all improved on in the exokernel papers:

Swapping out old pages is also better done with application involvement because applications know better than an LRU algorithm which pages they need. For example, when the kernel needs more physical memory, a database could opt to free its index pages rather than swapping them out, because regenerating them is faster than loading them from the disk.
Scheduling is still done centrally- that's part of "securely multiplexing hardware." Although, heirarchical scheduling that allows parent processes to schedule child processes again allows an application like a web browser to more optimally schedule its tabs/applets/etc.
Prefetching data from disk is better done at the application level because applications know more precicesly what they need to prefetch. I'm not sure how a centralized kernel would make your decision any more than an exokernel would- how do you differentiate prefetching and normal reading? The exokernel itself still knows exactly which blocks are being read and can still do things like merge read schedules.

Brendan wrote:Correct. If they find a significant bottleneck; they're all much more likely to submit a patch for the Linux kernel or use existing methods to get lower level access (e.g. using "/dev/sda" directly to avoid going through VFS/file system layers).

The whole benefit of an exokernel is that you 1) don't have to patch the kernel (partly because if EVERYBODY trying to do an exokernel-style optimization patched the kernel, they would have huge design conflicts and/or a slower kernel trying to handle all the possibilities, and 2) the lower-level /dev/sda style access is still secure.

Brendan wrote:That's just plain silly - rather than having one piece of code managing file systems/file permissions you end up with code duplicated at multiple levels (in the kernel and in the library) doing the same thing.

It looks like Xok avoided this by having all file systems implemented in the kernel (using a "pseudo-RISC assembly language" as a safe way to add extensibility to the kernel's file system support).

Nope. Xok only put the file system permission code (which only exists in one place) in the kernel (through the "pseudo-RISC" stuff). The rest of the file system code (deciding when and where to load which blocks) is in libOSes. Their basic premise is that because file system data structures already describe block permissions, the kernel just needs a way to interpret that.

So there's a deterministic bytecode function that, given a block of metadata (e.g. an inode), returns a list of its owned blocks and their types (inode, data, etc). There's another bytecode function that, given a block of metadata and a modification to it, determines if it's allowed. The kernel can thus check that the libOS code only allocates blocks it owns, and file systems can have their own permission systems- acls, capabilities, whatever else.

Honestly, I would replace the bytecode shenanigans with trusted loadable modules. You lose the ability for any arbitrary program to implement its own file system (it needs permission to load a kmod), but you keep all the real optimization opportunities for accurate prefetching, disk block placement, etc.

Rusky · Post by **Rusky** » Mon Feb 17, 2014 12:29 pm

A more interesting example of an in-kernel bytecode is packet filters. The DPF system, used in Xok, takes packet filters as a list of predicates to match and then dynamically compiles them to machine code. Unlike most packet filters, which just run a bunch of them in sequence, it merges common segments, so a filter for (ip_header && tcp_header && my_port) would be at least as fast as a hand-coded TCP/IP stack.

Just like when a compiler chooses how to implement a switch statement (either by a jump table, a series of if/elses, etc), the kernel can choose how to switch on IP address, port, etc. depending on how many filters there are. The DPF paper has some pretty impressive benchmarks.

Again, like the file system bytecode, I might build a little bit more awareness into the exokernel so that inserting a filter for a new TCP connection can take some shortcuts, but the dynamic recompilation concept is still pretty cool.

Brendan · Post by **Brendan** » Mon Feb 17, 2014 3:31 pm

Hi,

Rusky wrote:
Brendan wrote:I'm talking about global optimisations which effect global performance. You're talking about local optimisations (within a process) that effect global performance. These are 2 completely different (opposite?) things. For example, with extremely good local optimisations and extremely bad global optimisations; global performance might remain the same.

Of course if you give processes the ability to do local optimisations you remove the kernel's ability to do global optimisations. For an example; imagine one process is struggling to get data fast enough from one partition, and a different process decides it should prefetch data from another partition on the same disk. Do you allow that second process to do its local optimisation (and severely effect the first process); or do you make a global decision that's best for both processes (fetching data that's needed now is more important than pre-fetching data that might be needed later) that prevents the local optimisation?
What global optimizations could not be better performed without local information? Your earlier examples are all improved on in the exokernel papers:

Swapping out old pages is also better done with application involvement because applications know better than an LRU algorithm which pages they need. For example, when the kernel needs more physical memory, a database could opt to free its index pages rather than swapping them out, because regenerating them is faster than loading them from the disk.

I'd just send a message to all threads (that have indicated that they want it) when memory is getting low (it's mostly required for micro-kernels when VFS/disk caches are run in user-space). It doesn't require an exo-kernel approach; is probably done by quite a few micro-kernels already, and is also easily done by a monolithic kernel. The only real problem is that it's "non POSIX" (regardless of kernel type).

Rusky wrote:Scheduling is still done centrally- that's part of "securely multiplexing hardware." Although, heirarchical scheduling that allows parent processes to schedule child processes again allows an application like a web browser to more optimally schedule its tabs/applets/etc.

Effective scheduling requires global knowledge (e.g. "is thread 1 in process 1 more important than both thread 2 in process 2"), changes rapidly (e.g. if you need to task switch to decide which task to switch to then you've got a performance disaster) and is better served by "hints" (e.g. threads that tell you their desired scheduling policy and priority).

Rusky wrote:Prefetching data from disk is better done at the application level because applications know more precicesly what they need to prefetch. I'm not sure how a centralized kernel would make your decision any more than an exokernel would- how do you differentiate prefetching and normal reading? The exokernel itself still knows exactly which blocks are being read and can still do things like merge read schedules.

Most decent OSs support IO priorities (were you can issue a very low priority "read" to prefetch) and IO cancellation (where you can cancel a low priority read and issue a higher priority read if the data wasn't prefetched before you need it). Of course most of the time read requests only get as far as the VFS and are satisfied from the VFS's global file cache (there's no sane reason for file systems and disk drivers to be involved in a "VFS cache hit"); and not having a global file cache would suck (e.g. per process file cache with "cold cache" every time a process starts and the same data cached by multiple processes; or no file cache at all and only sector caches where you fail to avoid the overhead of file system layer for the "cache hit" case).

Rusky wrote:
Brendan wrote:Correct. If they find a significant bottleneck; they're all much more likely to submit a patch for the Linux kernel or use existing methods to get lower level access (e.g. using "/dev/sda" directly to avoid going through VFS/file system layers).
The whole benefit of an exokernel is that you 1) don't have to patch the kernel (partly because if EVERYBODY trying to do an exokernel-style optimization patched the kernel, they would have huge design conflicts and/or a slower kernel trying to handle all the possibilities, and 2) the lower-level /dev/sda style access is still secure.

If you find that the exo-kernel is inefficient (e.g. you want to bypass the "meta-file system in kernel" thing in Xok) then you're still stuck with doing kernel patches and/or using lower level "/dev/sda" style access. Also note that patching a (monolithic) kernel is more likely to benefit other processes (even processes that were written 10 years ago and nobody has touched since) than to cause design conflicts and/or slower kernels.

Rusky wrote:
Brendan wrote:That's just plain silly - rather than having one piece of code managing file systems/file permissions you end up with code duplicated at multiple levels (in the kernel and in the library) doing the same thing.

It looks like Xok avoided this by having all file systems implemented in the kernel (using a "pseudo-RISC assembly language" as a safe way to add extensibility to the kernel's file system support).
Nope. Xok only put the file system permission code (which only exists in one place) in the kernel (through the "pseudo-RISC" stuff). The rest of the file system code (deciding when and where to load which blocks) is in libOSes. Their basic premise is that because file system data structures already describe block permissions, the kernel just needs a way to interpret that.

Most of the file system's code is maintaining file/directory information and figuring out which blocks correspond to which files/directories. It doesn't matter if your kernel is only using part of the file/directory information (e.g. the permissions and not the other metadata) you're still doing a large amount of the work that a file system has to do.

Rusky wrote:Honestly, I would replace the bytecode shenanigans with trusted loadable modules. You lose the ability for any arbitrary program to implement its own file system (it needs permission to load a kmod), but you keep all the real optimization opportunities for accurate prefetching, disk block placement, etc.

Agreed. Of course "trusted loadable modules" is what most micro-kernels and monolithic kernels use to support file systems. I'd say the single largest problem here is the design of old interfaces - e.g. processes not being able to use "standard hints" to effect things like disk block placement, IO priorities, etc. For a simple example, to open a file for writing you should probably have to provide an "expected resulting file size", plus some flags (if you expect the file to be appended to later, if the file will be read often, if the file is unlikely to be modified after its created, etc); so that the file system code can make better decisions about things like disk placement.

Actually; that's probably the single largest benefit of exo-kernels - avoiding problems caused by "crusty old APIs".

Cheers,

Brendan

Rusky · Post by **Rusky** » Mon Feb 17, 2014 5:24 pm

So, I think you have some misconceptions about what an exokernel is. In the MIT papers, they implement a UNIX interface on top of their exokernel, and then simply re-link existing applications. These applications have the same safety guarantees, and they often have better performance. Then, they implement some applications that bypass traditional all-things-to-all-people abstractions, including the Cheetah webserver, that improve performance by a factor of 8.

Brendan wrote:I'd just send a message to all threads (that have indicated that they want it) when memory is getting low (it's mostly required for micro-kernels when VFS/disk caches are run in user-space). It doesn't require an exo-kernel approach; is probably done by quite a few micro-kernels already, and is also easily done by a monolithic kernel. The only real problem is that it's "non POSIX" (regardless of kernel type).

That is the exokernel approach. That is one way in which those kernels are "exokernely," much like DRM in Linux is "exokernely." Wayland's client-allocated buffers are "exokernely." The idea is closely related to "mechanism, not policy"- the trusted, non-replaceable module (in this case, the kernel) securely multiplexes hardware resources, while everything else is relegated to libraries. This way, low-level "/dev/sda" access can still be secure, different apps can run different libraries at the same time, securely, and abstractions can by fully or partially bypassed when they're inconvenient.

Brendan wrote:Effective scheduling requires global knowledge (e.g. "is thread 1 in process 1 more important than both thread 2 in process 2"), changes rapidly (e.g. if you need to task switch to decide which task to switch to then you've got a performance disaster) and is better served by "hints" (e.g. threads that tell you their desired scheduling policy and priority).

These issues are exactly the same in an exokernel and everything else! Exokernels don't have the microkernel dogma of pushing everything out of the kernel- they just provide a lower-level interface to multiplex hardware resources rather than abstractions on top of them. An exokernel scheduler would thus schedule simpler, lighter-weight entities than processes (say, scheduler activations?), with many traditional features securely implemented in libOSes.

Brendan wrote:Most decent OSs support IO priorities (were you can issue a very low priority "read" to prefetch) and IO cancellation (where you can cancel a low priority read and issue a higher priority read if the data wasn't prefetched before you need it). Of course most of the time read requests only get as far as the VFS and are satisfied from the VFS's global file cache (there's no sane reason for file systems and disk drivers to be involved in a "VFS cache hit"); and not having a global file cache would suck (e.g. per process file cache with "cold cache" every time a process starts and the same data cached by multiple processes; or no file cache at all and only sector caches where you fail to avoid the overhead of file system layer for the "cache hit" case).

Again, there is no reason an exokernel couldn't have IO priorities. There is also no reason for an exokernel not to have a global (unified!) disk block cache- Xok and Aegis do exactly this. The difference is that the kernel only keeps the mapping of disk blocks to pages, while applications are the ones that decide when and where to load those pages. What makes it an exokernel is that the libOS directly asks the kernel for a block rather than reading from a file (it figured out which block by reading the filesystem metadata, which it also requested on its own).

Brendan wrote:If you find that the exo-kernel is inefficient (e.g. you want to bypass the "meta-file system in kernel" thing in Xok) then you're still stuck with doing kernel patches and/or using lower level "/dev/sda" style access. Also note that patching a (monolithic) kernel is more likely to benefit other processes (even processes that were written 10 years ago and nobody has touched since) than to cause design conflicts and/or slower kernels.

No, if you want to bypass the file system, you just allocate yourself some disk blocks and don't put them in a file system. You'd need an on-disk structure to track that allocation, and the "meta-file system" is what lets the kernel understand it, but after that you've got no more overhead than straight "/dev/sda" (this could still work with a "raw disk" fs kmod). Yes, performance patches to monolithic kernels benefit everyone, but so do performance patches to libOSes (dynamic linking). However, feature patches push kernel abstractions more toward the all-things-to-all-people problem. In libOSes, on the other hand, applications using those features don't have to go through the one-kernel-to-rule-them-all trying to handle all the features that application isn't using.

Brendan wrote:Most of the file system's code is maintaining file/directory information and figuring out which blocks correspond to which files/directories. It doesn't matter if your kernel is only using part of the file/directory information (e.g. the permissions and not the other metadata) you're still doing a large amount of the work that a file system has to do.

File system logic is not duplicated. It is moved. libFS does everything it can until it needs to ask the kernel for permission, and then the kernel uses libFS-provided, deterministic bytecode functions to provide access control. To create a file in a directory, for example, the libFS chooses a free disk block (using a kernel-managed free list) in an ideal location, and asks the kernel to modify the directory metadata block with it. The kernel then runs the deterministic function to check that the requested modification does indeed allocate the requested block, and performs the modification. The libFS then maps in the new block, writes to it, and eventually decides when to write back both the disk block and the metadata block. This is all described straightforwardly in the Xok paper...

Brendan wrote:Agreed. Of course "trusted loadable modules" is what most micro-kernels and monolithic kernels use to support file systems. I'd say the single largest problem here is the design of old interfaces - e.g. processes not being able to use "standard hints" to effect things like disk block placement, IO priorities, etc. For a simple example, to open a file for writing you should probably have to provide an "expected resulting file size", plus some flags (if you expect the file to be appended to later, if the file will be read often, if the file is unlikely to be modified after its created, etc); so that the file system code can make better decisions about things like disk placement.

The difference is that rather than supporting a "crusty old" all-things-to-all-people API, or having to create a new all-things-to-all-people API with hints for all possible eventualities, the exokernel just lets the libFS make all those decisions. Then when you see a need to change the interface, you just use a different version of the libFS, rather than making a breaking change to the kernel/userspace API. With proper library versioning, you could have applications linked against different versions of different libFSes, with no overhead from the kernel supporting multiple interfaces, no "minimum kernel version" requirements, and no need to trust new programs that use these new APIs.

Brendan wrote:Actually; that's probably the single largest benefit of exo-kernels - avoiding problems caused by "crusty old APIs".

Exactly this. You avoid making APIs that are all things to all people that have to support every possible use case from high performance servers to games to productivity software, but you keep all the safety and isolation guarantees of a regular monolithic kernel.

Brendan · Post by **Brendan** » Mon Feb 17, 2014 9:10 pm

Hi,

Rusky wrote:So, I think you have some misconceptions about what an exokernel is. In the MIT papers, they implement a UNIX interface on top of their exokernel, and then simply re-link existing applications. These applications have the same safety guarantees, and they often have better performance.

I think you mean "the exo-kernel designers claim these applications often have better performance while their own graphs and benchmarks often say the opposite".

Rusky wrote:Then, they implement some applications that bypass traditional all-things-to-all-people abstractions, including the Cheetah webserver, that improve performance by a factor of 8.

Never (under any circumstances) trust these biased morons - find out where their statistics came from.

As far as I can tell, the claim is "up to 8 times faster than the other web servers we measured (on the same hardware)", and the other web servers they tested were 3 web servers that I've never heard of (Socket, Harvest and NCSA) all running on top of OpenBSD. Two of these web servers are so obscure that I couldn't find any information about them at all. NCSA is extremely old ("among the earliest web servers developed" according to the Wikipedia page) and got it's behind slapped severely by Apache (even though Apache isn't one of the fastest web server either) and ended up abandoned over 15 years ago.

Also as far as I can tell; OpenBSD didn't have a "zero copy" network stack when they did their benchmarking (OpenBSD does now); and the Cheetah web server was only a file transfer thing that supports nothing else (no CGI of any kind, no virtual hosts, etc) while at least NCSA supports some of it (CGI, SSI).

Basically; their "up to 8 times faster" claim is meaningless marketing hype - it doesn't show that a web server running on a monolithic kernel can't outperform the exo-kernel's web server in future, doesn't show that a web server running on a monolithic kernel doesn't outperform the exo-kernel's web server now, and doesn't even show that a comparable web server (with equal features and techniques) running on a monolithic kernel didn't outperform the exo-kernel's web server back when they did their measurements.

Rusky wrote:
Brendan wrote:I'd just send a message to all threads (that have indicated that they want it) when memory is getting low (it's mostly required for micro-kernels when VFS/disk caches are run in user-space). It doesn't require an exo-kernel approach; is probably done by quite a few micro-kernels already, and is also easily done by a monolithic kernel. The only real problem is that it's "non POSIX" (regardless of kernel type).
That is the exokernel approach.

No, it's not. The exokernel approach is to let each process' libOS make its own decisions (without being told what they should do according to the kernel's policy via. a message), and then copy what normal kernels do when they realise the exokernel approach sucked.

Rusky wrote:
Brendan wrote:Effective scheduling requires global knowledge (e.g. "is thread 1 in process 1 more important than both thread 2 in process 2"), changes rapidly (e.g. if you need to task switch to decide which task to switch to then you've got a performance disaster) and is better served by "hints" (e.g. threads that tell you their desired scheduling policy and priority).
These issues are exactly the same in an exokernel and everything else! Exokernels don't have the microkernel dogma of pushing everything out of the kernel- they just provide a lower-level interface to multiplex hardware resources rather than abstractions on top of them. An exokernel scheduler would thus schedule simpler, lighter-weight entities than processes (say, scheduler activations?), with many traditional features securely implemented in libOSes.

So, you get some sort of signal from hardware (e.g. from ACPI) saying that a CPU is about to overheat. Who is responsible for shifting load to other CPUs? A flock of random/untrusted libOSs that don't know about each other's existence but somehow overcome their "split brain problem" and arrive at ideal CPU load balancing; or is there "mechanism and policy and high level abstractions" implemented in the kernel in defiance of exo-kernel principles?

Note: I tend to separate "nano-kernel" (everything pushed out of kernel where possible, even when there's no sane reason) and "micro-kernel" (things pushed out of kernel for sane reasons). None of my micro-kernels have ever had any scheduler related code outside the kernel.

Rusky wrote:
Brendan wrote:Most decent OSs support IO priorities (were you can issue a very low priority "read" to prefetch) and IO cancellation (where you can cancel a low priority read and issue a higher priority read if the data wasn't prefetched before you need it). Of course most of the time read requests only get as far as the VFS and are satisfied from the VFS's global file cache (there's no sane reason for file systems and disk drivers to be involved in a "VFS cache hit"); and not having a global file cache would suck (e.g. per process file cache with "cold cache" every time a process starts and the same data cached by multiple processes; or no file cache at all and only sector caches where you fail to avoid the overhead of file system layer for the "cache hit" case).
Again, there is no reason an exokernel couldn't have IO priorities. There is also no reason for an exokernel not to have a global (unified!) disk block cache- Xok and Aegis do exactly this. The difference is that the kernel only keeps the mapping of disk blocks to pages, while applications are the ones that decide when and where to load those pages. What makes it an exokernel is that the libOS directly asks the kernel for a block rather than reading from a file (it figured out which block by reading the filesystem metadata, which it also requested on its own).

I didn't say an exokernel can't have IO priorities - I was using this as an example of how there's no need for a libOS to get "intelligent prefetching" right. I know that exokernels can have a global/unified disk block cache, and this is why I mentioned that a global/unified disk block cache sucks in the post you're replying to ("...sector caches where you fail to avoid the overhead of file system layer for the "cache hit" case).

Rusky wrote:
Brendan wrote:If you find that the exo-kernel is inefficient (e.g. you want to bypass the "meta-file system in kernel" thing in Xok) then you're still stuck with doing kernel patches and/or using lower level "/dev/sda" style access. Also note that patching a (monolithic) kernel is more likely to benefit other processes (even processes that were written 10 years ago and nobody has touched since) than to cause design conflicts and/or slower kernels.
No, if you want to bypass the file system, you just allocate yourself some disk blocks and don't put them in a file system. You'd need an on-disk structure to track that allocation, and the "meta-file system" is what lets the kernel understand it, but after that you've got no more overhead than straight "/dev/sda" (this could still work with a "raw disk" fs kmod). Yes, performance patches to monolithic kernels benefit everyone, but so do performance patches to libOSes (dynamic linking). However, feature patches push kernel abstractions more toward the all-things-to-all-people problem. In libOSes, on the other hand, applications using those features don't have to go through the one-kernel-to-rule-them-all trying to handle all the features that application isn't using.

Um, this doesn't make sense - if you want to bypass the exo-kernel's "meta-file system in kernel" you have to use the exo-kernel's "meta-file system in kernel"?

What I'm getting at here is that the exo-kernel provides an interface, and changing the exo-kernel's interface (that's used by the libOS) is the same problem as changing a monolithic kernel's interface (that's used by the standard library).

Rusky wrote:
Brendan wrote:Agreed. Of course "trusted loadable modules" is what most micro-kernels and monolithic kernels use to support file systems. I'd say the single largest problem here is the design of old interfaces - e.g. processes not being able to use "standard hints" to effect things like disk block placement, IO priorities, etc. For a simple example, to open a file for writing you should probably have to provide an "expected resulting file size", plus some flags (if you expect the file to be appended to later, if the file will be read often, if the file is unlikely to be modified after its created, etc); so that the file system code can make better decisions about things like disk placement.
The difference is that rather than supporting a "crusty old" all-things-to-all-people API, or having to create a new all-things-to-all-people API with hints for all possible eventualities, the exokernel just lets the libFS make all those decisions. Then when you see a need to change the interface, you just use a different version of the libFS, rather than making a breaking change to the kernel/userspace API. With proper library versioning, you could have applications linked against different versions of different libFSes, with no overhead from the kernel supporting multiple interfaces, no "minimum kernel version" requirements, and no need to trust new programs that use these new APIs.

Brendan wrote:Actually; that's probably the single largest benefit of exo-kernels - avoiding problems caused by "crusty old APIs".
Exactly this. You avoid making APIs that are all things to all people that have to support every possible use case from high performance servers to games to productivity software, but you keep all the safety and isolation guarantees of a regular monolithic kernel.

A better idea is to break compatibility with "1960's Unix with extensions" and design a decent/modern API. Your "every possible use case" is a relatively small number of things where existing monolithic kernels already support most of them and could easily support the remainder (once you throw away the crusty old API).

Cheers,

Brendan

Rusky · Post by **Rusky** » Mon Feb 17, 2014 11:33 pm

Brendan wrote:I think you mean "the exo-kernel designers claim these applications often have better performance while their own graphs and benchmarks often say the opposite".

Which graphs and benchmarks say the opposite?

Brendan wrote:As far as I can tell, the claim is "up to 8 times faster than the other web servers we measured (on the same hardware)", and the other web servers they tested were 3 web servers that I've never heard of (Socket, Harvest and NCSA) all running on top of OpenBSD. Two of these web servers are so obscure that I couldn't find any information about them at all. NCSA is extremely old ("among the earliest web servers developed" according to the Wikipedia page) and got it's behind slapped severely by Apache (even though Apache isn't one of the fastest web server either) and ended up abandoned over 15 years ago.

Also as far as I can tell; OpenBSD didn't have a "zero copy" network stack when they did their benchmarking (OpenBSD does now); and the Cheetah web server was only a file transfer thing that supports nothing else (no CGI of any kind, no virtual hosts, etc) while at least NCSA supports some of it (CGI, SSI).

Basically; their "up to 8 times faster" claim is meaningless marketing hype - it doesn't show that a web server running on a monolithic kernel can't outperform the exo-kernel's web server in future, doesn't show that a web server running on a monolithic kernel doesn't outperform the exo-kernel's web server now, and doesn't even show that a comparable web server (with equal features and techniques) running on a monolithic kernel didn't outperform the exo-kernel's web server back when they did their measurements.

You realize Harvest became squid (designed for exactly what they did with Cheetah), and NCSA (which had 95% market share) became Apache, right? Socket is Cheetah without the exokernel optimizations, and they did run that on both Xok and OpenBSD. Its performance on OpenBSD was virtually identical to Harvest, and 2x Harvest on Xok. Cheetah was 4x Socket on Xok and 8x Harvest and Socket on OpenBSD. Its optimizations depend on the lower-level API provided by Xok.

The papers acknowledge that yes, those optimizations could of course be implemented in a monolithic kernel. However it took more effort to implement their file system in OpenBSD than it did to design and implement it as a libFS in the first place. The point is that it's easier, safer, and more flexible to put those optimizations in a library.

Brendan wrote:
Rusky wrote:That is the exokernel approach.
No, it's not. The exokernel approach is to let each process' libOS make its own decisions (without being told what they should do according to the kernel's policy via. a message), and then copy what normal kernels do when they realise the exokernel approach sucked.

Did you even read the papers? One of their biggest ideas is to let applications know what's going on so they can make better choices. Visible resource revocation, scheduler activations, etc.

Brendan wrote:So, you get some sort of signal from hardware (e.g. from ACPI) saying that a CPU is about to overheat. Who is responsible for shifting load to other CPUs? A flock of random/untrusted libOSs that don't know about each other's existence but somehow overcome their "split brain problem" and arrive at ideal CPU load balancing; or is there "mechanism and policy and high level abstractions" implemented in the kernel in defiance of exo-kernel principles?

Load balancing is totally within the bounds of "securely multiplexing resources." If you had read the papers, you would know about visible resource revocation- the exokernel doesn't just hope a schizophrenic flock of libOSes decide to load-balance on their own. It can move processes around all it wants- the difference is that it lets the processes know that they're being moved, and from/to which cores. Check out scheduler activations for an example of why this is useful.

Brendan wrote:I didn't say an exokernel can't have IO priorities - I was using this as an example of how there's no need for a libOS to get "intelligent prefetching" right. I know that exokernels can have a global/unified disk block cache, and this is why I mentioned that a global/unified disk block cache sucks in the post you're replying to ("...sector caches where you fail to avoid the overhead of file system layer for the "cache hit" case).

Application-specific prefetching can be better than generic prefetching. A unified disk block cache does not preclude a file cache. A VFS cache is one case that is harder to do on an exokernel- but it's a trade-off, not an impossibility. LibOSes can still cooperate with each other through standards (userland of most operating systems is like this already), or you could let the kernel expose a mapping between dentries and pages still without forcing any particular abstractions on applications that want protection (which is all I really care about when I say "exokernel").

Brendan wrote:Um, this doesn't make sense - if you want to bypass the exo-kernel's "meta-file system in kernel" you have to use the exo-kernel's "meta-file system in kernel"?

What I'm getting at here is that the exo-kernel provides an interface, and changing the exo-kernel's interface (that's used by the libOS) is the same problem as changing a monolithic kernel's interface (that's used by the standard library).

The difference is that the exokernel's interface is lower-level and thus the need to change it does not come up as much. You can use the "meta-file system" to mark your low-level blocks as allocated, and then use them without touching the "meta-file system." This is precisely the same overhead you get from accessing /dev/sda.

Brendan wrote:A better idea is to break compatibility with "1960's Unix with extensions" and design a decent/modern API. Your "every possible use case" is a relatively small number of things where existing monolithic kernels already support most of them and could easily support the remainder (once you throw away the crusty old API).

My point is that a decent/modern API would look much like an exokernel.

Brendan · Post by **Brendan** » Tue Feb 18, 2014 4:25 am

Hi,

Rusky wrote:
Brendan wrote:
Rusky wrote:That is the exokernel approach.
No, it's not. The exokernel approach is to let each process' libOS make its own decisions (without being told what they should do according to the kernel's policy via. a message), and then copy what normal kernels do when they realise the exokernel approach sucked.
Did you even read the papers? One of their biggest ideas is to let applications know what's going on so they can make better choices. Visible resource revocation, scheduler activations, etc.

Sure - they start out saying "process perform better if they're responsible for what happens when" and then they shift all the "what happens when" decisions into the exo-kernel.

Rusky wrote:
Brendan wrote:So, you get some sort of signal from hardware (e.g. from ACPI) saying that a CPU is about to overheat. Who is responsible for shifting load to other CPUs? A flock of random/untrusted libOSs that don't know about each other's existence but somehow overcome their "split brain problem" and arrive at ideal CPU load balancing; or is there "mechanism and policy and high level abstractions" implemented in the kernel in defiance of exo-kernel principles?
Load balancing is totally within the bounds of "securely multiplexing resources." If you had read the papers, you would know about visible resource revocation- the exokernel doesn't just hope a schizophrenic flock of libOSes decide to load-balance on their own. It can move processes around all it wants- the difference is that it lets the processes know that they're being moved, and from/to which cores. Check out scheduler activations for an example of why this is useful.

I guess if the libOS isn't actually able to control anything that effects optimisation (beyond mere hints to the kernel that's in control of all the global optimisation) then it should work, in an "identical to a monolithic kernel" way.

Note: I did check out wikipedia's "scheduler activation" page. A bunch of monolithic kernels implementing scheduler activations and then removing it wasn't really a great example of why it's useful.

Rusky wrote:
Brendan wrote:I didn't say an exokernel can't have IO priorities - I was using this as an example of how there's no need for a libOS to get "intelligent prefetching" right. I know that exokernels can have a global/unified disk block cache, and this is why I mentioned that a global/unified disk block cache sucks in the post you're replying to ("...sector caches where you fail to avoid the overhead of file system layer for the "cache hit" case).
Application-specific prefetching can be better than generic prefetching. A unified disk block cache does not preclude a file cache. A VFS cache is one case that is harder to do on an exokernel- but it's a trade-off, not an impossibility. LibOSes can still cooperate with each other through standards (userland of most operating systems is like this already), or you could let the kernel expose a mapping between dentries and pages still without forcing any particular abstractions on applications that want protection (which is all I really care about when I say "exokernel").

Random untrusted LibOSes that can still cooperate with each other through standards? Heh, that's not going to be a security disaster or a performance disaster in any way at all..

Rusky wrote:
Brendan wrote:A better idea is to break compatibility with "1960's Unix with extensions" and design a decent/modern API. Your "every possible use case" is a relatively small number of things where existing monolithic kernels already support most of them and could easily support the remainder (once you throw away the crusty old API).
My point is that a decent/modern API would look much like an exokernel.

I doubt it. All I see (at least for your definition of "exo-kernel") is a monolithic kernel with lower level abstractions and a lot more system calls caused by processes attempting to interfere with the kernel's policies.

Cheers,

Brendan

Combuster · Post by **Combuster** » Tue Feb 18, 2014 5:42 am

Sure - they start out saying "process perform better if they're responsible for what happens when" and then they shift all the "what happens when" decisions into the exo-kernel.

What is really being advocated here is that the kernel is minimally involved in higher-level abstractions - other than as far as it concerns security. The individual processes are still isolated and can't reason about the others and as such the kernel has to make some choices - preferably in a way that satisfies all processes. It's like individual departments going to the boss with their projects - if there's time and money for all of them then they all get what they want, otherwise some are told they can't get exactly what they want, and incur some delays until the time and funding can be arranged.

In the oldfashioned kernel the boss would have to go past all the departments and collect each data and make decisions based on that. But since the boss is typically not a tech guy he can't reason about the project details and has to make decisions based on heuristics he finds in project management textbooks.

I doubt it. All I see (at least for your definition of "exo-kernel") is a monolithic kernel with lower level abstractions and a lot more system calls caused by processes attempting to interfere with the kernel's policies.

"Policies"? The kernel has only one policy: security. If an app tries to interfere with that it is of course rightfully to be put into place. But that is meant to be an exception, not a commonality.

Brendan · Post by **Brendan** » Tue Feb 18, 2014 7:07 am

Hi,

Combuster wrote:
Sure - they start out saying "process perform better if they're responsible for what happens when" and then they shift all the "what happens when" decisions into the exo-kernel.
What is really being advocated here is that the kernel is minimally involved in higher-level abstractions - other than as far as it concerns security. The individual processes are still isolated and can't reason about the others and as such the kernel has to make some choices - preferably in a way that satisfies all processes. It's like individual departments going to the boss with their projects - if there's time and money for all of them then they all get what they want, otherwise some are told they can't get exactly what they want, and incur some delays until the time and funding can be arranged.

In the oldfashioned kernel the boss would have to go past all the departments and collect each data and make decisions based on that. But since the boss is typically not a tech guy he can't reason about the project details and has to make decisions based on heuristics he finds in project management textbooks.

Apparently, it's not that simple.

If you ask an exo-kernel advocate about the benefits of "libOS" they'll happily explain that a process is free to make any/all decisions about what happens when (including when its threads run, when disk sectors are prefetched, how networking packets are handled when, etc) and can therefore get incredible performance advantages. Then when you ask about global optimisations, suddenly the "libOS" is only providing hints and the kernel is in complete control over what happens when (including when a process' threads run, when disk sectors are prefetched for a process, etc) and therefore it can avoid incredible performance disadvantages.

This must be what makes "exo-kernel" so magical - it exists in 2 mutually exclusive states at the same time (possibly in some parallel universe full of pixies and unicorns), where processes are in control and the kernel isn't but at the same time the kernel is in control and processes aren't.

As far as I'm concerned, it's likely the standard researcher's formula: "same old stuff + hype = something new and miraculous" (or in this case, "monolithic kernel + optimisation hints from processes = something that isn't a monolithic kernel").

Cheers,

Brendan

Combuster · Post by **Combuster** » Tue Feb 18, 2014 8:21 am

This must be what makes "exo-kernel" so magical - it exists in 2 mutually exclusive states at the same time (possibly in some parallel universe full of pixies and unicorns), where processes are in control and the kernel isn't but at the same time the kernel is in control and processes aren't.

And you missed that I never mentioned that the boss was not in charge - the tasks of the boss simply changed. The boss is in charge of keeping things running, the tasks are in charge of how they use the resources given. Therefore they are both in charge of something. It is that simple.

The real design philosophy of an exokernel is that all organisatorial tasks are migrated to userland - specifically on a per-task basis. The net result is more akin to a skewed microkernel than a real monolithic kernel: it is essentially cut in half with the hardware primitives and the security-critical parts running in privileged mode and all the "smart" stuff in userland - which is thereafter called the libos. There have also been projects where the driver code has also been moved to userland, but that's a different concept altogether.

Now, if every app uses the exact same libos, the result would appear like an overengineered monolithic kernel: one that is slightly slower, but also with additional safety because half of what would be the monolithic kernel is now in userland and can't trample stuff other than itself. Also, there's no such thing as "hinting" going on: the libos is perfectly explicit in what it wants, and because the interfaces are of a lower order, it doesn't have to explain anything in terms of a specific API that it wouldn't have to do if it were running barebones.

You have also just created an opportunity to run multiple "smart kernel implementations" side to side, without a need for restarts and such. And if you happen to find a scheme of access patterns that perfectly fits you application, you don't have to teach the actual kernel anything new.

But that's for pure exokernels. Exokernelism is the third dimension in kernel architecture (next to amount of device code run in kernel land and amount of code per binary) and none of the OSes are as pure in design as they were back in those days. Things like graphics have almost always tended towards exokernel end: the linux user library builds the command lists and the kernel verifies they're safe. That tendency has always been there because there has never been a good-for-everything graphics standard and every game out there needed all the performance they could get.

Rusky · Post by **Rusky** » Tue Feb 18, 2014 9:54 am

Combuster wrote:Now, if every app uses the exact same libos, the result would appear like an overengineered monolithic kernel: one that is slightly slower, but also with additional safety because half of what would be the monolithic kernel is now in userland and can't trample stuff other than itself.

The benchmarks we have show that it's not necessarily even slower (which you might expect because of potentially more system calls). Xok uses the same drivers as OpenBSD, and they ported their new file system back to OpenBSD for these benchmarks.

Performance is often comparable- a little higher or a little lower, but in a couple cases it's much faster- for things like pax and diff that are chewing through the file system. Because all that is in a libOS, there's no need for system calls just to look up directory entries.

Not to mention an exokernel's system calls and exception handlers can be much faster in the first place (less state saving, etc).

Brendan · Post by **Brendan** » Wed Feb 19, 2014 12:00 am

Hi,

Combuster wrote:
This must be what makes "exo-kernel" so magical - it exists in 2 mutually exclusive states at the same time (possibly in some parallel universe full of pixies and unicorns), where processes are in control and the kernel isn't but at the same time the kernel is in control and processes aren't.
And you missed that I never mentioned that the boss was not in charge - the tasks of the boss simply changed. The boss is in charge of keeping things running, the tasks are in charge of how they use the resources given. Therefore they are both in charge of something. It is that simple.

No it's not that simple. Answer these questions:

What actually determines who gets CPU time when (each individual process or the kernel)?
What actually determines when to read data, whether it be a high priority read or a prefetch, (each individual process or the kernel)?
What actually determines when a CPU should enter a power management state (each individual process or the kernel)?
What actually determines when load should be migrated between CPUs (each individual process or the kernel)?

From what Rusky is saying; the kernel is what actually determines all of these things, and the processes themselves have no direct control over any of it (and only have indirect control via. suggestions/hints that the kernel is free to ignore and/or postpone). This makes perfect sense as the kernel is the only thing that has the information needed to make efficient global optimisations - each process is isolated and shouldn't care how many other processes are trying to use the CPU/s or read data from which devices or whatever else. Basically, the kernel is in charge and not the processes, and this makes perfect sense.

Now let's draw some diagrams:

Code: Select all

    _____________
   |             |
   |  Kernel and |
   |    Drivers  |
   |_____________|
        |
        | Kernel API
        |
        |   CPL=0
--------|-------------
        |   CPL=3
        |
    ____|________
   |             |
   | Library     |
   |_____________|
   |             |
   |  Process    |
   |_____________|

For this diagram, each process can have a different library (that uses the kernel API), where each different library can provide a different interface to the process (e.g. a POSIX interface, a Win32 interface, etc). Is this a diagram of a monolithic kernel (e.g. where the library is typically a standard C/C++/POSX library, but may not be), or is this a diagram of an exo-kernel?

Let's assume this is a diagram of a monolithic kernel. The Kernel API could be anything (it mostly only exists to keep the library happy; although it is possible for a process to use the bare kernel API without any library). For example, the kernel API could provide higher level abstractions and include things like "fopen()" and "malloc()" (where the library is a very minimal layer). The kernel API could also provide lower level abstractions - for example, it might only provide "open()" and "sbrk()" where the library has to implement "fopen()" and "malloc()" on top. For another example, the kernel API might provide extremely low level abstractions - e.g. it might only expose raw blocks and pages, where the library has to implement "open()" and "fopen()", and "sbrk()" and "malloc()" on top of that. None of this changes the fact that it's still a monolithic kernel - it's just a sliding scale of from "very high level kernel API abstractions" to "very low level kernel API abstractions". If fact nothing really prevents a monolithic kernel's API from providing multiple levels of abstraction at the same time - e.g. it could provide very high level abstractions in addition to providing a very low level abstractions.

Now; think about a monolithic kernel that provides a very low level abstractions via. its kernel API; and tell me how that is different to an exo-kernel. It's impossible for an exo-kernel to do anything that a monolithic kernel can't do, because they're essentially exactly the same thing.

The real question here is whether a monolithic kernel's API should provide very high level abstractions, or very low level abstractions, or something in between, or a mixture of high and low level abstractions. The most likely answer to this is "something in between" - if it made sense to provide very low level abstractions then all of the existing monolithic kernels would be providing those low level abstractions already (possibly in addition to their existing higher level abstractions, for the sake of compatibility); or at a minimum they'd be evolving towards lower level abstractions; so that they can give the "massive number of people" that write their own library (and don't just use the system's standard C/C++/POSIX library) a lot more flexibility.

Of course I'm been a little sarcastic - some people actually do write their own library instead of just using the system's standard C/C++/POSX library. Examples of this are the Wine project (implementing the Win32 API on *nix systems) and Microsoft's POSIX subsystem (implementing a standard C/C++/POSX library on Windows). It's not like you need very low level abstractions for this.

So what are the benefits of having very low level abstractions? The main benefit is, if you're a researcher and not writing a "real world OS" and don't need to care about malicious code, or compatibility, or features that people have grown to expect (e.g. like your web server being able to files stored on an NFS file system); then you can make it seem like low level abstractions provide better performance in some cases without acknowledging the fact that the sacrifices you've made to gain that extra performance are unlikely to be practical in a real OS.

Cheers,

Brendan

Brendan · Post by **Brendan** » Wed Feb 19, 2014 12:25 am

Hi,

Rusky wrote:The benchmarks we have show that it's not necessarily even slower (which you might expect because of potentially more system calls). Xok uses the same drivers as OpenBSD, and they ported their new file system back to OpenBSD for these benchmarks.
Rusky wrote:Performance is often comparable- a little higher or a little lower, but in a couple cases it's much faster- for things like pax and diff that are chewing through the file system. Because all that is in a libOS, there's no need for system calls just to look up directory entries.

Like I've already said, these benchmarks are very old. Since the benchmarks were done OpenBSD has implemented "zero copy" networking and disk IO, plus things like packet splicing and packet filters and a bunch of other stuff. Basically; the fundamental reasons why OpenBSD was slower back then are gone; and it's very likely that if you redid the benchmarks today you'd find the performance differences between OpenBSD and Xok to be show a very different picture.

In a similar way; if Xok was used by the general public (like OpenBSD is) I'd expect a lot of people would find that there's a large number of security problems, missing support/features and compatibility problems; and that fixing all of this (so that it actually is equivalent to OpenBSD and a fair "apples vs. apples" comparison is possible) would slow down Xok.

Basically what the researchers are doing is comparing a "minimal subset of an OS" to a full OS, and then failing to acknowledge that it's possible for the full OS to improve, and then claiming (based on highly biased and unfair comparisons) that "exo-kernel is better than monolithic" with no valid proof whatsoever.

Cheers,

Brendan

Combuster · Post by **Combuster** » Wed Feb 19, 2014 1:14 am

Lets honour your request and make your life difficult in return:

Brendan wrote:No it's not that simple. Answer these questions:
[*]What actually determines who gets CPU time when (each individual process or the kernel)?

The end user. It chooses which tasks the computer should do and therefore what the computer must put time spending.

[*]What actually determines when to read data, whether it be a high priority read or a prefetch, (each individual process or the kernel)?

The user and the process. The kernel can't know what to read until it is explicitly told to. If there's no file loaded ever, the kernel has no clue what to prefetch either.

[*]What actually determines when a CPU should enter a power management state (each individual process or the kernel)?

When the user is around or not.

[*]What actually determines when load should be migrated between CPUs (each individual process or the kernel)?

The end user?

OSDev.org

About exokernels...

Re: About exokernels...

Re: About exokernels...

Re: About exokernels...

Re: About exokernels...

Re: About exokernels...

Re: About exokernels...

Re: About exokernels...

Re: About exokernels...

Re: About exokernels...

Re: About exokernels...

Re: About exokernels...

Re: About exokernels...

Re: About exokernels...

Re: About exokernels...

Re: About exokernels...