OSDev.org

Posted: **Wed Dec 24, 2008 1:16 pm**

Hi,

I'm just wondering why L4 is considered to be so efficient instead of other microkernels? I read the wikipedia article which suggests that they use some fancy IPC and optimize here and there. Does somebody know something more concretly before I dig into sources

Posted: **Wed Dec 24, 2008 7:32 pm**

Hi,

kiwipresse wrote:I'm just wondering why L4 is considered to be so efficient instead of other microkernels? I read the wikipedia article which suggests that they use some fancy IPC and optimize here and there. Does somebody know something more concretly before I dig into sources

There's at least 2 techniques that L4 uses that I'm aware of...

The first one is called "small address spaces". The idea here is to use segmentation to split a large address space into a medium address space and several small address spaces; so you can switch from one process to another by changing segment registers if both processes are in the same large address space, which reduces TLB misses because there's no need to flush any TLB entries in this case. For example, the large 4 GiB address space might be split into one 2 GiB medium address space and eight 128 MiB small address spaces (with the remaining 1 GiB used by the kernel). Also, if a process is running in a small address space and grows too big it'll be shifted into it's own medium address space, so that processes aren't limited to the size of a small address space.

The next one is the way they do messaging - they use synchronous/rendezvous messaging where the message data is passed in registers, so that sending a message from one process to another involves switching address spaces and returning to the new process and almost nothing else (no need to save/load registers, no need to store message data anywhere, etc).

If you combine both of these things, sending a message from one process to another may involve a CPL=3 -> CPL=0 switch, segment register reloads and a CPL=0 -> CPL=3 context switch and almost nothing else, which is what makes L4's "best case messaging" extremely fast; and benchmarks for this special case is what they use as the basis for claiming that L4 is extremely fast.

However, I'm skeptical. Best case/special case micro-benchmarks usually have little effect on overall real world performance. They don't mention L4's "worst case messaging" (e.g. where the processes are in different large address spaces and the message data is too large to fit in registers). They also don't implement/include certain features that increase process switch times (e.g. like accounting for time used by each process). They also tend not to consider SMP systems, where rendezvous messaging (which really does behave like an inter-process function call) often means that only one CPU does all the work while many more CPUs are idle (it has poor scalability and it's hard to optimize for NUMA). Basically what I'm suggesting is that claims that L4 is extremely fast may be based on "best case" micro-benchmarks, and the performance you actually get in practice (e.g. if you compared the overall performance of applications running on a modern multi-core computer on different OSs) you might get very different results.

Cheers,

Brendan

Posted: **Thu Dec 25, 2008 9:04 am**

Also, fast IPC does not guarantee good performance. Even fast IPCs really add up over time if there are enough of them.

Posted: **Fri Dec 26, 2008 2:12 pm**

Has anyone bothered to check? I know I didn't...

Posted: **Fri Dec 26, 2008 5:24 pm**

Even fast IPCs really add up over time if there are enough of them.

Its not that macrokernels have no overhead. Instead of communicating via IPC with another process, you communicate via a write() syscall of some sort. lots of IPC in a microkernel tends to imply a corresponding amount of system calls to a certain driver when run on your favorite *nix.

Posted: **Fri Dec 26, 2008 11:39 pm**

Hi,

Combuster wrote:Its not that macrokernels have no overhead. Instead of communicating via IPC with another process, you communicate via a write() syscall of some sort. lots of IPC in a microkernel tends to imply a corresponding amount of system calls to a certain driver when run on your favorite *nix.

Yes, but I think for some micro-kernels part of the problem is "excessively reduced kernel scope". For example, imagine if you implement the linear memory manager and the physical memory manager as 2 separate processes, so that instead of doing a syscall and a few near calls to allocate or free some RAM you end up with at least 4 syscalls and 4 task switches ("caller -> kernel -> linear memory manager -> kernel -> physical memory manager -> kernel -> linear memory manager -> kernel -> user"). In this case you can easily add 5000 cycles of IPC overhead to an operation that could have only cost about 500 cycles; and while you might gain some theoretical advantage from extra flexibility/modularity, in practice no software really benefits from these theoretical advantages.

The other problem (IMHO) is that a lot of micro-kernels rely on synchronous/rendezvous messaging, where every message sent costs an immediate task switch. With asynchronous/buffered messaging, the messages can be queued and task switches can be avoided. For example, for synchronous/rendezvous messaging if you send 10 requests to a server and get 10 replies back then you need 20 task switches; but for asynchronous/buffered messaging you can send 10 requests, do one task switch to the server, the server can send 10 replies, and then you do another task switch back to the client who receives the 10 replies (20 task switches vs. 2 task switches); and on multi-CPU systems you can have the client and server running on different CPUs and end up with no task switches at all.

For monolithic kernels, the IPC tends to be similar to asynchronous/buffered messaging (pipes, file I/O, sockets, etc, all implemented with FIFO buffers).

Synchronous/rendezvous messaging is easier to use because you can hide the messaging in a library functions, and a standard C library function (e.g. "fopen()") looks like a standard C library function should. Asynchronous/buffered messaging can't really do this, and some standard C library functions can't be implemented easily (e.g. "fopen()" becomes "request_fopen()" plus either a callback or code shifted to a message handling loop). For a micro-kernel, this basically means you've got a choice - have poor performance and make it easy to port existing software to your OS, or have better performance and make it very difficult to port existing software to your OS. Note: there are ways to make asynchronous/buffered messaging simulate synchronous/rendezvous messaging using client-side buffering and callbacks, etc; but this removes all the performance advantages of asynchronous/buffered messaging and adds some extra overhead of it's own.

Of course for most existing micro-kernel projects (L4, Mach, Minix, etc) the first thing the developers do is slap a Unix compatibility layer on top, so that everyone can port normal software (designed for monolithic systems) to it and compare it to existing monolithic OSs, and decide that it's slow. That's just plain stupid IMHO - it'd be better to design the micro-kernel without caring about "application portability", then write applications specifically designed for the micro-kernel, then port these applications to a monolithic OS and then claim that the monolithic OS is slow...

Cheers,

Brendan

Posted: **Sat Dec 27, 2008 9:00 am**

Brendan wrote:then port these applications to a monolithic OS and then claim that the monolithic OS is slow...

You mean "microkernel OS is slow"

Posted: **Sat Dec 27, 2008 10:35 am**

Hi,

Love4Boobies wrote:
Brendan wrote:then port these applications to a monolithic OS and then claim that the monolithic OS is slow...
You mean "microkernel OS is slow"

If an expert made a saddle for a camel and then put that saddle on an elephant, would it be a surprise if it doesn't fit well? If the same expert made a saddle for an elephant and put that saddle on a camel, then it still won't fit right. Obviously you'd want to use the saddle for a camel on a camel, and the saddle for an elephant on an elephant. Anything else would be completely stupid.

If an expert made an application for a monolithic kernel and then put that application on a micro-kernel then it won't fit well. If the same expert made an application for a micro-kernel and then put that application on a monolithic kernel, then it won't fit well either. Obviously you'd want to use an application designed for a monolithic kernel on a monolithic kernel, and an application designed for a micro-kernel on a micro-kernel. Anything else would be completely stupid.

If you try to port an application designed for a micro-kernel over to a monolithic kernel, then you'd need to emulate the original OS's messaging/IPC by using pipes or sockets or something (and emulate/hide any other differences under some sort of abstraction layer), and the overheads are likely to be far worse. I've actually tried this - using datagram socket operations to emulate asynchronous messaging on Linux, with separate threads used as interfaces to things like the file system and user interface to make synchronous operations behave like asynchronous operations. The end result was extremely ugly, and exhibited poor performance because things that were meant to be asynchronous (e.g. and done in the background) actually weren't.

Cheers,

Brendan

Posted: **Sun Dec 28, 2008 7:55 am**

I'm sorry, I didn't read your last sentence right and thought you meant something else

. Yeah, I agree. Oh, and btw, I didn't read your last message at all

A lot of text and the word "saddle" is all I saw but I can guess what it was all about

Oh, and another thing --- L4 couldn't be a 64-bit kernel on x86_64 systems, right? Since it uses segmentation, I guess a different microkernel approach is needed...

Posted: **Sun Dec 28, 2008 8:02 am**

Hi,

Love4Boobies wrote:Oh, and another thing --- L4 couldn't be a 64-bit kernel on x86_64 systems, right? Since it uses segmentation, I guess a different microkernel approach is needed...

There's no reason a 64-bit kernel can't run *lots* of 32-bit processes in the same (huge) address space...

However, each 64-bit process would need it's own address space, because in long mode segmentation is only supported for 16-bit and 32-bit code.

Cheers,

Brendan

Posted: **Sun Dec 28, 2008 8:04 am**

Managed code sounds like a good idea in this case.

OSDev.org

L4's hidden secret

L4's hidden secret

Re: L4's hidden secret

Re: L4's hidden secret

Re: L4's hidden secret

Re: L4's hidden secret

Re: L4's hidden secret

Re: L4's hidden secret

Re: L4's hidden secret

Re: L4's hidden secret

Re: L4's hidden secret

Re: L4's hidden secret