OSDev.org

Posted: **Mon Apr 19, 2010 2:35 am**

It would make sense to compare a synchronous monolithic kernel with a synchronous microkernel. It's indeed a poor idea to have a synchronous microkernel... unless you use a managed design (ofc, no hardware protection domains), where IPC is even faster than that of a traditional monolithic kernel.

Posted: **Mon Apr 19, 2010 4:30 am**

Love4Boobies wrote:It would make sense to compare a synchronous monolithic kernel with a synchronous microkernel. It's indeed a poor idea to have a synchronous microkernel... unless you use a managed design (ofc, no hardware protection domains), where IPC is even faster than that of a traditional monolithic kernel.

The fastest microkernels are all purely synchronous native-code designs.

As for IPC in a monolithic kernel.. It tends to be slow for the same reasons Mach was slow. Too many checks.

Posted: **Mon Apr 19, 2010 4:36 am**

Owen wrote:The fastest microkernels are all purely synchronous native-code designs.

Not really. Synchronous (in a non-bytecoded microkernel design) means you have to do a lot of work on every system call...

As for IPC in a monolithic kernel.. It tends to be slow for the same reasons Mach was slow. Too many checks.

Mach uses a microkernel, it's not really a fair, nor accurate comparison.

Posted: **Mon Apr 19, 2010 5:24 am**

Love4Boobies,

What does synchronous mean in this sense? Are you talking about synchronous message passing? Or something else?

- gerryg400

Posted: **Mon Apr 19, 2010 10:33 am**

I think he means synchronous as in the system call is handled immediately instead of being (potentially) put in a message box for the kernel to handle later. The second "asynchronous" option is only possible if all system calls are actually messages to the kernel, like in MINIX for example.

Posted: **Mon Apr 19, 2010 10:37 am**

Love4Boobies wrote:
Owen wrote:The fastest microkernels are all purely synchronous native-code designs.
Not really. Synchronous (in a non-bytecoded microkernel design) means you have to do a lot of work on every system call...

Please give your reasoning here. In L4, an IPC results in the message registers being transferred to the receiver and then a context switch. Nothing more. If the receiver is not expecting a message [from this process], then the sender blocks.

The first is the common case and not particularly expensive.

As for IPC in a monolithic kernel.. It tends to be slow for the same reasons Mach was slow. Too many checks.
Mach uses a microkernel, it's not really a fair, nor accurate comparison.

You said "where IPC is even faster than that of a traditional monolithic kernel." - but no monolithic kernel I know is renowned for its IPC performance!

Posted: **Mon Apr 19, 2010 11:41 am**

The point is that in an asynchronous system you can handle messages in batches - even if you have a system call for each message, you don't have a context switch for each message (which costs more than a trap alone)

Posted: **Mon Apr 19, 2010 11:46 am**

Owen wrote:
Love4Boobies wrote:
Owen wrote:The fastest microkernels are all purely synchronous native-code designs.
Not really. Synchronous (in a non-bytecoded microkernel design) means you have to do a lot of work on every system call...
Please give your reasoning here. In L4, an IPC results in the message registers being transferred to the receiver and then a context switch. Nothing more. If the receiver is not expecting a message [from this process], then the sender blocks.

The first is the common case and not particularly expensive.

The general reasoning is given by Combuster in the previous post. L4 can handle IPC in a variety of ways, what you mention is a special case - but don't be fooled. IPC via registers is very fast but you can only send so much data at once. What's the gain if you can have very fast IPC but need to do it 10,000 times more each time you want to send a message?

As for IPC in a monolithic kernel.. It tends to be slow for the same reasons Mach was slow. Too many checks.
Mach uses a microkernel, it's not really a fair, nor accurate comparison.
You said "where IPC is even faster than that of a traditional monolithic kernel." - but no monolithic kernel I know is renowned for its IPC performance!

No one cares about it because IPC behaves much better (and is needed a whole lot less) in monolithic systems. IPC is usually a bottleneck of microkernels and that's where it actually becomes a problem.

Posted: **Mon Apr 19, 2010 12:43 pm**

Love4Boobies wrote: The general reasoning is given by Combuster in the previous post. L4 can handle IPC in a variety of ways, what you mention is a special case - but don't be fooled. IPC via registers is very fast but you can only send so much data at once. What's the gain if you can have very fast IPC but need to do it 10,000 times more each time you want to send a message?

And you fail to understand that L4's IPC is for delivering notifications. You wouldn't send a single message in 10k calls - you would write it to a shared memory page and then deliver a notification.

Arguments that "You need to do it 10,000 times" are flawed because they assume idiot usage.

As the benchmark Brendan quoted said - "under AIM benchmark, L4 Linux is 8% slower than native linux on average". Consider this - an 8% performance decrease for doing things in a way completely unoptimized for the kernel. Native code microkernels can perform very well.

The general argument against native microkernels seems to be that "You waste lots of CPU time doing message passing". To which, I have to question something: How often is a process *both* IO and CPU bound? Very rarely. That overhead doesn't matter much in practice (As long as not horrid - like Mach), but there is the other question: Will a managed code OS' JIT ever produce assembly as well optimized as a human?

If ahead of time compilers with years of development and free reign of the CPU to run optimizers can't - then what hope does a JIT have? And we are not talking trivial speed-ups here. We are talking optimizations that sometimes as much as quarter a program's running time.

Posted: **Mon Apr 19, 2010 1:12 pm**

Owen wrote:
Love4Boobies wrote: The general reasoning is given by Combuster in the previous post. L4 can handle IPC in a variety of ways, what you mention is a special case - but don't be fooled. IPC via registers is very fast but you can only send so much data at once. What's the gain if you can have very fast IPC but need to do it 10,000 times more each time you want to send a message?
And you fail to understand that L4's IPC is for delivering notifications. You wouldn't send a single message in 10k calls - you would write it to a shared memory page and then deliver a notification.

Arguments that "You need to do it 10,000 times" are flawed because they assume idiot usage.

Indeed. Too bad there's nothing worse than shared memory. Well, okay, register passing, you've got me there.

As the benchmark Brendan quoted said - "under AIM benchmark, L4 Linux is 8% slower than native linux on average". Consider this - an 8% performance decrease for doing things in a way completely unoptimized for the kernel. Native code microkernels can perform very well.

They can perform well indeed - usually not as good as monolithic kernels though. Brendan and I have certain disagreements on how a proper microkernel should be implemented: he is for asynchronous message passing with a proper API on top, while I'm for a synchronous managed design. But if I had to rely on hardware instead of software, I'd have to agree with him.

The general argument against native microkernels seems to be that "You waste lots of CPU time doing message passing". To which, I have to question something: How often is a process *both* IO and CPU bound? Very rarely. That overhead doesn't matter much in practice (As long as not horrid - like Mach), but there is the other question: Will a managed code OS' JIT ever produce assembly as well optimized as a human?

I can see no reason why they couldn't - in fact they could in theory produce better code than humans because code is optimized dynamically. Two common examples of such optimizations include using processor/model-specific optimizations and using proper register allocation across shared libraries.

If ahead of time compilers with years of development and free reign of the CPU to run optimizers can't - then what hope does a JIT have? And we are not talking trivial speed-ups here. We are talking optimizations that sometimes as much as quarter a program's running time.

The JIT strategy in fact produces much better code than AOT compilers, the only problem is overhead. The question to be asked is: Considering the overhead introduced by JIT + runtime (e.g., bounds checking, garbage collection), etc., is a managed design able to compete with a true microkernel? The benchmarks we have are pretty promising (and I'm not just talking about numbers in research papers). Also, don't forget that JIT is not an infinite process - when the compilation is done, code will runs very fast. Also, don't forget that everything you JIT (complete or not), you can cache locally on the disk for future use.

Bytecoded designs have two additional advantages: portability across CPU architectures (heck, they can even work on MMU-less systems) and the ability to use certain techniques the processor doesn't natively support. JamesM is currently writing his thesis on using microthreading with the x86.

Posted: **Mon Apr 19, 2010 4:36 pm**

Love4Boobies wrote:Indeed. Too bad there's nothing worse than shared memory. Well, okay, register passing, you've got me there.

Your argument against shared memory is?

The general argument against native microkernels seems to be that "You waste lots of CPU time doing message passing". To which, I have to question something: How often is a process *both* IO and CPU bound? Very rarely. That overhead doesn't matter much in practice (As long as not horrid - like Mach), but there is the other question: Will a managed code OS' JIT ever produce assembly as well optimized as a human?
I can see no reason why they couldn't - in fact they could in theory produce better code than humans because code is optimized dynamically. Two common examples of such optimizations include using processor/model-specific optimizations and using proper register allocation across shared libraries.

So are my reference projects. x264, for example, will look at cpuid at runtime in order to determine what optimizations to enable (If you compile for multiple machines)

If ahead of time compilers with years of development and free reign of the CPU to run optimizers can't - then what hope does a JIT have? And we are not talking trivial speed-ups here. We are talking optimizations that sometimes as much as quarter a program's running time.
The JIT strategy in fact produces much better code than AOT compilers, the only problem is overhead. The question to be asked is: Considering the overhead introduced by JIT + runtime (e.g., bounds checking, garbage collection), etc., is a managed design able to compete with a true microkernel? The benchmarks we have are pretty promising (and I'm not just talking about numbers in research papers). Also, don't forget that JIT is not an infinite process - when the compilation is done, code will runs very fast. Also, don't forget that everything you JIT (complete or not), you can cache locally on the disk for future use.

Bytecoded designs have two additional advantages: portability across CPU architectures (heck, they can even work on MMU-less systems) and the ability to use certain techniques the processor doesn't natively support. JamesM is currently writing his thesis on using microthreading with the x86.

JITs produce better code than AOT compilers for branchy code where the developer hasn't profiled it properly. If you take the code, run it through a profiler, then inform the compiler accordingly (For example, use GCC's __builtin_expect to tell it whether a branch is regularly taken or not), it can generate very good code; in fact, in this case, GCC is able to smoke most JITs. Of course, it is extra work for the programmer, but if your software is CPU intensive, then it should be done. And, of course, much math intensive code is not branchy...

However, some optimizations, and even crucial parts of a compiler - register scheduling, for example, are rather expensive processes, and JITs have to use less CPU intensive methods of assigning registers. The other thing that all compilers are poor at - but JITs in particular - is vectoriztaion. Most languages don't have the vector intrinsics provided to C programmers by processor developers, and both AOT and JIT compilers are horrid at autovectorization.

However, even if you turn the optimizer right up on your vector code, the hand assembler can always beat the compiler. Always.

Posted: **Mon Apr 19, 2010 11:41 pm**

Managed OS != JIT. For example, Singularity uses AOT compilation.

Posted: **Tue Apr 20, 2010 1:07 am**

Owen wrote:
Love4Boobies wrote:Indeed. Too bad there's nothing worse than shared memory. Well, okay, register passing, you've got me there.
Your argument against shared memory is?

I was hoping this won't turn into a shared memory vs. message-passing discussion, but here goes...

There was quite some fuss some years ago where people were trying to figure out what the one true way of IPC is. The main problem is that it's not scalable - you have to keep using locking mechanisms in order to properly synchronize data accesses - and even then, you might not be able to enforce this. It gets even worse for MP systems. Another problem is that it is difficult to do for a programmer (except... see below). Last but not least, it doesn't syngergize with networking (see below).

Tuple spaces, transactional memory, they're all very nice and easy (yes, I went there, heh) but none of these scale.

I find it even worse as a distributed paradigm and I'm not alone. Even as early as 1992, here's what the Plan 9 people had to say about it:

Rob Pike, Ken Thompson, Dave Presotto and Phil Winterbottom wrote: This one baffles us: distributed shared memory is a lousy model for building systems, yet everyone seems to be doing it. (Try to find a PhD this year on a different topic.)

Most high-end multiprocessor operating systems implement message passing today. Clicky!

Owen wrote:JITs produce better code than AOT compilers for branchy code where the developer hasn't profiled it properly. If you take the code, run it through a profiler, then inform the compiler accordingly (For example, use GCC's __builtin_expect to tell it whether a branch is regularly taken or not), it can generate very good code; in fact, in this case, GCC is able to smoke most JITs. Of course, it is extra work for the programmer, but if your software is CPU intensive, then it should be done. And, of course, much math intensive code is not branchy...

It's not really the same thing, but ok. That is something that will give static optimization a good boost.

However, some optimizations, and even crucial parts of a compiler - register scheduling, for example, are rather expensive processes, and JITs have to use less CPU intensive methods of assigning registers.

That is indeed true. The LLVM folks have found an algorithm that uses puzzle solving to get near-optimum register allocation in real time. I have read this paper some time ago, it's very possible that someone came up with something even better by now.

The other thing that all compilers are poor at - but JITs in particular - is vectoriztaion. Most languages don't have the vector intrinsics provided to C programmers by processor developers, and both AOT and JIT compilers are horrid at autovectorization.

Lolwut?

However, even if you turn the optimizer right up on your vector code, the hand assembler can always beat the compiler. Always.

The main problem is overhead. The other problem is that compilers aren't perfect -- hand-written assembly can anways tune performance or decrease size. Perhaps some day in the future we will figure out how to make compilers that can always figure out the best way to generate code (if humans can do it, then so can computers - it's just a matter of "how?").

Regarding your statement, we usually use a lot of mid- and high-level languages and find their performance acceptable. If the benefits outweigh the overhead we should go for it.

Colonel Kernel wrote:Managed OS != JIT. For example, Singularity uses AOT compilation.

Indeed, that's true - it's just that this discussion seems to lean to towards the (non-)benefits of the JIT approach. In fact, that's what people usually use today, JIT has not been explored much as far as managed OSes are concerned (although there is some interest in this).

Posted: **Tue Apr 20, 2010 12:31 pm**

Love4Boobies wrote:
Owen wrote:
Love4Boobies wrote:Indeed. Too bad there's nothing worse than shared memory. Well, okay, register passing, you've got me there.
Your argument against shared memory is?
I was hoping this won't turn into a shared memory vs. message-passing discussion, but here goes...

There was quite some fuss some years ago where people were trying to figure out what the one true way of IPC is. The main problem is that it's not scalable - you have to keep using locking mechanisms in order to properly synchronize data accesses - and even then, you might not be able to enforce this. It gets even worse for MP systems. Another problem is that it is difficult to do for a programmer (except... see below). Last but not least, it doesn't syngergize with networking (see below).

Tuple spaces, transactional memory, they're all very nice and easy (yes, I went there, heh) but none of these scale.

And my design doesn't use them.

On the local machine, it uses shared memory primarily as a conduit for message passing. Messages are placed into the memory, then a notification sent to the reciever. Each socket has one or two ring buffers, into which requests and responses are placed. Each buffer has an established writer and reader, and protocols to efficiently control the processing of messages. All of this logic is implemented in userspace and without copying.

This does not preclude a distributed system: the RPC daemon connects to one of these sockets (Either as a client or a server) and behaves as a traditonal client, except that it sends the messages between machines (My preference is over SCTP).

OSDev.org

Is it Microkernel?

Re: Is it Microkernel?

Re: Is it Microkernel?

Re: Is it Microkernel?

Re: Is it Microkernel?

Re: Is it Microkernel?

Re: Is it Microkernel?

Re: Is it Microkernel?

Re: Is it Microkernel?

Re: Is it Microkernel?

Re: Is it Microkernel?

Re: Is it Microkernel?

Re: Is it Microkernel?

Re: Is it Microkernel?

Re: Is it Microkernel?