OSDev.org

Posted: **Fri Jul 08, 2005 7:47 pm**

Old wisdom is that any problem in CS can be solved by a level of indirection. But where should that level be?

That's more or less what I though, before I got this interesting (?) idea. It has to do with the code that Brendan posted: writing/reading single character at a time. In a sense, if we assume every message send/recv needs a system call, then we are doing just that -- we are just sending one larger message, instead of a single character message.

Sending multiple messages at a time isn't very hard: just copy the messages somewhere, and give them all at once to the system call. Whether this is useful is questionable, since if the userspace API is to be kept sane, we need to copy the message to some "outbox" in the process, before giving them all to kernel. It can be done, but I'd do something else first.

Optimizations to receiving side are much more interesting: if we can place the message directly into userspace, we don't need to store it in kernel at all. Futher, we only have to visit kernel on the receiving side when we wait for a message to arrive; we need kernel to sleep, but we need kernel to schedule something else anyway, so there's nothing really lost there.

The simple solution emerges when we realize that to receive a message, we give kernel a buffer to copy it into. Now, it doesn't matter if we give that buffer before the message arrives. And instead of a single buffer, we can give kernel a bigger buffer to fill with as many message as will fit. But that's obvious. To receive a message, one checks if some buffer end-point marker has advanced, and if not, then calls the kernel for wait.

What was less obvious for me, was how to handle out-of-order receiving: waiting for a certain type of message, while letting other messages remain in the queue, and keep their relative order. As long as we'd stay with perfect FIFO order, everything would be simple.

Actually, if we simply fill a continuous buffer with messages, each prefixed with size, we can still remove messages from the middle "lazily"; just overwrite the relevant message with a "dummy" message of the same size. Alternatively, we can use a single bit in the size-field (or somewhere) to mark the message as "already delivered." In this case we can we can keep the message in the buffer for later processing (although this might not be useful without a copying garbage collector).

The last thing is what to do when then buffer fills. And the obvious answer is not to let the buffer fill: switch to another buffer when the original buffer has less than some threshold of free space. If the receiver can't switch the buffer, then it probably isn't processing the messages, and we'd just run out of messages if we tried to queue infinite number of incoming messages. As long as there is enough buffering space for normal operation, we are fine. If we run out of buffering space, then any futher send's should fail, and the process will eventually be killed as "unresponsive".

Posted: **Fri Jul 08, 2005 8:08 pm**

The specific implementation I am considering for a future version of Neon:

Each thread has two attributes: buffer_begin and buffer_end.

Each time a message is sent, we first check if it fit's into receivers buffer. If not, the send fails. If it fits, then we copy it at buffer_begin (with it's size as prefix) and advance buffer_begin. We keep a zero size-prefix after the last message, so that receiver can check where the remaining buffer space begins.

Receiving side can read incoming messages without visiting the kernel. It can also call kernel to wait for any/specific message. If it's waiting for a specific message, it can also be awaken if the buffer space is too low. A separate buffer can also be given for the specific type of message, in order to simplify the buffer-space management a bit (this is just an optimization).

Finally, any thread can, at any time, replace it's buffer with another one. From that point on, any new message arrive at the new buffer.

To avoid race-conditions, when a thread waits, it tells kernel where it thinks the buffer-space starts. If kernel disagrees, then the thread got a message between the last time it checked for new messages (in userspace) and the moment it arrived in kernel.

A single buffer can be used:

Allocate a buffer large enough. Keep track of the last-position that contains messages that haven't been handled yet. When the buffer starts to fill, replace it with the space that has been freed from the beginning of the buffer. Enlarge the area every time messages are handled. The effect is an approximation of a circular buffer: a moving window into a bigger buffer-space.

Multiple buffers can be used:

A simple modification of the basic stop-and-copy semispace garbage collector might find it useful to trigger a garbage collection for the buffer-area when the buffer fills. Before it starts collecting the old buffer area, it sets the buffer into a new area. Once it's done, it can swap the roles of the areas, and collect again when the buffer fills again.

Posted: **Fri Jul 08, 2005 8:14 pm**

Summary:

There need not be any kernel buffers. The buffers are directly in userspace. Policy is mostly in userspace as well.

Only a single copying is done, unless the receiver wants to keep the message, but reuse the buffer area. I believe not needing to keep the message is the normal case.

The system is simple.

Receiving only needs a system call if it needs to wait.

Posted: **Sun Jul 10, 2005 10:12 am**

Hi!

Interesting concept...
But a few questions remain in my head:

First of all, is this intended to work in one single address space only, or also cross-address-space/cross-process? For this you'll need to map the whole buffer region into every involved address space, and so multiple senders could read (and manipulate) messages sent by others (privacy?).

Each thread has two attributes: buffer_begin and buffer_end.

Where do these 'attributes' reside? Are they accessible from userspace directly? If so, how would you handle this with multiple address spaces? The buffer is unlikely to be mapped at the same base address in every address space.

Each time a message is sent, we first check if it fit's into receivers buffer. If not, the send fails. If it fits, then we copy it at buffer_begin (with it's size as prefix) and advance buffer_begin

How do you ensure that the sender isn't preempted in between reading those values, checking the remaining buffer space, and placing it's message? Once it advances buffer_begin (and then is preempted), the kernel will wake up the receiver (thinking that a new message has arrived), which then tries to interpret the message, even if it's still incomplete.

You need mutex-protection here (don't know if this was obvious to you, but you didn't mention it at all

). But using mutexes probably needs more system calls...

This sounds more like a high-level messaging system built on top of shared memory, but then you should rather make this with high-level 'message ports' rather than thread-to-thread. Each buffer is then associated with one port, a thread can wait on this port for a message to arrive, but somehow every involved process (senders and receivers) has full access to the buffer, so restricting a thread to sending or receiving only has no point.

My thoughts of this

. And, by the way, my first post

!

cheers
Joe

Posted: **Sun Jul 10, 2005 10:43 am**

JoeKayzA wrote: First of all, is this intended to work in one single address space only, or also cross-address-space/cross-process? For this you'll need to map the whole buffer region into every involved address space, and so multiple senders could read (and manipulate) messages sent by others (privacy?).

Well. Kernel will be doing the "send" operation, so userspace need not be able to access remote address spaces. So kernel will copy the message to receiver's buffer on behalf of the sender, but receiver need not involve the kernel.

Each thread has two attributes: buffer_begin and buffer_end.
Where do these 'attributes' reside? Are they accessible from userspace directly?

In TCB, and not directly accessible from userspace. But userspace can reconstruct these attributes pretty cheaply, because it knows what it set them to originally.

How do you ensure that the sender isn't preempted in between reading those values, checking the remaining buffer space, and placing it's message? Once it advances buffer_begin (and then is preempted), the kernel will wake up the receiver (thinking that a new message has arrived), which then tries to interpret the message, even if it's still incomplete.

Kernel does sending, so we can make the whole send-operation "atomic" if we want. I don't see a point though: the real trouble is only when you have multiple concurrent senders. If you have only one, you can always first copy the message, then advance the begin-pointer, and then wakeup the receiver. You only need the mutex to prevent multiple concurrent senders trying to use the same part of the buffer. But since sending is done in kernel anyway for security reasons, one mutex adds little overhead.

You need mutex-protection here (don't know if this was obvious to you, but you didn't mention it at all ). But using mutexes probably needs more system calls...

Well, mutex-protection is one solution. But I was trying to avoid too low-level details.

This sounds more like a high-level messaging system built on top of shared memory, but then you should rather make this with high-level 'message ports' rather than thread-to-thread. Each buffer is then associated with one port, a thread can wait on this port for a message to arrive, but somehow every involved process (senders and receivers) has full access to the buffer, so restricting a thread to sending or receiving only has no point.

I'm sure you can (by now) see that I am just optimizing a normal async message-send system. In fact, in my specific system I'm sending message to objects, which are owner by threads. Messages are delivered to thread-specific message-queues, and then dispatched back to the server-side objects.

Posted: **Sun Jul 10, 2005 12:48 pm**

Well, back here..

Kernel will be doing the "send" operation

OK, this more or less answers all my questions ;D. Sorry, I didn't see that clearly at first, I thought you wanted to completely avoid the need for entering the kernel except for waiting on the receiving side. But with this you would still need a system call for every single message sent, true?

Thanks, now it looks much more clear to me! When the kernel does the sending stuff, things like synchronization and security can be implemented quite naturally. I misunderstood the concept at first.

The L4-Kernel also uses a messaging system without in-kernel-buffers, however a synchronous one. The only real difference to yours is, that an L4-thread can only receive one message each time it calls the kernel to wait for a message, and any thread that tries to send after that is simply blocked until the receiver waits again (I guess this is actually what motivated you to reduce the amount of system calls on the receiver side).

You should consider, however, to provide a maximum number of accepted messages when a thread goes to wait for msgs to arrive (in addition to the buffer space limitation). This way, the synchronous model could be emulated quite easily (wait for one message only). Synchronous messaging allows you to build further synchronization primitives on top of it (like it is done on L4, for example). You'll end up with an even more flexible system.

cheers
Joe

Posted: **Sun Jul 10, 2005 3:36 pm**

Actually, both synchronous and asynchronous messaging can be used to emulate the other one:

1. synchronous to asyncronous: have a manager thread/service that waits for messages and queues them, and then gives them to the receiver when it requests one. If you think of it, this is more or less how (say) UDP in Unix works.

2. asynchronous to synchronous: again, have a manager thread/service. This time we wait for a request to send, and matching request to receive. Then we transfer the message, and then we reply to both threads that the operation has been completed.

As long as you have a kernel that implements both synchronous and asynchronous services, it is of little importance which model it uses internally. Actually, one can build the kernel on asynchronous messaging, and still export only a fully synchronous model to userspace; this might not be a bad solution for certain designs.

If on the other hand you want to move most of the stuff out of kernel, then it is natural to export one of the models, and build the other on top of the other in userspace.

L4 chose the synchronous route. It's advantage is that if applications are build with synchronous messaging in mind, we can send small (in registers) messages VERY fast; if the fast-path is written in assembler, all you need to do is check that no waiting is necessary, swap a new ESP/EIP, possibly load new page directory, and return to userspace. (Ok, some book keeping is needed as well to perform rest of the context-switch lazily if needed later.)

The trouble with the L4 model is, that asynchronous messages are often easier to deal with: both graphical user interfaces, and multi-client services are natural to write with event driver architectures. Asynchronous messaging also makes it easier to avoid deadlocks, and if you want to have mutually untrusting processes talk to each other, you need some form of asynchronity (or at least some kind of abortion system) anyway.

L4 has timeouts for this purpose, but now we have added realtime constraint: we are violating Dijsktra's principle that hard realtime constraints (essential urgency) should be avoided in favour of soft constraints (moral urgency) where possible. Once you have a hard timeout, there is little you can do about it.

Say, if a client timeouts just when the server finished, there is little the server can do (except cache the result). In an asynchronous design, the client would simply send a cancellation to the server. The server could then either drop the request, or reply anyway. If the server replied, then the client could decide that it's happy anyway. Or it could drop the request, if it's too late to use it for anything. This models real life which is highly asynchronous; decisions can be changed at any point.

So basicly, after I got fed up with adding special cases to my old, fully synchronous system, I designed the simplest possible asynchronous messaging API that could possibly work. In fact, it works quite nicely (see my post about the new Neon IPC system a week or few ago).

What I'm trying to do here, is to take that simple API, and find an implementation that remains simple, but avoids kernel buffering, and unnecessary context switching. In other words: I'm trying to apply some of the L4 spirit into the asynchronous world.

Oh, and my hypothesis is that if an asynchronous messaging facility is made simple (and efficient) enough on the kernel level, then on application level we might reach better real-life performance with less work, than we can get from building on top of L4 style synchronous system.

The L4 fast-path might be fast, but one has to remember that when you add things like security and queueing on top of it, you are doing much more messaging that you would need with a native solution.

Posted: **Mon Jul 11, 2005 6:38 am**

sounds interesting. Will investigate when i'll have more available time ...

Posted: **Mon Jul 11, 2005 12:36 pm**

me too. need more time for that kinda stuff. Too busy atm with sports, work & cooking:-)

Posted: **Wed Jul 27, 2005 9:14 am**

finally took the time to read through the proposal ...

Optimizations to receiving side are much more interesting: if we can place the message directly into userspace, we don't need to store it in kernel at all. Futher, we only have to visit kernel on the receiving side when we wait for a message to arrive; we need kernel to sleep, but we need kernel to schedule something else anyway, so there's nothing really lost there.

so if i make it clear, when a thread is willing to receive messages, it gives the kernel a collection of pages that will be used as "inbox" and process that want to send messages simply write to that area (hopefully mapped somehow into emittor's space from receiver's space). 1) how do you ensure that mapping is available when you're trying to write the message and 2) how do you ensure that process C that is willing to write to process A cannot modify/delete message written to A's buffer by process B earlier ?

What was less obvious for me, was how to handle out-of-order receiving: waiting for a certain type of message, while letting other messages remain in the queue, and keep their relative order. As long as we'd stay with perfect FIFO order, everything would be simple.

wouldn't it be way clearer if we simply state that messages from a queue are being processed FIFO and that if some special messages should be processed first, they should be coming on another 'message queue' ?

All in all, honestly, if i was facing the same design issues, i'd be tempted to say "aren't we simply trying to implement something like a socket/pipe/stream rather than a message switching here ?"

Posted: **Wed Jul 27, 2005 9:50 am**

Pype.Clicker wrote: 1) how do you ensure that mapping is available when you're trying to write the message and 2) how do you ensure that process C that is willing to write to process A cannot modify/delete message written to A's buffer by process B earlier ?

It wasn't clear to me either, at first. The point is that sending of a message is completely carried out in the kernel, so the reiceiver's buffer doesn't need to be directly accessible to any sender. (And this prevents process C from becoming inpolite and read other people's mail)

All in all, honestly, if i was facing the same design issues, i'd be tempted to say "aren't we simply trying to implement something like a socket/pipe/stream rather than a message switching here ?"

Packets of data, arriving on a stream, don't neccessarily include a message header (in linux at least - don't know of your specific implementation). When the kernel takes care of message sending, it will reliably insert a full message header to the receivers buffer. But aside that, IMHO, there are no real differences between buffered stream IPC and buffered message IPC (with variable message sizes).

cheers Joe

OSDev.org

IDEA: Asynchronous message-passing with single copy + reduce

IDEA: Asynchronous message-passing with single copy + reduce

Re:IDEA: Asynchronous message-passing with single copy + red

Re:IDEA: Asynchronous message-passing with single copy + red

Re:IDEA: Asynchronous message-passing with single copy + red

Re:IDEA: Asynchronous message-passing with single copy + red

Re:IDEA: Asynchronous message-passing with single copy + red

Re:IDEA: Asynchronous message-passing with single copy + red

Re:IDEA: Asynchronous message-passing with single copy + red

Re:IDEA: Asynchronous message-passing with single copy + red

Re:IDEA: Asynchronous message-passing with single copy + red

Re:IDEA: Asynchronous message-passing with single copy + red