OSDev.org

Posted: **Thu Aug 13, 2015 7:45 pm**

Hi,

I've been thinking about message passing lately for my micro-kernel and thought I'd ask how others have implemented it or are planning to do so. Also what criteria do you think is important?

Criteria that comes to mind:
- Simpler is better
- Performance/Efficiency
- Reliable vs unreliable
- Fixed size vs variable sized
- Synchronous vs asynchronous
- Local vs distributed
- Addressing, using PID (or thread ID, task ID, etc) as sender and recipient address or something else?
- Queuing or not

Assuming virtual memory space is used, then it is required to move between memory spaces to deliver the message. Two ways of doing that:
- Message router has access to all memory spaces
- Every address space (task) has (limited) access to message routers address space

So far I've been thinking of using fixed sized messages, probably very small. The smallest message I can think of would only contain the recipient, which I'm pretty sure can't be avoided. The next smallest would have some message data, which should be 32-bit (address sized for IA-32). There would also be a standardized way of sending larger messages by utilizing this recipient+data format, for example the data being a pointer to some type of structure that's in a page frame that can be moved to the receiving task's memory.

One other thing that comes to mind is whether tasks should have only one incoming message queue or if they should be allowed to have multiple. Even if they have only one, they can still create multiple in user land by simply multiplexing everything thru that one inbox, for example the code managing the inbox simply takes the messages and based on their priority (or requested activity) places them in multiple inboxes that the application's main logic then waits on. Having just one would move the complexity from kernel to userland and only when it is needed does it cause overhead. On the other hand if the kernel has to do some lookup anyway (to find out where to put the message in the first place) it might be more efficient to put the message in the recipients proper inbox immediately and not have the application recreate part of the logic.

One reason to have multiple inboxes is related to the point above about minimum message only having a recipient, this could work similar to the linux signals, if a message is received in a specific inbox then do something specific, like terminate the task.

How do you see the message sending semantics? Once a message is sent, should it be at all available for the sending side?

Posted: **Thu Aug 13, 2015 9:22 pm**

Hi,

Every OS does this differently, and there's no standard and/or "right" way. It's mostly just a matter if figuring out what suits your OS the most.

In general; asynchronous is better for scalability and reducing the number of task switches, but requires different programming practices (threads tend to end up looking like state machines and not like procedural code) and costs more RAM (for message queues). Synchronous is easier to use because it has similar semantics to normal function calls. This mostly implies that asynchronous is better suited to larger systems (e.g. modern multi-CPU 80x86) that have no compatibility constraints; while synchronous is better suited to smaller things (e.g. embedded systems) and/or systems where you do have compatibility constraints (e.g. POSIX).

For my OS; I use asynchronous messaging with variable sized messages (up to 2 MiB). Messages that were successfully sent aren't guaranteed to be successfully received (mostly because it's a distributed OS and networking may be involved). Each thread has 2 message buffers (at fixed virtual addresses in the thread's private "thread space"). ThreadIDs are used for message addresses (a thread sends a message to another thread, not to an arbitrary mailbox). When a message is sent its moved from the sending thread's message buffer to the receiver's message queue in kernel space (or a networking queue if the receiver is remote); and when a message is received it's moved from the receiver's message queue in kernel space to the receiving thread's message buffer.

In old versions of my OS I only had one message buffer per thread. I found this to be awkward because it means that if you need to send a message while handling a received message you need to shift the received message out of the way first. I also provide a kind of "batch kernel API" feature where you can ask the kernel to do a list of things (instead of doing one syscall at a time) to reduce "CPL=3 -> CPL=0 -> CPL=3" switching costs; and having 2 message buffers means that you can send or receive 2 messages at a time (or send and receive at the same time); which helps to improve throughput when you're transferring a lot of data (e.g. file IO).

Also note that I use messaging for all communication. With small fixed size messages you can't do that and end up needing something else (shared memory) to transfer larger amounts of data just to keep the overhead sane (e.g. if you want to send 100 MiB of data you really don't want to split it up into 6.5 million tiny little 16-byte messages). I don't use shared memory (because it's very painful/inefficient to emulate when separate computers are involved), which means I have to support large messages, which means they need to be variable sized (to avoid sending 2 MiB when you only need to send 2 bytes).

For security; the kernel guarantees that a message's "sender ID" can't be forged; and it's the receiver's responsibility to implement its own permission checks (if/when different senders have different permissions). I've seen some kernels (e.g. Minix) where the kernel has a built in permission system, and these systems always seem both inadequate and excessively expensive to me (e.g. a massive bitfield for every thread saying which other threads it will accept messages from; where you're stuck with a global limit of 1 million threads and it adds 128 KiB of metadata per thread, and where a thread can't say "these threads can send some types of messages but not other types of messages" and has to have its own secondary permission system half the time anyway).

Cheers,

Brendan

Posted: **Fri Aug 14, 2015 7:08 am**

I'm trying to figure out what would be the best(tm) way of handling the message passing. One of factors would be where it is located:
- Message router can be either ring0 or ring3
- It can be either in its own memory space, in the sending tasks memory space or all tasks memory spaces (ie. sending and receiving)

I'm not sure if there's really a distinction whether it's in the sending tasks memory space or in all memory spaces, so that really leaves us with four combinations (ring0 vs ring3 and own memory space vs all memory spaces). Own memory space of course has higher overhead due to having to switch virtual memory, and in this case it would probably make sense to make the message router a full task of its own (for consistency), thus making it even heavier. Besides it needs some access to both sending and receiving memory spaces anyway, so I don't see much benefit of it being in its own memory space.

As for rings I think it gets a bit trickier. I'd prefer as much as possible in ring3 for obvious reasons, but at the end it will need to do a ring0 transition to get the message across. One possible ring3 implementation would be to have the message router prepare needed buffers (pages in sending tasks memory space), app writes to those calls message router which then calls the pager to move the pages across. The pager can allow only the message router to make such calls so should be just as safe. As for ring0 it would be pretty much the same, except the message router itself is also in ring0 and thus can do the memory mapping itself or ask the pager to do it, since there's no extra ring0 if it uses the pager and it might keep responsibilities cleaner.

In both cases I'm also thinking of doing capability check (access rights to send the messages across) before the first message (or for ease of use, at the beginning of the first message) and then allowing all further messages thru. The capabilities need to be kept track of, so it can be revoked later if needed. This obviously minimizes overhead. One benefit of ring3 would be that once setup it would allow everything to remain in ring3, though then it effectively becomes shared memory, which can just as well be accomplished with the ring0 solution and a separate shared memory "service" to which you send initial message to setup a shared memory area between tasks and thus essentially has pretty much the same overhead.

I omitted ring1/2 options, they might add a bit of extra security by having the message router have "fixed" access to all the extra mapped memory regions that are used for sending messages and operating in ring1 or ring2. That way the message router doesn't have to be in ring0, yet the application doesn't have direct access to the other tasks memory space. Assuming the message router is small/simple enough the extra security would be minimal.

Any other solutions or improvements you can think of? Or other "locations" where the message router could reside in? By locations I mean rings, memory spaces, segments, anything really that could either provide cleaner separation, extra security or better performance, or some other benefit if you can think of it =)

At the end of the day, the message router really only needs to provide the minimum needed and rest can be extended from that, even if it's inefficient, since the efficiency can be improved with either special cases, shared memory (while having the same semantics as messages), etc.

Out of curiosity, does anyone have any statistics of how many messages your systems send for example every second or minute or so? During idle and during load (what kind of load? more or less full GUI?)..

Posted: **Fri Aug 14, 2015 7:44 am**

Brendan wrote: In general; asynchronous is better for scalability and reducing the number of task switches, but requires different programming practices (threads tend to end up looking like state machines and not like procedural code) and costs more RAM (for message queues). Synchronous is easier to use because it has similar semantics to normal function calls. This mostly implies that asynchronous is better suited to larger systems (e.g. modern multi-CPU 80x86) that have no compatibility constraints; while synchronous is better suited to smaller things (e.g. embedded systems) and/or systems where you do have compatibility constraints (e.g. POSIX).

I'm planning on only supporting async, since async is pretty much going to be needed anyway and sync should be avoided if/when possible. If it becomes necessary (it shouldn't) then I can build it on top of async anyway, shouldn't be too difficult. Also it's relatively likely that we'll see even more cores added to CPU's, especially if software starts to support it better. In recent years I think games have started to support it better and better (ie. multithreading).

[/quote]
Each thread has 2 message buffers (at fixed virtual addresses in the thread's private "thread space"). ThreadIDs are used for message addresses (a thread sends a message to another thread, not to an arbitrary mailbox). When a message is sent its moved from the sending thread's message buffer to the receiver's message queue in kernel space (or a networking queue if the receiver is remote); and when a message is received it's moved from the receiver's message queue in kernel space to the receiving thread's message buffer.
[/quote]

By fixed address do you mean it's fixed offset from the thread specific part of the address space or fully fixed? I mean, if you have multiple threads in one process and each thread has a private "thread space" within the process virtual memory space, then it can't be fully fixed, right?

So an app allocates buffer space (or it might already be fixed sized buffer, fixed to 2MiB max), writes it's message there. Next the kernel copies that message to a kernel space buffer in the receivers queue, finally it's copied again from kernel space to the receivers message queue in receivers address space? Hasn't there been any performance issues with the two extra copy operations?

In old versions of my OS I only had one message buffer per thread. I found this to be awkward because it means that if you need to send a message while handling a received message you need to shift the received message out of the way first. I also provide a kind of "batch kernel API" feature where you can ask the kernel to do a list of things (instead of doing one syscall at a time) to reduce "CPL=3 -> CPL=0 -> CPL=3" switching costs; and having 2 message buffers means that you can send or receive 2 messages at a time (or send and receive at the same time); which helps to improve throughput when you're transferring a lot of data (e.g. file IO).

Are those two mail boxes (message buffers) random use or essentially inbox and outbox? I was planning on having inbox and outbox, which could hold arbitrary number of messages, for the same reason as you, preparing multiple messages in one go and having the ring transition cost only once to send all of them, possibly to many different recipients. Just wondering if you've noticed some benefit in having two random use mailboxes and can be used for both sending and receiving, so you can send two at a time. Wouldn't it then be reasonable to send more than just two, or am I missing something?

Also note that I use messaging for all communication. With small fixed size messages you can't do that and end up needing something else (shared memory) to transfer larger amounts of data just to keep the overhead sane (e.g. if you want to send 100 MiB of data you really don't want to split it up into 6.5 million tiny little 16-byte messages). I don't use shared memory (because it's very painful/inefficient to emulate when separate computers are involved), which means I have to support large messages, which means they need to be variable sized (to avoid sending 2 MiB when you only need to send 2 bytes).

I also intend to use messaging for all communication, my idea was to have the message router as simple as possible and using what ever method is available to transfer larger chunks (ie. moving pages/page tables, etc). I don't consider moving pages from one address space to another shared memory, as the memory is not actually shared, it's removed from sender and after that mapped to receiver, both never have access to it simultaneously. At which point I don't see any practical difference to copying manually byte (or word/dword/etc) by byte from one address space to another and then yet to another.

Essentially you will need to send 50x2MiB for 100MiB, where I can just move a few pages or tables from one directory to another and make it appear. I think getting shared memory to work over a network would be relatively difficult and possibly even quite inefficient, I haven't decided yet if I'm going to make it possible for tasks to request shared memory or not (due to not being sure if I can guarantee it over a network efficiently enough). However within the machine I don't see how the networking comes into picture.

In your case (assuming at least Ethernet for the network, maybe also TCP/IP or UDP/IP) you need to do 50x2MiB messages from the sender to your networking stack, which then needs to piece it into 1,5k chunks, send over the network, the receiver network stack needs to piece it back together into 50x2MiB messages and then do 50x2MiB message transfers on the receiving side to get to the consuming process/thread.

Using page moves it might only mean 1x100MiB message to network stack, the networking being the same (possibly a little bit simpler as the 2MiB messages might not fit into 1,5k nicely and you might have to either leave some IP packets not full or have extra logic to concatenate last part of one message and the first part of the next), and on the receiving side there's also only one 1x100MiB message from network stack to consuming process/thread.

For security; the kernel guarantees that a message's "sender ID" can't be forged; and it's the receiver's responsibility to implement its own permission checks (if/when different senders have different permissions). I've seen some kernels (e.g. Minix) where the kernel has a built in permission system, and these systems always seem both inadequate and excessively expensive to me (e.g. a massive bitfield for every thread saying which other threads it will accept messages from; where you're stuck with a global limit of 1 million threads and it adds 128 KiB of metadata per thread, and where a thread can't say "these threads can send some types of messages but not other types of messages" and has to have its own secondary permission system half the time anyway).

I was actually planning on checking how Minix does it, can't remember anymore it's been so long time since I've read the book. I'm planning on also having the kernel (or rather the message router) guarantee sender ID, but how are you doing that over the network? Or is it only that you guarantee it came from local network stack and no guarantee where it actually originated from?

I don't really like the 1M max threads (though unlikely to be exceeded easily, it's still an arbitrary limitation and I don't like those if I can help it), the extra 128 KiB, but the biggest is that it probably won't suite many cases anyway. The only benefit I can see is that the message router can short circuit some logic and not even attempt delivery if it knows the recipient doesn't want it. I would expect such cases to be the vast minority, since what's the point of spamming another thread if it's not interested in your messages..

I'd prefer to have message filtering (aka firewall) on the kernel/message router side as it would allow easy manipulation, for example NAT'ing certain traffic to go to the other side of the planet and the app not even knowing it, though I think I can get the same result in other ways too, so it's not that important right now. The reason why I'd like it in the user space (part of the app) is because the app is really the only one who knows what messages it wants, if it wants all the messages then there's no overhead and having the message router do firewalling would just be unnecessary cost, some apps that have some filtering needs but simple ones again don't need very complex firewall and can save in overhead, etc. I was thinking of providing the functionality though, for example either as user space libraries or as a user space ring3 service. The service would have the benefit of letting the app have the simplest that fits it's requirements while allowing the user to add to it (the NAT'ing I mentioned for example), without the extra cost of ring transitions or address space changes or context switches.

Posted: **Fri Aug 14, 2015 8:19 am**

Out of curiosity, does anyone have any statistics of how many messages your systems send for example every second or minute or so? During idle and during load (what kind of load? more or less full GUI?)..

That's an interesting question. Interesting enough that I thought I'd add some counters to my kernel code to see. It's kind of difficult to measure messages in a time period so I thought I'd do some common tasks and see how many messages get passed

From power-on until the system becomes idle I see

Code: Select all

Msgs received: 4143

Running 'gcc' to compile and link a single file program

Code: Select all

Msgs received: 1487

Running 'ls' in my home folder containing just 2 files

Code: Select all

Msgs received: 74

Running 'ls -al' in /usr/bin with 163 files

Code: Select all

Msgs received: 1049

I'm not sure what these stats mean (if anything)

Posted: **Mon Aug 17, 2015 3:12 am**

Interesting.
Assuming async messaging, how would it work, exactly?

I.e. a process want to open a file.
It sends a message to the VFS with a file name.
Some task switches later the VFS processes this message...

Then what?
How does the file handle get back to the requester?

If the VFS sends a reply message, then the requester process will have to be able to view all messages in it's inbox, to be able to find this reply, which might come after some unrelated messages.
If there is a reply embedded in the messaging system, then it becomes synchronous and quite a bit more complex, since you need to track the messages.

How is it done, in practice?

Posted: **Mon Aug 17, 2015 4:31 am**

I'm going to disagree with Brendan here and say sync vs async doesn't really affect scalability. Or, to put it another way: if you want to scale with sync message passing, that just involves threads. Additionally, there are lots of things which don't need to scale - consider the unix "ls" command - it's performance is completely dominated by the overheads of printing its' output or disk speed, and no amount of asynchronous work is going to noticeably speed it up.

My message passing system is derived from that of seL4 and QNX. Its' synchronous rendezvous by default (that is, a message is only transferred when somebody is waiting for one), asynchronous as an option. It's built around fixed size messages of an attribute register (which contains a 16-bit message ID and some other metadata), a "badge ID" (more on that later) and 7 message registers (which are 32-bits), plus the ability to attach two additional buffers to a message (a transmit buffer and a receive buffer) and one or more capabilities (which might be used for passing other message ports, for example). The buffers may be sparse (they use the same iovec structures that the POSIX readv & writev functions do).

Messages are sent to a port and arrive through an endpoint. A port is derived from an endpoint by "badging" it; a server would generally use the badge to point to some per-client structure. The badge is unforgeable - the kernel copies it out of the port data into the message.

The message passing primitives are

Send - One way message send to a port
Call - Send message to a port and wait for response
Reply - Send a response back to the caller
(Reply)Wait - If there is a caller, send a response back (a response is always sent - even if its' zero length); then wait for a new message on the specified endpoint

The "caller" information is stored inside the kernel. Calls are reliable; Sends are semi-reliable (they'll always arrive at the recipient in order, but if a connection error happens then some number of sends since the last call may be lost). Replies are unreliable - if there is no caller, the caller has dropped or similar, no error is reported and the message is just dropped on the floor (in other words, network issues are considered the caller's problem)

Read and Write functions are used for copying things to/from the caller's message buffers. For example, the FS server wouldn't want to have to allocate a 1GB buffer in case somebody did a 1GB write. Instead, it'd use the Read & Write functions to copy chunks of the message into it's own buffers*.

For synchronous message passing, there are a couple of other utility functions: one which allows to "suspend" and "resume" handling a call - this copies the caller context into a single use object with a handle, if there is a need to temporarily switch to handling something else. There's also a "forward" function - this allows delegation of a call to another process (the message registers are replaced, but the message buffers are passed through)

The important thing is that so far none of the functions need to do any memory allocation except "suspend." This is important to my goal of minimum overhead message passing.

Finally, there is support for asynchronous message passing. A process can "bind" a port to an endpoint and some queue space. This permits asynchronous calls and sends through the port - replies are delivered as messages into the endpoint. Again, this keeps allocations out of the main message passing interface. This is important for minimizing overheads there and enforcing policy on process' resource use.

* There is one aspect of network non-transparency here - the maximum message buffer size on the local machine is bounded only by the size of your address space. When sending remotely, the limit is lower. Upon connection establishment, a "system message**" is sent to the server which can give the ID of a proxy server it would like the client machine to use. If this proxy can be started on the client machine, it's activated and might, for example, break up a 1GB write into lots of 64kB writes

** One bit of the message attributes is used to identify system messages. These are divided into a couple of groups, some of which are special unforgable messages for things like resource management)

OSDev.org

How do you handle messaging?

How do you handle messaging?

Re: How do you handle messaging?

Re: How do you handle messaging?

Re: How do you handle messaging?

Re: How do you handle messaging?

Re: How do you handle messaging?

Re: How do you handle messaging?