Question about a design of syscalls for microkernels

Ethin · Post by **Ethin** » Fri Jun 18, 2021 6:52 pm

So, I've been mulling over this idea in my mind but thought I'd bring it to you guys.
I know that, typically, a microkernel uses message passing/IPC to communicate between processes and the kernel. My idea is to change this in a couple ways:

The kernel would only contain the absolute minimum syscalls. As much code as possible would run in userspace. The kernel would contain syscalls for threading, processes, memory allocation and such, as well as PCI device access, but that would be it.
Device access would run through userspace servers via shared memory. The server would initialize and access the PCI/CXL bus through the kernel and get the info it needed, but then it would allocate a shared memory buffer and communicate with the device that way.
System calls would go through a similar mechanism. For the majority of tasks in libc, for example (printing to the console, accessing the network, communicating with the filesystem, ...) applications would send request packets to the server in question. The request and response communication mechanism would occur "under the hood" via libc, so applications wouldn't be aware that this was happening. When libc was initialized for a given process it would allocate a shared buffer to all of the servers that it needed access to or could use a central dispatch server that would handle the communication. To the application, it would just be calling fread/fopen/fclose/... and would have no idea that the underlying interface was using this method. This, in turn, would make porting apps simpler.

What do you guys think of this method? It could occur via a space-consuming format like JSON-RPC or it could use my idea of a tightly-packed data structure using strong typing. What would be the (theoretical) performance drawbacks/improvements? How could this be made better? Would be its advantages/disadvantages?
One advantage I could see would be that I could (possibly) bypass traditional IPC entirely. IPC could be handled via shared buffers instead of going through the kernel or VFS layer as is done on Linux/BSD. That would eliminate the overhead of syscalls (I think) as well as the need to allocate VFS objects for pipes and such. It might also be riskier too; for example, it might not be as secure. But I thought I'd post it here and see what you guys thought.

nullplan · Post by **nullplan** » Sat Jun 19, 2021 12:09 am

I always wonder if the microkernel guys aren't overthinking it. You could just do all system calls the same way (system call instruction, kernel transition), and then decide on the kernel side which ones to handle immediately and which ones to package up and send to an external server. That way, userspace doesn't have to know how to contact the relevant processes, you can change the handling dynamically (even after release of the kernel), and importantly, there is no way for userspace to duck up the communication. Shared memory with privileged processes is always a hard proposition from a security perspective.

Next problem: Your I/O server (or VFS server or whatever you call the thing that handles open()) must talk to all other processes. That is a 1:n relationship, limiting the number of processes in the system, just because the I/O server can only take on so many clients at a time. In your case, the shared memory mapping at least consumes one page, so you can only have as many processes as free pages in the I/O server. That probably is not the biggest limiting factor here, so let's continue with the next problem: How to handle notification? All methods I can think of consume either linear auxiliary memory or linear time, meaning your I/O performance gets worse the more processes there are.

Ethin wrote:One advantage I could see would be that I could (possibly) bypass traditional IPC entirely. IPC could be handled via shared buffers instead of going through the kernel or VFS layer as is done on Linux/BSD. That would eliminate the overhead of syscalls (I think) as well as the need to allocate VFS objects for pipes and such. It might also be riskier too; for example, it might not be as secure. But I thought I'd post it here and see what you guys thought.

You still need the kernel to handle notifications, unless you want the I/O server to constantly poll for updates, in which case laptop users will hate you, since that will drain the battery very quickly.

Korona · Post by **Korona** » Sat Jun 19, 2021 3:19 am

nullplan wrote:I always wonder if the microkernel guys aren't overthinking it. You could just do all system calls the same way (system call instruction, kernel transition), and then decide on the kernel side which ones to handle immediately and which ones to package up and send to an external server.

This is how Redox does it. I don't like that design, because if you move that functionality to the kernel, then the kernel needs to know about all kinds of requests, right? (For example, it needs to know about POSIX, about device-specific ioctls(), and so on.)

nullplan wrote:Next problem: Your I/O server (or VFS server or whatever you call the thing that handles open()) must talk to all other processes. That is a 1:n relationship, [...]

I don't think micro- and monolithic kernels differ in this regard. The kernel is in this position (1:n) anyway, so this is only a problem if your VFS server can handle significantly less requests than your kernel. Throughput of IPC is not that hard to optimize (via batching of wakeups at the VFS server's side, it's not necessary to batch at the client side for this to work), so it's unlikely that this becomes a bottleneck.

I agree with nullplan about his points on shared memory queues though: you need at least one additional IPC mechanism to handle notifications, and you have to ensure that shared memory queues can be operated safely even if one side decides to trash the data structures and to write garbage to the memory area.

Ethin · Post by **Ethin** » Sat Jun 19, 2021 12:02 pm

What would you guys suggest for handling notifications and memory queues safely and such? I originally thought of avoiding data structure trashing via a serialization/deserialization library. Though that might have security implications there's no way to ensure a system is absolutely secure.
And yeah, I don't want my kernel to "know about everything" as Korona explained. If I do that I might as well just write a monolithic kernel.

nullplan · Post by **nullplan** » Sat Jun 19, 2021 12:40 pm

Korona wrote:I don't like that design, because if you move that functionality to the kernel, then the kernel needs to know about all kinds of requests, right?

Not really. It only needs to know about all requests it can handle itself, and what to do with all the other ones. OS-9 uses a model where it will has divided all system calls up into F calls and I calls. All F calls are handled by the kernel, all I calls are packaged up and then sent to be handled by IOMAN. And how that one decides to pass on the request is anyone's guess.

Korona wrote:it needs to know about POSIX

Well, you design the system calls, I should hope they are POSIX compatible. Doesn't really matter who handles the system calls, the design ought to be POSIX compatible. Unless you are building one of those "no forced abstraction" kind of kernels.

Ethin wrote:What would you guys suggest for handling notifications and memory queues safely and such?

Futexes. Futexes for everything. Mutexes get a futex, and conditions get a futex, and semaphores get a futex, ...

Yeah, I'm a fan of futexes. So simple yet versatile. Waiting for multiple futexes is a bit of trouble, though, so you might want to use the "one thread per client" approach.

Ethin wrote:And yeah, I don't want my kernel to "know about everything" as Korona explained. If I do that I might as well just write a monolithic kernel.

No clue what he means by that. I'm planning on writing a monolithic kernel that doesn't even know how to render characters into a framebuffer (I've explained before: Essentially Linux, but I'm taking a hacksaw to it, and see what can be removed and still retain a functioning system). Coming along quite nicely so far.

Korona · Post by **Korona** » Sat Jun 19, 2021 1:52 pm

nullplan wrote:Well, you design the system calls, I should hope they are POSIX compatible. Doesn't really matter who handles the system calls, the design ought to be POSIX compatible. Unless you are building one of those "no forced abstraction" kind of kernels.

Not sure if it would fall into your "no forced abstraction" category but Managarm implement POSIX entirely in userspace, and the [url="https://docs.managarm.org/hel-api/hel_8h.html"]kernel's API[/url] . Instead of using POSIX-like syscalls, the C library sends an IPC request to a POSIX server to handle things like open(). Once the file is open()ed, the process directly communicates with the FS driver (using the same IPC mechanism). There is no need to have a POSIX-compatible syscall API since instead of having the kernel forward requests, we can just as well directly send the request to the target server.

Interestingly, we have a mechanism similar (but not identical) to the one that you describe for OS-9: syscalls that have the highest bit set are considered "supercalls"; instead of handling them in the kernel, they are forwarded to the parent process. These are used to obtain initial information (such as the IPC handle that is used to communicate with POSIX) from the parent process.

Ethin · Post by **Ethin** » Sat Jun 19, 2021 2:52 pm

Korona wrote:
nullplan wrote:Well, you design the system calls, I should hope they are POSIX compatible. Doesn't really matter who handles the system calls, the design ought to be POSIX compatible. Unless you are building one of those "no forced abstraction" kind of kernels.
Not sure if it would fall into your "no forced abstraction" category but Managarm implement POSIX entirely in userspace, and the kernel's API . Instead of using POSIX-like syscalls, the C library sends an IPC request to a POSIX server to handle things like open(). Once the file is open()ed, the process directly communicates with the FS driver (using the same IPC mechanism). There is no need to have a POSIX-compatible syscall API since instead of having the kernel forward requests, we can just as well directly send the request to the target server.

This is similar to what I was suggesting in the OP. Get rid of the overhead of syscalls and just directly send requests to the server via a bidirectional communication channel of some kind. Then you only have the overhead of DMA accesses, which, I think, is far less than the overhead of syscalls. But I might be wrong on that one.
The integrity of requests and responses could be handled in lots of ways; one way is this:

Use a serialization library like serde or rkyv which allow for custom serialization/deserialization implementations but, more importantly, allow data formats to have strong typing guarantees. This means that I can define a structure format like so:
Code: Select all
```
#[repr(u64)]
#[derive(Clone, Copy, Debug, Eq, PartialEq, Ord, PartialOrd, Hash, Serialize, Deserialize)]
pub enum FsRequestType {
FileOpen,
FileClose,
FileRead,
FileWrite,
// ...
}

#[derive(Clone, Copy, Debug, Default, Eq, PartialEq, Ord, PartialOrd, Serialize, Deserialize)]
pub struct FsRequest {
pub rtype: FsRequestType,
pub args: StackVec<StackString<1024>, 32>, // Stack-allocated array of stack-allocated strings of size 32 and 1024, respectively
}
```
This would have a couple benefits:
- Since the size of both the stack-allocated string and array are constant, this means that the compiler can do more optimizations to the code. Minor benefit, however.
- The major benefit is that all of this uses Rusts type system. If someone tries to trash the structures before the server receives them, deserialization will fail. They could trash it afterwards but it wouldn't do anything since the deserialized data would be in the servers memory space by that point, so we could probably get away with zeroing the received request once its been received.

This is just one way of doing this, but its also one of the fastest. The NX bit could be set on the entire memory buffer since your (probably) never going to need to be able to execute code in that area. I'm sure other paging/hardware protections could also be set.

Korona · Post by **Korona** » Sat Jun 19, 2021 4:07 pm

Serialization libraries like serde won't save you (at least if you allow running foreign memory-unsafe code on your OS). Malicious programs could always access the shared buffer without going through the serialization lib. (And I don't think serde protects against concurrent in-place modification of the data buffer, does it?)

Ethin · Post by **Ethin** » Sat Jun 19, 2021 6:19 pm

Korona wrote:Serialization libraries like serde won't save you (at least if you allow running foreign memory-unsafe code on your OS). Malicious programs could always access the shared buffer without going through the serialization lib. (And I don't think serde protects against concurrent in-place modification of the data buffer, does it?)

A normal program wouldn't have access to the buffer, only libc. But yes, I get your point, and nefarious programs could access the buffer if they dig around enough.
I don't think there really is a really good way of securing the buffer from tampering. I could use X25519 and sign each request or response, or use hashes (e.g. BLAKE3), but even that might not be enough.
In general the idea was to eliminate syscalls from the equation as much as possible. There's no point in using a syscall if your just going to send something to another userspace process from the syscall.

nullplan · Post by **nullplan** » Sun Jun 20, 2021 12:22 am

Korona wrote:Instead of using POSIX-like syscalls, the C library sends an IPC request to a POSIX server to handle things like open(). Once the file is open()ed, the process directly communicates with the FS driver (using the same IPC mechanism). There is no need to have a POSIX-compatible syscall API since instead of having the kernel forward requests, we can just as well directly send the request to the target server.

This is basically the exact same misunderstanding of POSIX I keep hearing from such luminaries as Linus Torvalds: POSIX does not know about system calls. It does not prescribe system calls. It prescribes other concepts, like file names, the file tree, etc., but not system calls. It does define an open() function, certainly, but that can be implemented in any number of ways, and even if the OS offers a SYS_open system call, that does not mean that that call is the only possible implementation of the open() function.

I should probably clarify (or correct myself) that it is the OS in its entirety that needs to be POSIX compatible. The syscalls need to be such that it is possible to build a POSIX library on top, that is what I meant. In your case, you implement open() not with a system call that opens a file, but with a system call that sends a request to the POSIX server, for it to open the file. Cygwin is another attempt at a POSIX library, this time on top of NT. And there, open() is more complicated because they are implementing pseudo file systems in a library. And it does work, it's just slow.

Ethin wrote:Get rid of the overhead of syscalls and just directly send requests to the server via a bidirectional communication channel of some kind.

That does not work. If the channel were some kernel-side channel like a pipe or socket, you still need to call send and receive system calls. If it is shared memory, you need to notify the receiving process, using, you guessed it, a system call. And if it is shared memory, but the receiving process polls for updates, Greta Thunberg will burn your house down. In that case you would be exchanging the system call for massive amounts of needless work.

Ethin wrote:Use a serialization library like serde or rkyv which allow for custom serialization/deserialization implementations but, more importantly, allow data formats to have strong typing guarantees. This means that I can define a structure format like so:

If my C program interfaces with your Rust OS, I can just have it fill the SHM with whatever I want and send the receiving process into confusion. Complexity and security are enemies! I would suggest using a simple system with simple binary formats, so that the receiving process does not have to parse text, and validation is very simple. Decide ahead of time which calls a given server handles and what the arguments look like. Choose a communication channel that preserves datagram boundaries.

One such system is implemented in FUSE. Each request starts with a common header and a request-specific body. The header contains an opcode and a unique 64-bit number. The response also contains a header and a request specific body. And the response header also contains the unique number. This allows the OS to send several requests to the same FUSE server. And almost no strings are sent, if at all possible. Lookup is the only request that contains a string, and ReadDir is the only reply that contains strings. And all strings are Pascal strings, their length is given as a number somewhere else, so you don't have to rely on NUL termination. Of course, libfuse still adds NUL termination everywhere.

That system is not very flexible. Which is a good thing. It allows diverse servers to work on diverse OSes.

Ethin wrote:A normal program wouldn't have access to the buffer, only libc.

That is not a boundary your OS can enforce. Both are userspace. What one can access, the other can access.

Ethin wrote:I don't think there really is a really good way of securing the buffer from tampering.

Yes there is: Make the buffer inaccessible after sending. The simplest idea would be to allocate a page for the data in the sending process, fill it with data, then transfer the page to the receiving process. Then the receiving process can do what it wants and send the page back to the sender with the result. No actual data has to be copied, it's just the page mapping is moved.

If you allow the sender to still tamper with the data after sending it, you will never be able to verify anything. Anything you check could be changed immediately after checking. The receiver would have to copy the request from SHM into its private memory to look at it, which defeats the purpose of SHM.

Ethin wrote:In general the idea was to eliminate syscalls from the equation as much as possible. There's no point in using a syscall if your just going to send something to another userspace process from the syscall.

Well, you won't be able to make do without one without incurring heavy resource costs as outlined above. So might as well bite the bullet.

You might build something like io_uring, but even that uses system calls to notify the kernel of changes and it also has to notify the receiver somehow.

Korona · Post by **Korona** » Sun Jun 20, 2021 2:26 am

It's not clear to me what the misunderstand is that you mention in your last post, nullplan

.

Re making the buffer read-only/inaccessible: that has high overhead due to TLB shootdown (and the need to send and wait for IPIs). You could use memory protection keys but they are only available on new Intel CPUs.

I concur with the recommendation to look at io_uring. For the data transport itself, I think a shared memory queue (with a sufficiently robust parser) and futexes are good primitives (but I don't think existing parsers protect against malicious concurrent modification of the input buffer, so you'll have to write that yourself).

AndrewAPrice · Post by **AndrewAPrice** » Mon Jun 21, 2021 12:07 pm

For microkernels, the kernel tends to handle things like scheduling, memory management, and IPCs (although some purists try to move these things user space too.)

I think the most useful thing is to develop an IDL/IPC/RPC framework (e.g. Gratch, gRPC) that gives you a consistant interface for defining services and their message formats.

I built an IDL called Permebuf. Here are some examples:

I have a code generator that turns Permebuf->C++, and each service has 2 C++ classses - one for calling the service, and another than you can inherit if you want to implement a service. When you create an instance of a service, it registers with the kernel using a fully qualified name, e.g. perception.devices.GraphicsDriver, and there can be multiple implementations, anyone can iterator, or call any of them (although you do have to worry about permissions.) Shared memory and other services can be embedded as fields in messages since they're just IDs.

So, say for reading a file off a disk, you can implement a "Read" operation, pass is the shared memory buffer to write into, as well as the file handle, offset, length - the VFS and FS driver might have to do some work to find the file, find all the file fragments, but they instruct the disk driver to write directly into the shared memory buffer.

nexos · Post by **nexos** » Mon Jun 21, 2021 12:14 pm

The best method would be to keep the kernel to just scheduling, IPC, timers, interrupts, and memory management. Whenever an app needs to read a file, it will send an IPC to the VFS server. I plan on making large requests done via shared memory, like for example, if an app must read a large file, then libc turns the file buffer to shared memory, and the VFS just rights file data there. This would limit the amount copying needed. As you can see, a microkernel needs many performance optimizations in order to work well.

rdos · Post by **rdos** » Tue Jun 22, 2021 1:50 pm

I don't think safe user-user process interaction can be implemented. I would not expose things like message queues to user space, and under normal circumstances you also need to wake up the server/client, and you cannot move the scheduler to user space.

I also think that different servers have different optimal communication mechanisms. If you want to implement readdir, it's probably fine to just let the server copy the entries to a shared buffer, but if you want to read/write files, this is clearly not optimal. Instead, file-IO is best done by memory mapping buffers in user space (where the IO takes place), and transfer physical page buffers between server and client. Of course, you cannot let user space build physical page lists, and so this must be done in kernel space, just as the memory mapping in user space.

The case of FUSE is interesting, but I still fail to see why buffers for read/write needs to be memory mapped in the VFS server. The VFS should deliver an array of sectors, and then the physical addresses for those are pulled-out in kernel space (from the disc cache) and sent back to the user process where they are memory mapped.

My "IPC" buffers for the VFS "microkernel" server are preallocated 4k sized. There is a header which contains opcodes and register state and then the rest can be used for request & reply data, including physical page buffers. The same buffer is used both for the request and the reply. This way, I don't need to dynamically allocate buffers for IPC, nor do I need to memory map them.

andrew_w · Post by **andrew_w** » Sat Jun 26, 2021 8:02 pm

UX/RT, the OS I'm writing, will do things a bit differently than most other microkernel OSes. Rather than implementing the filesystem API on top of a structured RPC transport, the lowest-level user-visible IPC API will implement Unix-like read/write functions itself directly on top of raw kernel IPC.

In addition to the traditional copying read() and write() functions there will also be corresponding functions that allow accessing the underlying kernel message registers (except for a few reserved by the IPC transport library for internal use) and shared message buffer; all of these functions will interoperate. This should eliminate the need for raw kernel IPC to be exposed to user processes directly.

Services that require structured RPC will use a library implemented on top of the read/write API; this library will use a "message special" file type that preserves message boundaries (like a SEQPACKET socket). The non-read/write/seek-type filesystem API functions will be implemented on top of this RPC library over a permanently open file descriptor present at process creation and connected to the VFS component of the process server (reads and writes for filesystems implemented outside the process server will bypass the VFS completely). This way, the overhead of structured RPC and intermediary servers will be eliminated for services like disk filesystems that deal in bulk opaque data.

OSDev.org

Question about a design of syscalls for microkernels

Question about a design of syscalls for microkernels

Re: Question about a design of syscalls for microkernels

Re: Question about a design of syscalls for microkernels

Re: Question about a design of syscalls for microkernels

Re: Question about a design of syscalls for microkernels

Re: Question about a design of syscalls for microkernels

Re: Question about a design of syscalls for microkernels

Re: Question about a design of syscalls for microkernels

Re: Question about a design of syscalls for microkernels

Re: Question about a design of syscalls for microkernels

Re: Question about a design of syscalls for microkernels

Re: Question about a design of syscalls for microkernels

Re: Question about a design of syscalls for microkernels

Re: Question about a design of syscalls for microkernels

Re: Question about a design of syscalls for microkernels