network drivers - using exo- and micro-kernel-like concepts
Posted: Sat Jan 16, 2021 8:37 pm
In designing an IPC system, I'm thinking through how various drivers would really work. Networking is an ideal case, and one that I really want to get right. Please share any thoughts, feedback, or flaming criticism - that would all be very helpful!
The background here, is I want to write a microkernel. With the main IPC method as lockless queues via shared memory. Throughput of asynchronous IO is the main design goal - low latency is good, but secondary.
I really like the thinking behind exokernels, and I want to use elements of that in my driver APIs. Stripping out as many abstractions as possible, instead using userspace libraries to smooth over differences
For networking, that means getting rid of all concept of sockets and ports, and any middleware. Collapsing the OSI stack and passing full packets direct to the driver, which copies them to the network card. Passing full packets back to applications. How is it possible to do this securely and quickly? Berkley Packet Filters. I'm sure I first saw this idea in an article on exokernel design, but I rediscovered it more recently when writing a network security tool using libnet. They are a really neat concept, where the user program sends BPF bytecode to the kernel, the kernel checks and compiles the BPF, and then runs the filter on each new packet to determine whether to pass it to the program or not. The flexibility is great for writing asynchronous scanners (one process sends probes, second process receives and parses the reply). The problem is, on linux they can only be used as root.
Sending the packets to a driver would itself be done via shared memory. At first I'd thought to use a simple producer-consumer ringbuffer per task (they are easy to do lockless), but looking at the hardware it turns out some hardware no longer uses ringbuffers? It would be good to mirror that with something more flexible in the interface too. For example, the RTL8169 uses an interesting system, and I wonder if it would be possible to map areas of memory used for tx, to a user program directly? That would save on copying overhead. Anyway, either way, here's what I imagine will happen to the packets themselves. For transmission:
The interesting part is security. We can't let each network driver have its own unique permission system. But the design above is flexible, so there are different ways around this! An authorised "socket driver" could instruct the network drivers which BPF filters to set up, based on unix-style permissions and limits. On the other extreme, the network drivers could be set to accept new filters that are signed with a public key. This way, a program that wanted to receive connections on port 80, would ask a permissions server for a signed copy of the relevant BPF, which would be provided only so long as the program has the capabilities required. The program would then forward it to the network driver themselves. Such a permissions server could even be run remotely, managing permissions across an entire site or cluster. The main difficulties with key-based security are revocation (maybe a complex system of timestamps and expiry?), and with validating the signatures (good crypto libraries ported to a hobby OS, reviewed to make sure no new bugs are introduced, could be a lot of work)
With the system as a whole, the first major problem is that adding any complexity to a network driver means fewer people will want to write them, and there will be huge issues with buggy and outdated code. So there will need to be really nice libraries for user and driver sides. With a *stable* api too, since breaking changes would cause too much chaos. (how do you test modifications to the driver of some obscure and expensive network card, on a hobby budget?) I think good libraries are achievable though. Best of all, most parts can be tested without booting a new OS (use libnet to make a simulated network driver, linux shared memory to test the IPC)
Secondly, stateful connection tracking (as seen in firewall rules) could be difficult. Is there any application protocol that would be difficult to use with BPF? Ie where "please pass me all traffic for port x" isn't enough?
Then of course there is the issue of speed and overhead. Modern kernels are already doing very similar processing (eg ipchains), on top of any overhead for tracking and managing sockets. Whether overhead is lower or higher will depend on the elegance of the implementation. Having filters that fail early. Ruling out groups of tasks in one go without running a set of filters for each one. And so on. It's hard to make it flexible and fast, while also avoiding complications. Too complicated, and noone will want to use it. Overall though, pushing the job of parsing packets into userspace should actually speed things up. At least for specialised applications. Exokernel research confirms that it's possible. I'd like to see a comparison with modern linux though, as I'm sure there have been improvements since the old research I was reading.
So what do you think - any glaring gaps or obvious things I've missed? All comments gratefully taken on board
OScoder
The background here, is I want to write a microkernel. With the main IPC method as lockless queues via shared memory. Throughput of asynchronous IO is the main design goal - low latency is good, but secondary.
I really like the thinking behind exokernels, and I want to use elements of that in my driver APIs. Stripping out as many abstractions as possible, instead using userspace libraries to smooth over differences
For networking, that means getting rid of all concept of sockets and ports, and any middleware. Collapsing the OSI stack and passing full packets direct to the driver, which copies them to the network card. Passing full packets back to applications. How is it possible to do this securely and quickly? Berkley Packet Filters. I'm sure I first saw this idea in an article on exokernel design, but I rediscovered it more recently when writing a network security tool using libnet. They are a really neat concept, where the user program sends BPF bytecode to the kernel, the kernel checks and compiles the BPF, and then runs the filter on each new packet to determine whether to pass it to the program or not. The flexibility is great for writing asynchronous scanners (one process sends probes, second process receives and parses the reply). The problem is, on linux they can only be used as root.
Sending the packets to a driver would itself be done via shared memory. At first I'd thought to use a simple producer-consumer ringbuffer per task (they are easy to do lockless), but looking at the hardware it turns out some hardware no longer uses ringbuffers? It would be good to mirror that with something more flexible in the interface too. For example, the RTL8169 uses an interesting system, and I wonder if it would be possible to map areas of memory used for tx, to a user program directly? That would save on copying overhead. Anyway, either way, here's what I imagine will happen to the packets themselves. For transmission:
- copy the packet into network card memory, but do not mark it ready to send
[**] (This is done to prevent the sender from modifying the packet AFTER filters are applied, creating a race condition. It is copied directly, to avoid the cost of making an unnecessary intermediate copy) - Apply all filters to the packet. Unless it passes, reject it. It is anticipated there will be several layers of filters - for example per task, per user, per system (a firewall), etc
- Instruct the hardware to send the packet
- Pass an acknowledgement message to the sender
- Run through the receive filters for every task registered with the driver. Where a packet matches, copy it into their queue
- Instruct hardware that the packet has been read
The interesting part is security. We can't let each network driver have its own unique permission system. But the design above is flexible, so there are different ways around this! An authorised "socket driver" could instruct the network drivers which BPF filters to set up, based on unix-style permissions and limits. On the other extreme, the network drivers could be set to accept new filters that are signed with a public key. This way, a program that wanted to receive connections on port 80, would ask a permissions server for a signed copy of the relevant BPF, which would be provided only so long as the program has the capabilities required. The program would then forward it to the network driver themselves. Such a permissions server could even be run remotely, managing permissions across an entire site or cluster. The main difficulties with key-based security are revocation (maybe a complex system of timestamps and expiry?), and with validating the signatures (good crypto libraries ported to a hobby OS, reviewed to make sure no new bugs are introduced, could be a lot of work)
With the system as a whole, the first major problem is that adding any complexity to a network driver means fewer people will want to write them, and there will be huge issues with buggy and outdated code. So there will need to be really nice libraries for user and driver sides. With a *stable* api too, since breaking changes would cause too much chaos. (how do you test modifications to the driver of some obscure and expensive network card, on a hobby budget?) I think good libraries are achievable though. Best of all, most parts can be tested without booting a new OS (use libnet to make a simulated network driver, linux shared memory to test the IPC)
Secondly, stateful connection tracking (as seen in firewall rules) could be difficult. Is there any application protocol that would be difficult to use with BPF? Ie where "please pass me all traffic for port x" isn't enough?
Then of course there is the issue of speed and overhead. Modern kernels are already doing very similar processing (eg ipchains), on top of any overhead for tracking and managing sockets. Whether overhead is lower or higher will depend on the elegance of the implementation. Having filters that fail early. Ruling out groups of tasks in one go without running a set of filters for each one. And so on. It's hard to make it flexible and fast, while also avoiding complications. Too complicated, and noone will want to use it. Overall though, pushing the job of parsing packets into userspace should actually speed things up. At least for specialised applications. Exokernel research confirms that it's possible. I'd like to see a comparison with modern linux though, as I'm sure there have been improvements since the old research I was reading.
So what do you think - any glaring gaps or obvious things I've missed? All comments gratefully taken on board
OScoder