network drivers - using exo- and micro-kernel-like concepts

oscoder · Post by **oscoder** » Sat Jan 16, 2021 8:37 pm

In designing an IPC system, I'm thinking through how various drivers would really work. Networking is an ideal case, and one that I really want to get right. Please share any thoughts, feedback, or flaming criticism - that would all be very helpful!

The background here, is I want to write a microkernel. With the main IPC method as lockless queues via shared memory. Throughput of asynchronous IO is the main design goal - low latency is good, but secondary.

I really like the thinking behind exokernels, and I want to use elements of that in my driver APIs. Stripping out as many abstractions as possible, instead using userspace libraries to smooth over differences

For networking, that means getting rid of all concept of sockets and ports, and any middleware. Collapsing the OSI stack and passing full packets direct to the driver, which copies them to the network card. Passing full packets back to applications. How is it possible to do this securely and quickly? Berkley Packet Filters. I'm sure I first saw this idea in an article on exokernel design, but I rediscovered it more recently when writing a network security tool using libnet. They are a really neat concept, where the user program sends BPF bytecode to the kernel, the kernel checks and compiles the BPF, and then runs the filter on each new packet to determine whether to pass it to the program or not. The flexibility is great for writing asynchronous scanners (one process sends probes, second process receives and parses the reply). The problem is, on linux they can only be used as root.

Sending the packets to a driver would itself be done via shared memory. At first I'd thought to use a simple producer-consumer ringbuffer per task (they are easy to do lockless), but looking at the hardware it turns out some hardware no longer uses ringbuffers? It would be good to mirror that with something more flexible in the interface too. For example, the RTL8169 uses an interesting system, and I wonder if it would be possible to map areas of memory used for tx, to a user program directly? That would save on copying overhead. Anyway, either way, here's what I imagine will happen to the packets themselves. For transmission:

copy the packet into network card memory, but do not mark it ready to send
[**] (This is done to prevent the sender from modifying the packet AFTER filters are applied, creating a race condition. It is copied directly, to avoid the cost of making an unnecessary intermediate copy)
Apply all filters to the packet. Unless it passes, reject it. It is anticipated there will be several layers of filters - for example per task, per user, per system (a firewall), etc
Instruct the hardware to send the packet
Pass an acknowledgement message to the sender

For receiving:

Run through the receive filters for every task registered with the driver. Where a packet matches, copy it into their queue
Instruct hardware that the packet has been read

To speed up sorting packets, I think using several levels and groups would be helpful. There's no point testing a TCP SYN packet against every task and filter, except for those that accept new connections. 99% of tasks will want nothing to do with SCTP-ip packets. Any suggestions for how to organise this well? For receive, perhaps a couple of general layers to rule out tasks quickly (ie if packet matches a tcp-syn or tcp-ack, filter out all tasks that didn't explicitly asked for it). For transmit, system-wide rules would be run first (eg firewall rules), before the per-task ones are reached. What I'm very unsure about, is whether to make it possible to change and re-route packets entirely within the driver? This would save on time wasted by passing them back and forth between processes, but add a lot of complexity - the network drivers would all have to be aware of and communicate with each other.

The interesting part is security. We can't let each network driver have its own unique permission system. But the design above is flexible, so there are different ways around this! An authorised "socket driver" could instruct the network drivers which BPF filters to set up, based on unix-style permissions and limits. On the other extreme, the network drivers could be set to accept new filters that are signed with a public key. This way, a program that wanted to receive connections on port 80, would ask a permissions server for a signed copy of the relevant BPF, which would be provided only so long as the program has the capabilities required. The program would then forward it to the network driver themselves. Such a permissions server could even be run remotely, managing permissions across an entire site or cluster. The main difficulties with key-based security are revocation (maybe a complex system of timestamps and expiry?), and with validating the signatures (good crypto libraries ported to a hobby OS, reviewed to make sure no new bugs are introduced, could be a lot of work)

With the system as a whole, the first major problem is that adding any complexity to a network driver means fewer people will want to write them, and there will be huge issues with buggy and outdated code. So there will need to be really nice libraries for user and driver sides. With a *stable* api too, since breaking changes would cause too much chaos. (how do you test modifications to the driver of some obscure and expensive network card, on a hobby budget?) I think good libraries are achievable though. Best of all, most parts can be tested without booting a new OS (use libnet to make a simulated network driver, linux shared memory to test the IPC)

Secondly, stateful connection tracking (as seen in firewall rules) could be difficult. Is there any application protocol that would be difficult to use with BPF? Ie where "please pass me all traffic for port x" isn't enough?

Then of course there is the issue of speed and overhead. Modern kernels are already doing very similar processing (eg ipchains), on top of any overhead for tracking and managing sockets. Whether overhead is lower or higher will depend on the elegance of the implementation. Having filters that fail early. Ruling out groups of tasks in one go without running a set of filters for each one. And so on. It's hard to make it flexible and fast, while also avoiding complications. Too complicated, and noone will want to use it. Overall though, pushing the job of parsing packets into userspace should actually speed things up. At least for specialised applications. Exokernel research confirms that it's possible. I'd like to see a comparison with modern linux though, as I'm sure there have been improvements since the old research I was reading.

So what do you think - any glaring gaps or obvious things I've missed? All comments gratefully taken on board

OScoder

eekee · Post by **eekee** » Sun Jan 17, 2021 9:10 pm

This made me think.

I realised I know more about IP with its ports than anything beneath it. Ports are an extremely fast way for the kernel to decide which packet does to which process, although I don't want to say they're the only way. I shot down a lot of my own ideas, actually, realising there are workarounds for almost every concern I could raise.

I do have one concern I can't quite get around: A powerful server may need to install a very large number of filters to work properly. How would all these filters be managed? Would a sysadmin need to load all of them into the authorised socket driver? Perhaps they could be loaded from a secure file installed with the server. This file needn't even be readable by the server itself. That sounds good until you think about changing the port(s) the server listens to. How would that work? Substitution codes in the file to be replaced by values from... what? It needs some thought.

OSDev.org

network drivers - using exo- and micro-kernel-like concepts

network drivers - using exo- and micro-kernel-like concepts

Re: network drivers - using exo- and micro-kernel-like conce