Design of hybrid kernel -- Tellur -- RFC

tellur · Post by **tellur** » Sun Dec 30, 2018 12:06 pm

Hi,

For a few years I have been developing my OS (codename: 'tellur') on and off in my free time. I even got to writing my own (working) FAT16 x86_64 bootloader and a kernel capable of booting up all SMP processors/cores and executing scheduled kernel code (also very simple ELF loading of userspace code, but no syscalls yet). That said, I have been considering scrapping some of this code again, but more to that later.

I would like to continue developing the kernel, but I think this time I would like to ask for feedback on the architecture first, especially since a rewrite is under consideration. I already owe a lot to this forum in terms of information, so maybe I can contribute something back in the future. So let's start with the design ...

The design of my kernel has changed somewhat over the years, but the core principles and goals have stayed the same. Those are (in no particular order):

It should be somewhat of a "spiritual successor" to Unix, especially in that the kernel API is very simple and that everything is a file. It should not emulate Unix though (that's what userspace servers are for)
Not only is everything a file, everything is also a capability and a stream
It should provide a cryptographically secure capability system, which makes it possible to contain the effects of both malicious code and userspace software vulnerabilities (again, more on that later)
It should not prevent the user from using closed-source drivers, but it should offer functionality to contain them
It should be hot-swappable
It should be fast and have the potential for production use, especially for developers
Reliance on libraries should be reduced in favor of a client-server architecture (correcting the IMO most crucial mistake in the path Unix took in the late 90s)
All APIs should be asynchronous
The kernel should neither act as a mere hardware multiplexer/scheduler, nor as a Linux-like monolith with filesystems and a firewall built-in. It should act as the hardware abstraction layer, and in conjuction with the other requirements, this requires memory management, a scheduler, *a VFS* (with hooks for userspace filesystems and protocol handlers, see below), and capability management. Probably other things as well that I forgot.
Compatibility is more important that beauty (or integrated-ness into the ecosystem). This is one of the major reasons Microsoft operating systems have been so successful. Architecturally, this will be achieved by adding a variety of layers (POSIX, Linux, Windows, etc.) as userspace servers, about which the kernel has no knowledge.
The system should preserve its state across reboots, just like a suspend to disk. This should even work when power is interrupted (although recent changes that haven't been written to disk might get lost).
There's no rush to implement all this, but I really would like to try!

Capabilities

With those core principles out of the way, let's very briefly discuss capability security. There's a lot of material on the web, for instance on http://cap-lore.com/, so I won't reiterate everything, but suffice it to say that:

... capability security is an /alternative/ to ACLs or permission-based systems like Unix or Windows, where user permissions for files and devices are set, and the latter can be opened without the user's consent simply by virtue of the correct permissions being set.

Now, a capability represents authority (a 'claim') to a certain resource (a file, an IO port, etc. etc.). It can be thought of as an object (as in OOP), or as a more general form of a Unix file descriptor. In tellur, capabilities are represented in userspace as a cryptographic public key unique to that capability (implementation details pending). The kernel issues the keys and maintains a map between private keys and capabilities, allowing it to unilaterally invoke any authority if that is desired.

An example: let's say your browser has been given access to the audio device and you open up a website playing annoying music, you can just cut off access to the audio device. Or you have a rogue app filling up your home directory with garbage files. Just cut the rogue app off from the filesystem by revoking all of its capabilities to files.

It should be obvious how useful such a system is and how permissions can't deliver on that. But the question is how userspace programs would attain such capabilities. After all, they can't just allow the kernel to give out a capability to a certain file (let's say /etc/shadow) without the user consenting. Otherwise, we'd be back to a system without any kind of security (like DOS), or we'd have to add permissions back in.

So we must have the user's consent in order to open files and access devices. With almost everyone now on Android 6+, there's a handy example available on how that would work: When you first open an app on newer Android devices, you get asked about whether you want to allow camera access, filesystem access, calendar access, etc.

Capability security UI works in a similar way, just more granularly. You don't simply allow apps access to your entire filesystem, you allow access to single files or directory trees. This can work through file selection dialogs (which we need anyway). For devices, an approach similar to Android is suitable however. All those dialogs and confirmation messages require a special GUI process handling them (kind of like 'system mode' dialogs in Windows work).

On console it's even easier. Consider:

Code: Select all

[test@machine]: cat file1 file2
error: 'file1' not a capability!
[test@machine]: cat [open file1] [open file2]
...
[test@machine]: # or with syntactic sugar:
[test@machine]: cat 'file1' 'file2'
...

(if you hadn't noticed, the syntax for the shell is inspired by Tcl, but it should be clear: brackets [] are function calls)

In the shell case, the command "open", also represented by single quotes, is a special privileged function that can ask the kernel directly for capabilities, but only if the shell is interactive. Otherwise we would run into the consent problem after all. This is achieved by the kernel supplying each login shell with a capability to the whole filesystem, which allows "opening" (obtaining the capability to) arbitrary files and directories. The login shell never passes this capability on to arbitrary processes, only to its trusted internal functions, and to other interactive shell-like programs, such as the desktop environment.

Of course, you could also start the desktop environment with a restricted version of the "login shell capability" that denies access to system files. And BAM, you have a user system that is already more powerful than Unix's. You can get a restricted version of every capability (for example create a read-only capability from a read-write one, or exclude subdirectories from a directory capability)

I would love to hear what you think about the capability system so far. Moving on...

Everything is a file

This is pretty straightforward. Capability systems have traditionally eschewed filesystems, but with tellur, I have decided to go into the opposite direction, since it's obvious tree-like filesystems are not going anywhere, and URLs have even expanded on that scheme. So there will be a special /url/ directory that is multiplexed by the kernel. You could do something like this:

Code: Select all

[test@machine]: cat [open /url/https://google.com/]

... and get the source code of Google's home page back, provided that a handler for https is registered and that the current shell has the proper capabilities.

Needless to say, device files and something like the proc filesystem will also exist.

Hot Swapping

In my opinion, it should be a basic requirement for a modern kernel (as opposed to dated NT and Linux) to be able to update itself without rebooting. The current environment and contents of RAM should also be kept (just like suspend to disk). By considering hot swapping early in development, we lay the foundation for robust patching technology for kernel data structures. This finally allows always-on servers without the security risk of running REALLY old kernels with only occasional backported patches (like Debian Stable servers do in a lot of cases).

Thinking about hot-swapping brought up the idea that perhaps it would be better to actually implement the kernel in architecture-specific assembly, and only the userspace programs in C or C++. This would make the locations of all kernel code and data deterministic, at the cost of slowing down development, breaking portability and having to start over (though the last point is not really a big deal). I'm not sure, but maybe the same could be achieved using linker scripts? Perhaps someone known more about this than me? Another possibility is to use some kind of "deterministic heap", and store all information likely to change between kernel versions there. Does anyone have experience with that sort of solution? The current kernel is largely written in C++ by the way.

Client-Server Architecture and Kernel Responsibilities

In tellur, the role of libraries should be massively reduced. Libraries are neither network-transparent nor easy to get right in terms of ABI/API compatibility. If dynamically linked, ABI changes break your code. If statically linked, code size blows up and security bugs in a library can only be fixed through recompiling all programs using this library.

Servers that run as extra processes and communicate with clients through network sockets avoid all of these problems. The GUI would be provided by such a server (a bit like X11, just modernized by a lot).

Servers can also be registered with the VFS. In this case, opening a certain file path redirects the stream to a server socket. For instance, opening /url/https://google.com/ opens a new stream with the local HTTP streaming service, which connects to the web on the user's behalf. Opening /mnt/fat32_disk/ opens a new stream with the FAT32 filesystem service etc. etc.

Services (and processes in general) can be dynamically started and suspended, similar to what Android does with apps. There is no swap partition/file, there are simply image files on disk that mirror ram contents specific to each process. RAM and disk are synchronized when the disk is not being accessed. When such a sync happens, RAM is pruned of all processes that have not been used recently.

The kernel itself should not use modules at all. If any sort of optional code is included, it must be recompiled and then hot-swapped, which I'm sure is an acceptable substitute for module loading.

System Calls

All system calls should be asynchronous. This will eliminate almost all performance problems traditionally associated with keeping stuff like filesystems outside the kernel. All file operations should be done on streams. Those work without copying the data. instead leveraging the huge 64 bit virtual address spaces we have nowadays. This means that tellur will NOT support 32 bit systems, sorry. It's probably best to show you the API in the form of a coding sample (API still pending, of course):

Code: Select all

/*
  The dir_capab comes from a call to [open] in the shell or from a file chooser dialog.
  This is a bit of an artifical example, because most programs would get the opened file
  passed directly to them. But consider cases such as the emulated POSIX environment,
  which will simply be passed a directory capability that it can go crazy with.
*/
int read_file_synchronous(tellur_capab dir_capab, char **dest) {
	tellur_capab file;
	tellur_wait_token token;
	int ret;

	/* asynchronous call, this is simply put on a queue */
	token = tellur_open(dir_capab, &file, TELLUR_RDONLY);
	/*
	  Block until file open operation finished (this switches context).
	  If you wanted, you could replace this with a loop that calls other
	  tasks in the meantime (cooperative multitasking).
	  If the token was invalid to begin with (a syntax error on calling
      tellur_open for example), the error is just returned back immediately.
	*/
	ret = tellur_wait(&token); 
	if (ret)
		goto error;
	/* This reads in an entire file and places the string pointer to the contents in dest.
	   The dest pointer should be freed with tellur_stream_free() */
	token = tellur_read(&dest, file, -1, TELLUR_WHOLEPAGES | TELLUR_STREAM);
	/* Again, in this simple example, we just wait until everything is read, but we could
	   simply call tellur_stream_next() (see next example) and read the stream peacemeal */
	ret = tellur_wait(&token); 
	if (ret)
		goto close_file;

	return 0;

	close_file:
		/* Asynchronous, but guaranteed to succeed, so no need to wait */
		tellur_close(file);
	error:
		tellur_perror(ret);
		return ret;
}

/* This is the asynchronous version with a callback */
int read_file_asynchronous(tellur_capab dir_capab, int (callback)(char *, size_t)) {
	tellur_capab file;
	tellur_wait_token token;
	int ret;
	char *dest;
	size_t len;

	/* asynchronous call, this is simply put on a queue */
	token = tellur_open(dir_capab, &file, TELLUR_RDONLY);
	/* Now actually do the cooperative loop thing */
	do {
		ret = tellur_error(&token); 
		if (ret)
			goto error;
		/* This function should call tellur_sleep(-1)
		   (i.e. put the current process to sleep until something changes),
		   if there are no elements on the queue left. */
		coop_queue_next();
	} while (!tellur_ready(&token))

	/* No string pointer initially given, TELLUR_CONSECUTIVE guarantees no memory is overwritten,
	   so the whole file will be in memory consecutively after the stream is read.
       Note that in most cases, this is not needed, because stream data does not have to be kept around. */
	token = tellur_read(NULL, file, -1, TELLUR_WHOLEPAGES | TELLUR_CONSECUTIVE | TELLUR_STREAM);
	do {
		ret = tellur_error(&token); 
		if (ret)
			goto close_file;

		/* Read next piece of stream (typically a memory page if TELLUR_WHOLEPAGES is set).
		   The dest pointer should be freed with tellur_stream_free() */
		token = tellur_stream_next(&dest, &len);
		/* Call the callback as soon as the next piece is there
		   The callback could be part of a parser engine that advances the state of
		   an internal state machine for example */
		ret = tellur_stream_wait(&token, callback);
		if (ret)
			goto close_file;
	} while (!tellur_stream_eof(token));

	return 0;

	close_file:
		/* Asynchronous, but guaranteed to succeed, so no need to wait */
		tellur_close(file);
	error:
		tellur_perror(ret);
		return ret;
}

This no-copy, completely asynchronous architecture allows for efficient data transport between services. An app that opens a file connects to the filesystem server (the socket is redirected there by the kernel) and streams its stuff from there. The filesystem server again gets its data streamed from a device. If everything runs locally and filesystem clusters/blocks are 4k or larger, the kernel can simply juggle pointers and remap physical memory. The data has to be read only once (by the filesystem server). Because everything can be made to work asynchronously (with two userspace queues, one inbox and one outbox that cause no context switches except if they fill up), the amount of context switches is reduced to a very low level (only once per timeslice if stream throughput is optimized).

This is the architecture in a nutshell. I hope you like it and I also hope you find something to critique and can suggest improvements. All questions I have for you are in bold.

Additional questions:

- How would you go about implementing graphics drivers? I'm not sure which portions exactly to put into the kernel and which not to. Also, which driver (apart from the trivial Bochs graphics interface, the only one I have so far) should I be working on first for x86_64. VESA requires real-mode code (either native or emulated) AFAIK, which would mean either switching down to real mode when switching graphics mode (a huge pain, especially with SMP), or emulating everything (again a huge pain). Or is there some other alternative?
- Are there any performance or portability issues that could possibly arise from my approach? My initial goal is to have x86_64 and later armv8 support. No 32 bits support is planned as that would clash with the memory model. (Or maybe not?)

Thanks in advance!

max · Post by **max** » Mon Dec 31, 2018 3:49 pm

Hi Tellur!

I didn't have much time to read all of your post but I can say something to...

- How would you go about implementing graphics drivers? I'm not sure which portions exactly to put into the kernel and which not to. Also, which driver (apart from the trivial Bochs graphics interface, the only one I have so far) should I be working on first for x86_64. VESA requires real-mode code (either native or emulated) AFAIK, which would mean either switching down to real mode when switching graphics mode (a huge pain, especially with SMP), or emulating everything (again a huge pain). Or is there some other alternative?

Unfortunately you can't use the VM86 mode on x86_64, so you're forced to either write drivers for some specific video cards, or you use VESA to set up the graphics mode in your kernel loader already and provide the framebuffer to your kernel (I assume you're booting your OS through BIOS). Writing real drivers might be the "proper" way but I'm not sure how feasible it is. Setting up a specific graphics mode in VBE is not a big deal so it won't bloat your kernel loader much. Also, if you decide to switch to UEFI one day it would work the same way, providing a framebuffer to your kernel.

Greets & happy new year!

tellur · Post by **tellur** » Tue Jan 01, 2019 4:17 am

Happy new year to all!

Hi Max,

Thanks for the suggestion! I think when I last tried to approach the graphics problem, I also considered putting the logic into the loader. The only downside is that it is not possible to switch resolutions after boot, but that's good enough for the medium term or as a fallback driver. It's a bit like old-school linux vga= command line options

I also only support BIOS at the moment (quite correct), because it's easier

and because all my personal machines are still BIOS-based and I definitely want to support those. But EFI should also be kept in mind. My loader (written in pure assembly) is structured like this:

MBR code (called tload16) (loads tload64 [edit: and the config file] from known disk address)
|
v
tload64 (1 switches directly to long mode from real mode, reads its own config file (using ATA PIO, only FAT16/32 ATM)), displays text-mode menu, 2 sets up higher-half mapping (and a small identity mapping in low memory regions for kernel loading), loads kernel from disk (same current restrictions as config file), parses ELF header for entry point, 3 jumps into kernel
|
v
Kernel

So the main question is, where would I put VESA setup and VBE mode switching in this diagram? Either I'd have to put it at 1, i.e. do it before switching to long mode, but that would mean the loader would have to deal with all sorts of graphics modes, meaning I'd have to write font rendering and other graphics code code for the loader (which I should probably code in such a way as to reuse it in the userspace console server, which I haven't even designed yet ...). Another downside is that the graphics mode would either be set in the config file (edit: which has been loaded by tload16), or I'd have to support changing the graphics mode for the loader itself, not just for the kernel to be loaded (which I'd rather do without). I could also put it at 2, meaning I'd switch down to real mode from long mode, with pure identity mapping to start with, do my VBE stuff, then return to long mode and only then set up higher-half and load the kernel. But if something goes wrong during VBE setup, I couldn't display any error message to the user after. Option 3 is perhaps the best option, because it allows error messages right through the final step. I could also delay the higher-half setup until VBE has been initialized.

Option 3 could also be used for a multiboot stub (switch from protected mode to real mode, do 3, switch to long mode, set up identity mapping, copy kernel to the right place, set up higher half, jump to kernel), which would be nice to have as an alternative method perhaps.

I think if I decide to do it in the loader, I'd go with option 3, but I'd like to know what you think.

I don't know much about UEFI as of yet, but does it also provide a simple framebuffer (It sounds like it does by your description)? If so, that would simplify the kernel graphics 'driver' by a lot, because I could use the exact same code for both VESA and UEFI graphics. Could I also just link my kernel to a PE executable for UEFI and then boot it directly in long mode? Or is it still necessary to have some kind of loader? Having an extra FAT partition for the kernel/loader doesn't bother me; everything but the kernel is eventually supposed to be on an ext2/ext3/ext4 partition.

BTW I will put the current code on github shortly, I just have to make sure it compiles and runs properly on my new distro install. I haven't touched it in almost two years.

Schol-R-LEA · Post by **Schol-R-LEA** » Tue Jan 01, 2019 1:18 pm

I noted one minor thing, which I am bringing up mainly to forestall others from criticizing it: you seem to be using goto a lot, primarily to jump out to exit conditions. It looks as if the intent was either to enforce RAII, or perhaps to avoid multiple exit points, I'm not certain.

While I don't personally see the latter as necessary (I find using unstructured jumps for this to be worse than multiple exit points, but the matter is arguable), the former is (or at least can be) a legitimate use... for deeply nested loops. In this case, though, I would think that a more structured escape mechanism such as break; should be sufficient.

OTOH, if the goal is RAII, I can see that you would want to be be consistent in how you apply it, and in C at least, there is no easy way to escape a deeply nested loop without gotos.

I only mention this to ask if this is your intent, and because I know there are those who would argue that goto is never accpetable, a position I personally see as excessively restrictive. The construct itself is not inherently flawed; it is the uses it is sometimes put to that are the problem, and in this instance, I can see why it would be justified. But it is a topic one might want to clear the air about.

bzt · Post by **bzt** » Tue Jan 01, 2019 2:10 pm

tellur wrote:So the main question is, where would I put VESA setup and VBE mode switching in this diagram? Either I'd have to put it at 1, i.e. do it before switching to long mode, but that would mean the loader would have to deal with all sorts of graphics modes, meaning I'd have to write font rendering and other graphics code code for the loader (which I should probably code in such a way as to reuse it in the userspace console server, which I haven't even designed yet ...). Another downside is that the graphics mode would either be set in the config file (edit: which has been loaded by tload16), or I'd have to support changing the graphics mode for the loader itself, not just for the kernel to be loaded (which I'd rather do without). I could also put it at 2, meaning I'd switch down to real mode from long mode, with pure identity mapping to start with, do my VBE stuff, then return to long mode and only then set up higher-half and load the kernel. But if something goes wrong during VBE setup, I couldn't display any error message to the user after. Option 3 is perhaps the best option, because it allows error messages right through the final step. I could also delay the higher-half setup until VBE has been initialized.

Happy New Year to you too!

I think you have collected your options pretty well. I have just a few things to add:
- no need to reuse the loader's font renderer in your console. It's likely you want a simple font (probably bitmap font with no more than 127 glyphs) in the loader, because you must optimize for size and you display only static strings and most probably in English only; while your console should support unicode because you don't know in advance what needs to be displayed (think about cat-ing an utf-8 text file on the console for example) and you could use an antialiased font too.
- if I were you, I wouldn't care about displaying error message. This is because repoted VBE modes should work, and you can report errors on the serial console too. But if you want to display an error message on the screen desperately, you still can switch back to text mode in your error reporting function (that works regardless VBE setmode worked or not).
- for future compatibility, you probably want graphics mode anyway, as EFI does not have any other. Text mode is a very special display mode, only supported on outdated PC display cards (CGA, EGA, VGA etc.) which is going to be obsoleted pretty soon. Also if you plan to support multiple architectures, linear frame buffer is the way to go.

I don't know much about UEFI as of yet, but does it also provide a simple framebuffer (It sounds like it does by your description)? If so, that would simplify the kernel graphics 'driver' by a lot, because I could use the exact same code for both VESA and UEFI graphics.

Yes. Take a look at my boot loader for an example: https://gitlab.com/bztsrc/bootboot. I set up the same framebuffer with VBE on BIOS and with GOP on UEFI machines (and with VideoCore mailbox calls on RPi). My kernel does not need to know how was the console initialized, it's having a common console driver for all platforms. I've included a very minimal PSF2 font renderer (one function only) in the example kernel, tested on BIOS, UEFI and RPi machines. Also it's parsing a config file to get the requested resolution just as you wanted. (For the records, my loader is switching back to real mode to get VBE fw. on BIOS machines.)

Could I also just link my kernel to a PE executable for UEFI and then boot it directly in long mode? Or is it still necessary to have some kind of loader? Having an extra FAT partition for the kernel/loader doesn't bother me; everything but the kernel is eventually supposed to be on an ext2/ext3/ext4 partition.

Yes, that's perfectly doable, as a matter of fact you can compile the linux kernel as an EFI PE executable. Interesting thing, linux does not use any PE specific toolchain for that. The kernel is compiled into a flat binary as always, and the PE header is just assembled as data bytes into it.

Cheers,
bzt

eekee · Post by **eekee** » Mon Jan 28, 2019 7:09 am

Wow! Reading your first list of bullet points, I think Plan 9 meets most of them. You might want to study it. It's certainly possible to improve on it, but study it first, there are some mind-blowing things about it!

The really big architectural difference is where you say, "All APIs should be asynchronous". In Plan 9, all (or almost all) APIs are synchronous, but it does very well that way, getting an impressive amount of functionality and flexibility out of remarkably little code. It uses many threads where other systems use callbacks. There's a paper claiming this reduces complexity here:
http://doc.cat-v.org/bell_labs/concurre ... ow_system/

Plan 9 uses both coroutines and lightweight processes as threads, the latter with and without shared memory. (Its process model is very flexible.) There are a bunch of other things which increase the simplicity and flexibility of the system, like private namespaces (for files) which it uses all over the place. They can also be used for security, of course, and any process can prevent its children or even itself from changing its namespace (RFNOMNT). All devices are represented as files, so anything may be locked down this way. (Note: the code hasn't been audited; at least one dev says there are security holes.)

Pipes (including virtual file interfaces) preserve message boundaries (up to 8KB limit), which also simplifies communication when compared to Unix pipes.

Down-sides include poor documentation, (especially for the authentication system,) and the fact that you have to be extremely smart to understand some of the code.

Personally, I also wished some of the interfaces were structured in a way as to be easier to access from the shell. They're already very accessible, I got a lot done with Plan 9's excellent shell, things which would have required a lot more program support in Unix if they were possible at all, but there were times when the nested quoting became unbearable. Also, interfacing the many different little languages got tiresome.

As a Plan 9 user, I'll say synchronous communication does reduce the complexity of many tasks a great deal. Requiring multi-threading is more of a mixed bag, Plan 9's thread library does have a simpler interface than pthreads, using channels more than mutexes, but I got the impression many programmers struggled with it anyway. Everyone needed help with their first userspace filesystem.

If I seem to be going on about Plan 9 a bit much, it's a bit like Lisp: Mind-blowing when you 'get' it, quite out of this world! Despite the deficiencies, the possibilities seem immense. My long-term goal is to take the good ideas of Plan 9 and make an even more expressive system.

Anyway, there are a lot more Plan 9 papers at the above link, and the same team went on to design Google Go (golang), so that may be of interest too.

Reading more of your post, I saw this: "This can work through file selection dialogs (which we need anyway)." I know it's fun imitating the big boys and redesigning is hard, but for my part there is nothing in GUI I find myself loathing more regularly, or have hated so consistently for such a long time, as the file selector. I already have my project directories open in some other file access medium, or can open them very quickly, so why should I have to open and WAIT FOR this cramped clumsy interface just so The Application (capitalization intentional) can receive my choices? And WHY do I have to WAIT in the SECOND DECADE of the TWENTY-FIRST CENTURY? File selectors are seriously not faster than they were in 1990. And why, when I drag a file to the file manager, do I have to be very careful not to move the file to whatever random directory happens to be open in this box that just popped up?

These other file access media include traditional file managers as more or less the least option. I also use Acme SAC which can send messages to the host OS to open things. Acme and the associated plumber can do very powerful things with plain text.

Also, Eagle Mode taught me how powerful ZUI can be when applied to file management. It was my primary file manager under Linux and OS X. I don't use it much any more because it stalls waiting for hard disks to spin up. Bad design, but the concept holds. It's still worth waiting for sometimes. Huh... Eagle mode's front page now warns of intuitivity problems, but I found it extremely intuitive, just like many image viewers and, in my muscle memory, Transport Tycoon.

Isn't hot-swapping the kernel achievable with suspend-to-disk without an assembly-language kernel? Granted, suspend-to-disk requires considerable driver support as you shut every device down, then try to restore its former state when you bring it back up. Maybe the same technique could be applied to hot-swapping the kernel?

I hope your URL filesystem is workable. Plan 9's webfs doesn't seem so easy to use.

Zero-copy is not a magic pill for performance. I don't fully understand the details, but I'm told the necessary messing about with the page tables alone can be a serious performance hit, sometimes more so than copying! (I think it hits latency more than throughput.) If everything ran in ring 0 then maybe it would always be a good idea, but even then certain hardware devices are limited in where they can DMA to. There was a recent discussion on 9fans about this, several driver developers weighed in. Thread linked below, although it started under a different subject, this is most of it. Note that Steven Stallion at one point claimed 9p only supports one outstanding operation, which is not actually true but is how most Plan 9 programs use it, unfortunately. (Efforts are occasionally made toward "streaming 9p" which would make it easier.) Watch for cinap_lenrek calling Linux out for a horrible problem with software RAID related to zero-copy!

https://marc.info/?t=153921043300003&r=1&w=2

Also note Erik Quanstrom's one-liner, "zero copy is also the source of the dreaded 'D' state." The D state is "uninterruptible IO." It may or may not be Linux-specific, I thought I once saw it in FreeBSD, but can't clearly remember. Years ago, I had a few Linux installations sharing a filesystem over NFS. When I accidentally unplugged the wrong network cable, processes using those filesystems went into D state instantly. It really means "uninterruptible; the process can neither be killed nor run. You can't get your data out of it, except perhaps with a debugger. It sits there holding resources until the machine is rebooted. Every time I've thought about setting up a network filesystem with Linux since then, especially over WiFi, I remember this. The only one I would set up is AndrewFS, which can resume operations after disconnection.

On the other hand, there's a note that some operating systems get zero-copy right: SunOS and, of all things Win32!

I wonder how VMS does, because WinNT kernel was designed by the same guy who led VMS kernel design.

OSDev.org

Design of hybrid kernel -- Tellur -- RFC

Design of hybrid kernel -- Tellur -- RFC

Re: Design of hybrid kernel -- Tellur -- RFC

Re: Design of hybrid kernel -- Tellur -- RFC

Re: Design of hybrid kernel -- Tellur -- RFC

Re: Design of hybrid kernel -- Tellur -- RFC

Re: Design of hybrid kernel -- Tellur -- RFC