Design of hybrid kernel -- Tellur -- RFC
Posted: Sun Dec 30, 2018 12:06 pm
Hi,
For a few years I have been developing my OS (codename: 'tellur') on and off in my free time. I even got to writing my own (working) FAT16 x86_64 bootloader and a kernel capable of booting up all SMP processors/cores and executing scheduled kernel code (also very simple ELF loading of userspace code, but no syscalls yet). That said, I have been considering scrapping some of this code again, but more to that later.
I would like to continue developing the kernel, but I think this time I would like to ask for feedback on the architecture first, especially since a rewrite is under consideration. I already owe a lot to this forum in terms of information, so maybe I can contribute something back in the future. So let's start with the design ...
The design of my kernel has changed somewhat over the years, but the core principles and goals have stayed the same. Those are (in no particular order):
With those core principles out of the way, let's very briefly discuss capability security. There's a lot of material on the web, for instance on http://cap-lore.com/, so I won't reiterate everything, but suffice it to say that:
... capability security is an /alternative/ to ACLs or permission-based systems like Unix or Windows, where user permissions for files and devices are set, and the latter can be opened without the user's consent simply by virtue of the correct permissions being set.
Now, a capability represents authority (a 'claim') to a certain resource (a file, an IO port, etc. etc.). It can be thought of as an object (as in OOP), or as a more general form of a Unix file descriptor. In tellur, capabilities are represented in userspace as a cryptographic public key unique to that capability (implementation details pending). The kernel issues the keys and maintains a map between private keys and capabilities, allowing it to unilaterally invoke any authority if that is desired.
An example: let's say your browser has been given access to the audio device and you open up a website playing annoying music, you can just cut off access to the audio device. Or you have a rogue app filling up your home directory with garbage files. Just cut the rogue app off from the filesystem by revoking all of its capabilities to files.
It should be obvious how useful such a system is and how permissions can't deliver on that. But the question is how userspace programs would attain such capabilities. After all, they can't just allow the kernel to give out a capability to a certain file (let's say /etc/shadow) without the user consenting. Otherwise, we'd be back to a system without any kind of security (like DOS), or we'd have to add permissions back in.
So we must have the user's consent in order to open files and access devices. With almost everyone now on Android 6+, there's a handy example available on how that would work: When you first open an app on newer Android devices, you get asked about whether you want to allow camera access, filesystem access, calendar access, etc.
Capability security UI works in a similar way, just more granularly. You don't simply allow apps access to your entire filesystem, you allow access to single files or directory trees. This can work through file selection dialogs (which we need anyway). For devices, an approach similar to Android is suitable however. All those dialogs and confirmation messages require a special GUI process handling them (kind of like 'system mode' dialogs in Windows work).
On console it's even easier. Consider:
(if you hadn't noticed, the syntax for the shell is inspired by Tcl, but it should be clear: brackets [] are function calls)
In the shell case, the command "open", also represented by single quotes, is a special privileged function that can ask the kernel directly for capabilities, but only if the shell is interactive. Otherwise we would run into the consent problem after all. This is achieved by the kernel supplying each login shell with a capability to the whole filesystem, which allows "opening" (obtaining the capability to) arbitrary files and directories. The login shell never passes this capability on to arbitrary processes, only to its trusted internal functions, and to other interactive shell-like programs, such as the desktop environment.
Of course, you could also start the desktop environment with a restricted version of the "login shell capability" that denies access to system files. And BAM, you have a user system that is already more powerful than Unix's. You can get a restricted version of every capability (for example create a read-only capability from a read-write one, or exclude subdirectories from a directory capability)
I would love to hear what you think about the capability system so far. Moving on...
Everything is a file
This is pretty straightforward. Capability systems have traditionally eschewed filesystems, but with tellur, I have decided to go into the opposite direction, since it's obvious tree-like filesystems are not going anywhere, and URLs have even expanded on that scheme. So there will be a special /url/ directory that is multiplexed by the kernel. You could do something like this:
... and get the source code of Google's home page back, provided that a handler for https is registered and that the current shell has the proper capabilities.
Needless to say, device files and something like the proc filesystem will also exist.
Hot Swapping
In my opinion, it should be a basic requirement for a modern kernel (as opposed to dated NT and Linux) to be able to update itself without rebooting. The current environment and contents of RAM should also be kept (just like suspend to disk). By considering hot swapping early in development, we lay the foundation for robust patching technology for kernel data structures. This finally allows always-on servers without the security risk of running REALLY old kernels with only occasional backported patches (like Debian Stable servers do in a lot of cases).
Thinking about hot-swapping brought up the idea that perhaps it would be better to actually implement the kernel in architecture-specific assembly, and only the userspace programs in C or C++. This would make the locations of all kernel code and data deterministic, at the cost of slowing down development, breaking portability and having to start over (though the last point is not really a big deal). I'm not sure, but maybe the same could be achieved using linker scripts? Perhaps someone known more about this than me? Another possibility is to use some kind of "deterministic heap", and store all information likely to change between kernel versions there. Does anyone have experience with that sort of solution? The current kernel is largely written in C++ by the way.
Client-Server Architecture and Kernel Responsibilities
In tellur, the role of libraries should be massively reduced. Libraries are neither network-transparent nor easy to get right in terms of ABI/API compatibility. If dynamically linked, ABI changes break your code. If statically linked, code size blows up and security bugs in a library can only be fixed through recompiling all programs using this library.
Servers that run as extra processes and communicate with clients through network sockets avoid all of these problems. The GUI would be provided by such a server (a bit like X11, just modernized by a lot).
Servers can also be registered with the VFS. In this case, opening a certain file path redirects the stream to a server socket. For instance, opening /url/https://google.com/ opens a new stream with the local HTTP streaming service, which connects to the web on the user's behalf. Opening /mnt/fat32_disk/ opens a new stream with the FAT32 filesystem service etc. etc.
Services (and processes in general) can be dynamically started and suspended, similar to what Android does with apps. There is no swap partition/file, there are simply image files on disk that mirror ram contents specific to each process. RAM and disk are synchronized when the disk is not being accessed. When such a sync happens, RAM is pruned of all processes that have not been used recently.
The kernel itself should not use modules at all. If any sort of optional code is included, it must be recompiled and then hot-swapped, which I'm sure is an acceptable substitute for module loading.
System Calls
All system calls should be asynchronous. This will eliminate almost all performance problems traditionally associated with keeping stuff like filesystems outside the kernel. All file operations should be done on streams. Those work without copying the data. instead leveraging the huge 64 bit virtual address spaces we have nowadays. This means that tellur will NOT support 32 bit systems, sorry. It's probably best to show you the API in the form of a coding sample (API still pending, of course):
This no-copy, completely asynchronous architecture allows for efficient data transport between services. An app that opens a file connects to the filesystem server (the socket is redirected there by the kernel) and streams its stuff from there. The filesystem server again gets its data streamed from a device. If everything runs locally and filesystem clusters/blocks are 4k or larger, the kernel can simply juggle pointers and remap physical memory. The data has to be read only once (by the filesystem server). Because everything can be made to work asynchronously (with two userspace queues, one inbox and one outbox that cause no context switches except if they fill up), the amount of context switches is reduced to a very low level (only once per timeslice if stream throughput is optimized).
This is the architecture in a nutshell. I hope you like it and I also hope you find something to critique and can suggest improvements. All questions I have for you are in bold.
Additional questions:
- How would you go about implementing graphics drivers? I'm not sure which portions exactly to put into the kernel and which not to. Also, which driver (apart from the trivial Bochs graphics interface, the only one I have so far) should I be working on first for x86_64. VESA requires real-mode code (either native or emulated) AFAIK, which would mean either switching down to real mode when switching graphics mode (a huge pain, especially with SMP), or emulating everything (again a huge pain). Or is there some other alternative?
- Are there any performance or portability issues that could possibly arise from my approach? My initial goal is to have x86_64 and later armv8 support. No 32 bits support is planned as that would clash with the memory model. (Or maybe not?)
Thanks in advance!
For a few years I have been developing my OS (codename: 'tellur') on and off in my free time. I even got to writing my own (working) FAT16 x86_64 bootloader and a kernel capable of booting up all SMP processors/cores and executing scheduled kernel code (also very simple ELF loading of userspace code, but no syscalls yet). That said, I have been considering scrapping some of this code again, but more to that later.
I would like to continue developing the kernel, but I think this time I would like to ask for feedback on the architecture first, especially since a rewrite is under consideration. I already owe a lot to this forum in terms of information, so maybe I can contribute something back in the future. So let's start with the design ...
The design of my kernel has changed somewhat over the years, but the core principles and goals have stayed the same. Those are (in no particular order):
- It should be somewhat of a "spiritual successor" to Unix, especially in that the kernel API is very simple and that everything is a file. It should not emulate Unix though (that's what userspace servers are for)
- Not only is everything a file, everything is also a capability and a stream
- It should provide a cryptographically secure capability system, which makes it possible to contain the effects of both malicious code and userspace software vulnerabilities (again, more on that later)
- It should not prevent the user from using closed-source drivers, but it should offer functionality to contain them
- It should be hot-swappable
- It should be fast and have the potential for production use, especially for developers
- Reliance on libraries should be reduced in favor of a client-server architecture (correcting the IMO most crucial mistake in the path Unix took in the late 90s)
- All APIs should be asynchronous
- The kernel should neither act as a mere hardware multiplexer/scheduler, nor as a Linux-like monolith with filesystems and a firewall built-in. It should act as the hardware abstraction layer, and in conjuction with the other requirements, this requires memory management, a scheduler, *a VFS* (with hooks for userspace filesystems and protocol handlers, see below), and capability management. Probably other things as well that I forgot.
- Compatibility is more important that beauty (or integrated-ness into the ecosystem). This is one of the major reasons Microsoft operating systems have been so successful. Architecturally, this will be achieved by adding a variety of layers (POSIX, Linux, Windows, etc.) as userspace servers, about which the kernel has no knowledge.
- The system should preserve its state across reboots, just like a suspend to disk. This should even work when power is interrupted (although recent changes that haven't been written to disk might get lost).
- There's no rush to implement all this, but I really would like to try!
With those core principles out of the way, let's very briefly discuss capability security. There's a lot of material on the web, for instance on http://cap-lore.com/, so I won't reiterate everything, but suffice it to say that:
... capability security is an /alternative/ to ACLs or permission-based systems like Unix or Windows, where user permissions for files and devices are set, and the latter can be opened without the user's consent simply by virtue of the correct permissions being set.
Now, a capability represents authority (a 'claim') to a certain resource (a file, an IO port, etc. etc.). It can be thought of as an object (as in OOP), or as a more general form of a Unix file descriptor. In tellur, capabilities are represented in userspace as a cryptographic public key unique to that capability (implementation details pending). The kernel issues the keys and maintains a map between private keys and capabilities, allowing it to unilaterally invoke any authority if that is desired.
An example: let's say your browser has been given access to the audio device and you open up a website playing annoying music, you can just cut off access to the audio device. Or you have a rogue app filling up your home directory with garbage files. Just cut the rogue app off from the filesystem by revoking all of its capabilities to files.
It should be obvious how useful such a system is and how permissions can't deliver on that. But the question is how userspace programs would attain such capabilities. After all, they can't just allow the kernel to give out a capability to a certain file (let's say /etc/shadow) without the user consenting. Otherwise, we'd be back to a system without any kind of security (like DOS), or we'd have to add permissions back in.
So we must have the user's consent in order to open files and access devices. With almost everyone now on Android 6+, there's a handy example available on how that would work: When you first open an app on newer Android devices, you get asked about whether you want to allow camera access, filesystem access, calendar access, etc.
Capability security UI works in a similar way, just more granularly. You don't simply allow apps access to your entire filesystem, you allow access to single files or directory trees. This can work through file selection dialogs (which we need anyway). For devices, an approach similar to Android is suitable however. All those dialogs and confirmation messages require a special GUI process handling them (kind of like 'system mode' dialogs in Windows work).
On console it's even easier. Consider:
Code: Select all
[test@machine]: cat file1 file2
error: 'file1' not a capability!
[test@machine]: cat [open file1] [open file2]
...
[test@machine]: # or with syntactic sugar:
[test@machine]: cat 'file1' 'file2'
...
In the shell case, the command "open", also represented by single quotes, is a special privileged function that can ask the kernel directly for capabilities, but only if the shell is interactive. Otherwise we would run into the consent problem after all. This is achieved by the kernel supplying each login shell with a capability to the whole filesystem, which allows "opening" (obtaining the capability to) arbitrary files and directories. The login shell never passes this capability on to arbitrary processes, only to its trusted internal functions, and to other interactive shell-like programs, such as the desktop environment.
Of course, you could also start the desktop environment with a restricted version of the "login shell capability" that denies access to system files. And BAM, you have a user system that is already more powerful than Unix's. You can get a restricted version of every capability (for example create a read-only capability from a read-write one, or exclude subdirectories from a directory capability)
I would love to hear what you think about the capability system so far. Moving on...
Everything is a file
This is pretty straightforward. Capability systems have traditionally eschewed filesystems, but with tellur, I have decided to go into the opposite direction, since it's obvious tree-like filesystems are not going anywhere, and URLs have even expanded on that scheme. So there will be a special /url/ directory that is multiplexed by the kernel. You could do something like this:
Code: Select all
[test@machine]: cat [open /url/https://google.com/]
Needless to say, device files and something like the proc filesystem will also exist.
Hot Swapping
In my opinion, it should be a basic requirement for a modern kernel (as opposed to dated NT and Linux) to be able to update itself without rebooting. The current environment and contents of RAM should also be kept (just like suspend to disk). By considering hot swapping early in development, we lay the foundation for robust patching technology for kernel data structures. This finally allows always-on servers without the security risk of running REALLY old kernels with only occasional backported patches (like Debian Stable servers do in a lot of cases).
Thinking about hot-swapping brought up the idea that perhaps it would be better to actually implement the kernel in architecture-specific assembly, and only the userspace programs in C or C++. This would make the locations of all kernel code and data deterministic, at the cost of slowing down development, breaking portability and having to start over (though the last point is not really a big deal). I'm not sure, but maybe the same could be achieved using linker scripts? Perhaps someone known more about this than me? Another possibility is to use some kind of "deterministic heap", and store all information likely to change between kernel versions there. Does anyone have experience with that sort of solution? The current kernel is largely written in C++ by the way.
Client-Server Architecture and Kernel Responsibilities
In tellur, the role of libraries should be massively reduced. Libraries are neither network-transparent nor easy to get right in terms of ABI/API compatibility. If dynamically linked, ABI changes break your code. If statically linked, code size blows up and security bugs in a library can only be fixed through recompiling all programs using this library.
Servers that run as extra processes and communicate with clients through network sockets avoid all of these problems. The GUI would be provided by such a server (a bit like X11, just modernized by a lot).
Servers can also be registered with the VFS. In this case, opening a certain file path redirects the stream to a server socket. For instance, opening /url/https://google.com/ opens a new stream with the local HTTP streaming service, which connects to the web on the user's behalf. Opening /mnt/fat32_disk/ opens a new stream with the FAT32 filesystem service etc. etc.
Services (and processes in general) can be dynamically started and suspended, similar to what Android does with apps. There is no swap partition/file, there are simply image files on disk that mirror ram contents specific to each process. RAM and disk are synchronized when the disk is not being accessed. When such a sync happens, RAM is pruned of all processes that have not been used recently.
The kernel itself should not use modules at all. If any sort of optional code is included, it must be recompiled and then hot-swapped, which I'm sure is an acceptable substitute for module loading.
System Calls
All system calls should be asynchronous. This will eliminate almost all performance problems traditionally associated with keeping stuff like filesystems outside the kernel. All file operations should be done on streams. Those work without copying the data. instead leveraging the huge 64 bit virtual address spaces we have nowadays. This means that tellur will NOT support 32 bit systems, sorry. It's probably best to show you the API in the form of a coding sample (API still pending, of course):
Code: Select all
/*
The dir_capab comes from a call to [open] in the shell or from a file chooser dialog.
This is a bit of an artifical example, because most programs would get the opened file
passed directly to them. But consider cases such as the emulated POSIX environment,
which will simply be passed a directory capability that it can go crazy with.
*/
int read_file_synchronous(tellur_capab dir_capab, char **dest) {
tellur_capab file;
tellur_wait_token token;
int ret;
/* asynchronous call, this is simply put on a queue */
token = tellur_open(dir_capab, &file, TELLUR_RDONLY);
/*
Block until file open operation finished (this switches context).
If you wanted, you could replace this with a loop that calls other
tasks in the meantime (cooperative multitasking).
If the token was invalid to begin with (a syntax error on calling
tellur_open for example), the error is just returned back immediately.
*/
ret = tellur_wait(&token);
if (ret)
goto error;
/* This reads in an entire file and places the string pointer to the contents in dest.
The dest pointer should be freed with tellur_stream_free() */
token = tellur_read(&dest, file, -1, TELLUR_WHOLEPAGES | TELLUR_STREAM);
/* Again, in this simple example, we just wait until everything is read, but we could
simply call tellur_stream_next() (see next example) and read the stream peacemeal */
ret = tellur_wait(&token);
if (ret)
goto close_file;
return 0;
close_file:
/* Asynchronous, but guaranteed to succeed, so no need to wait */
tellur_close(file);
error:
tellur_perror(ret);
return ret;
}
/* This is the asynchronous version with a callback */
int read_file_asynchronous(tellur_capab dir_capab, int (callback)(char *, size_t)) {
tellur_capab file;
tellur_wait_token token;
int ret;
char *dest;
size_t len;
/* asynchronous call, this is simply put on a queue */
token = tellur_open(dir_capab, &file, TELLUR_RDONLY);
/* Now actually do the cooperative loop thing */
do {
ret = tellur_error(&token);
if (ret)
goto error;
/* This function should call tellur_sleep(-1)
(i.e. put the current process to sleep until something changes),
if there are no elements on the queue left. */
coop_queue_next();
} while (!tellur_ready(&token))
/* No string pointer initially given, TELLUR_CONSECUTIVE guarantees no memory is overwritten,
so the whole file will be in memory consecutively after the stream is read.
Note that in most cases, this is not needed, because stream data does not have to be kept around. */
token = tellur_read(NULL, file, -1, TELLUR_WHOLEPAGES | TELLUR_CONSECUTIVE | TELLUR_STREAM);
do {
ret = tellur_error(&token);
if (ret)
goto close_file;
/* Read next piece of stream (typically a memory page if TELLUR_WHOLEPAGES is set).
The dest pointer should be freed with tellur_stream_free() */
token = tellur_stream_next(&dest, &len);
/* Call the callback as soon as the next piece is there
The callback could be part of a parser engine that advances the state of
an internal state machine for example */
ret = tellur_stream_wait(&token, callback);
if (ret)
goto close_file;
} while (!tellur_stream_eof(token));
return 0;
close_file:
/* Asynchronous, but guaranteed to succeed, so no need to wait */
tellur_close(file);
error:
tellur_perror(ret);
return ret;
}
This is the architecture in a nutshell. I hope you like it and I also hope you find something to critique and can suggest improvements. All questions I have for you are in bold.
Additional questions:
- How would you go about implementing graphics drivers? I'm not sure which portions exactly to put into the kernel and which not to. Also, which driver (apart from the trivial Bochs graphics interface, the only one I have so far) should I be working on first for x86_64. VESA requires real-mode code (either native or emulated) AFAIK, which would mean either switching down to real mode when switching graphics mode (a huge pain, especially with SMP), or emulating everything (again a huge pain). Or is there some other alternative?
- Are there any performance or portability issues that could possibly arise from my approach? My initial goal is to have x86_64 and later armv8 support. No 32 bits support is planned as that would clash with the memory model. (Or maybe not?)
Thanks in advance!