Re: To POSIX or not to POSIX
Posted: Thu Jan 23, 2020 4:48 am
I have a few comments on the article's analysis of fork(), for tonight I'll cover this:nyc wrote:
There are also features to consider removing outright, such as fork() as per https://www.microsoft.com/en-us/researc ... otos19.pdf or, perhaps, much of the tty/pty infrastructure in UNIX and POSIX beyond just devising better-working features instead of POSIX threading and asynchronous IO.
tl;dr of my comments below:article wrote:Fork conflates the abstraction of a process with the hardware address space that contains it.
What does "process" mean, independent of protection architecture? Is it even a meaningful term on a system where memory protection and/or memory mapping state changes autotmatically as control transfers are made within the same privilege level? In my opinion, "process" is not meaningful in the context of such an architecture. It is primarily meaningful in the context of current architectures, in which case it means "a single memory mapping / protection context", in which case a process is already conflated with a hardware address space before we even decide how we want to spawn processes. In such an environment, I think fork() makes sense.
In more detail:
Really, I think the concept of a "process" is an artifact of the fact that almost all current hardware makes all memory protection and memory mapping state static in user mode. Consider a hypothetical architecture similar to the x86 architecture, with the following changes:
1) Instead of one CR3, you have a CR3 for each segment register: CR3CS, CR3DS, etc.
2) Each segment register has a corresponding Virtual Descriptor Table Register (VDTR) pointing to a Virtual Descriptor Table (VDT) for the segment loaded into that register.
3) When a segment register is loaded, the upper bits of the selector are used to select a segment register, and the lower bits are used as an index into the VDT designated by the VDTR for that
4) The primary element of a VDT entry is a Real Segment Selector (RSS), which is used as an index into a Real Descriptor Table (which replaces both the GDT and LDT of the real-life x86 architecture).
5) The primary elements of an RDT entry are a CR3 entry for the segment (rather than the offset into a single address space for an x86 segment) and a field designating a VDT for the segment. The CR3 is loaded into the CR3 for the segment register being loaded, and the VDT field is loaded into the VDTR.
6) A program can load any segment that is referenced by the VDT of any currently loaded segment.
7) Loaded segments remain loaded regardless of whether they are in the VDT of any loaded segment (the VDT entry used to load a segment just finds the RDT entry for the segment, it's the RDT entry that's actually loaded into the segment register, CR3, and VDTR). A program can unload any segments it needs to protect before transferring control to untrusted code.
It may be necessary to amend 6 and 7 to get adequate security with good performance, but the description provided is a good starting point for discussion about such architectures.
9) The kernel can load segments as described above, or can use special instructions to load segments directly with their Real Selectors. (It might actually even be possible, depending on how I/O was done and protected on such an architecture, to do away with classical privilege levels entirely).
The association of each segment with its own VDT basically creates a directed accessibility graph of segments, i.e, "Segment A has privileges to see and access segments X, Y, and Z. As long as Segment A is already loaded, segments X, Y, and Z can be loaded". It's basically a coarse-grained capability system (a lot of capability systems I've seen described have seemed to be too fine-grained to have a chance of being performant).
A microkernel implemented on such an architecture could implement message passing to/from servers in terms of function calls and returns, without needing the kernel as an intermediary: A program makes a far call to the library implementing the file system server, passing as an argument an empty segment waiting to be filled with data read from a file. The VDT for the segment containing the library has an entry for the file system server's global data area, which the library then loads. It services the read call and returns to the program, and if the kernel gets called at all, it's for things the server actually needs the kernel to do for it, not for message passing.
With such an architecture, you could basically just have "user threads" (stack segments containing a thread's stack and thread-local storage), "kernel threads" (address spaces containing data for a kernel scheduling entity), "executables" (address spaces containing code for executables and libraries, with a VDT entry for each external library needed), and "files" (Addresss spaces containing any other data, memory mapped files, anonymous shared memory, heaps, etc.). In such a setup, threads might very well cooperate on a task, and share the same files (including one or more heaps), which would correspond somewhat with the traditional concept of a "process", but it doesn't seem to me, with this kind of an architecture that such cooperation would need to be enshrined in a kernel object / tied to a particular address space, as it is on traditional flat-memory architectures. For instance, many microkernel servers might not need to have a dedicated kernel thread (or even user thread)at all, they might just function as libraries with system-global data areas not accessible to anything but their own code (after all, that's basically what a monolithic kernel is). Programs would call the server, and the server would execute their request on the calling program's timeslice, and possibly the calling program's own stack, so unlike microkernel servers on a traditional architecture, the server wouldn't much resemble (let alone be) a traditional "process"; a process has at least one thread! It is also imaginable that something like a word processor might be implemented with a kernel thread servicing a user thread for each open file, with each user thread having a VDT with an entry for the corresponding file, as well as a heap for that user thread. Does this count as one or two processes? There's only one kernel scheduling entity, so we might say one process, but the file and corresponding heap for each editing session are only accessible when the stack segment for the corresponding user thread is loaded, which would require two separate processes on traditional architectures.
In such a case, you wouldn't have fork() and exec, you'd have primitives to create a new segment copy-on-write from an existing one, map a new data segment from a file, insert a code segment into a VDT (possibly creating a new RDT entry for it if no other program has loaded that executable/library already), create an empty segment with no pages present, create a new thread using a designated segment as the stack, etc. Something roughly like spawning or fork()-exec()-ing a new process on a traditional architecture would look something like this:
Create a blank segment and call a language-runtime function to initialize it as a stack.
Then, putting the needed segments into the new stack segment's VDT (or the VDT of some segment reachable from the stack segment, or one that will be kept loaded across the thread switch):
cow() the segments containing any resources that need to be brought over to the new thread as-is, but won't be shared
Set up any shared memory with the new thread
Set up new blank segments for any resources that will not be taken from or shared with the existing thread, and call the appropriate runtime functions to intialize them.
mmap() any files required by the new thread that aren't being used by the old thread, including the executable for the new thread.
Call the kernel function newThread() with the new stack segment and a stack pointer (obtained from the runtime function that initialized the stack) as an argument. This will create a new kernel scheduling entity, and, like fork(), will return in both the new and old threads, but unlike Unix fork(), will not copy anything that has not already been copied, and, unlike fork(), will have the stack set to the SS:SP designated, rather than a COW copy of the original thread's stack.
In the new thread:
Unload the segment register that was being used by the existing thread to set up the stack for the new thread (since that segment is now loaded as our stack segment).
Unload any other segment registers designating segments that should not be accessible to the new thread once it jumps into the new program.
Far jump to the code segment for the new program.
In the existing thread:
Unload the segment register that was being used by the existing thread to set up the stack for the new thread.
Unload any other segment registers that were being used by the existing thread to set up segments for the new thread.
If any segments used in setting up the new thread remain in the VDT of a currently loaded segment, and are not used by the existing thread, disown them from the relevant VDT.
Continue with whatever you were doing before you spawned the new thread.
There could also be a forkStack() call for instances in which it is desirable to have the new thread have a COW copy of the existing thread's stack. In this case, rather than initializing the new stack with a language runtime function, you would leave it blank, but fill in its VDT as described above. When ready to spawn the new thread, you'd call forkStack(), instead of newThread(), with the blank stack segment for the new thread as a parameter, which would COW the current stack into the blank segment and return in both threads. Like newThread(), and unlike Unix fork(), it would not copy anything (other than the stack) that had not already been copied.
On a traditional architecture, on the other hand, the abstraction of a process, as a monolithic address space with one or more associated threads that will all need that address space to run, is forced upon us (unless we're using a single address space system with protection in software), and I think fork() is one of the better ways of dealing with that abstraction when that's what the hardware allows. Now, there may be a bit of a chicken-and-egg thing going on as far as the traditional memory/protection model making fork() optimal, and fork() making the traditional memory/protection model optimal, so that other avenues aren't explored, but it's not just an issue of eeeeevil fork() holding us back.