Replacing fork() with egg byte codes

nullplan · Post by **nullplan** » Sat Nov 20, 2021 8:41 am

Hi all,

I was writing a lengthy article on the difficulties of replacing fork() in a unixoid OS, when an idea popped into my head. But let me start at the beginning.

fork() has many conceptual and architectural problems. The general use of any fork() like API is something like:

Code: Select all

pid = fork();
if (!pid) { /* child */
  do_something();
  exit();
}

Anything that does not follow this pattern tends to be an abuse of the API. But that means that we are leaving correctness to convention. It also means that we are duplicating the entire stack of the parent in the child, and thus the entire chain of activation records all the way down to the root function of the thread, but we only need the highest layer. That's just conceptually wrong. It would be like buying an entire apple and then only taking a single bite out of it. You just don't do that.

Usually the alternative to fork() is given as a kind of spawn() system call; something that creates a fully formed process with a new program. These APIs generally tend to have tons of arguments to set specific things about the new process, and tend to be cumbersome to use (fork() has no arguments, CreateProcess() on Windows has around a dozen), and they are never comprehensive (some inheritable thing is inevitably forgotten), and never extensible (if a new property to be set comes along, you need a new version of the API).

So someone brought up "eggs" in the past: Some kind of operating system object a process can allocate, then use system calls to set certain attributes of the process, before hatching the egg into a full new process. The idea has merit, as it gets around most of the problems of fork() and spawn(), but it does mean adding a whole new set of system calls to set all those attributes. I also don't know if you really want to do a system call for all of those things. Another problem is that "eggs" don't fit into UNIX's idea of resources. UNIX knows processes and files. Memory only exists in form of processes using it up. When a process dies, all of its still-open files will be closed, thus cleaning them up, but all of its still-running child processes are simply re-parented to the init process. This is because presumably these processes are still running and will find some kind of exit sooner or later. An egg would fall into the category of "process", but would be unable to run, since it is not hatched yet. So re-parenting them is pointless, since they can never run. So I would have to change the process exit code to ensure all unhatched eggs of a process are cleaned up. Or I could make them files, but then they would be inheritable, and suddenly that whole thing gets a lot more complicated.

I was about to despair and ask for advice when I noticed that essentially, what I wanted to have was some kind of fork() I could hand a code pointer to. Then a small program could run to set various things about the child process and then call execve(). But the code would have to be restricted. It would have to somehow always ensure termination. And I also really don't like the idea of two processes running in the same address space while not being threads. So a real code pointer is out of the question; any subroutine above two bytes in size can potentially loop infinitely. And then it hit me: I need a byte code. I can set all the things a child process could ever want to set in the byte code, and if something is forgotten, I can add support in the byte code later. Maybe add a system call to query whether a given number is a valid opcode, so processes can negotiate what kind of byte code programs they need to create, and I have an extensible API that ought to be easy to use, feature complete, and be able to make do with five parameters (program file, argument list, environment list, byte code program pointer, byte code length).

That also gets rid of another problem I have with spawn(): Sometimes the order of operations is important. For example, a terminal emulator might want to spawn their child process with the slave console being the controlling terminal. The simplest way to do so is to call setsid() in the child, then open the slave console. setsid() removes the controlling terminal (that is otherwise inherited from the parent process), and opening the slave console sets it again, since a terminal is opened while the process has no controlling terminal. But this requires the setsid() to be done before the open(). posix_spawn() likely makes such a promise, but the interface really does not make it clear. But with a byte code the programmer can set the order of operations however they want.

So, what do you guys think, is that a sensible way to go?

P.S.: Regarding thread creation, I would just create another system call for that job. One that takes a new code and stack pointer and argument. While threads and processes are similar objects to the operating system, they are very different for the userspace creating them.

linguofreak · Post by **linguofreak** » Sun Nov 21, 2021 8:38 am

nullplan wrote:Hi all,

I was writing a lengthy article on the difficulties of replacing fork() in a unixoid OS, when an idea popped into my head. But let me start at the beginning.

fork() has many conceptual and architectural problems. The general use of any fork() like API is something like:
Code: Select all
pid = fork();
if (!pid) { /* child */
  do_something();
  exit();
}
Anything that does not follow this pattern tends to be an abuse of the API. But that means that we are leaving correctness to convention. It also means that we are duplicating the entire stack of the parent in the child, and thus the entire chain of activation records all the way down to the root function of the thread, but we only need the highest layer.

We're not duplicating it, we're COWing it. If one process returns all the way back to the root function before the other calls exec, then, yeah, you end up duplicating the whole stack, but I wonder how often that actually happens.

So someone brought up "eggs" in the past: Some kind of operating system object a process can allocate, then use system calls to set certain attributes of the process, before hatching the egg into a full new process. The idea has merit, as it gets around most of the problems of fork() and spawn(), but it does mean adding a whole new set of system calls to set all those attributes. I also don't know if you really want to do a system call for all of those things. Another problem is that "eggs" don't fit into UNIX's idea of resources. UNIX knows processes and files. Memory only exists in form of processes using it up. When a process dies, all of its still-open files will be closed, thus cleaning them up, but all of its still-running child processes are simply re-parented to the init process. This is because presumably these processes are still running and will find some kind of exit sooner or later. An egg would fall into the category of "process", but would be unable to run, since it is not hatched yet. So re-parenting them is pointless, since they can never run. So I would have to change the process exit code to ensure all unhatched eggs of a process are cleaned up. Or I could make them files, but then they would be inheritable, and suddenly that whole thing gets a lot more complicated.

You're overthinking it. Don't make it an object the process can allocate, make it an object the process can construct. Something like this:

1) Process calls mmap(), gets memory.
2) Process constructs an egg in the mmap()ed region.
3) Process calls hatch(egg)
4) OS checks that it has been given a valid egg, and uses it to create a new process.

codyd51 · Post by **codyd51** » Tue Nov 23, 2021 12:59 pm

Hello!

This is because presumably these processes are still running and will find some kind of exit sooner or later. An egg would fall into the category of "process", but would be unable to run, since it is not hatched yet. So re-parenting them is pointless, since they can never run. So I would have to change the process exit code to ensure all unhatched eggs of a process are cleaned up. Or I could make them files, but then they would be inheritable, and suddenly that whole thing gets a lot more complicated.

I think I'm in agreement with linguofreak here, unless there's a detail or implication that I'm missing. It sounds as though you're following through a line of thought in which eggs are a special kind of non-runnable process-like resource, but instead you could model them as an in-memory structure that's successively built up over a number of operations. I don't see why these operations can't be in userspace, while you're at it. You might even fancy doing something like sticking a version number as the struct's first field, then tacking on new fields you think of to the end of the struct. Pass a pointer to the kernel, whamo, you've got a process.

I wouldn't deny you the fun of building a bytecode interpreter, though!

vvaltchev · Post by **vvaltchev** » Tue Nov 23, 2021 4:03 pm

Mmm, honestly, @nullplan if you want to avoid a fork-like interface, I'm more for the approach suggested by @codyd51: a variable-sized, versioned, extensible structure which defines data and a series of operations on it. Clearly, it's less flexible than a full-blown byte-code interpreter, but by far simpler. Even if there are plenty of things a process might want to do between fork() and execve(), the whole Windows operating system survived with just CreateProcess() and CreateProcessEx(), using just a variable-sized struct that got extended over time and a bunch of other parameters. By allowing the struct to contain an array of IDs of operations with parameters as well, it should be reasonably enough.

Clearly, a byte-code interpreter is more powerful, but seems to me an overkill to have it just for the spawn case. If, instead, you design the whole kernel to have plenty of customization points using byte-code, that would make more sense to me. Linux has already something like that, called eBPF, and uses it for packet filtering, tracing and security features. So, you might take a look at that to get inspired about how to make a byte-code interpreter that's limited enough to run safely in the kernel.

Also, about fork(): today this syscall can be made very efficient by making not only the pages themselves CoW, but the page tables CoW as well: this makes it significantly faster. Finally, if you really want to avoid fork() without doing anything fancy, you could just implement vfork(): it's much more limited and there's no CoW (it works on no-mmu systems as well).

Korona · Post by **Korona** » Wed Nov 24, 2021 7:27 am

In principle, I'm a fan of vfork()s approach: create a new process that initially shares the same address space, and later change the address space.

However, vfork() is quite limiting if you need to do non-trivial setup work since you can basically only change some local variables without corrupting the parent's state, and you can't communicate with the parent (since the parent is suspended until execve()). Exploring some extension of vfork() that fixes these two issues would be great.

nullplan · Post by **nullplan** » Wed Nov 24, 2021 9:58 am

codyd51 wrote:I think I'm in agreement with linguofreak here, unless there's a detail or implication that I'm missing. It sounds as though you're following through a line of thought in which eggs are a special kind of non-runnable process-like resource, but instead you could model them as an in-memory structure that's successively built up over a number of operations. I don't see why these operations can't be in userspace, while you're at it. You might even fancy doing something like sticking a version number as the struct's first field, then tacking on new fields you think of to the end of the struct. Pass a pointer to the kernel, whamo, you've got a process.

OK, representing an egg as a memory structure was something that did not occur to me. It would solve the problem, except the order of operations is implicit (and sometimes important).

vvaltchev wrote:the whole Windows operating system survived with just CreateProcess() and CreateProcessEx(), using just a variable-sized struct that got extended over time and a bunch of other parameters. By allowing the struct to contain an array of IDs of operations with parameters as well, it should be reasonably enough.

Oh, lots of things "work". Vanilla fork() and execve() should also work. But I am not content with my OS merely working, I want it to excel. Part of the reason I'm writing it is to do away with lots of old bodgy cruft inherited from half a century's UNIX legacy.

But yes, an extensible struct should do the job as well. Although the byte code I have in mind is just a souped up version of an extensible struct. Haven't figured out versioning quite yet, but I think I will go with a system call to query whether a given number is a valid opcode.

vvaltchev wrote:Clearly, a byte-code interpreter is more powerful, but seems to me an overkill to have it just for the spawn case. If, instead, you design the whole kernel to have plenty of customization points using byte-code, that would make more sense to me. Linux has already something like that, called eBPF, and uses it for packet filtering, tracing and security features. So, you might take a look at that to get inspired about how to make a byte-code interpreter that's limited enough to run safely in the kernel.

I have considered several alternatives, and a byte code looks to be the most general solution, while still being feasible to implement. What I'm planning right now will only allow a sequence of operations. No jumps, no loops, no alternatives. If any of the calls fails, the entire fork() fails.

I haven't decided yet on packet filtering. I am pretty sure, however, that I want no part of seccomp, because it breaks the interface to the kernel. I would rather implement something like pledge() for the purpose of limiting the syscall set, with predefined subsets.

If I end up implementing packet filters, I will probably use eBPF for that as well. The byte code I am planning here cannot be used for that because it cannot make decisions. However, bear in mind that eBPF only exists to reduce the bandwidth Wireshark has to swallow. So it is purely an optimization, that I should think is expendable for my use cases (I'm not exactly making the next big Switch OS).

vvaltchev wrote:Also, about fork(): today this syscall can be made very efficient by making not only the pages themselves CoW, but the page tables CoW as well: this makes it significantly faster.

In order to set all pages to CoW, I have to set them read-only in the CPU. This requires dumping the entire user space TLB for the calling process. Even if I set them back to being writable as soon as the child process exits or execs, it is still a major performance impact on the parent. And all just for a call to fork().

I am unsure what you mean by "page tables CoW". But at this time I don't want to implement page table sharing between user space processes (except for the kernel pages, of course). The reason being that I would need to somehow keep track of reference counts for all page tables, whereas right now, every address space is being used by exactly one process.

Korona wrote:In principle, I'm a fan of vfork()s approach:

I'm not. Two processes in one address space, potentially with different credentials, is just asking for trouble. Plus, the vfork() child inherits the signal handlers from the parent, with hilarious results if the user presses CTRL-C at the wrong moment.

Korona wrote:However, vfork() is quite limiting if you need to do non-trivial setup work since you can basically only change some local variables without corrupting the parent's state, and you can't communicate with the parent (since the parent is suspended until execve()). Exploring some extension of vfork() that fixes these two issues would be great.

The historical definition of vfork() essentially said that the child process can only immediately exec or exit, nothing else. Even so much as calling setsid() is beyond spec. To address those limitations, we would need some kind of code, but more limited than machine code, that is self contained and only allows safe operations to be made in the child. If only there was something like that...

So I think I'm still going with the byte code idea. If you really need something complicated to happen in the child, and it cannot be done by adding one more thing to the byte code program, then I guess you will need a helper binary on my OS. Honestly, I'm having trouble imagining such a thing, however. Just do the preparations in the parent.

nexos · Post by **nexos** » Wed Nov 24, 2021 11:06 am

What I plan on doing is writing a spawn function that well accept a linked list of attributes. Each attribute will have a type associated with it, specifying various parameters. This is kind of themed after Windows' CreateProcess() function. IMO, fork is just a relic of the long-forgotten days before multithreading.

A byte code interpreter would be very complex, a variable sized structure or a linked list like what I'm doing would be simpler and easier to maintain.

vvaltchev · Post by **vvaltchev** » Wed Nov 24, 2021 1:56 pm

nullplan wrote:What I'm planning right now will only allow a sequence of operations. No jumps, no loops, no alternatives. If any of the calls fails, the entire fork() fails.

OK, just let me point out that you don't need a byte-code interpreter for a plain sequence of operations: an array of ops would be enough. A byte-code interpreter would be useful if you wanna support a Turing-complete language, with maybe some ad-hoc limitations like limit iterations to 10, limit the memory it can access etc.

nullplan wrote:If I end up implementing packet filters, I will probably use eBPF for that as well. The byte code I am planning here cannot be used for that because it cannot make decisions.

As above, if it cannot make decisions, it's not worth having a custom byte-code "interpreter".

nullplan wrote:In order to set all pages to CoW, I have to set them read-only in the CPU. This requires dumping the entire user space TLB for the calling process. Even if I set them back to being writable as soon as the child process exits or execs, it is still a major performance impact on the parent. And all just for a call to fork().

Well, it requires creating a dedicated page directory and switching to it. That will invalidate the TLB entries, of course.

nullplan wrote:I am unsure what you mean by "page tables CoW".

In the classic fork(), CoW is used only for pages. That means that all the page tables have to be copied in the new page directory for the child. Page tables CoW means that only the 1st level page directory is copied, while the other page tables are shared by making the pages containing them to be CoW. So, when the child writes to a CoW page, the page is copied and, when the kernel code tries to update the page table with the new page at the same vaddr, another CoW page fault is triggered the first time. All of this makes fork() just a bit slower than vfork(), while in the past the difference was more significant. Linux implemented CoW pages tables around 2010.

nullplan · Post by **nullplan** » Wed Nov 24, 2021 1:58 pm

nexos wrote:What I plan on doing is writing a spawn function that well accept a linked list of attributes.

That is basically what I'm intending to do, only I would serialize it into an array instead of a linked list, so I don't need to faff about with "next" pointers and can copy the whole thing in one go.

nexos wrote:A byte code interpreter would be very complex, a variable sized structure or a linked list like what I'm doing would be simpler and easier to maintain.

I'm getting the feeling we're not all on the same page, because everyone keeps telling me byte code interpreters would be complicated when I envision this one to be extremely simple. Idea is this: All numbers are ULEB128, all strings are the string length in ULEB128 followed by the string data, and the program is just a series of opcodes followed by the operands specified for them. End of program is implicit in the end of the array (array size is an argument).

Say, for example, you want to spawn a process in its own process group redirecting stdin from /dev/null, and the opcode for "setpgid" is 5 and takes the PGID as argument, and the opcode for "open" is 1 and takes the target FD, the file name, the open flags and the creation mode (let's just make that last one mandatory), then you can generate the program as: "05 00 01 00 09 /dev/null 02 00".

Specify a string longer than the remaining program or containing null bytes, and you get EINVAL back, as you will if insufficient arguments remain.

Maybe calling it a byte code interpreter was the wrong term for this idea?

linguofreak · Post by **linguofreak** » Wed Nov 24, 2021 3:07 pm

vvaltchev wrote:Also, about fork(): today this syscall can be made very efficient by making not only the pages themselves CoW, but the page tables CoW as well: this makes it significantly faster. Finally, if you really want to avoid fork() without doing anything fancy, you could just implement vfork(): it's much more limited and there's no CoW (it works on no-mmu systems as well).

I think fork() gets a bad rap from Unix's middle age: In the early PDP-11 days, when only one process was in core at a time and all others were swapped out to disk, fork() was quite an elegant solution: just swap the parent process out, but then, instead of loading another process, use the memory image already in RAM as the basis for the new process. As PDP-11 Unix developed and an outgoing process could be held in memory if there was room, it got to be a less elegant solution, but at least the most that would be duplicated on fork() was 64k. Then in the early VAX BSD days, you were duplicating an address space that could be quite large, and fork() was even less well suited as when you were swapping, you weren't even swapping out whole processes anymore, just individual pages, and that's where vfork() was invented. But then things came full circle when CoW was implemented in VAX BSD: a CoW implementation of fork() is once again very elegant.

Now, on hardware with something like z/Architectures access register mechanism, a process could have multiple different address spaces for different things, and could make system calls saying "give me an address space with this executable loaded", "give me an address space with each of these data files loaded", "give me a copy-on-write duplicate of these address spaces", "give me an empty address space". It could then write an initial stack for the language in use into the empty address space, and then make a system call saying "create a process with access to the listed address spaces. Place the stack pointer at this address in this address space, and the instruction pointer at that address in that address space". It could then, at its option, release its own access to the address spaces it used to create the child process, and then, at its option, either continue executing, or make a system call to yield to the new process.

Solar · Post by **Solar** » Thu Nov 25, 2021 4:49 am

Oh, fork() got bad rep from other sources than that. To be implemented, it requires full MMU support by the OS, which was a problem e.g. for Amiga's otherwise excellent ixemul.library: You could emulate virtually all of POSIX' API on the (non-memory-protected) AmigaOS, but not fork(). Back then lots of effort went into convincing various upstreams to replace fork() with less troublesome API calls where possible... it's a hammer, where most often a chisel would do.

nexos · Post by **nexos** » Thu Nov 25, 2021 5:50 am

Solar wrote:To be implemented, it requires full MMU support by the OS

fork() in of itself doesn't require an MMU. The function predated systems like VAX! Of course, without an MMU, fork must copy everything, that is extremely slow and wasteful.

As a side note, the earliest Unices with MMU support (3BSD IIRC) didn't even use CoW mappings for fork! Hence, fork can be implemented just fine without MMU support - it just might take a decade to run.

nullplan wrote:I'm getting the feeling we're not all on the same page, because everyone keeps telling me byte code interpreters would be complicated when I envision this one to be extremely simple.

My linked list idea would be simpler still; one can set fields in a structure instead of having to fiddle with endless magic bytes. Of course, you could use macros, but that's kind of messy still. A linked list would be cleaner.

linguofreak wrote:But then things came full circle when CoW was implemented in VAX BSD: a CoW implementation of fork() is once again very elegant.

IMO, fork is quite inelegant. Spawning a new process is much cleaner and more effective. fork is one of the greatest blunders of Unix and is a relic of the past.

thewrongchristian · Post by **thewrongchristian** » Thu Nov 25, 2021 7:45 pm

nullplan wrote:
Korona wrote:In principle, I'm a fan of vfork()s approach:
I'm not. Two processes in one address space, potentially with different credentials, is just asking for trouble. Plus, the vfork() child inherits the signal handlers from the parent, with hilarious results if the user presses CTRL-C at the wrong moment.

Why would vfork'ed processes have potentially different credentials? The child process is using all the same code as the parent process, so is no more or less vulnerable than a single process in that address space.

And if signals can be a problem in the child, just block signals while you're vfork-ing, no?

nullplan · Post by **nullplan** » Thu Nov 25, 2021 10:44 pm

thewrongchristian wrote:Why would vfork'ed processes have potentially different credentials?

Because they can call setuid(). Separate processes means they can have separate UIDs.

thewrongchristian wrote:And if signals can be a problem in the child, just block signals while you're vfork-ing, no?

Yes, but I can't do that in vfork() itself, it has to be done by the caller in order to transport the prior signal mask to the place where the child has exited. Which the application can of course do, but I have rarely seen programs do it in the wild. It is one more pitfall with the API; one more thing that shows it is a bad API. Using it the right way is hard, and using it the wrong way is easy. That is the opposite of what it should be.

Korona · Post by **Korona** » Fri Nov 26, 2021 2:29 am

Regarding what you're allowed to do after vfork(): due to how __attribute__((returns_twice)) works, you can actually do a lot more than what the man page allows. Basically, you can do everything that you can also do in a if(setjmp()) block.

I agree that the vfork() API is bad / hard to use correctly. I was more talking about its general approach (create process with same address space -> change to a different address space). My suggestion is not to use vfork as-is but to design a better API around the same concept.

OSDev.org

Replacing fork() with egg byte codes

Replacing fork() with egg byte codes

Re: Replacing fork() with egg byte codes

Re: Replacing fork() with egg byte codes

Re: Replacing fork() with egg byte codes

Re: Replacing fork() with egg byte codes

Re: Replacing fork() with egg byte codes

Re: Replacing fork() with egg byte codes

Re: Replacing fork() with egg byte codes

Re: Replacing fork() with egg byte codes

Re: Replacing fork() with egg byte codes

Re: Replacing fork() with egg byte codes

Re: Replacing fork() with egg byte codes

Re: Replacing fork() with egg byte codes

Re: Replacing fork() with egg byte codes

Re: Replacing fork() with egg byte codes