codyd51 wrote:
I think I'm in agreement with linguofreak here, unless there's a detail or implication that I'm missing. It sounds as though you're following through a line of thought in which eggs are a special kind of non-runnable process-like resource, but instead you could model them as an in-memory structure that's successively built up over a number of operations. I don't see why these operations can't be in userspace, while you're at it. You might even fancy doing something like sticking a version number as the struct's first field, then tacking on new fields you think of to the end of the struct. Pass a pointer to the kernel, whamo, you've got a process.
OK, representing an egg as a memory structure was something that did not occur to me. It would solve the problem, except the order of operations is implicit (and sometimes important).
vvaltchev wrote:
the whole Windows operating system survived with just CreateProcess() and CreateProcessEx(), using just a variable-sized struct that got extended over time and a bunch of other parameters. By allowing the struct to contain an array of IDs of operations with parameters as well, it should be reasonably enough.
Oh, lots of things "work". Vanilla fork() and execve() should also work. But I am not content with my OS merely working, I want it to excel. Part of the reason I'm writing it is to do away with lots of old bodgy cruft inherited from half a century's UNIX legacy.
But yes, an extensible struct should do the job as well. Although the byte code I have in mind is just a souped up version of an extensible struct. Haven't figured out versioning quite yet, but I think I will go with a system call to query whether a given number is a valid opcode.
vvaltchev wrote:
Clearly, a byte-code interpreter is more powerful, but seems to me an overkill to have it just for the spawn case. If, instead, you design the whole kernel to have plenty of customization points using byte-code, that would make more sense to me. Linux has already something like that, called eBPF, and uses it for packet filtering, tracing and security features. So, you might take a look at that to get inspired about how to make a byte-code interpreter that's limited enough to run safely in the kernel.
I have considered several alternatives, and a byte code looks to be the most general solution, while still being feasible to implement. What I'm planning right now will only allow a sequence of operations. No jumps, no loops, no alternatives. If any of the calls fails, the entire fork() fails.
I haven't decided yet on packet filtering. I am pretty sure, however, that I want no part of seccomp, because it breaks the interface to the kernel. I would rather implement something like pledge() for the purpose of limiting the syscall set, with predefined subsets.
If I end up implementing packet filters, I will probably use eBPF for that as well. The byte code I am planning here cannot be used for that because it cannot make decisions. However, bear in mind that eBPF only exists to reduce the bandwidth Wireshark has to swallow. So it is purely an optimization, that I should think is expendable for my use cases (I'm not exactly making the next big Switch OS).
vvaltchev wrote:
Also, about fork(): today this syscall can be made very efficient by making not only the pages themselves CoW, but the page tables CoW as well: this makes it significantly faster.
In order to set all pages to CoW, I have to set them read-only in the CPU. This requires dumping the entire user space TLB for the calling process. Even if I set them back to being writable as soon as the child process exits or execs, it is still a major performance impact on the parent. And all just for a call to fork().
I am unsure what you mean by "page tables CoW". But at this time I don't want to implement page table sharing between user space processes (except for the kernel pages, of course). The reason being that I would need to somehow keep track of reference counts for all page tables, whereas right now, every address space is being used by exactly one process.
Korona wrote:
In principle, I'm a fan of vfork()s approach:
I'm not. Two processes in one address space, potentially with different credentials, is just asking for trouble. Plus, the vfork() child inherits the signal handlers from the parent, with hilarious results if the user presses CTRL-C at the wrong moment.
Korona wrote:
However, vfork() is quite limiting if you need to do non-trivial setup work since you can basically only change some local variables without corrupting the parent's state, and you can't communicate with the parent (since the parent is suspended until execve()). Exploring some extension of vfork() that fixes these two issues would be great.
The historical definition of vfork() essentially said that the child process can only immediately exec or exit, nothing else. Even so much as calling setsid() is beyond spec. To address those limitations, we would need some kind of code, but more limited than machine code, that is self contained and only allows safe operations to be made in the child. If only there was something like that...
So I think I'm still going with the byte code idea. If you really need something complicated to happen in the child, and it cannot be done by adding one more thing to the byte code program, then I guess you will need a helper binary on my OS. Honestly, I'm having trouble imagining such a thing, however. Just do the preparations in the parent.