Memory Management in Micro-Kernels (mmap and execve)

Barry · Post by **Barry** » Sun May 08, 2022 4:59 pm

Hey everyone,

I've been thinking quite a bit about this lately, so I thought I'd ask how other's have tackled the issue. Apologies if this fits better in General Ramblings, it's meant to be about design.

I am writing a new micro-kernel and I've just reached the section on memory management. Previously, I just had a very simple manager in the ukernel which just allocated page frames to a process' address spaces. I never supported mmap(), because I don't want the kernel to have to send calls to user-space directly. In my head only user programs should IPC to other user programs, and the kernel should just facilitate this. To load a file into memory (for an execve()), I was literally calling read() on the FS server. My method of executing a file was extremely bad, and involved the kernel replacing the current program with an "exec server" which would then contact FS and load the file and jump to the entry point.

I want to do something better. I want proper memory management, with different regions of memory - something higher level than raw pages. I want an mmap() that can load files, anonymous regions and memory shares, and I want an execve() that doesn't need a special loader program. I've come up with all the technical designs, but I don't know where to put it. I can't figure out quite how it works without something being in the kernel - which I don't really want.

If the MM (memory manager) is a server, then it can cleanly request files from FS in an mmap(), but won't be able to force another process to jump to the entry point in an execve().
If the MM is in the kernel, then it can't cleanly request from the FS (kernel shouldn't be IPCing to a program directly), but it can force the entry point jump.
If both the MM and FS are in the kernel, then it's not really a micro-kernel anymore.

The only solutions I can think of are:

have a special syscall just for the MM server, that lets it modify any PCB (to set the EIP)
have the MM server run in ring-0 but as it's own process still, with it's own address space.
map the kernel data structure into MM's address space as RW/User/Present
have both as servers, make execve() occur in my libc, and have it mmap() the new file in first, then jump itself to the entry point - only issue is if the libc gets (re)moved in memory as part of the mmap() (this is why it's normally in the kernel)

All of these seem to have their own issues with them though.

Does anyone have any better solutions? (I'm sure there are many)
Has anybody got cautionary tales from when they tried to do something similar?
It seems like the MM and FS are pretty closely intertwined - how have people separate them into different servers?

Thanks,
Barry

thewrongchristian · Post by **thewrongchristian** » Mon May 09, 2022 9:45 am

Barry wrote:Hey everyone,

I've been thinking quite a bit about this lately, so I thought I'd ask how other's have tackled the issue. Apologies if this fits better in General Ramblings, it's meant to be about design.

I am writing a new micro-kernel and I've just reached the section on memory management. Previously, I just had a very simple manager in the ukernel which just allocated page frames to a process' address spaces. I never supported mmap(), because I don't want the kernel to have to send calls to user-space directly. In my head only user programs should IPC to other user programs, and the kernel should just facilitate this. To load a file into memory (for an execve()), I was literally calling read() on the FS server. My method of executing a file was extremely bad, and involved the kernel replacing the current program with an "exec server" which would then contact FS and load the file and jump to the entry point.

I want to do something better. I want proper memory management, with different regions of memory - something higher level than raw pages. I want an mmap() that can load files, anonymous regions and memory shares, and I want an execve() that doesn't need a special loader program. I've come up with all the technical designs, but I don't know where to put it. I can't figure out quite how it works without something being in the kernel - which I don't really want.

If the MM (memory manager) is a server, then it can cleanly request files from FS in an mmap(), but won't be able to force another process to jump to the entry point in an execve().
If the MM is in the kernel, then it can't cleanly request from the FS (kernel shouldn't be IPCing to a program directly), but it can force the entry point jump.
If both the MM and FS are in the kernel, then it's not really a micro-kernel anymore.

The only solutions I can think of are:

have a special syscall just for the MM server, that lets it modify any PCB (to set the EIP)

have the MM server run in ring-0 but as it's own process still, with it's own address space.

map the kernel data structure into MM's address space as RW/User/Present

have both as servers, make execve() occur in my libc, and have it mmap() the new file in first, then jump itself to the entry point - only issue is if the libc gets (re)moved in memory as part of the mmap() (this is why it's normally in the kernel)
All of these seem to have their own issues with them though.

Does anyone have any better solutions? (I'm sure there are many)
Has anybody got cautionary tales from when they tried to do something similar?
It seems like the MM and FS are pretty closely intertwined - how have people separate them into different servers?

Thanks,
Barry

I've never understood the idea of removing paging entirely from a microkernel. Handling CPU page faults is inherently a kernel task, and fixing up mappings similarly so.

What isn't necessarily a micro-kernel task is loading the page that will be mapped. This is where the separation needs to come in.

So, in the (micro)kernel, you'll need a page fault handler, which maps the virtual address to some sort of region, and that region will need some sort of handler to load missing pages.

The microkernel page fault handler then becomes a matter of looking up the region, getting its handler, and asking that handler to provide the details of the page that will service the page fault in this region at the given offset.

Now, this handler would be some sort of memory segment driver, which in a monolithic kernel, might ultimitely resolve to a file system driver that will load the corresponding page from disk. Or it may be an anonymous page driver that provides fresh 0 filled pages for new mappings, or loads existing pages from swap for existing swapped data.

But the point is the (micro)kernel doesn't need to know these details. It just has a region, and a page offset, which it defers to something else to handle.

This is where your user level paging or file system kicks in. The page fault handler can just send a message to your user process to load the page desired however it should be handled. The process then just returns the possibly newly read page to the page fault handler, which maps it into the virtual address required, and returns from the page fault handler.

So, for your microkernel abstraction, your mmap will create a microkernel region, that will be used as a handle to whatever provides the missing pages to be mapped. Your virtual address space manager can live in the microkernel, the virtual address space and set of regions mapped therein are sufficiently abstract already.

Your exec can then live in user space, creating the regions in the exec client address space as required (pointing them however you want to your file system code that does that actual page read/write), then once you've set up the regions, set the client process in motion by pointing it at the entry point.

The other thing I also struggle with microkernel wise is in separation of MM/FS. Minix provided a separate MM and FS, which I never understood as it would make page faults horribly inefficient.

I'm personally of the opinion that the mechanism of MM (such as page fault handling and mapping) lives entirely within the kernel. So long as the page desired is known to the kernel, there is no reason why we should transition to user space or filesystem code to resolve a page fault. That would also imply the MM handles or knows about the filesystem caching, caching by file identity/file offset, allowing the resolution of page faults to pages that are in the cache, without filesystem specific intervention code. So long as your management of working set pages is handled well in your kernel, you won't be reaching out to the filestsystem code that often, and when you do, it's likely that you'll be doing IO to resolve the fault anyway, so a trip to user space is not a ig deal in the grand schema of things.

That's what I'm aiming at in my kernel. It's not a microkernel, it still presents as a monolithic kernel in that the system call interface is handled by the kernel proper.

But my MM/VFS is integrated to cache exclusively by file identity/offset, which the page fault handling can use without filesystem intervention, and my file systsem drivers have the option to be moved to user space, where they'll only be invoked for the relatively slow operations of doing actual I/O.

Barry · Post by **Barry** » Mon May 09, 2022 10:51 am

Thanks - that's a nice little insight into how it works for your system. I'll probably end up taking quite a bit of inspiration from it.
It sounds like you opted for having the MM in the kernel and letting it spit out calls to servers directly. I'm starting to think this is the best approach. I've been debating with putting MM and the VFS in the kernel completely, but having all the FS drivers still as their own servers - or maybe I'll just switch to a mono-kernel.
I'm just struggling to understand how a completely separate process can update a different process - only the kernel should be able to do this, right?

I know exactly what you mean about Minix though, it's been confusing me for the last few weeks straight. The Minix model for exec / page-fault handling seems really inefficient. It just seems like there will always be some things that have to occur outside of an isolated "process". Obviously they're still in a process, but they're running as the calling process, not their own thing.

Is having the micro-kernel send messages itself normal for micro-kernel designs? I always assumed only user-mode programs should.

Thanks again,
Barry

Ethin · Post by **Ethin** » Mon May 09, 2022 11:16 am

I mean, if I can figure out paging and all that properly (the paging wiki article uses graphics all over the place and for some reason the Intel manuals just confuse me), I plan to let the kernel send messages to processes via IPC or something else. I don't see why it shouldn't be okay. There's no reason to strictly follow the microkernel or unikernel design; its your OS and you should feel free to build it however you like. If you want to mix microkernel and monokernel together, go for it. If you want to follow a design strictly, go for it.

Barry · Post by **Barry** » Mon May 09, 2022 12:13 pm

Ethin wrote:I don't see why it shouldn't be okay. There's no reason to strictly follow the microkernel or unikernel design; its your OS and you should feel free to build it however you like. If you want to mix microkernel and monokernel together, go for it. If you want to follow a design strictly, go for it.

This is a very good point, and I think I'll just put MM in the kernel and let it send messages itself. This seems to be what most other people are doing.

Ethin wrote:I mean, if I can figure out paging and all that properly (the paging wiki article uses graphics all over the place and for some reason the Intel manuals just confuse me)

If you haven't take a look at recursive paging before it really simplifies it all, even though it probably takes a good bit of thinking to understand initially.

Quick rundown of paging (pretty sure everyone has trouble understanding it at first, you're not alone at all):
CR3 should hold a physical address of a page (4096 bytes)
that page (the page directory) will hold 1024 physical addresses of more pages
each of those pages (the page tables) will hold 1024 physical addresses of more pages
those pages are accessible in memory byte-by-byte relative to where they are in the page tables / directory over all: e.g.
table#0, page#0 is at 0x00000000
table#0, page#1 is at 0x00001000
table#0, page#1023 is at 0x003ff000
table#1, page#0 is at 0x00400000
each page is accessible by reading/writing to that location, but will actually write to whatever address is stored in your page table in physical memory.

Hopefully that helps a little, and best of luck with paging. Recursive paging is very similar, but you just set the last page table to be the page dir, such that the last 4MB of memory is the continuous, in-order list of physical page frame addresses, and the last 4KB is the page table addresses. Look into it further if you want to.

thewrongchristian · Post by **thewrongchristian** » Mon May 09, 2022 3:56 pm

Barry wrote:Thanks - that's a nice little insight into how it works for your system. I'll probably end up taking quite a bit of inspiration from it.
It sounds like you opted for having the MM in the kernel and letting it spit out calls to servers directly. I'm starting to think this is the best approach. I've been debating with putting MM and the VFS in the kernel completely, but having all the FS drivers still as their own servers - or maybe I'll just switch to a mono-kernel.
I'm just struggling to understand how a completely separate process can update a different process - only the kernel should be able to do this, right?

gdb can trace other processes for debugging purposes. Same thing here. It's not unreasonable for related processes to be able to manipulate each others state, security permitting.

Microkernels just take that a step further. Setting up a child state could even be done using the same messages used to manipulate a process' own resources. For example, a mmap syscall message might be something like:

Code: Select all

struct dommap {
  pid_t pid;
  void * addr;
  size_t length;
  int prot;
  int flags;
  int fd;
  off_t offset;
};

Now, you can use that same message format to set up your own mmap (using your own pid) or a new process being exec'd (using that process' pid).

Starting the exec'd process off as well could be a message sent to that process. The kernel might have a core of simple message primitives that it can interpret on the processes behalf, to manipulate state (such as mapping data, setting register state) that could be used not only for debugging, but the entire exec state for a process.

Barry wrote: I know exactly what you mean about Minix though, it's been confusing me for the last few weeks straight. The Minix model for exec / page-fault handling seems really inefficient. It just seems like there will always be some things that have to occur outside of an isolated "process". Obviously they're still in a process, but they're running as the calling process, not their own thing.

I might be coloured by my memory of Minix. Going from memory, when a process forks, a message is sent to the MM to do any per-process work, and also to the FS, for it to do per-process work. Now, I can understand that a process has some per-process state, but that's merely a thin veneer of open file descriptor information. Open file state itself is not per-process, at least in UNIX, and that information is file system agnostic.

The actual FS state for an open file, such as tracking allocation data and inodes etc. is very much process agnostic. Files don't live in the context of a process (again, in UNIX at least) so it seems strange that a FS server would need to know when a process has forked and to track such information.

The other thing to remember is that much microkernel research was done on architectures that made address space isolation much cheaper than contemporary x86 designs. I think AST could envision RISC surging ahead in units shipped in the 1990s, and the MMU design of architectures like MIPS and SPARC had address space ids built in from the start, so switching address spaces wasn't nearly as expensive as the CR3 update the i386 required (with it's corresponding complete TLB flush.)

Unfortunately, Microsoft DOS and later Windows put paid to that rosy future, and PC operating systems lagged probably 10-15 years as a result, as well as tying us to x86 compatibility in the long term.

Barry wrote: Is having the micro-kernel send messages itself normal for micro-kernel designs? I always assumed only user-mode programs should.

I don't see why a kernel cannot send messages. They're just packets of data to convey intent or information, so it makes sense for the kernel to do that in the same way any other message sender would. Then you just need a single mechanism to receive and act on those messages.

Barry · Post by **Barry** » Mon May 16, 2022 2:34 pm

Small update:
I've ended up putting the Memory Manager in the kernel, and the VFS is still a server. At a later point I may move MM into a server, or VFS into the kernel depending how my system changes. I'm opting to have some user-mode pagers (necessary when the VFS is in user-space, but anonymous memory is handed out by the kernel still, and it calls out to the disk driver server. It's a little bit clunky, but it works. I had an idea to either just have an actual swap server, use a swap file, or some kind of hybrid file system that stores the actual memory regions on disk (backing each region with it's own file). The only thing I actually have to be careful about is that the kernel doesn't move the DISK/VFS servers into swap, since then it won't be able to call them to recover them.
To unify my interfaces a bit, and to make it easy to move stuff in to / out of of the kernel, I've decided that my MM calls (such as mmap) are going to be messages that you can send to the kernel, rather than actual interrupts. This just keeps my syscall list small, and saves me from rewriting a lot of lines if I decide later to put MM in userspace or VFS in the kernel.

thewrongchristian wrote:Your exec can then live in user space, creating the regions in the exec client address space as required (pointing them however you want to your file system code that does that actual page read/write), then once you've set up the regions, set the client process in motion by pointing it at the entry point.

This is a good idea, and I'm actually tempted to make exec it's own server, having re-read this. That'd be really nice because then I wouldn't have something bulky like ELF parsing in the kernel (incase you couldn't tell, I really hate putting stuff in the kernel). I guess I'd just need to do some privilege checking on the caller of mmap() to make sure a malicious program isn't changing another program's memory layout. I just don't like the idea of having a feature, i.e. the id parameter in mmap(), just for a single program to use.

Thanks for the ideas,
Barry

AndrewAPrice · Post by **AndrewAPrice** » Tue Oct 11, 2022 6:27 pm

I'm implementing memory mapped IO in my microkernel OS now. I also have the same goal of making the kernel as generic as possible. My kernel doesn't know what a file is, nor should it matter.

I've had shared memory for a while. One process can create a buffer, and send the buffer's ID to another process, that other process can join the buffer. I use this for communicating the contents of the window and screen between applications, the window manager, and the screen driver. I also use this for sending the contents of a file between applications, the VFS, and the disk driver.

Also, already the kernel can send messages to processes. These are messages such as a timer has elapsed, an interrupt has occurred (for drivers), another process you were monitoring has died, etc.

So this is my plan to support read-only memory mapped files:

I'm extending my shared memory buffers to support lazily loaded buffers.

The syscall to create a buffer now has a few extra parameters. A flag to say it's lazily allocated, a flag to say if anyone else joining is allowed write access (so we can share a single file between multiple programs). If it is lazily allocated, there's a parameter of the ID the kernel should give messages sent to the creator to say "someone wants a page".

In my exception handler, I added some extra logic if a page fault happens. If the page falls in where a shared buffer should be mapped but it isn't allocated, I can do either:

1) if you're the owner (or the owner no longer exists), allocate the page, and notify anyone waiting for the page that it now exists
2) if you're not the owner, sleep the thread until the page exists

When a process wants to read from a memory mapped file, it expects as soon as it wakes from the page fault the page is fully populated with the file's contents. To do that, I've added a new syscall that's that takes:
- a currently mapped page
- the buffer to assign it to
- the index in the buffer to assign it to
And it moves the virtual address of that page into the buffer, and notifies anyone waiting that the page is now ready.

Now, if a program wants to read a part of the memory mapped file that hasn't been mapped:
- the program will page fault and sleep the thread,
- a message will be dispatched from the kernel to the creator (the VFS) to say someone is waiting for this particular page of this shared buffer
- the VFS will grab a blank page
- the VFS will fill the page with the file's contents
- the VFS will use the new syscall to move that page into the shared buffer
- the kernel will wake up the sleeping thread that's waiting for that shared buffer page

I'm adding the ability for the kernel to tell a program what flags a buffer was created with, so you can't crash the window manager by saying "this is my window's contents, but it's lazily allocated and I'm never going to fill it".

AndrewAPrice · Post by **AndrewAPrice** » Fri Oct 14, 2022 10:05 pm

I finished implementing what I described above. MMIO is much faster than than read/seek (roughly a 100x speedup loading fontconfig - went from many minutes to a few seconds), mainly because I use musl, and musl reads 1024 bytes of a file into a buffer, but seeking clears the buffer, and fontconfig reads a few bytes (which triggered 1kb of data to get read), seeks a few bytes forward (clearing the buffer), reads a few bytes, repeats.

qookie · Post by **qookie** » Sat Oct 15, 2022 11:53 am

The approach managarm takes to this is to have the kernel manage memory objects (including CoW memory), address spaces, and threads, and have things like processes, mmap, fork, and execve be handled by the POSIX server.

Memory mapped files are handled by having a concept of "managed memory objects", where a thread (one managing a file system in this case) is responsible for filling the memory with data and writing it back based on notifications it receives from the kernel (sent when a page is first accessed or when it's being evicted).
Each file has it's own associated memory object, which is then handed out to other threads via IPC if they request it. These memory objects are more or less equivalent to the page cache in Linux. As such, read and write are also just implemented by reading/writing from the memory objects.
In the case of mmap, the POSIX server manages mapping the memory objects in into the process that requested it, because it also needs to also keep track of every mapping itself (needed for fork).

Executing new binaries is also implemented in the POSIX server, by:
- creating a new address space,
- loading the new executable (and the dynamic linker, and other necessary stuff) into it,
- tearing down the old thread (the kernel kind of thread),
- creating a new thread that uses the new address space.

This is all done "remotely", because the managarm system calls for mapping memory take the address space to operate on as an argument, and memory objects themselves are not associated with any particular mapping.

The code that implements the meat of exec is here: https://github.com/managarm/managarm/bl ... 1034-L1102

AndrewAPrice · Post by **AndrewAPrice** » Sat Oct 15, 2022 2:45 pm

I'm curious about your POSIX server.

I was planning to create an Executor ("Launcher" is already taken by the graphical program that lists applications you can launch) that is responsible for parsing ELF binaries off disk and has permission to load data into another process's memory and create the initial thread.

I use Musl as my C library, and libcxx calls the c library under the hood, and musl is built around Linux system calls, so I created a shim that emulates the behaviour of Linux system calls. For example, here's mmap. But, my shim is just a userland library. Are you doing something similar but putting this shim in your POSIX server?

qookie · Post by **qookie** » Sat Oct 15, 2022 7:25 pm

AndrewAPrice wrote:Are you doing something similar but putting this shim in your POSIX server?

We're using a custom C library (https://github.com/managarm/mlibc), and almost all of the "system calls" are implemented via regular IPC. The ones that aren't (mmap, fork, clone, exec, some signal stuff, etc) are implemented using supercalls, which are like actual syscalls except that the kernel notifies the POSIX server that it should handle it (and it is it's job to read the thread registers and memory, do the thing, write results into the registers and memory, and resume the thread).

For the actual POSIX server (and other servers and drivers), we trade a bit of "true microkernel-ness" for simplicity, by having the kernel handle a small subset of these IPC messages (to the client there's no difference between an IPC stream where the other side is the kernel or a thread). The simplicity we gain from this is that our POSIX server can be a regular dynamically linked executable, that can just access files from the initramfs and log to the kernel console with printf, and we get access to the full hosted libstdc++.

Also, as a side note, the POSIX server is not an intermediary in all communications. For example, read and write requests go directly onto a stream associated with that file (most often to the server that handles the file system it's on). This is implemented by having open go to the POSIX server, the POSIX server the figures out which server to talk to, gets an IPC stream for the particular file, and sends a handle to it to the thread that opened the file.
Another thing worth noting about it is that it's completely asynchronous (while currently being single-threaded) due to our heavy usage of C++20 coroutines. This also means that clients that are aware of this can do things like asynchronous reads/writes natively, by sending requests but not completely blocking the thread while waiting for a response like the C library does.

AndrewAPrice · Post by **AndrewAPrice** » Sat Oct 15, 2022 8:29 pm

qookie wrote:The simplicity we gain from this is that our POSIX server can be a regular dynamically linked executable

I use "server" to mean a separate process, but you mention dynamically linking it in. Can you clarify what you mean?

qookie · Post by **qookie** » Sun Oct 16, 2022 6:38 am

AndrewAPrice wrote: I use "server" to mean a separate process, but you mention dynamically linking it in. Can you clarify what you mean?

The POSIX server is indeed it's own process. What I meant is that it just dynamically links (using our ld.so) against a regular libc.so, libstdc++.so, and some other libraries (like libhw.so, which provides the client code for the hw protocol, for accessing PCI devices etc).

OSDev.org

Memory Management in Micro-Kernels (mmap and execve)

Memory Management in Micro-Kernels (mmap and execve)

Re: Memory Management in Micro-Kernels (mmap and execve)

Re: Memory Management in Micro-Kernels (mmap and execve)

Re: Memory Management in Micro-Kernels (mmap and execve)

Re: Memory Management in Micro-Kernels (mmap and execve)

Re: Memory Management in Micro-Kernels (mmap and execve)

Re: Memory Management in Micro-Kernels (mmap and execve)

Re: Memory Management in Micro-Kernels (mmap and execve)

Re: Memory Management in Micro-Kernels (mmap and execve)

Re: Memory Management in Micro-Kernels (mmap and execve)

Re: Memory Management in Micro-Kernels (mmap and execve)

Re: Memory Management in Micro-Kernels (mmap and execve)

Re: Memory Management in Micro-Kernels (mmap and execve)

Re: Memory Management in Micro-Kernels (mmap and execve)