Barry wrote:Hey everyone,
I've been thinking quite a bit about this lately, so I thought I'd ask how other's have tackled the issue. Apologies if this fits better in General Ramblings, it's meant to be about design.
I am writing a new micro-kernel and I've just reached the section on memory management. Previously, I just had a very simple manager in the ukernel which just allocated page frames to a process' address spaces. I never supported
mmap(), because I don't want the kernel to have to send calls to user-space directly. In my head only user programs should IPC to other user programs, and the kernel should just facilitate this. To load a file into memory (for an
execve()), I was literally calling
read() on the FS server. My method of executing a file was extremely bad, and involved the kernel replacing the current program with an "exec server" which would then contact FS and load the file and jump to the entry point.
I want to do something better. I want proper memory management, with different regions of memory - something higher level than raw pages. I want an
mmap() that can load files, anonymous regions and memory shares, and I want an execve() that doesn't need a special loader program. I've come up with all the technical designs, but I don't know where to put it. I can't figure out quite how it works without something being in the kernel - which I don't really want.
If the MM (memory manager) is a server, then it can cleanly request files from FS in an
mmap(), but won't be able to force another process to jump to the entry point in an
execve().
If the MM is in the kernel, then it can't cleanly request from the FS (kernel shouldn't be IPCing to a program directly), but it can force the entry point jump.
If both the MM and FS are in the kernel, then it's not really a micro-kernel anymore.
The only solutions I can think of are:
- have a special syscall just for the MM server, that lets it modify any PCB (to set the EIP)
- have the MM server run in ring-0 but as it's own process still, with it's own address space.
- map the kernel data structure into MM's address space as RW/User/Present
- have both as servers, make execve() occur in my libc, and have it mmap() the new file in first, then jump itself to the entry point - only issue is if the libc gets (re)moved in memory as part of the mmap() (this is why it's normally in the kernel)
All of these seem to have their own issues with them though.
Does anyone have any better solutions? (I'm sure there are many)
Has anybody got cautionary tales from when they tried to do something similar?
It seems like the MM and FS are pretty closely intertwined - how have people separate them into different servers?
Thanks,
Barry
I've never understood the idea of removing paging entirely from a microkernel. Handling CPU page faults is inherently a kernel task, and fixing up mappings similarly so.
What isn't necessarily a micro-kernel task is loading the page that will be mapped. This is where the separation needs to come in.
So, in the (micro)kernel, you'll need a page fault handler, which maps the virtual address to some sort of region, and that region will need some sort of handler to load missing pages.
The microkernel page fault handler then becomes a matter of looking up the region, getting its handler, and asking that handler to provide the details of the page that will service the page fault in this region at the given offset.
Now, this handler would be some sort of memory segment driver, which in a monolithic kernel, might ultimitely resolve to a file system driver that will load the corresponding page from disk. Or it may be an anonymous page driver that provides fresh 0 filled pages for new mappings, or loads existing pages from swap for existing swapped data.
But the point is the (micro)kernel doesn't need to know these details. It just has a region, and a page offset, which it defers to something else to handle.
This is where your user level paging or file system kicks in. The page fault handler can just send a message to your user process to load the page desired however it should be handled. The process then just returns the possibly newly read page to the page fault handler, which maps it into the virtual address required, and returns from the page fault handler.
So, for your microkernel abstraction, your mmap will create a microkernel region, that will be used as a handle to whatever provides the missing pages to be mapped. Your virtual address space manager can live in the microkernel, the virtual address space and set of regions mapped therein are sufficiently abstract already.
Your exec can then live in user space, creating the regions in the exec client address space as required (pointing them however you want to your file system code that does that actual page read/write), then once you've set up the regions, set the client process in motion by pointing it at the entry point.
The other thing I also struggle with microkernel wise is in separation of MM/FS. Minix provided a separate MM and FS, which I never understood as it would make page faults horribly inefficient.
I'm personally of the opinion that the mechanism of MM (such as page fault handling and mapping) lives entirely within the kernel. So long as the page desired is known to the kernel, there is no reason why we should transition to user space or filesystem code to resolve a page fault. That would also imply the MM handles or knows about the filesystem caching, caching by file identity/file offset, allowing the resolution of page faults to pages that are in the cache, without filesystem specific intervention code. So long as your management of working set pages is handled well in your kernel, you won't be reaching out to the filestsystem code that often, and when you do, it's likely that you'll be doing IO to resolve the fault anyway, so a trip to user space is not a ig deal in the grand schema of things.
That's what I'm aiming at in my kernel. It's not a microkernel, it still presents as a monolithic kernel in that the system call interface is handled by the kernel proper.
But my MM/VFS is integrated to cache exclusively by file identity/offset, which the page fault handling can use without filesystem intervention, and my file systsem drivers have the option to be moved to user space, where they'll only be invoked for the relatively slow operations of doing actual I/O.