Minimalist Microkernel Memory Management

JoeKayzA · Post by **JoeKayzA** » Wed Jul 27, 2005 12:29 am

Greetings!

Let's say that I decide I don't need to have multiple pagers in user-space, and that I want most of memory management in the microkernel. How do I handle page faults that require disk I/O? The disk and filesystem drivers will be in user-space, and it would be awkward for the kernel to have special knowledge of them... Any suggestions?

I think your kernel would at least need to define a block-io-interface, that every suitable disk driver (or filesystem driver, when doing block-io on files) needs to support, so that your kernel can swap out pages to it. From that point on, I would go straight ahead: On page fault, send a message to the desired driver, tell it to fetch the page from disk and fix the page fault. But make sure the disk driver won't cause a page fault on it's own.

If you want less policy in your kernel, you could define a specialized paging interface, that your disk or filesystem drivers should support (but I start disliking this idea when I think of it...), but this would make the involved user space processes more like user-level pagers IMHO, what do you think?

cheers Joe

Solar · Post by **Solar** » Wed Jul 27, 2005 1:07 am

Excuse me while I mingle...

If you're using a swap partition, you don't have to involve the filesystem service. I consider swap files to be a stupid idea, anyhow - if you don't want to put your user to the hassle of having to assign a dedicated partition, let your filesystem put aside some "raw" space for swapping, but don't make the swap space a user-visible file. Unnecessary, confusing, potentially dangerous (for stability and security), and a pain for backup tools.

So if you have the filesystem out of the game, all you need is block I/O. I'm not that much into the "pure micro" philosophy and would prefer handling swapping in the kernel itself, but it should be possible to provide a minimalist interface for "raw" block I/O like needed for swapping.

Solar · Post by **Solar** » Wed Jul 27, 2005 1:16 am

Colonel Kernel wrote: RAM is the most critical performance bottleneck in modern hardware, and it's only getting worse. Cache and TLB thrashing seem to be the worst performance killers. Sounds to me like a good reason to keep the kernel as small as possible... Just my $0.02.

You sure about that? I mean, of course critical core kernel components (like synchronizing primitives, interrupt handling, scheduling and memory management) have to be aware of memory footprint and cache lines.

But on the whole? How often do you have to do raw number crunching? Virtually any app requiring efficiency in the "throughput" department (as opposed to responsiveness) is somewhat of a database - and I'd exchange an interrupt handler that does fit into a cache line for more to-disk bandwidth anytime...

Of course that depends on what you intend to do. If you're writing control software for something imbedded, disk I/O is the least of your worries.

Brendan · Post by **Brendan** » Wed Jul 27, 2005 1:27 am

Hi,

Colonel Kernel wrote:A general comment about L4 based on what I've read -- the point of it being so minimalist is not just for the added flexibility, but also to keep its cache footprint as small as possible. RAM is the most critical performance bottleneck in modern hardware, and it's only getting worse. Cache and TLB thrashing seem to be the worst performance killers. Sounds to me like a good reason to keep the kernel as small as possible... Just my $0.02.

Not to me - keeping the kernel small just means having more services in other places, and doesn't make much difference for the code & data caches. The main difference would be the TLB's, where making good use of "global" pages and being careful to prevent CR3 being changed can help (which is the main reason why L4 has those "small address spaces" IMHO).

Colonel Kernel wrote:I do have another question though. Let's say that I decide I don't need to have multiple pagers in user-space, and that I want most of memory management in the microkernel. How do I handle page faults that require disk I/O? The disk and filesystem drivers will be in user-space, and it would be awkward for the kernel to have special knowledge of them... Any suggestions?

For my OS there's "thread states" which determine why a blocked thread was blocked. One of these states is "waiting for pager/disk IO".

For swap space, each "swap provider" notifies the kernel when it initializes and the kernel keeps a list of them (and keeps track of total size, free size, message port ID, etc for each one). For normal file access the VFS is used.

When data is needed from disk, the thread causes a page fault (page not present), the page fault handler figures out why then sends a message to the swap provider or VFS asking for the data. Then the page fault handler stores a sender ID and function ID, marks the thread as "waiting for pager/disk IO" and the scheduler switches to another thread for a while.

Sooner or later a message comes back from the VFS or swap provider containing the status of the operation and hopefully the data. The messaging code notices the thread was "waiting for pager/disk IO" and checks if the message sender matches the sender ID and function ID set by the page fault handler. When this happens it puts the message at the start of the message queue (rather than at the end, which is what would normally happen) and clears the sender ID and function ID. Then the messaging code clears the thread's "waiting for pager/disk IO" state and adds the thread to the scheduler's "ready to run" queue.

When the scheduler gives the thread CPU time the page fault handler gets the first message from the message queue, which happens to be the right message because of the extra stuff done by the messaging code. The message data is checked, and if the status is OK a free page is mapped where it needs to go, the data is copied into the free page and the page fault handler returns. If the status is bad (timeout, file IO error, etc) there's a critical error and the thread is terminated.

When data is swapped out to disk it's mostly the reverse of this (data sent via. messaging to be saved by the swap provider or VFS, and the pager blocks waiting for returned status). If a page is sent to swap the "block number" is stored in the page table entry so that the data can be found again (where a block number is a reference to which 4096 byte block of swap space was used to store the data).

This gives some restrictions - for a not-present page there's 31 unused bits in the page table entry and one bit is used to determine if it's swap of memory mapped. This means block numbers for swap space must be 30 bits or less, so 2^30 * 4096 gives a maximum swap space size of 4096 GB. When the kernel is using PAE this maximum is increased to 2^74 bytes, as there's 64 bit page table entries.

If a file is memory mapped an index into a "memory mapped range" list is stored in the page table entry. The scheduler steals a few MB of the thread's address space to store this memory mapped range list. Each entry in the list contains a starting address, size and file handle for the memory mapped file.

Of course for this to work you'd need to be able to transfer at least 4100 bytes in a message (easy for my OS). It's definately not the fastest way, or the most flexible way, or the only way...

.

Cheers,

Brendan

JoeKayzA · Post by **JoeKayzA** » Wed Jul 27, 2005 4:49 am

Solar wrote: Excuse me while I mingle...

If you're using a swap partition, you don't have to involve the filesystem service. I consider swap files to be a stupid idea, anyhow - if you don't want to put your user to the hassle of having to assign a dedicated partition, let your filesystem put aside some "raw" space for swapping, but don't make the swap space a user-visible file. Unnecessary, confusing, potentially dangerous (for stability and security), and a pain for backup tools.

True, absolutely. When a disk driver supports the predefined block-io interface, this will be possible anyway, I meant. I just wanted to point out that you could easily extend the idea over filesystems to *also* support swap files, and I know of a few people, who absolutely deny the use of a seperate partition for swap space (thinking that each partition equals to one installed OS - not my idea). OTOH, couldn't you also use a contiguos file for swap (if the FS supports this) and do direct disk io anyway, once you've found the starting block of the file...just some thoughts.

Brendan wrote: Of course for this to work you'd need to be able to transfer at least 4100 bytes in a message (easy for my OS). It's definately not the fastest way, or the most flexible way, or the only way...

When a kernel doesn't support such large messages, what about just mapping the appropriate physical page to the disk driver, let it fetch data and then remap it into the address space of the faulting process? The destination for a missing (because swapped out) page is page aligned anyway

, so isn't this behaviour more straight forward (and involves less copying)?

My two cents.
cheers
Joe

Solar · Post by **Solar** » Wed Jul 27, 2005 5:19 am

JoeKayzA wrote: I know of a few people, who absolutely deny the use of a seperate partition for swap space (thinking that each partition equals to one installed OS - not my idea).

Mine neither. It will be a long time before my OS will be installed on a partition of itself. Until then, it will happily be bootable after having unpacked it into some subdirectory (provided you already have GRUB installed for chainloading it).

OTOH, couldn't you also use a contiguos file for swap (if the FS supports this) and do direct disk io anyway, once you've found the starting block of the file...just some thoughts.

Hmm... a secure system shouldn't allow raw block access to file system areas IMHO...

AR · Post by AR » Wed Jul 27, 2005 6:18 am

Solar wrote:Hmm... a secure system shouldn't allow raw block access to file system areas IMHO...

I'm not so sure, if the software cannot trust the kernel which has ultimate control of everything then it shouldn't have been written for the OS to begin with. The kernel can be indirectly trusted with direct block IO via messages to user space drivers since the kernel could easily impersonate a filesystem driver by forging the Process ID anyways.

gaf · Post by **gaf** » Wed Jul 27, 2005 7:00 am

Brendan wrote:That's not quite what I meant. Imagine you want to make sure that the highest priority thread that can run is always the thread that does run. Every time a high priority thread sends a message to a low priority thread the IPC would cause a thread switch to the low priority thread, so regardless of how the scheduler is implemented the low priority thread gets CPU time when it shouldn't.

Because IPC is used for almost everything, I'd guess that IPC often has more control over which threads get CPU time than the scheduler does, and the scheduler is often ignored (regardless of how it's implemented).

To implement any scheduling policy accurately you need to prevent IPC from causing undesirable thread switches. The only way this can be done is with message queues.

Message queues could be build on-top of L4's IPC mechnism, but I doubt that they can solve the problem. If the scheduler's policy was to not allow thread switches to lower privileged tasks due to IPC, messages to these tasks would in general be pointless because they were never processed by the receiver.
If a tasks sends a message to another task it doesn't do so without any reasons. Either it acts as a client and has to needs some service from a manager, or it's a server itself and has to answer a caller's request. In both cases the task-switch can't be avoided, trying to do so would mean halting the system.

Brendan wrote:Building the sigma0 pager (or a physical memory manager) into the kernel wouldn't suddenly make the entire OS monolithic.

I thought you meant it in a more general way and not only for sigma0. Nevertheless including sigma0 is a (small) step in direction of a monolithic kernel. As you yourself have pointed out, even sigma0 can be implemented in different ways (PAE extention for systems with RAM > 4GB) and including it into the kernel would mean that you have to chose one of them. Apart from that I doubt that the performance gain will be higher than some 0.001 percentages, except if the user-level pagers don't support any page caching.

Brendan wrote:This is of course a rather debatable statement - from my perspective L4 doesn't prove that micro kernels can be fast enough compared to monolithic kernels, but instead only proves that machine specific code is faster than portable code. Further, I don't think that the microkernel concept was ever flawed to begin with (as long as you accept that performance is sacrificed for other benefits that can't be acheived with monolithic kernels).

In order to be able to support specific code an OS has to be modular, which is why ?kernels allow user-space managers to decide about the policy to be used. Monolithic kernels can never be really modular, the name already says it all..
Apart from that monolithic kernels aren't more portable at all because portability can also be ensured on user-level through libraries and standard server interfaces.

Brendan wrote:My point is that it's possible to take concepts from both micro-kernels and monolithic kernels, choosing performance or other benefits where it makes the most sense.

I'm pretty sceptic about this approach because it (IMHO) means that you simply delay/avoid making an actual decision about the overall concept. Danger is hight that you end up with a pretty complex OS that is neither micro- nor monokernel and lacks some clearly visible design guideline.

Colonel Kernel wrote:I do have another question though. Let's say that I decide I don't need to have multiple pagers in user-space, and that I want most of memory management in the microkernel. How do I handle page faults that require disk I/O? The disk and filesystem drivers will be in user-space, and it would be awkward for the kernel to have special knowledge of them... Any suggestions?

Just because you don't want to support multiple pagers, doesn't neccessarily mean that you have to move memory managemant back into the kernel. You could aswell simply support a single user-space manager that is in charge for all memory requests. This would keep up a minimum level of flexibility (user can choose to replace the pager if he really has to) and avoid that you have to include a lot of policy and protocols directly in the kernel.

Hmm... a secure system shouldn't allow raw block access to file system areas IMHO...

Why not ? Security can also be implemented on a much lower-level than files, for example one could allow apps to allocate sectors and make access checks right there..

regards,
gaf

Solar · Post by **Solar** » Wed Jul 27, 2005 7:18 am

AR wrote: I'm not so sure, if the software cannot trust the kernel which has ultimate control of everything then it shouldn't have been written for the OS to begin with.

Only if the filesystem driver knows it's the kernel requesting the block I/O on filesystem space. If your filesystem driver can, jolly.

Security can also be implemented on a much lower-level than files, for example one could allow apps to allocate sectors and make access checks right there..

Security, yes. Consistency? What if the app wreaks havoc on your filesystem metadata, which you allowed it to bypass?

gaf · Post by **gaf** » Wed Jul 27, 2005 7:41 am

Of course you have to avoid that an app can simply overwrite meta-data, but this shouldn't really be a problem. If you're interested in the details you might have a look at long urls screw up display badly (especially chapter 2 is important).

regards,
gaf

AR · Post by AR » Wed Jul 27, 2005 7:43 am

Solar wrote:Only if the filesystem driver knows it's the kernel requesting the block I/O on filesystem space. If your filesystem driver can, jolly.

I'd say that if the drivers can't tell where a message came from then you need to take a closer look at your design (in a microkernel). Microkernels pretty much live on messaging, if there is no way to determine the source of a given message then you're heading into dangerous waters (See Windows for a demonstration of this problem). Basically a message from the kernel itself would have Process ID "0", the Process ID would always be set by the kernel to prevent impersonation.

Solar · Post by **Solar** » Wed Jul 27, 2005 8:17 am

As I said, if your filesystem driver can tell, jolly.

Bear with me. As was said in some other thread, we all have areas where we are experts and others where we can barely scratch the surface.

Right now I would design my system in a way that would disallow raw access on filesystem areas, but would pass out "raw" disk areas if so requested, marking the disk area as reserved (swap, database uses).

I might think different when I get to the point of actually implementing it. Like, two lifetimes from now.

Solar · Post by **Solar** » Wed Jul 27, 2005 8:21 am

gaf wrote: If you're interested in the details you might have a look at (...)

Ah, exos... been there, read that, didn't like 'em.

Brendan · Post by **Brendan** » Wed Jul 27, 2005 4:39 pm

Hi,

gaf wrote:
To implement any scheduling policy accurately you need to prevent IPC from causing undesirable thread switches. The only way this can be done is with message queues.
Message queues could be build on-top of L4's IPC mechnism, but I doubt that they can solve the problem. If the scheduler's policy was to not allow thread switches to lower privileged tasks due to IPC, messages to these tasks would in general be pointless because they were never processed by the receiver.
If a tasks sends a message to another task it doesn't do so without any reasons. Either it acts as a client and has to needs some service from a manager, or it's a server itself and has to answer a caller's request. In both cases the task-switch can't be avoided, trying to do so would mean halting the system.

A low priority server and high priority client doesn't make much design sense, so let's assume the opposite - a high priority server sending a message to a low priority client. In this case it can be assumed that the server is sending the message in response to an earlier request (it's the most likely possibility). Without an immediate task switch, the high priority server could continue to run or other tasks may run instead of the low priority task. The low priority task would eventually get CPU time (when it should, according to the scheduler) and receive the message. There's no reason for the entire system to become halted.

gaf wrote:
Building the sigma0 pager (or a physical memory manager) into the kernel wouldn't suddenly make the entire OS monolithic.
I thought you meant it in a more general way and not only for sigma0. Nevertheless including sigma0 is a (small) step in direction of a monolithic kernel. As you yourself have pointed out, even sigma0 can be implemented in different ways (PAE extention for systems with RAM > 4GB) and including it into the kernel would mean that you have to chose one of them. Apart from that I doubt that the performance gain will be higher than some 0.001 percentages, except if the user-level pagers don't support any page caching.

I did mean it in a more general way, but only building things into the kernel where it makes the most sense (which doesn't really apply to the other user-pagers in L4). If the performance gains of building sigma0 into the kernel mean that page caching is unecessary, then it'd result in a simpler system where free RAM isn't scattered across multiple free page pools and less "free up some RAM" messages sent to the user-level pages.

gaf wrote:
This is of course a rather debatable statement - from my perspective L4 doesn't prove that micro kernels can be fast enough compared to monolithic kernels, but instead only proves that machine specific code is faster than portable code.
In order to be able to support specific code an OS has to be modular, which is why ?kernels allow user-space managers to decide about the policy to be used. Monolithic kernels can never be really modular, the name already says it all..
Apart from that monolithic kernels aren't more portable at all because portability can also be ensured on user-level through libraries and standard server interfaces.

What if I wrote a huge monolithic kernel in assembly, making every possible machine-specific optimization I could? In order to be able to support machine-specific code an OS doesn't have to be modular (modularity just helps with portability). Compare the performance of a portable micro-kernel (e.g. Mach) to a machine-specific micro-kernel (e.g. L4). It's difficult to compare a machine-specific monolithic kernel to a portable monolithic kernel, as all portable monolithic kernels use some machine-specific code and conditional compiling.

Cheers,

Brendan

Brendan · Post by **Brendan** » Wed Jul 27, 2005 4:41 pm

Hi,

JoeKayzA wrote:
Of course for this to work you'd need to be able to transfer at least 4100 bytes in a message (easy for my OS). It's definately not the fastest way, or the most flexible way, or the only way...
When a kernel doesn't support such large messages, what about just mapping the appropriate physical page to the disk driver, let it fetch data and then remap it into the address space of the faulting process? The destination for a missing (because swapped out) page is page aligned anyway , so isn't this behaviour more straight forward (and involves less copying)?

It would be possible for the swap provider or VFS to send the physical address for a page (8 or 12 bytes including status) instead of 4 KB of data. My problem is that the swap provider (or a message sender in general) can be on a different computer without the sender or receiver knowing or caring, which means the networking layer and/or messaging code would need to know that it's a physical address rather than simple data (which IMHO isn't worth the extra complexity, as the cost of the copying will be negligable compared to the disk IO and any networking). This only really applies to my OS though (I'd do a lot of things different if it wasn't meant to be distributed).

Cheers,

Brendan

OSDev.org

Minimalist Microkernel Memory Management

Re:Minimalist Microkernel Memory Management

Re:Minimalist Microkernel Memory Management

Re:Minimalist Microkernel Memory Management

Re:Minimalist Microkernel Memory Management

Re:Minimalist Microkernel Memory Management

Re:Minimalist Microkernel Memory Management

Re:Minimalist Microkernel Memory Management

Re:Minimalist Microkernel Memory Management

Re:Minimalist Microkernel Memory Management

Re:Minimalist Microkernel Memory Management

Re:Minimalist Microkernel Memory Management

Re:Minimalist Microkernel Memory Management

Re:Minimalist Microkernel Memory Management

Re:Minimalist Microkernel Memory Management

Re:Minimalist Microkernel Memory Management