Kernels that don't share address space with user processes

cardboardaardvark · Post by **cardboardaardvark** » Fri Nov 22, 2024 4:15 pm

I've got an architecture in mind where the kernel shares as little address space with user processes as possible. In my concept I do not intend to use an actual microkernel design. This is an offset of a topic I started about how to get the linker to deal with a kernel that is spread around in it's address space: /viewtopic.php?p=350146.

This hypothetical architecture would avoid a standard higher-half kernel design and have a minimal trampoline area that is the only part of the address space that is shared between the kernel and user space. Here is a diagram of the address space layout and page mapping types I'm thinking of:

In my linker question post some discussion started related to the validity and merit of this architecture. I'd like to open up a discussion on those topics in the right place instead of having it in a post about a linker question.

First, this design is something I came up with because to me it "feels right." Sharing as little address space between the kernel and user space makes sense to me and my sysadmin brain. This design isn't based off anything else besides my gut telling me what to do. I also have barely any idea what I am doing.

Here are the pros of the design that I've come up with:

User mode process gets access to as much address space as is possible. This probably only matters for a 32 bit system and then also only matters for some rather specific cases where a program would need so much address space.
Likewise the kernel gets access to as much address space as possible. The benefit here I can think of is that the extra address space over a higher-half kernel could be used to maintain even larger filesystem and other caches.
Inherently resistant to the meltdown data exfiltration attack.

A point was made previously that this design has some drawbacks of a microkernel design with out the benefits. Here's my attempt at enumerating the cons:

Every transition between user space to kernel space or kernel space to user space requires the TLB to be flushed. This is going to have performance implications. Exactly how bad the performance hit is and how different it would be from a microkernel design is not clear to me.
Identifying a pointer from user space is now more complicated than the higher-half kernel approach of seeing if the numerical value of the pointer is above the 1/2 of the address space.
Using a user mode pointer will either involve setting up page table entries just for the pointer or doing manual address space translation and copying what the pointer actually points at. I also suspect setting up page table entries just for the pointer won't work correctly since the kernel is supposed to be able to use anything in it's address space as part of it's heap.

That's what I've got so far. I'm curious to hear the thoughts on people who have some idea of what they are doing.

thewrongchristian · Post by **thewrongchristian** » Fri Nov 22, 2024 5:23 pm

cardboardaardvark wrote: ↑Fri Nov 22, 2024 4:15 pm A point was made previously that this design has some drawbacks of a microkernel design with out the benefits. Here's my attempt at enumerating the cons:

Every transition between user space to kernel space or kernel space to user space requires the TLB to be flushed. This is going to have performance implications. Exactly how bad the performance hit is and how different it would be from a microkernel design is not clear to me.

Newer x86 CPUs include address space id (PCID) support, which tags TLB entries with an address space ID, so switching address spaces can be done without flushing the entire TLB.

Using PCID should mitigate much of the performance impact of switching address spaces.

When Meltdown first landed, the initial mitigations had measurable performance impacts, between 5% and 30% were bandied about, but I suspect later work optimised the use of PCID in Linux and Windows to the point that the performance impact would have been negligable, probably not much more significant than regular benchmark noise.

Still, I've no figures to back any of this up, and not bothered to lookup up any of the reports that claimed significant performance drops from the Linux and Windows mitigations.

cardboardaardvark wrote: ↑Fri Nov 22, 2024 4:15 pm

Identifying a pointer from user space is now more complicated than the higher-half kernel approach of seeing if the numerical value of the pointer is above the 1/2 of the address space.

On the contrary, it is easy to determine a user pointer. They should be coming from system call parameters only, and any user pointer must by definition be treated with caution.

Of course reading a user pointer is now more work, as you'll have to map the pages pointed to by the pointer into the kernel address space, but that's an update of the page kernel table and a TLB miss per user page away.

cardboardaardvark wrote: ↑Fri Nov 22, 2024 4:15 pm

Using a user mode pointer will either involve setting up page table entries just for the pointer or doing manual address space translation and copying what the pointer actually points at. I also suspect setting up page table entries just for the pointer won't work correctly since the kernel is supposed to be able to use anything in it's address space as part of it's heap.

What do you mean by that?

The heap does not equate to the address space. The address space includes the heap space, but you can include non-heap address mappings within the kernel.

When dealing with user pointers, say when copying to/from a user buffer, it is not difficult to simply provide a function that takes a user address space id and pointer, and return the equivalent memory mapped into the kernel address space:

Code: Select all

/* Given a user PCID and address, get or make a kernel alias mapping to same, and return the equivalent kernel pointer */
void * user_to_kernel_map(pcid user_address_space, uintptr user_address);

/* Given a user PCID and address, unmap the kernel side and release the region for reuse */
void user_to_kernel_unmap(pcid user_address_space, uintptr user_address);

cardboardaardvark · Post by **cardboardaardvark** » Fri Nov 22, 2024 8:03 pm

thewrongchristian wrote: ↑Fri Nov 22, 2024 5:23 pm On the contrary, it is easy to determine a user pointer. They should be coming from system call parameters only, and any user pointer must by definition be treated with caution.

Indeed it is easy to know what is a user pointer when it first shows up in the kernel from a system call. I recall the "not using higher half kernel leads to harder to identify user pointers" from the wiki when I was doing my first reading there. I suspect what that is referring to is figuring out if a pointer is a user pointer after it runs around inside the kernel for a while. Perhaps I'm wrong.

I've had some thoughts about how to make sure a pointer to a physical page doesn't wind up in the wrong place by accident and I think it would work ok-ish for user pointers too. My kernel is in C++ and I could wrap a physical page pointer and user pointers up in a class that won't go where void * can until you take some explicit step to unwrap the pointer. It's not perfect but it would at least cause compile errors for a subset of sending the wrong pointer to the wrong place.
[/quote]

Using a user mode pointer will either involve setting up page table entries just for the pointer or doing manual address space translation and copying what the pointer actually points at. I also suspect setting up page table entries just for the pointer won't work correctly since the kernel is supposed to be able to use anything in it's address space as part of it's heap.

What do you mean by that?

The heap does not equate to the address space. The address space includes the heap space, but you can include non-heap address mappings within the kernel.

When dealing with user pointers, say when copying to/from a user buffer, it is not difficult to simply provide a function that takes a user address space id and pointer, and return the equivalent memory mapped into the kernel address space:

What I meant was that allowing a user space pointer with any arbitrary virtual address into kernel space and then setting up a map in kernel space so the user pointer works the same from user space would require arbitrary portions of the kernel address space to be available for the mapping, wouldn't it? That would interfere with the idea that the kernel is free to use it's entire address space as it wants. Am I mistaken here?

I was expecting to handle user pointers by doing the page table lookups and address translation in software. I'd be quite pleased if my expectations don't hold true. I suppose one way to deal with this is by reserving a part of the common address space (next to the trampoline) specifically for passing data between user and kernel space.

nullplan · Post by **nullplan** » Fri Nov 22, 2024 10:28 pm

cardboardaardvark wrote: ↑Fri Nov 22, 2024 8:03 pm Indeed it is easy to know what is a user pointer when it first shows up in the kernel from a system call. I recall the "not using higher half kernel leads to harder to identify user pointers" from the wiki when I was doing my first reading there. I suspect what that is referring to is figuring out if a pointer is a user pointer after it runs around inside the kernel for a while. Perhaps I'm wrong.

Indeed that is a problem. Especially when you are allowing kernel threads to use similar APIs to user space, the confusion can quickly mount. There is a reason why Linux, despite being a higher-half kernel, took to annotating user pointers and using a special tool to find bad conversions.

cardboardaardvark wrote: ↑Fri Nov 22, 2024 8:03 pm What I meant was that allowing a user space pointer with any arbitrary virtual address into kernel space and then setting up a map in kernel space so the user pointer works the same from user space would require arbitrary portions of the kernel address space to be available for the mapping, wouldn't it? That would interfere with the idea that the kernel is free to use it's entire address space as it wants. Am I mistaken here?

From a high-level perspective, this is really simple: You cannot access user pointers directly, you have to do so through special functions, anyway. Let's call it copy_from_user(). Its signature is

Code: Select all

int copy_from_user(void *, void *__user, size_t);

So it takes a kernel address, a user address, and a size, and returns 0 if it could copy all that from the user or -EFAULT if not.

Now, on a higher-half kernel, you can make that function pretty much just an instrumented memcpy() that returns failure when the CPU faults, but in your case this will not be possible. However, you can look up the physical page the user address is pointing to, map that somewhere in your own address space, and start copying the data out.

This also means that you will need a place in your kernel address space for arbitrary memory maps, but then, you will need that anyway for device drivers. Most modern devices use MMIO or even in-memory structures to talk to the OS.

In general, I don't think that such a system is worth building, since the pros are so spurious and easily outweighed by the cons. The only pros you have vanish on a 64-bit system after implementing PTI.

It is also worth noting that on a 32-bit system, the C standard actually prohibits the existence of objects larger than 2GB, because inside those, ptrdiff_t would not work correctly. To take advantage, you'd need a very special kind of application that can run on a 32-bit platform, and has objects, each not above 2GB in size, but adding up to between 3 and 4 GB (because 3GB is what you get on most higher-half systems). I know of no application that fits these criteria. The ones I know that go above 3GB go way above, and then you do really need a 64-bit system.

But who am I to stop you? If you really want to do this, go for it. This forum is filled with bad ideas to the brim, and ignoring the received wisdom is partially what the hobby is all about (else what's the point of building a new OS when NetBSD hasn't gone anywhere?)

cardboardaardvark · Post by **cardboardaardvark** » Sat Nov 23, 2024 2:36 am

nullplan wrote: ↑Fri Nov 22, 2024 10:28 pm Now, on a higher-half kernel, you can make that function pretty much just an instrumented memcpy() that returns failure when the CPU faults, but in your case this will not be possible. However, you can look up the physical page the user address is pointing to, map that somewhere in your own address space, and start copying the data out.

Ah right, the address of the physical page has to be determined and then the physical page has to be mapped into the address space of the kernel but not necessarily at the same virtual address from user space. Not sure why I didn't realize that before. I'm not used to thinking in terms of juggling a MMU.

I've already got a physical page manager going and a dynamic heap using Doug Lea's memory allocator itself using the sbrk() interface to make the address space bigger. I haven't implemented the temporary mapping trick like you mentioned though. At the moment, after paging is turned on, there is only one specific place where I deal with memory that isn't allocated on the heap or in the identity mapped section of my kernel: returning a physical page to the free physical page stack.

I zero the page as it goes into the free list. The first time that happens is when the free list is built as a part of bootstrapping paging. After paging is enabled and I want to put a physical page back on the free list I have the function that zeros the page out and puts it back into the list temporarily disable paging while it does it's work. Since the instruction pointer, the stack pointer, and the physical page list are all working in the identity mapped region this works just fine. Or seems to anyway. I'm pretty sure toggling the paging enable bit doesn't cause a TLB flush so I don't think there's a performance cost here.

It's a little bit worrisome that you might wind up coming out of the function somehow and not enable paging again so I wrote a C++ class that acts like a std::lock_guard. When the object is created it reads the paging status from the control register and when the object goes out of scope it returns the paging status to what it was when it was created. Covers exceptions, early returns, and paging being enabled or disabled when the function is entered.

It is also worth noting that on a 32-bit system, the C standard actually prohibits the existence of objects larger than 2GB, because inside those, ptrdiff_t would not work correctly. To take advantage, you'd need a very special kind of application that can run on a 32-bit platform, and has objects, each not above 2GB in size, but adding up to between 3 and 4 GB (because 3GB is what you get on most higher-half systems). I know of no application that fits these criteria. The ones I know that go above 3GB go way above, and then you do really need a 64-bit system.

It's been a long time since I was an Oracle DBA but I recall that Oracle would use up as much RAM as it could because it had a preference for caching disk IO in it's heap. Course all of these arguments about how to fully utilize the space available in a 32 bit address space hardly matter anymore with 64 bit machines being the norm. It'll be an interesting day when this argument matters again on 64 bit machines. How much RAM could Windows version 82 use?

Contrast to PostgreSQL with it's preference for letting the OS do the majority of the disk caching. Though now that I'm thinking about it, with a higher half kernel on a 32 bit machine doesn't that mean the kernel would have max 2 GiB of cache to offer Postgres?

But who am I to stop you? If you really want to do this, go for it. This forum is filled with bad ideas to the brim, and ignoring the received wisdom is partially what the hobby is all about (else what's the point of building a new OS when NetBSD hasn't gone anywhere?)

Now I'm curious how non-higher half kernels usually work. I'm going with what feels right for my design and of course feelings are often wrong. If something stands out that as making this a catastrophically bad idea I'll likely move to a higher half design with out worrying about implementing what I'm thinking of. Otherwise I'll probably go ahead with this concept unless I also see something that would make it overly difficult to change over to higher-half if this does turn out to be an awful idea.

iansjack · Post by **iansjack** » Sat Nov 23, 2024 5:59 am

Though now that I'm thinking about it, with a higher half kernel on a 32 bit machine doesn't that mean the kernel would have max 2 GiB of cache to offer Postgres?

I'm not quite sure what you are getting at here.

A higher-half kernel doesn't use all of the higher half of memory; it just resides in part of that address space.

rdos · Post by **rdos** » Sat Nov 23, 2024 6:01 am

cardboardaardvark wrote: ↑Fri Nov 22, 2024 4:15 pm First, this design is something I came up with because to me it "feels right." Sharing as little address space between the kernel and user space makes sense to me and my sysadmin brain. This design isn't based off anything else besides my gut telling me what to do. I also have barely any idea what I am doing.

The idea makes sense, at least of parts of the OS.

cardboardaardvark wrote: User mode process gets access to as much address space as is possible. This probably only matters for a 32 bit system and then also only matters for some rather specific cases where a program would need so much address space.

How much space user mode gets is dependent on how much space the kernel needs. Typically, most space will be used for disc & file caches, and the more you can cache the better performance you can get. So, in order to give user mode more space, you should get the disc & file caches out of the linear memory of the kernel.

cardboardaardvark wrote: Likewise the kernel gets access to as much address space as possible. The benefit here I can think of is that the extra address space over a higher-half kernel could be used to maintain even larger filesystem and other caches.

True, but given the large sizes of files & data modern applications use, a 32-bit system will not be able to keep this in it's linear address space, so another solution is required for that.

cardboardaardvark wrote: A point was made previously that this design has some drawbacks of a microkernel design with out the benefits. Here's my attempt at enumerating

A true microkernel design is not required to solve the disc cache problem.

cardboardaardvark wrote: Every transition between user space to kernel space or kernel space to user space requires the TLB to be flushed. This is going to have performance implications. Exactly how bad the performance hit is and how different it would be from a microkernel design is not clear to me.

Depending on what the design solves this might not be an issue. For instance, if you can keep much larger caches, then you will still get better performance with disc-IO.

cardboardaardvark wrote: Identifying a pointer from user space is now more complicated than the higher-half kernel approach of seeing if the numerical value of the pointer is above the 1/2 of the address space.

I solve the pointer problem by giving user mode a selector with a 3 GB limit and letting syscalls operate on 48-bit (segment + 32-bit offset) pointers. So, I wouldn't say this is a trivial problem in regards to the typical flat memory-model kernel. It might even be easier to ensure stability if all user mode pointers must be mapped in kernel space rather than being able to use them directly.

I think the primary issue for 32-bit OSes is to provide a smart algorithm for caching disc-IO and file contents. Initially, I kept both the disc cache
and the file cache mapped in kernel space. This means the kernel needs a somewhat aggressive method of reducing cache size, even in the presence of lots of RAM.

My new way of dealing with this is a bit similar to your design. Every disc creates it's own process, reserving 1 GB for keeping the disc cache in the form of lists/trees pointing to physical addresses. Every partition is a "fork" where the lower 2 GB can be used as private space for caching meta data and similar stuff. The interface with user mode / kernel is through a 4k page (similar to your trampoline) where users can post requests, and then gets blocked until the server has processed them. Every open file has another 4k page where users of the file can queue commands. The file cache is maintained in user mode of the process having the file open. This also reduce syscalls since if the content is already mapped, then user mode can just read or write it. This solves the issue of having the cache mapped in kernel space. It solves this issue without actually separating user & kernel from each other, instead the disc functions run in their own processes as services. I wouldn't say I have a microkernel, rather this is a special way of dealing with disc caching.

rdos · Post by **rdos** » Sat Nov 23, 2024 6:16 am

iansjack wrote: ↑Sat Nov 23, 2024 5:59 am
Though now that I'm thinking about it, with a higher half kernel on a 32 bit machine doesn't that mean the kernel would have max 2 GiB of cache to offer Postgres?
I'm not quite sure what you are getting at here.

A higher-half kernel doesn't use all of the higher half of memory; it just resides in part of that address space.

Right. The term is a bit misleading. Most 32-bit OSes seems to give user mode 3 GB and kernel 1 GB. I think if you can get the disc cache out of kernel space, then it's possible to give user mode 3.5 GB and kernel 512 MB, or even to switch the balance even more.

nullplan · Post by **nullplan** » Sat Nov 23, 2024 12:33 pm

cardboardaardvark wrote: ↑Sat Nov 23, 2024 2:36 am Contrast to PostgreSQL with it's preference for letting the OS do the majority of the disk caching. Though now that I'm thinking about it, with a higher half kernel on a 32 bit machine doesn't that mean the kernel would have max 2 GiB of cache to offer Postgres?

The combination of 32-bit CPU and lots of memory is not usually one you need to concern yourself with these days. If you must, though, then the Linux idea of swapping out certain maps means the OS can utilize unlimited memory (well, the maintenance structures must still fit in the permanently mapped memory). If you are unfamiliar with the scheme, Linux on a 32-bit system will linearly map the first 768 MB of physical address space to the virtual 3GB mark, and will expect all kernel code to be contained in that. The remaining 256 MB of virtual address space can be swapped out as needed on a page-by-page basis. The kernel depends on the entire kernel code and all of the page tables, as well as the structures that manage that high memory, being inside those 768 MB, but everything else (including file system caches) can be swapped out. In this way, it is possible to make use of all physical memory, as long as the management stuff fits in 768 MB.

cardboardaardvark wrote: ↑Sat Nov 23, 2024 2:36 am Now I'm curious how non-higher half kernels usually work.

The only non-higher-half kernel I know is OS-9. It instead uses a direct mapped approach. The MMU is only used for memory protection, and otherwise everything is getting identity mapped. The executable format is relocatable, and the ABI also allows for two registers to be reserved as pointers to the code and data sections respectively.

The design is deficient in many ways. Since nobody thought to reserve a thread-pointer register, threads are handled with a thread-pointer in memory, so a specific memory address contains a thread-specific pointer that the kernel writes there before scheduling the thread. But that design is fundamentally incompatible with SMP, so they've completely locked themselves out of that forever. And the direct mapping means that memory fragmentation can make it impossible to fulfill a request even though enough memory would be free. If you are left with only a 2 MB block over here, and another 3 MB block over there, and a program requests 4 MB, then the request is denied.

Octocontrabass · Post by **Octocontrabass** » Sat Nov 23, 2024 6:27 pm

cardboardaardvark wrote: ↑Sat Nov 23, 2024 2:36 amI'm pretty sure toggling the paging enable bit doesn't cause a TLB flush so I don't think there's a performance cost here.

Both Intel's and AMD's manuals say disabling paging will flush the TLB. Plus, in 64-bit mode, you can't disable paging.

rdos · Post by **rdos** » Sun Nov 24, 2024 7:50 am

Octocontrabass wrote: ↑Sat Nov 23, 2024 6:27 pm
cardboardaardvark wrote: ↑Sat Nov 23, 2024 2:36 amI'm pretty sure toggling the paging enable bit doesn't cause a TLB flush so I don't think there's a performance cost here.
Both Intel's and AMD's manuals say disabling paging will flush the TLB. Plus, in 64-bit mode, you can't disable paging.

You can only disable paging in protected mode if your code is running in a unity-mapped area. You can do the same in 64-bit mode if your code is below 4G and unity-mapped (although you also need to change a MSR). I know this works since I do it in my long mode <-> protected mode switching code that is part of the scheduler.

Octocontrabass · Post by **Octocontrabass** » Sun Nov 24, 2024 3:11 pm

rdos wrote: ↑Sun Nov 24, 2024 7:50 amYou can do the same in 64-bit mode if your code is below 4G and unity-mapped (although you also need to change a MSR). I know this works since I do it in my long mode <-> protected mode switching code that is part of the scheduler.

No, you really can't disable paging in 64-bit mode. You must be switching to 32-bit mode before you disable paging.

rdos · Post by **rdos** » Sun Nov 24, 2024 3:34 pm

Octocontrabass wrote: ↑Sun Nov 24, 2024 3:11 pm
rdos wrote: ↑Sun Nov 24, 2024 7:50 amYou can do the same in 64-bit mode if your code is below 4G and unity-mapped (although you also need to change a MSR). I know this works since I do it in my long mode <-> protected mode switching code that is part of the scheduler.
No, you really can't disable paging in 64-bit mode. You must be switching to 32-bit mode before you disable paging.

Yes, you need to switch to compatibility mode before you can disable paging. So, yes, you are correct that you cannot execute 64-bit code without paging.

OSDev.org

Kernels that don't share address space with user processes

Kernels that don't share address space with user processes

Re: Kernels that don't share address space with user processes

Re: Kernels that don't share address space with user processes

Re: Kernels that don't share address space with user processes

Re: Kernels that don't share address space with user processes

Re: Kernels that don't share address space with user processes

Re: Kernels that don't share address space with user processes

Re: Kernels that don't share address space with user processes

Re: Kernels that don't share address space with user processes

Re: Kernels that don't share address space with user processes

Re: Kernels that don't share address space with user processes

Re: Kernels that don't share address space with user processes

Re: Kernels that don't share address space with user processes