OSDev.org

Posted: **Tue Mar 12, 2024 5:27 pm**

I've been thinking about how to best represent process memory maps and am curious about what others have come up with. Right now, I'm just doing something simple with having an array of "regions", where each region has a start,end,permissions,etc... which is "ok", but definitely is inefficient for certain operations.

I believe the requirements generally are:

* entries need to be non-overlapping, so detecting overlaps should be made easy
* at least two entries (heap and stack) should be extendable, so detecting the bounds of the nearest region in a given direction should be made easy
* ideally, finding holes of a given size should also be easy, so things like mmap can be efficiently implemented.

So, what have people done? Do they just use a sorted array and brute force it knowing that there won't be THAT many regions per process? or is there a clever data structure I'm not thinking of?

Posted: **Tue Mar 12, 2024 8:30 pm**

My idea is to use two collections. One will be a binary tree with each entry indexed by address and length. This will be used for looking up addresses in a page fault, so it provides the mapping address-->(file, offset). The other will be some collection of free blocks. Haven't decided on the exact collection yet. But all of the free virtual memory must be in there. I know that in practice I will also need to look up addresses in there, and that mmap() can be tasked with finding "the first free block of size x at or above address y", and I am not sure how to solve that one yet. A sorted array would be nice if not for the fact that this structure is likely to change all the time.

proxy wrote:Do they just use a sorted array and brute force it knowing that there won't be THAT many regions per process?

My firefox is currently running with 662 lines in /proc/self/maps. I am sure for mpv it would be even more. Recent versions of binutils create four mappings per dynamic library, and mpv is linked against 333 libraries. So yes, you can probably expect around 1500-2000 mappings in a process on the high side.

Posted: **Wed Mar 13, 2024 1:41 am**

I'm using the page table entries themselves for this. I also use some of the reserved bits for copy-on-write and to trigger a kernel mode callback.

I do not support the Unix model where everything is based on fork, as this is quite inefficient. Instead, I handle fork as a special case and build-up custom structures for this that normally do not exist. The normal way of creating a process is by actually creating a new paging structure, rather than forking the existing. Much like how Windows does too.

Posted: **Wed Mar 13, 2024 9:00 am**

rdos wrote:I do not support the Unix model where everything is based on fork, as this is quite inefficient. Instead, I handle fork as a special case and build-up custom structures for this that normally do not exist. The normal way of creating a process is by actually creating a new paging structure, rather than forking the existing. Much like how Windows does too.

Yeah, I also am not going for a fork/exec model. Unlike Linux, where "everything is a process", and threads are just multiple processes which happen to share the same address space (i know, over simplified); My design is more like I learned in college. That is, a process is simply an address space and associated resources file handles, and threads are streams of execution which exist in a given process. So to launch a new application, I simply create a new Process object, where it can inherit some things from its parent process such as file handles, and then I just attach a main thread to it.

So my syscall for launch a process is just "sys_spawn" which aesthetically looks like linux's exec, except that there is no fork, it just creates a process and adds a thread to it and moves on. I like this because it's just dead simple (in my mind).

Posted: **Wed Mar 13, 2024 9:05 am**

rdos wrote:I'm using the page table entries themselves for this. I also use some of the reserved bits for copy-on-write and to trigger a kernel mode callback.

@rdos, so I considered using the page tables themselves for this, but I eventually concluded that I didn't like it for a few reasons:

1. I couldn't think of a good algorithm for finding holes besides just linearly searching
2. there are attributes for regions that I'd want to track as well, like "is this page OWNED by this process?" this was important for my VDSO page because I don't want a process "freeing" that page back to the page manager
3. if it's an mmaped page, what file/device is it associated with?
etc..

Using reserved bits for this is something I hadn't considered. So I'll have to think on it some more, but, I thought that they HAVE to be 0's for any present pages or it will cause a page fault with the errcode having the RSVD bit set. am I mistaken about that?

Posted: **Wed Mar 13, 2024 9:41 am**

proxy wrote:
rdos wrote:I'm using the page table entries themselves for this. I also use some of the reserved bits for copy-on-write and to trigger a kernel mode callback.
@rdos, so I considered using the page tables themselves for this, but I eventually concluded that I didn't like it for a few reasons:

1. I couldn't think of a good algorithm for finding holes besides just linearly searching
2. there are attributes for regions that I'd want to track as well, like "is this page OWNED by this process?" this was important for my VDSO page because I don't want a process "freeing" that page back to the page manager
3. if it's an mmaped page, what file/device is it associated with?
etc..

Using reserved bits for this is something I hadn't considered. So I'll have to think on it some more, but, I thought that they HAVE to be 0's for any present pages or it will cause a page fault with the errcode having the RSVD bit set. am I mistaken about that?

Well, when the page is not present, you can use all other bits (31 for 32-bit paging) for status. I use one bit to indicate that a process has allocated a page, but it is not yet mapped to physical memory. If the process access a non-mapped page, the pagefault handler will check if it is allocated, and if so, will allocate physical memory and restart the instruction. If the page is not allocated, a page-fault exception is generated in userspace.

The same principle can be used for memory mapping. My new VFS memory maps all file contents automatically in userspace, and the pages are marked with a special "don't free me" bit to indicate that the physical memory is shared and should not be freed as part of process exit, or if the linear address is freed. This handles all types of sharing, and works with memory mapping too. I use the share bit in kernel too.

Posted: **Wed Mar 13, 2024 4:32 pm**

proxy wrote:I've been thinking about how to best represent process memory maps and am curious about what others have come up with. Right now, I'm just doing something simple with having an array of "regions", where each region has a start,end,permissions,etc... which is "ok", but definitely is inefficient for certain operations.

I believe the requirements generally are:

* entries need to be non-overlapping, so detecting overlaps should be made easy
* at least two entries (heap and stack) should be extendable, so detecting the bounds of the nearest region in a given direction should be made easy
* ideally, finding holes of a given size should also be easy, so things like mmap can be efficiently implemented.

So, what have people done? Do they just use a sorted array and brute force it knowing that there won't be THAT many regions per process? or is there a clever data structure I'm not thinking of?

I map virtual addresses to segment descriptors (with base, size and backing details) using binary search tree. There is one tree for the kernel address space shared between all processes, and one tree per process for the process user address space.

A sorted data structure here is important, to make lookups O(log2), as you'll be doing look ups quite often (each page fault, any system call argument that validates a user space provided address,) hence why I use a BST, but any sorted structure would do (sorted array, b+tree). In fact, an array based structure would be more optimal for lookup, as it'll be more cache friendly.

For holes to satisfy mmap calls, you're likely to have far fewer entries to track, as your space is likely to be in a small number of actual spaces. I don't have a mmap yet, so when exec'ing, I just allocate segments based on PT_LOAD segment entries from the ELF file, keeping track of the highest and lowest addresses used, then setting the break address to the highest address from which the heap is expanded.

But I do track holes in kernel space, and for that, I use a sorted map of the holes, again by base address. Given the small number of entries you'll probably have (segments are likely concentrated into contiguous regions), a sorted array will probably be optimal, but I currently use a BST here as well. I walk the available holes in address space order, using the first fit hole to allocate the desired new segment. "Removing" the new segment from the hole is as simple as advancing the hole base address by the size of the allocated segment, and reducing the size accordingly, so you don't even have to re-sort or resize your sorted map, unless the hole found was exactly the right size and can be removed.

For your specific requirements:

* entries need to be non-overlapping, so detecting overlaps should be made easy
* at least two entries (heap and stack) should be extendable, so detecting the bounds of the nearest region in a given direction should be made easy

Using a sorted map, you should be able to easily find entries that are greater than or less than your address key. My abstract map interface defines lookups using <, <=, ==, >= and > operators, so you can easily find the start of the next segment in either direction that will provide the bounds to which you can extend a segment. so both of these should be easy and efficient.

* ideally, finding holes of a given size should also be easy, so things like mmap can be efficiently implemented.

As noted above, a map of holes will likely be small, so using first-fit, you can linearly walk the map of holes by base address, for an O(N) lookup (where N is the number of holes, and will be small.) If your desired address has any alignment requirements, this can fragment your holes, but I don't find that a problem. Running up my kernel, running a fork bomb test case and mounting a filesystem, splits my kernel address space into 3 holes.

Using a sorted map by base address of the hole makes it easy to find adjacent holes when freeing segments, allowing you to easily merge adjacent holes. Certainly easier than if your holes are indexed by size. Plus, indexing by size would require a compound key of [size,base], to keep keys unique when faced with multiple holes of the same size.

TL; DR

The goto interface in my kernel is the sorted map, against which I have a number of implementations (BST, sorted array)

For address space segments, I add segments to the map sorted by segment base address.

For address space holes, I add holes to the map sort by hole base address, with a linear first fit search to find a suitable hole to satisfy an address space allocation.

Basically, use a sorted map by address for both is my advice.

Posted: **Wed Mar 13, 2024 8:27 pm**

proxy wrote:So I'll have to think on it some more, but, I thought that they HAVE to be 0's for any present pages or it will cause a page fault with the errcode having the RSVD bit set. am I mistaken about that?

Reserved bits must be zero, but available bits can be whatever you want. Present entries have a handful of available bits. Not-present entries are mostly available bits.

Posted: **Wed Mar 13, 2024 8:50 pm**

thewrongchristian wrote:any system call argument that validates a user space provided address

That is way overkill. The simplest idea is to verify that the user provided address does actually point to userspace (so is smaller than the start of kernel space, or whatever metric you use), and then just access it in a special instrumented function that your page fault handler can pick up and place you in an exception handler. Your way is how Linux used to do things, and my way is how they are doing things now (that is, ever since 2.4.x somewhere) that they realized that the CPU already does all of the checking (as long as CR4.WP is enabled in normal execution).

Basically, all of the user access functions write the address of the instruction actually doing the address into an exception table. Then all of the fault handlers notice if they are in kernel mode and the instruction pointer equals one of the ones in the table, and then the instruction pointer is simply overwritten with the corresponding handler and the fault handler returns. And the exception handler in the user access function simply returns failure.

OSDev.org

Good data structure for process memory maps?

Good data structure for process memory maps?

Re: Good data structure for process memory maps?

Re: Good data structure for process memory maps?

Re: Good data structure for process memory maps?

Re: Good data structure for process memory maps?

Re: Good data structure for process memory maps?

Re: Good data structure for process memory maps?

Re: Good data structure for process memory maps?

Re: Good data structure for process memory maps?