OSDev.org

Posted: **Mon Jan 22, 2018 6:12 am**

Hi,

while I understand the basic concepts of paging and swapping and virtual memory, I am wondering a bit about how this is managed in the kernel in practice. I.e. what data structures, algorithems and strategies are used to manage the many address spaces. It is probably easier to illustrate with a concrete question:

Suppose the system, while executing Process A, has run out of physical memory and a page needs to be evicted. The system must find something that is paged in but that is old. This may of course involve a page that is owned by a completely different process B, i.e. it is not even in the current virtual adderss space. What data structure do kernels typically use to do this? Does the kernel have a data structure of all _physical_ pages (or ranges) and in which address spaces (i.e. processes) they are currently mapped into (and some bits for keeping track of last use)? Because for determining pages to evict we are of course only interested in something that is currently physical memory. This table would be a kind of reverse-indexed page table (because the CPU's page tables are indexed by virtual address) and would be swapped into the global kernel space.

An alternative could be to for each process have data structures indexed by virtual address (perhaps even the native CPU page structures which have to be there). When finding a page to evict, this structure would have to be walked for all processes to find candidates for eviction. Questino is then, would this data structure be in the global kernel space? Or would we have to switch between address spaces to access it? Or would we on a "per process" basis map it into the kernel address space during the walk, then unmap it afterwards? Self-referencing page table strategies would only be usable if we actually switch in the foreign process' address space during the scan.

Then once a candidate for eviction has been found, the next question is how eviction actually happens? Does the kernel do some ad-hoc mapping of both the page itself and of the owning process' paging structuers (since they have to be updated) in order to swap out the data and update the structure? Or will the kernel switch to the adress space of the owning process (or wait until it happens naturally?) and then handles things from there where self-mapping page tables can be used?

Hope my questions make sense

Posted: **Mon Jan 22, 2018 7:00 am**

You really need to spend some time solely dedicated to understand this topic, say a year, to also implement it successfully.

It's similar to disk management. Take into account that the initial intention of virtual memory is getting rid of fragmentation, but fragmentation will still keep occurring in a nested way inside your virtual memory spaces.

I've made a static paging structure for the full 4GB, identity-mapped, so that it's easy to understand the full sequence of the paging structures for 32-bit:
https://sourceforge.net/p/x86-32-paging/

First take into account that you could start by allocating things in such way, starting with purely static structures that you could later generate dynamically when you have enough memory management code to test.

To make things simpler, you can take into account the following:

- At the lowest level, you are just concerned with detecting how much free memory there is, finding free pages, freeing used pages, corresponding physical with virtual addresses, building one or more page directories, identity mapping, finding physically contiguous groups of pages, allocating despite of fragmentation if no contiguous memory is required, being able to point to the same physical pages from different virtual pages, to resolve from virtual to physical addresses and vice-versa. Having on-disk virtual memory comes much later, it's basically a subsystem/subfunction that is capable of using disk instead of RAM transparently, and moving the contents found on disk to memory when/if needed. Note that you could transparently be writing to disk without really moving all contents to RAM, but to make it efficient you will need to make tests to find the fastest implementation. All of those are functions you will need to implement individually.

- To make things easier, you can define a minimum allocation size, it could be just one 4096-byte page.

- You can also allocate the different subsystems of your OS, low-level or generic, in different ways so that it can later be easy to manage those components in virtual memory as neatly-defined modules.

- You need a bit more particular memory management for memory reserved by hardware devices. Yo will need being able to gather memory from the BIOS, PCI, etc., in your memory manager.

- You will need a physical memory manager and also a virtual memory manager, it could be the same manager in two layers. It's the physical manager the one capable of accessing all RAM and capable of building the paging structures. The virtual memory manager could use many of the functions of the physical memory manager, but through the virtual memory hardware/software layer.

- You need to find genuine cases where you need to use paging for software and hardware. For example, try to embed a multi-image GIF in your kernel, boot it in graphics mode and try to decompress its frames, show them as animation, using dynamic memory allocation. It should give you enough initial testing to evaluate how to handle memory to render, for example, a big GIF cinemagraph for cases where it fits in the whole screen and cases where it doesn't.

Posted: **Mon Jan 22, 2018 7:22 am**

Hi, thanks for the reply but I didn't get much insight into the data structures to use. I am looking for if there are some "common" architectural choices just like you know self-referencing page tables is incredible useful but that obvious until you hear about it. And I guess it is the way most operating systems work. But the problem is different now multiple processes are in play.

Where I am now:

1) I have multiple address spaces - upper part is shared kernel space
2) I have a primitive physical allocator (just adding up starting from 1 MB and up)
3) I have a primitive "per address space" virtual memory allocator for user-space (also just adds up but starting from some virtual address in user space per process)
4) I have a primitive "kernel address space" virtual memory allocator (just adds up starting from some fixed virtual address in the shared kernel space9
5) I have a memory mapper that for a given address space can build and extend long mode page tables and map to specified physical regions.
6) I have no support for "freeing" or "deallocation" of any kind!
7) I have cooperative and pre-emptive task switching working (stress tested)

I have helper functions for mapping in areas of physical memory into kernel space (i.e. MmMapIoSpace equivalent).

So I basically have everything that is critical to allocate some memory and get things up and running. But I want a proper memory subsystem that can support (or eventually support):
1) Supporting allocate AND freeing of physical memory
2) Supporting fragmented physical and virtual memory when allocating
3) Memory-mapped files with on demand read (for e.g. executables)
4) Being able to release rarely used pages of memory mapped files as needed (can just be read in again)
5) Swapping to/from disk

All of this on just a "per page" basis since I believe anything that isn't on a "per page" basis should be implemented via a secondary heap that uses the per-page functions.

It is particularly no 3-5 where it seems there is "transversal" of address spaces and where more complicated / non-obvious structures may be helpful. For anything else I think relatively straight forward schemes can be used.

Posted: **Mon Jan 22, 2018 7:32 am**

If you are able to find/allocate/deallocate pages and sub-buffers contiguously and with fragmentation as I explained you and also implement what you want to do, then you don't really have much more to do other than functions to build different page directories if you want to assign them to different processes. You just need an algorithm that lets you find free/used pages and contiguous blocks as fast as possible.

You are in the right track right now.

All you need to do is finish dedicating time to complete and clean up your implementation.

It could take around 5-6 months, 1 year to really test the code.

So you should concentrate on this, on what you said, and implement it fully.

Other tasks will only distract you endlessly so the best choice is to spend some weeks, months or this whole year, to fully implement your code.

Next year you'll be glad you dedicated that much time for having a really decent memory manager.

The paging structures are the ones that tell you if the pages are physically/virtually contiguous or not, if they are present or not, cacheable or not, accessed or not, so it's best to only use them instead of other structures because they are advanced enough and because the CPU will use them. The best you could do would be small variables to gather data when using your memory manager functions, but the paging structures even tell you what pages to free when you terminate a process, so you practically already have everything you need to find your free/used pages, and use them as you need by your software.

Other types of software flags can get outdated and cause serious problems to your code, but the paging structures will always be up to date at the CPU level so whatever you do for managing RAM has to rely fully on the paging structures' current state in RAM and cache.

Posted: **Mon Jan 22, 2018 7:42 am**

Hi thanks again. Although I hope I can finish things up faster than a year.

My "project goal" is to have a fully functioning OS research kernel but it doesn't necessarily have to be optimal in terms of clever algorithms for managing a heap etc. because then I would have 1000x research projects to do

Rather it is to have something really simple, lean and hackable so when a new CPU feature og new hardware device comes out I can use my OS for experimenting with it and where I fully understand everything going on. So I think I can implement this (without memory-mapped files and swapping etc.) in just a few weeks, basically just some bitmaps for keeping track of what is available and then some hash tables to store what is allocated by whom etc. and how long the segment is.

But I still would like to know how OS kernels do their "across-process" magic. If they switch address spaces on the fly to easily access the addesspace befoer doing surgery on it. Or if they bite the bullet and create "ad-hoc" mappings for paging structures for the foreign process and use that for the surgery. Or if some 3rd strategy is used (some clever variant of self-mapping page tables). I think that is key part of my question.

Posted: **Mon Jan 22, 2018 8:03 am**

With what you have, you can get to work and only when you have a problem you cannot figure out by really turning on your brain/mind, you can search for more ideas. But you will find that because you have your own personal-styled implementation in mind and because you are just starting compared with other projects, writing new code will be much easier than using existing one (unless you want to make a library/module-based project with existing packages), and not much people will join to write code by your side able to understand your mindset (unless you know it's otherwise), and then you will have to be patient enough to recover energy and get ideas to implement pieces of your code without bugs.

Normally people expect a developer to release a minimally functional project before joining in to work with him/her/them and contribute in a patch-based fashion where additional code most likely won't have the same style, just obscure functions normally. It would be much much simpler asking others for a "sample opcode" in code or in pseudocode for a given small task than expecting them to work with our code, and we would end up having a very similar style, because we would be using the programming style of the machine, which is just a given instruction with a specified task, then we can decide if a sample software opcode is what we need or if we need to keep looking for another opcode-like function.

A few weeks will probably be too little time unless you manage to fully clean up your code, make it look as simple as a web script for making it highly maintainable and readable (PHP/JavaScript), and test it with real applications. It will be absolutely no gain if it fails sometimes. It has to be well thought before being able to continue productively.

kenz wrote:Hi thanks again. Although I hope I can finish things up faster than a year. My "project goal" is to have a fully functioning OS research kernel but it doesn't necessarily have to be optimal in terms of clever algorithms for managing a heap etc. because then I would have 1000x research projects to do Rather it is to have something really simple, lean and hackable so when a new CPU feature og new hardware device comes out I can use my OS for experimenting with it and where I fully understand everything going on. So I think I can implement this (without memory-mapped files and swapping etc.) in just a few weeks, basically just some bitmaps for keeping track of what is available and then some hash tables to store what is allocated by whom etc. and how long the segment is.

But I still would like to know how OS kernels do their "across-process" magic. If they switch address spaces on the fly to easily access the addesspace befoer doing surgery on it. Or if they bite the bullet and create "ad-hoc" mappings for paging structures for the foreign process and use that for the surgery. Or if some 3rd strategy is used (some clever variant of self-mapping page tables). I think that is key part of my question.

Posted: **Mon Jan 22, 2018 9:17 am**

kenz wrote: Does the kernel have a data structure of all _physical_ pages (or ranges) ...

PFN database, or mem_map array.

kenz wrote: ... and in which address spaces (i.e. processes) they are currently mapped into ...

PTE chain (not exactly sufficient to point to address spaces, but enough for the job of eviction)

kenz wrote: (and some bits for keeping track of last use)?

Accessed bit in the PTE.

It seems that you are looking for page replacement algorithms. There's enough literature on the algorithms and the implied data structures, for deciding the victim.

Evicting a page requires unlinking it from the PTEs that refer it. That is achieved by marking the PTEs invalid, and by saving the information about the location of the contents (or pointers to other necessary pieces of information) into the rest of the bits.

Posted: **Mon Jan 22, 2018 1:02 pm**

Don't forget that you can use the CR3 register to load a different page directory to switch your virtual space. The CPU expects that you use it just for that efficiently in hardware.

Posted: **Mon Jan 22, 2018 7:58 pm**

Hi,

kenz wrote:Suppose the system, while executing Process A, has run out of physical memory and a page needs to be evicted. The system must find something that is paged in but that is old. This may of course involve a page that is owned by a completely different process B, i.e. it is not even in the current virtual adderss space. What data structure do kernels typically use to do this? Does the kernel have a data structure of all _physical_ pages (or ranges) and in which address spaces (i.e. processes) they are currently mapped into (and some bits for keeping track of last use)? Because for determining pages to evict we are of course only interested in something that is currently physical memory. This table would be a kind of reverse-indexed page table (because the CPU's page tables are indexed by virtual address) and would be swapped into the global kernel space.

An alternative could be to for each process have data structures indexed by virtual address (perhaps even the native CPU page structures which have to be there). When finding a page to evict, this structure would have to be walked for all processes to find candidates for eviction. Questino is then, would this data structure be in the global kernel space? Or would we have to switch between address spaces to access it? Or would we on a "per process" basis map it into the kernel address space during the walk, then unmap it afterwards? Self-referencing page table strategies would only be usable if we actually switch in the foreign process' address space during the scan.

First, assume that there's a "memory pressure" rating somewhere that reflects how much physical memory is free (e.g. maybe "memory_pressure = min(total_pages - free_pages - 256, 0) / total_pages;"). When "memory_pressure" is low enough the kernel does almost nothing, when "memory_pressure" is near the middle of its range the kernel is gathering "what was used when" statistics, and when "memory_pressure" is high enough kernel starts trying to reduce the memory pressure (by doing things like sending pages to swap space to free the physical pages). This allows the kernel to avoid the overhead of gathering statistics when there's lots of free physical memory (which improves performance) but also ensures that kernel starts reducing memory pressure before it becomes critical.

Second, assume that each thing using memory also has a rating that is calculated from 2 factors - an estimate of when the data will be needed again next, and an estimate of how expensive it is to fetch the data if/when it is needed again. The estimate of when the data will be needed again next is typically derived from how long its been since it was used last (but that's not necessarily the only way, and in some specific cases it's not necessarily the best way). How expensive it is to fetch the data again varies significantly, depending on where it came from and if its been modified since. For example, if the data hasn't been modified and is part of a memory mapped file on a fast SSD then the kernel can free the page without saving the data on swap space and then fetch it from the fast SSD if it's needed again, but if the data was modified then it'd have to save it to swap space first and that's likely to be more expensive. The point of this calculation is to make more intelligent decisions (sometimes it's better for performance to free something that's more likely to be needed sooner simply because the data can be fetched faster).

Of course both of these things should be scaled to allow simple comparisons - e.g. "if( memory_pressure > rating_for_this_data) { free_this_data_to_reduce_memory_pressure(); }".

What isn't as obvious is that it can also be used in reverse - e.g. "if( memory_pressure < rating_for_this_data) { prefetch_this_data_before_its_needed(); }". Sometimes "memory_pressure" decreases a lot in a short amount of time (e.g. when a large application is terminated) and when that happens you end up with lots of free physical pages of RAM being wasted that could be used (if things like disk drives aren't doing anything important) to improve performance later.

Now; let's assume that the OS happens to be running a web browser (that has a lot of web pages cached) and a bitmap image editor (that is consuming a lot of memory for "undo" - to allow the most recent changes to be reverted) and several other applications that use lazy algorithms (e.g. maybe something like "if( cached_result[x, y] == UNKNOWN) { cached_result[x, y] = calculate(x, y); } return cached_result[x, y];" to avoid expensive calculations that have been done previously). The kernel could notify these processes whenever the kernel's "memory_pressure" has changed enough since last time; so that processes are able to free memory (and/or pre-fetch/pre-calculate stuff) to help the kernel keep "memory_pressure" within a certain range.

Next; assume that the VFS has file data caches, and the networking layer has DNS caches. These can work on the same "notify if memory pressure changed enough since last time" mechanics that processes use (and if it's a micro-kernel where these things are implemented as processes there's no additional work needed).

The last piece is the kernel's virtual memory manager. This is 2 parts - tracking statistics, and finding the best pages to free (or pre-fetch). The raw information comes from the "accessed" and "dirty" flags in the page table entries - it's (almost literally) scanning the virtual address space looking for page table entries that have the accessed bit set, and if it is set you'd set a "time_this_page_was_used_last = now" and clear the flag. Because the raw information comes from virtual addresses it's probably best to keep all the statistics indexed by virtual address (which makes shared memory trickier, but makes everything else easier). This also allows you to cheat - if a process hasn't been given any CPU time for 1234 seconds then you know that none of its pages have been accessed for at least 1234 seconds without bothering to do a (relatively expensive) scan of its virtual address space. More specifically, if you scan a virtual address space immediately before doing a task switch you can do "actual time page was used last = time since process was given CPU time + time page was used last time the scan was done". You can also keep track of "oldest time page was used when scan was done" for each process, so that (when kernel wants to free some memory) the kernel can quickly decide which processes are/aren't worth looking at.

Also note that the page fault handler can be used to assist in tracking the usage statistics. Specifically; immediately before a task switch you can scan a task's virtual address space - if a page was accessed you'd clear the "accessed" flag, and if a page was not accessed you'd unmap the page and put it into a "hot pool"; and immediately after a task switch you'd merge everything from the "cool pool" into the "cold pool", then move everything from the "warm pool" to the "cool pool", then move everything from the "hot pool" to the "warm pool". Of course where I've said "move" I really mean swap some pointers and move nothing. In this case, if you get a page fault because something isn't mapped you'd search all of the pools (hot, warm, cool, cold) and if the page is in one of the pools you'd remove it from the pool and map it again. In this way you end up with 4 (or more or less) pools where the least recently used pages are in the same pool (e.g. all in the "cold pool" if its not empty). You can also track "latest time for all pages in each pool at time of last scan" and do the same "latest time for all pages in pool = latest time at time of last scan + time since scan was done" thing.

kenz wrote:Then once a candidate for eviction has been found, the next question is how eviction actually happens? Does the kernel do some ad-hoc mapping of both the page itself and of the owning process' paging structuers (since they have to be updated) in order to swap out the data and update the structure? Or will the kernel switch to the adress space of the owning process (or wait until it happens naturally?) and then handles things from there where self-mapping page tables can be used?

The actual eviction depends on where the data came from. If the data is from a memory mapped file and hasn't been modified, or if the data was already saved to swap space and hasn't been modified since it was loaded back from swap last, then the kernel might be able to (atomically) set the page table entry back to "not present" and then free the physical page. Otherwise (if the data was modified and has to be saved in swap space first) then you have to worry about race conditions (the process accesses it and wants it back before it's saved); and you either have a "waiting for swap" state for the task so that the scheduler can ensure the task doesn't get CPU time until the data has been written to swap space (where you're going to need a "waiting for swap" state for loading data back from swap space anyway), or you have a "currently being sent to swap" state for the page itself so that the task can continue running (if scheduler gives it CPU time) until/unless it tries to access the page that is still being sent to swap space.

Finally; don't forget that swap space may be tiered. For example, a video card driver might give you 256 MiB of "fast VRAM as swap space", an SSD might give you 4 GiB of swap partition, and mechanical hard drive might give you another 60 GiB of slower swap partition. In this case you need swap space management to do things like shift pages from the faster swap providers to slower swap providers; and possibly for other purposes (e.g. maybe to provide redundancy - store 2 copies on 2 different swap providers so that if a device fails you're able to recover).

Cheers,

Brendan

OSDev.org

Virtual memory management strategies

Virtual memory management strategies

Re: Virtual memory management strategies

Re: Virtual memory management strategies

Re: Virtual memory management strategies

Re: Virtual memory management strategies

Re: Virtual memory management strategies

Re: Virtual memory management strategies

Re: Virtual memory management strategies

Re: Virtual memory management strategies