Hi,
kenz wrote:Suppose the system, while executing Process A, has run out of physical memory and a page needs to be evicted. The system must find something that is paged in but that is old. This may of course involve a page that is owned by a completely different process B, i.e. it is not even in the current virtual adderss space. What data structure do kernels typically use to do this? Does the kernel have a data structure of all _physical_ pages (or ranges) and in which address spaces (i.e. processes) they are currently mapped into (and some bits for keeping track of last use)? Because for determining pages to evict we are of course only interested in something that is currently physical memory. This table would be a kind of reverse-indexed page table (because the CPU's page tables are indexed by virtual address) and would be swapped into the global kernel space.
An alternative could be to for each process have data structures indexed by virtual address (perhaps even the native CPU page structures which have to be there). When finding a page to evict, this structure would have to be walked for all processes to find candidates for eviction. Questino is then, would this data structure be in the global kernel space? Or would we have to switch between address spaces to access it? Or would we on a "per process" basis map it into the kernel address space during the walk, then unmap it afterwards? Self-referencing page table strategies would only be usable if we actually switch in the foreign process' address space during the scan.
First, assume that there's a "memory pressure" rating somewhere that reflects how much physical memory is free (e.g. maybe "memory_pressure = min(total_pages - free_pages - 256, 0) / total_pages;"). When "memory_pressure" is low enough the kernel does almost nothing, when "memory_pressure" is near the middle of its range the kernel is gathering "what was used when" statistics, and when "memory_pressure" is high enough kernel starts trying to reduce the memory pressure (by doing things like sending pages to swap space to free the physical pages). This allows the kernel to avoid the overhead of gathering statistics when there's lots of free physical memory (which improves performance) but also ensures that kernel starts reducing memory pressure before it becomes critical.
Second, assume that each thing using memory also has a rating that is calculated from 2 factors - an estimate of when the data will be needed again next, and an estimate of how expensive it is to fetch the data if/when it is needed again. The estimate of when the data will be needed again next is typically derived from how long its been since it was used last (but that's not necessarily the only way, and in some specific cases it's not necessarily the best way). How expensive it is to fetch the data again varies significantly, depending on where it came from and if its been modified since. For example, if the data hasn't been modified and is part of a memory mapped file on a fast SSD then the kernel can free the page without saving the data on swap space and then fetch it from the fast SSD if it's needed again, but if the data was modified then it'd have to save it to swap space first and that's likely to be more expensive. The point of this calculation is to make more intelligent decisions (sometimes it's better for performance to free something that's more likely to be needed sooner simply because the data can be fetched faster).
Of course both of these things should be scaled to allow simple comparisons - e.g. "if( memory_pressure > rating_for_this_data) { free_this_data_to_reduce_memory_pressure(); }".
What isn't as obvious is that it can also be used in reverse - e.g. "if( memory_pressure < rating_for_this_data) { prefetch_this_data_before_its_needed(); }". Sometimes "memory_pressure" decreases a lot in a short amount of time (e.g. when a large application is terminated) and when that happens you end up with lots of free physical pages of RAM being wasted that could be used (if things like disk drives aren't doing anything important) to improve performance later.
Now; let's assume that the OS happens to be running a web browser (that has a lot of web pages cached) and a bitmap image editor (that is consuming a lot of memory for "undo" - to allow the most recent changes to be reverted) and several other applications that use lazy algorithms (e.g. maybe something like "if( cached_result[x, y] == UNKNOWN) { cached_result[x, y] = calculate(x, y); } return cached_result[x, y];" to avoid expensive calculations that have been done previously). The kernel could notify these processes whenever the kernel's "memory_pressure" has changed enough since last time; so that processes are able to free memory (and/or pre-fetch/pre-calculate stuff) to help the kernel keep "memory_pressure" within a certain range.
Next; assume that the VFS has file data caches, and the networking layer has DNS caches. These can work on the same "notify if memory pressure changed enough since last time" mechanics that processes use (and if it's a micro-kernel where these things are implemented as processes there's no additional work needed).
The last piece is the kernel's virtual memory manager. This is 2 parts - tracking statistics, and finding the best pages to free (or pre-fetch). The raw information comes from the "accessed" and "dirty" flags in the page table entries - it's (almost literally) scanning the virtual address space looking for page table entries that have the accessed bit set, and if it is set you'd set a "time_this_page_was_used_last = now" and clear the flag. Because the raw information comes from virtual addresses it's probably best to keep all the statistics indexed by virtual address (which makes shared memory trickier, but makes everything else easier). This also allows you to cheat - if a process hasn't been given any CPU time for 1234 seconds then you know that none of its pages have been accessed for at least 1234 seconds without bothering to do a (relatively expensive) scan of its virtual address space. More specifically, if you scan a virtual address space immediately before doing a task switch you can do "actual time page was used last = time since process was given CPU time + time page was used last time the scan was done". You can also keep track of "oldest time page was used when scan was done" for each process, so that (when kernel wants to free some memory) the kernel can quickly decide which processes are/aren't worth looking at.
Also note that the page fault handler can be used to assist in tracking the usage statistics. Specifically; immediately before a task switch you can scan a task's virtual address space - if a page was accessed you'd clear the "accessed" flag, and if a page was not accessed you'd unmap the page and put it into a "hot pool"; and immediately after a task switch you'd merge everything from the "cool pool" into the "cold pool", then move everything from the "warm pool" to the "cool pool", then move everything from the "hot pool" to the "warm pool". Of course where I've said "move" I really mean swap some pointers and move nothing. In this case, if you get a page fault because something isn't mapped you'd search all of the pools (hot, warm, cool, cold) and if the page is in one of the pools you'd remove it from the pool and map it again. In this way you end up with 4 (or more or less) pools where the least recently used pages are in the same pool (e.g. all in the "cold pool" if its not empty). You can also track "latest time for all pages in each pool at time of last scan" and do the same "latest time for all pages in pool = latest time at time of last scan + time since scan was done" thing.
kenz wrote:Then once a candidate for eviction has been found, the next question is how eviction actually happens? Does the kernel do some ad-hoc mapping of both the page itself and of the owning process' paging structuers (since they have to be updated) in order to swap out the data and update the structure? Or will the kernel switch to the adress space of the owning process (or wait until it happens naturally?) and then handles things from there where self-mapping page tables can be used?
The actual eviction depends on where the data came from. If the data is from a memory mapped file and hasn't been modified, or if the data was already saved to swap space and hasn't been modified since it was loaded back from swap last, then the kernel might be able to (atomically) set the page table entry back to "not present" and then free the physical page. Otherwise (if the data was modified and has to be saved in swap space first) then you have to worry about race conditions (the process accesses it and wants it back before it's saved); and you either have a "waiting for swap" state for the task so that the scheduler can ensure the task doesn't get CPU time until the data has been written to swap space (where you're going to need a "waiting for swap" state for loading data back from swap space anyway), or you have a "currently being sent to swap" state for the page itself so that the task can continue running (if scheduler gives it CPU time) until/unless it tries to access the page that is still being sent to swap space.
Finally; don't forget that swap space may be tiered. For example, a video card driver might give you 256 MiB of "fast VRAM as swap space", an SSD might give you 4 GiB of swap partition, and mechanical hard drive might give you another 60 GiB of slower swap partition. In this case you need swap space management to do things like shift pages from the faster swap providers to slower swap providers; and possibly for other purposes (e.g. maybe to provide redundancy - store 2 copies on 2 different swap providers so that if a device fails you're able to recover).
Cheers,
Brendan