Handling SWAP

Jeko · Post by **Jeko** » Fri Mar 28, 2008 7:43 am

I'm thinking about handling swap of memory to my disk.
I will use another partition (like linux does) to handle swap.

How must I decide memory to swap?
What are the best methods?

binutils · Post by **binutils** » Fri Mar 28, 2008 7:50 am

swap is just swap, is there any method?

if there is something, just mimic it.

lukem95 · Post by **lukem95** » Fri Mar 28, 2008 8:05 am

do you have a stable ata driver? (may sound silly, but iv never seen one through to completion).

and maybe you should add a field to your memory manager, so you can see which process allocated the memory. This way you can swap memory from low-priority tasks' with little run time, and minimise the risk of un-swapping it immediatly

JamesM · Post by **JamesM** » Fri Mar 28, 2008 9:29 am

MarkOS wrote:I'm thinking about handling swap of memory to my disk.
I will use another partition (like linux does) to handle swap.

How must I decide memory to swap?
What are the best methods?

Hi,

There are several methods of deciding what to swap.

1. Pick something randomly. This may seem like a stupid solution but it actually has its benefits. If you have access to some source of pseudorandomness, for example a hardware downcounter that counts so fast its state could be anything, you can use that to choose a frame to swap. The advantage to this is that it avoids thrashing (as much as possible). Thrashing is where a set of frames are swapped out, then are required again and paged back in, then the same ones are paged back out etc. With the random approach this would never happen more than once. Several architectures do this similarly when handling TLB misses and choosing a TLB entry to refill, c.f. MIPS.

2. Identify a working set. I believe that vista does this. Analysis of programs has shown that at any one point it will only be working on a subset of its allocated memory (that's a no-brainer). So if you can identify this "working set", you can ensure that anything that is swapped out doesn't belong to it, so as not to cause thrashing. As the working set could change at any time (programs go through "phases" of working sets, so instead of a migration of 'adding a few bytes, then removing a few bytes' from its working set, the entire thing usually changes) you need to monitor this and use some carefully crafted algorithms to detect and identify these events unobtrusively.

There are probably a couple of others, so don't take this as a definitive list, but I just can't think of them right now! Obviously 2 is more difficult than 1, so I would go with the first solution to start with. I'd be interested to know, actually, how linux and winXP do it, because I honestly have no idea.

As far as storing the swapped data on disk is concerned, using a partition for it enables you to use your own filesystem, which is very useful, as swap data is implicitly more constrained and ordered than 'normal' files/directories. There are algorithms out there that can give you extremely good insert/delete/find speed (close to O(1) iirc). Using a pagefile is similar, but you get a little (but not much) less flexibility about the arrangement of your data, imho.

Cheers,

James

Krox · Post by **Krox** » Fri Mar 28, 2008 9:59 am

I saw some nice algorithm in a book about the internal structures of System V (its ~20 years old, dont know if current implementations do it similar):

The kernel goes through all pages in a fixed interval (suggests 100-1000 ms, maybe more) and checks the accessed-flag. If it isnt set, it incremenents the age-counter (there are enough avaible bits inside the page-maps to implement it directly there). If it is set, it gets cleared and the age-counter is set to zero. Now everything above a certain age can be swapped to disk.

Obviously the algorithm can be optimized by not traversing all pages. You can exclude high-priority-tasks, specially marked ones, or simply those, which couldnt be swapped for a long time (but thats quite difficult to implement)

Jeko · Post by **Jeko** » Fri Mar 28, 2008 10:22 am

Krox wrote:I saw some nice algorithm in a book about the internal structures of System V (its ~20 years old, dont know if current implementations do it similar):

The kernel goes through all pages in a fixed interval (suggests 100-1000 ms, maybe more) and checks the accessed-flag. If it isnt set, it incremenents the age-counter (there are enough avaible bits inside the page-maps to implement it directly there). If it is set, it gets cleared and the age-counter is set to zero. Now everything above a certain age can be swapped to disk.

Obviously the algorithm can be optimized by not traversing all pages. You can exclude high-priority-tasks, specially marked ones, or simply those, which couldnt be swapped for a long time (but thats quite difficult to implement)

I'm very interested on this.
How can I do this?

lukem95 · Post by **lukem95** » Fri Mar 28, 2008 10:30 am

erm... add a field to your paging struct for usage and age. Traverse through the linked list checking the usage flag, if not used, increment age.

he told you how to do it.

Krox · Post by **Krox** » Fri Mar 28, 2008 10:55 am

I assume you have already implemented paging. The accessed-flag is a bit inside the page-map-entrys. It is set by the harware on every read/write-operation, but can be cleared only by the OS. Adidionally there are some bits ignored by hardware (3 in Protected- and 14 in longmode), avaible for use by the OS. Probably that is enough for the age-counter, so you dont even need additional stuctures for that information.

z180 · Post by **z180** » Fri Mar 28, 2008 11:30 am

If you mean the book of maurice j bach that linus used while
inventing linux ,I have to say that the vm system added
to System V revision 2 is that what BSD used from 3BSD
up to 4.3 BSD reno and is more than a bit unportable
and limtited because it assumes that there is more swap
space than memory it cannot use a regular file to swap and
it was first developed on VAX.Mach and SunOS wrote own vm
systems and I recommend looking at newer portable systems
source code like NetBSD.

Brendan · Post by **Brendan** » Fri Mar 28, 2008 12:50 pm

Hi,

MarkOS wrote:How must I decide memory to swap?
What are the best methods?

Typically the only hard part is deciding which pages to send to swap space when you run out of RAM. This is often done with an LRU (or "least recently used") algorithm.

The general idea is to scan page tables and check the "accessed" flag in the page table entry for each present page. For LRU you'd have one main counter, and for each present page you'd have some sort of "count" variable. Each time you do the scan you increase the counter, and then, for each page that was accessed you'd do "last_count = 0xFFFFFFFF" and for each page that wasn't accessed you do "if (last_count == 0xFFFFFFFF) last_count = counter". Then when you want to send pages to swap space you find the pages with the lowest "last_count".

This has a few problems though. First, the counter can overflow. When this happens you can do "counter = counter >> 1" and (for every page) do "last_count = last_count >> 1".

Secondly, without some improvements it's slow (too much overhead). To speed it up you can keep track of the lowest count for each process, so that when you're looking for a page to send to swap space you don't need to look at every page table entry in every process (instead, you'd start with the page table entries for the process with the lowest count). You can also keep track of a "pending count shifts" value for each process - when the counter overflows you'd increase the "pending count shifts" value for every process, and then do "last_count = last_count >> pending_shifts" when that process is scanned next.

Also, it's simplistic. A better idea is to combine this with other ways of getting RAM, like evicting data from file system caches and flushing memory mapped files to disk. You'd also want to take into account the cost of writing a page to disk (if necessary) and the cost of potentially loading a page back in if it's needed.

For example, imagine that there's a page of data that's part of a "read only" memory mapped file stored on a floppy that hasn't been accessed for 40 seconds; and a normal RAM page that hasn't been accessed for 10 seconds. In this case the page used by the memory mapped file could be freed immediately (without writing any data to disk because it's already on disk), while the normal RAM page would need to be written to swap before the page of RAM could be re-used. Despite this the floppy drive is insanely slow, so it's probably still better to send the normal RAM page to swap (even though it's not the least recently used, and even though it involves transferring more data).

Now, what if there's also a page of file system cache that hasn't been used for 15 seconds - would it be better to recycle that page instead? That depends...

For another example, imagine there's 2 normal pages of RAM. The first page of RAM hasn't been accessed 11 seconds and belongs to a very high priority process, and the second page of RAM hasn't been accessed for 9 seconds and belongs to a very low priority process. Which page do you send to swap?

The basic idea would be to create some sort of formula that gives each page a rating (based on a range of factors, like time since the page was accessed last, process priority, time to write to disk, time to load from disk, etc), and then select pages with the best rating.

Cheers,

Brendan

Brendan · Post by **Brendan** » Fri Mar 28, 2008 1:19 pm

Hi,

Krox wrote:The kernel goes through all pages in a fixed interval (suggests 100-1000 ms, maybe more) and checks the accessed-flag. If it isnt set, it incremenents the age-counter (there are enough avaible bits inside the page-maps to implement it directly there). If it is set, it gets cleared and the age-counter is set to zero. Now everything above a certain age can be swapped to disk.

Imagine you've got 2000 blocked tasks with 2000 pages each, and every 100 ms you need to examine every page table entry for each of these blocked tasks. If it takes 20 cycles (on average) to update the page's age, then that'd add up to 800000000 (or "2000*2000*20/0.1") cycles per second of overhead; or about 40% of CPU time for a 2 GHz CPU.

For my approach (see my last message in this topic) the "age" (or "last_count") for each page doesn't need to be touched unless the task was given CPU time (you could even have a "needs scanning" flag that is set by the scheduler when a task is given CPU time and cleared by the VMM after it updates the "last_counts"). That way you can have billions of tasks, and if only 100 tasks were given CPU time since the last scan then you'd only scan 100 task's pages.

Of course I should point out that some OSs keep track of the least recently used pages in each process and other OSs keep track of the least recently used pages in all processes (or the least recently used pages in the whole computer). I prefer the second approach...

Cheers,

Brendan

Krox · Post by **Krox** » Fri Mar 28, 2008 2:30 pm

well... I thought my approach whould be slow, but not THAT slow. Actually I havent implemented swapping yet, but your words make sense, but a "really good" algorithm now sounds nearly impossible to me

@z180: Yes, i was referring to that book though I didnt know linus used it. Interesting thing. I dont know how BSD handles that stuff, but honestly I dont see, why the approach I described is not portable. Of cause, it needs paging, but that isnt your point I think?

Krox · Post by **Krox** » Fri Mar 28, 2008 4:36 pm

hm... I've taken a second look onto your calculation.. you assume a RAM of 4Mio Pages = 16GiB, thats pretty much for a 2 GHz CPU. Furthermore I have to admit that 100ms is a too short interval. That said, the calculation could look like this:

2 GiB of (used) RAM with pages a 4KiB makes 2^19 pages. With one Traverse per second and 20 cycles per counter-update that sums up to 10485760 cycles per second; or about 0.5% of CPU time for a 2 GHz CPU.

And that can be optimized even further when excluding certain pages as described before. Or maybe you put pages, which couldnt be excluded for a long time into a second list, which only gets traversed every 20 seconds. You could build something hirachical for that. Honestly, your idea of excluding sleeping tasks is simply good, I hadnt thought of that before.

mrvn · Post by **mrvn** » Sat Mar 29, 2008 12:15 pm

Doesn't the hardware also set the accessed and dirty flags for the page tables themself?

So you look at the topmost page tables and scan all 512 (1024) entries. For those that have been accessed you recurse into the lower page tables. That should cut down majorly on the number of entries you actually have to look at.

If that doesn't work you can always mark entries as not present when they haven't been used for a while. That way you get a page fault on the next access. Using that you can keep a set of pages to scan (young pages). Think of the other pages as swapped out but cached.

MfG
Mrvn

Ready4Dis · Post by **Ready4Dis** » Mon Mar 31, 2008 2:31 am

What I intend to do is store a counter for each process, and in general, and simply loop through active pages marking them as not present one at a time as required. My thought is, once it makes it's first round, the pages not being used will not be in the list next time around, leaving it's memory swapped. I know it's not a great solution, and I plan on revising it after initial implementation. I tend to implement things at least twice, sometimes more depending on testing/benchmarking. First I make a working copy, that well.. works. After that, I imlement a version that is supposed to work better and do a bit of testing. Then if I'm satisfied, I move on, if not, I write another version and try again until I'm satisfied. Most of my functions are abstract enough to be replaced without modifying kernel or user code, so I can leave a 'working' swap disk function in my kernel, and change it out after I get my GUI up and running, because chances are, I don't need a (an efficient) swap disk in place before I am loading tons of bitmaps/images/applications into memory. I will probably write a few stress testers to make sure it's working properly (like loading way more threads/processes than I know I have memory for), possibly a bit of benchmarking with a few very large threads, possibly, allocate a few megs with about 10 processes each, and leaving 2 inactive and 8 active and see how it runs, and what is paged to disk and what is in actual memory, etc). Then re-run those 2 threads and see how long it takes to page back in, etc. This way I have a baseline when I re-implement a 'better' version (sometimes more complex doesn't mean better).

OSDev.org

Handling SWAP

Handling SWAP

Re: Handling SWAP

Re: Handling SWAP