Page 1 of 2

Compression for Swap Space

Posted: Thu Jun 21, 2007 4:58 am
by Brendan
Hi,
bewing wrote:I'm wondering if it's possible to use RL memory compression, rather than paging to hard disk, to free up memory pages, under most realworld circumstances.
Of course it's possible, depending on the load. For e.g. if everything fits in RAM anyway, even a system with no swap space at all would work. The question then becomes how "good" is it for loads that don't fit in RAM.

For working this out there's many factors that would effect performance:
  • 1) the "working set" of pages that software is actually using
    2) the "total set" of all pages that software has allocated
    3) how quick pages can be fetched from swap and stored in swap
    4) the compression ratio
It'd also depend a lot on exactly how it's implemented. For example, a system that compresses all pages might perform worse than a system that only compresses pages that compress well (for e.g. if you compress a 4 KB page and it's still 3 KB or larger, then it might be better to leave that page uncompressed and find a different page to compress).
Candy wrote:Nope. Typical memory usage would perform fairly bad under memory compression and the only kind of "compression" that would work with RLE would be zeroes - which are handled by only actually giving the process a page when it writes to it.
With RLE, a page full of 0x33 (or any other non-zero byte) would compress just as well as a page full of zeros. Using RLE only wouldn't be as effective as using better compression algorithms though...
Combuster wrote:You could try huffman encoding. It can do fair compression under most circumstances and it encodes/decodes in linear time. The part of the equation that remains unknown is how well it actually performs, i.e. if the extra CPU time is worth the decrease in disk accesses...
I'm guessing that (if it's implemented well), compression/decompression could be much faster than waiting for hard disk seek and sector transfers. However, often the CPU can do other work while it's waiting for hard disk (but can't do other things while it's compressing/decompressing). IMHO this means compression/decompression might be better for I/O bound tasks, while hard disk space might be better for CPU bound tasks.
jnc100 wrote:The other problem is that using some of your memory as a compressed store reduces the amount of working memory, thus increasing the likelihood that you need to compress/uncompress data, as well as increasing the number of tlb invalidations/refills.

edit: this might be useful
Yes.

For a system with 1 GB of RAM and a compression ratio of 2:1 you'd be able to have a maximum of 2 GB of (compressed) data, but, unless the size of the working set is zero you'll be constantly thrashing (compressing and decompressing all the time). If the working set is 512 MB, then you'd be able to have a total of 1.5 GB of data (1 GB compressed and 512 MB uncompressed) without causing thrashing.

For the same system using hard disk space, the total amount of data you can have is 1 GB plus the amount of hard disk space you're using (typically much more than the 2 GB you'd get with compression), and the size of the working set could always be up to 1 GB without thrashing.

IMHO if I had to choose between compression or hard drive space, then I'd have to choose hard drive space, as it allows for larger working sets and larger total data (even though I think compression would improve performance in some cases).

However, why do we have to choose one or the other? I can imagine a system that uses both methods to get better characteristics than either method would have by itself....


Cheers,

Brendan

Posted: Thu Jun 21, 2007 9:44 pm
by bewing
Yes, all these points were in my thought process, Brendan -- except for the one about how disk access "work" is mostly offloaded from the CPU.

I am thinking that when virtual memory is allocated in the 1st place, it is initialized to 0, anyway. So up to 4gb of virtual memory, initialized to 0, would compress to 8 bytes -- as the initial memory allocation for all userland proggies. RLE is the only encoding that can compress to those kind of ratios, so I consider all other compression mechanisms pretty worthless -- unless they would happen to be built into the machine's hardware.

Also, it seems to me that most programs allocate a bit of "extra" buffer space, when they do allocations. Unused memory, in this fashion, would always remain fully compressed -- as 8 bytes. Expanding compressed pages on demand is trivial, of course.

But yes, I'm planning on keeping the system usage of my OS so small that nothing ever needs to be paged out anyway. Death to bloatware!

Posted: Thu Jun 21, 2007 11:37 pm
by Candy
What about the worst case? A program de-encrypting a compressed object in memory? You can't compress either part of its buffer.

Posted: Fri Jun 22, 2007 12:53 am
by os64dev
bewing wrote:...
But yes, I'm planning on keeping the system usage of my OS so small that nothing ever needs to be paged out anyway. Death to bloatware!
Mostly it is not the OS parts that are huge, but the programs running on it.
It's like the building the cross compiler.. It give large executables > 2 MiB but after a strip session there only a few 100 KiB. So wat it the moral behind this. Generally you as a software developer would like to have debug information or some other trace information in an executable so you can trace bugs better or ate least make a better bug report, which bloats your executable, or you can make bug tracing harder and have smaller execatables. You as a developer will problably like the latter in case of releases but software compagnies or business in general will prefer the bloated version.

Posted: Fri Jun 22, 2007 7:12 am
by JAAman
bewing wrote: I am thinking that when virtual memory is allocated in the 1st place, it is initialized to 0, anyway. So up to 4gb of virtual memory, initialized to 0, would compress to 8 bytes -- as the initial memory allocation for all userland proggies. RLE is the only encoding that can compress to those kind of ratios, so I consider all other compression mechanisms pretty worthless -- unless they would happen to be built into the machine's hardware.

Also, it seems to me that most programs allocate a bit of "extra" buffer space, when they do allocations. Unused memory, in this fashion, would always remain fully compressed -- as 8 bytes. Expanding compressed pages on demand is trivial, of course.
iirc, most OSs dont actually allocate memory pages when its requested, they allocate the pages when it is used (the first time the application reads/writes the new memory address, the OS then allocates physical memory and zeros it), meaning there arent pages set to zero and unused -- so it takes even less than 8bytes before its used, and once its used it must be decompressed anyway, and 'overallocation' wont ever take any space at all

Posted: Fri Jun 22, 2007 7:22 am
by Brendan
Hi,
bewing wrote:I am thinking that when virtual memory is allocated in the 1st place, it is initialized to 0, anyway. So up to 4gb of virtual memory, initialized to 0, would compress to 8 bytes -- as the initial memory allocation for all userland proggies. RLE is the only encoding that can compress to those kind of ratios, so I consider all other compression mechanisms pretty worthless -- unless they would happen to be built into the machine's hardware.
Each page needs to be handled seperately so that you can access one page without worrying about any other pages (you don't want to decompress up to 4 GB of data just to access one page in the middle of it). This means that you'd be compressing pages not larger things.

If you've got huge amounts of virtual memory that's initialized to zero, then you can use allocation on demand and those "initialized to zero" pages costs nothing (not 4 bytes or more per page).

So, how well does RLE work on 4 KB pieces of data that aren't still initialized to zero?
bewing wrote:But yes, I'm planning on keeping the system usage of my OS so small that nothing ever needs to be paged out anyway. Death to bloatware!
I've got an 80486 here with 8 MB of RAM. I want to open all of the PDF documents that make up Intel's CPU Manual and "alt+tab" between the PDF documents. There's 7 documents that add up to 16.9 MB. Exactly how small would your OS need to be (and how good would your compression need to be) for this to work?

If you've got a computer with 256 MB of RAM and all of the software running only needs 200 MB of RAM, then you don't need swap space. But, if there's 100 MB of rarely used pages why not send them to swap space and then use 156 MB of RAM for file system caches? It doesn't make sense to waste actual RAM for things that are rarely used when you could use that RAM to improve the performance of things that are used....


Cheers,

Brendan

Posted: Mon Jun 25, 2007 12:17 am
by bewing
Heh. Thanks for the replies! Including this as part of my VMM is just a stray, vague idea at this point. I have no idea if I'll even want to try implementing it someday.

But basically, the point is: start with the traditional process of needing a free page, and therefore forcing a page swap out to disk -- which, as we all know, is damned slow to recover. It also puts some load on the disk bus -- which might be doing more important things.

This only really matters if the response time of a particular process is important, of course. When the swapped out page is finally needed again, the process that owned it will simply hang (for a long time) until it's back.

But let's say that userspace memory is not being completely utilized. The parts that are being used have some sparse arrays in them, or were initted by a program to all 0xffffffff's or something. Wouldn't it be kinda neat if the VMM pager could just check quick to see if some of memory was quickly and easily compressible? In 100M of space, maybe it finds 8K of contiguous memory that it can compress out -- in a process, say, that just got initialized and then task switched out. There's 2 pages free! Suddenly no disk swap is needed -- at the expense of doing an 8K compression/decompression. And if you are talking about an RLE compression, that means a few thousand opcodes. (And, yes, I DO realize that scanning 100M of memory every so often to find compressible stuff is going to be rather "expensive".)
Candy wrote:What about the worst case? A program de-encrypting a compressed object in memory?
Correct. The parts of memory that are being actively used will almost certainly not be usefully compressible. This is meant as a marginal method, to free up a few pages of memory when you really need them.
os64dev wrote:RE: bloatware ... Mostly it is not the OS parts that are huge, but the programs running on it.
Mostly, but WinXP uses what, 100MB? More? It puts itself in virtual space, and pages itself out, and wastes a million years of cpu cycles doing it.
os64dev wrote:You as a developer will problably like the latter in case of releases but software compagnies or business in general will prefer the bloated version.
Neither the OS developer, nor the software companies, own the machine that is running the specified software. If a program can run in either 2MB with tracing, or 200K without -- the superuser of the machine should be able to specify which will happen. Not the OSdever, or the software company that doesn't give a sh*t about the performance of the user's machine.
JAAman wrote:most OSs dont actually allocate memory pages when its requested, they allocate the pages (on demand)
Yes, that part of my vague concept is intended as a replacement for on-demand paging. Instead of fully setting up the processes' page tables, pointing to unallocated memory pages -- the page tables themselves would only be set up on demand, out of the processes' pool of RLE compressed memory.
Brendan wrote:This means that you'd be compressing pages, not larger things.
Well, I'm not sure I agree with that. If you have a giant pool of compressed space, and you mess up a page in the middle of it, then you have compressed space up to there, and a messed up actual allocated memory page, and compressed space after. You certainly have to be able to handle the concept of multiple compressed spaces with allocated pages in the gaps, but it seems theoretically possible.
Brendan wrote:I've got an 80486 here with 8 MB of RAM ....
I'm requiring a 586 with a minimum of 16M, so you're outta luck on my OS. :wink:
But I was not suggesting this as a complete replacement for paging out to disk ... just as a mechanism to minimize disk swapping.
And if you're gonna be so silly as to overflow your machine's capacity to that extent, then you deserve all the lack of responsiveness that you get! :lol:

Posted: Mon Jun 25, 2007 7:36 am
by JAAman
Mostly, but WinXP uses what, 100MB? More? It puts itself in virtual space, and pages itself out, and wastes a million years of cpu cycles doing it.
no, winXP takes about 35MB, and yes it does page out most of it (it is extremely rare for more than about 8MB to actually be used during the entire time the computer is running)

it doesnt waste cpu cycles doing it, it significantly improves performance of the system by doing it
But basically, the point is: start with the traditional process of needing a free page, and therefore forcing a page swap out to disk -- which, as we all know, is damned slow to recover. It also puts some load on the disk bus -- which might be doing more important things.
no it doesnt -- normally an OS will keep free memory at all times, so if your running low on memory, you will page out unused portions long before you actually run out of memory, so the page-swap can be scheduled to when its most efficient to do so

as for recovering, if you are constantly needing to recover memory from disk, then your virtual memory manager is probably very poorly designed (only pages which are not used should be swapped out -- unless you are completely out of memory, in which case compression isnt going to help either)
Yes, that part of my vague concept is intended as a replacement for on-demand paging. Instead of fully setting up the processes' page tables, pointing to unallocated memory pages -- the page tables themselves would only be set up on demand, out of the processes' pool of RLE compressed memory.
that is how most OSs do it (well, it isnt compressed, but they dont allocate page tables until they are needed -- the page table pointers are marked as not present/unallocated, until the application requests allocation which will use that page table, at which point the table is assigned a physical page out of the zero-page-pool)
And if you're gonna be so silly as to overflow your machine's capacity to that extent, then you deserve all the lack of responsiveness that you get! Laughing
thats just the point -- disk paging, does not become unresponsive unless you go way beyond the machines capacity -- a few years ago, before i increased my RAM from 128MB to 512MB, i had some really bad performance -- but i only had 128MB of RAM and firefox was taking about 500MB -- no 'compression' will ever help that
(And, yes, I DO realize that scanning 100M of memory every so often to find compressible stuff is going to be rather "expensive".)
probably much more expensive than a decent VMM -- most of the 'problems you mention are simple to avoid

Posted: Mon Jun 25, 2007 7:43 am
by AJ
A while ago, I heard about the advent of 'hybrid' hard drives - which have a small amount of flash memory (or some other solid state memory) as additional buffer space - it would be interesting to see how that improves performance (if at all..), but I haven't heard much about them for a while.

Adam

Posted: Mon Jun 25, 2007 8:46 am
by JAAman
most important thing about it isnt speed (flash memory isnt very good for speed anyway -- that is what the on-disk cache is for -- 16MB on most disks right now) but battery life in mobile systems -- the disk (especially faster disks) is very expensive on battery life, but if the most used areas (including boot areas) are in flash memory, then the disk can stay in power-down more often allowing faster disks at lower power usage

it is theoretically possible to see a performance improvement in boot times, as the OS loaded from flash could be a little faster than spinning up the disk, but at this point we are seeing very little improvement in either performance or power usage from this technology, but its still quite new, it will take time to learn to fully utilize this feature (which should start appearing in notebooks within the next few months -- windows vista supports this natively)

Posted: Mon Jun 25, 2007 10:54 am
by bewing
JAAman wrote:it doesnt waste cpu cycles doing it, it significantly improves performance of the system by doing it
Until the system starts running out of RAM and swapping heavily. I disagree with you. System performance degrades rapidly just after the machine overflows its physical memory and begins serious paging. As you noted, you need to maintain approximately the same amount of physical RAM as your apps are using, or your machine bogs. If you do the experiment, I suspect you will see marked degradation by the time you've overflowed by a factor of 2.
JAAman wrote:(only pages which are not used should be swapped out -- unless you are completely out of memory, in which case compression isnt going to help either)
I suspect your overhead of keeping careful track of which pages are used often, and which are not, will more than offset the time it takes me to find compressible memory areas.
JAAman wrote:the page table pointers are marked as not present/unallocated, until the application requests allocation which will use that page table, at which point the table is assigned a physical page out of the zero-page-pool
As a userland app developer, unless I am absolutely certain that every single OS that I'm designing my app for will initialize my bss, stack, AND mallocs to zero -- the very very 1st thing I do is initialize everything to zero myself. It is generally considered "good programming practice." If we assume that almost all programmers do this (and I suspect they do), then every single page allocated by an application is always immediately written to by the app, with zeroes, by every app on the system, as soon as the app is run. Which means that "on demand allocation of pages" is a worthless joke. However, my system will recompress it all back down, if it feels the need. Yours will simply page out multiple pages of unused zeroes to the disk.
JAAman wrote: -- but i only had 128MB of RAM and firefox was taking about 500MB -- no 'compression' will ever help that
Agreed. My little scheme might help when you have 128, and userspace apps are using 200. Once you are heavily into the disk swapping realm anyway, the compression scheme would be turned off, most likely.
Brendan wrote:Each page needs to be handled separately ...
After thinking about it some more, you are completely right about this. I don't want to have the VMM have to trace "contiguous" memory from a userapp's point of view to do the compression, and I obviously do not want to compress more than 4K in kernelspace when userapp pages can (of course) move around.

Posted: Mon Jun 25, 2007 1:26 pm
by frank
AJ wrote:A while ago, I heard about the advent of 'hybrid' hard drives - which have a small amount of flash memory (or some other solid state memory) as additional buffer space - it would be interesting to see how that improves performance (if at all..), but I haven't heard much about them for a while.

Adam
Now they have completely solid state drives

Samsung announces 64gb Solid State Drives
16GB 2.5-inch IDE Solid State Drive

Posted: Mon Jun 25, 2007 2:31 pm
by Brendan
Hi,
bewing wrote:
JAAman wrote:(only pages which are not used should be swapped out -- unless you are completely out of memory, in which case compression isnt going to help either)
I suspect your overhead of keeping careful track of which pages are used often, and which are not, will more than offset the time it takes me to find compressible memory areas.
I doubt that. To determine if pages are compressible or not you need to actually look at the contents of each page, which would involve loading huge amounts of data from RAM.

To determine if a page has been accessed you only need to look at the page table entry (there's 2 flags that the CPU sets for you, one called "accessed" that's set if the CPU read or writes to the page, and another called "dirty" that's only set if the CPU writes to the page). There's similar flags in page directory entries too, so that if the flags in the page directory entry haven't been set you don't need to check the page table entries.

Of course a well designed system that uses compression would use the "accessed" and "dirty" flags the same as a normal system would, and then (try to) compress pages that aren't being used (and avoid compressing pages that are being used).
bewing wrote:
JAAman wrote:the page table pointers are marked as not present/unallocated, until the application requests allocation which will use that page table, at which point the table is assigned a physical page out of the zero-page-pool
As a userland app developer, unless I am absolutely certain that every single OS that I'm designing my app for will initialize my bss, stack, AND mallocs to zero -- the very very 1st thing I do is initialize everything to zero myself. It is generally considered "good programming practice."
I doubt that - I'd generally call it paranoia. The .bss will be filled with zero by the standard library's startup code if the underlying OS doesn't guarantee that it's already filled with zero. The stack doesn't need to be filled with zero at all (as long as what you push on the stack can be popped off again, nothing else matters).

For "malloc()" I'd recommend having a look at "calloc()" if you actually do need the allocated space to be filled with zero. However, it's very rare that dynamically allocated space does need to be filled with zero. Typically you allocate the space for a reason and then initialize a structure in it, or copy data into it, etc. I honestly can't remember ever using "calloc()" (or needing a dynamically allocated area to be filled with zero).
bewing wrote:If we assume that almost all programmers do this (and I suspect they do), then every single page allocated by an application is always immediately written to by the app, with zeroes, by every app on the system, as soon as the app is run.
No. Not just because you're the only paranoid application programmer, but also because application programmers typically have no idea what they've actually allocated or when.

Application programmers tend to use the heap manager (e.g. "malloc()", "calloc()" and "free()", or perhaps "new()" and "delete()"), and have no idea when the heap manager allocates pages and when it doesn't. For example, you might use "malloc()" to allocate 2048 bytes, and the heap manager might take a look and ask the OS for another 64 KB of RAM, then give you 2048 bytes of it. That leaves 62 KB of RAM that the application programmer doesn't even know was allocated, that the heap manager hasn't used yet. This is actually fairly common (heap managers do it so they don't need to call the kernel every time anyone allocates some heap space, which improves performance by reducing CPL=3 -> CPL=0 -> CPL=3 context switches).
bewing wrote:
JAAman wrote: -- but i only had 128MB of RAM and firefox was taking about 500MB -- no 'compression' will ever help that
Agreed. My little scheme might help when you have 128, and userspace apps are using 200. Once you are heavily into the disk swapping realm anyway, the compression scheme would be turned off, most likely.
Hehe - no. You'd leave the compression scheme turned on and send pages of compressed data to hard disk (in addition to raw/uncompressable pages).


Cheers,

Brendan

Posted: Tue Jun 26, 2007 4:17 pm
by bewing
Brendan wrote:To determine if a page has been accessed you only need to look at the page table entry ...
Yes. I would certainly pay attention to accessed/dirty flags. But I was taking JAAman literally. He didn't say he was only checking whether pages were accessed. He said he was determining whether pages were accessed often. To calculate how often a page is used, relative to others, you need to gather per page statistics. You need to clear the accessed/dirty flags, and see how fast they get reset again. Multiple times. And do this for many/all pages. That was the overhead that I was referring to.
Brendan wrote:I doubt that - I'd generally call it paranoia.
Hey! :wink:
Brendan wrote:The .bss will be filled with zero by the standard library's startup code ...
Please note that this still means that every single .bss "allocated" page just got accessed, and the on-demand page allocation system just had to assign actual compressible pages full of zeroes to the whole .bss space.
Brendan wrote:However, it's very rare that dynamically allocated space does need to be filled with zero. Typically you allocate the space for a reason and then initialize a structure in it ...
Structures are very inconvenient to initialize. You pretty much have to do them one at a time, even if you allocate a bunch of them. But I completely disagree with you in theory ... when you allocate an array (even dynamically) you almost always need to initialize it. And arrays are much more efficient than structures, and should be used much more often.

As discussed on other threads: for security purposes, a free()'ed memory page -- or a de-allocated code/data/.bss segment -- or a stack -- full of passwords (or any other supposedly secure info that JoeUser should not be able to access) needs to be zeroed out before being handed to a new process. And with my compression scheme this is even more important, as it makes the pages more compressible.

Posted: Tue Jun 26, 2007 5:52 pm
by Brendan
Hi,
bewing wrote:
Brendan wrote:To determine if a page has been accessed you only need to look at the page table entry ...
Yes. I would certainly pay attention to accessed/dirty flags. But I was taking JAAman literally. He didn't say he was only checking whether pages were accessed. He said he was determining whether pages were accessed often. To calculate how often a page is used, relative to others, you need to gather per page statistics. You need to clear the accessed/dirty flags, and see how fast they get reset again. Multiple times. And do this for many/all pages. That was the overhead that I was referring to.
Would you monitor page usage in the exact same way to determine if a page is being used often or not before compressing it?

If you don't monitor page usage you'd be continually compressing and decompressing used pages, which could waste more CPU time than monitoring page usage would. If you do monitor page usage, then your overhead will be similar (overhead for monitoring plus overhead for compression/decompression, rather than overhead for monitoring plus overhead for setting up disk transfers).
bewing wrote:
Brendan wrote:The .bss will be filled with zero by the standard library's startup code ...
Please note that this still means that every single .bss "allocated" page just got accessed, and the on-demand page allocation system just had to assign actual compressible pages full of zeroes to the whole .bss space.
You forgot the other half of my sentence: "...if the underlying OS doesn't guarantee that it's already filled with zero."

For an example, an OS can guarantee that the .bss is filled with zeros by using allocation on demand for the pages used by the .bss. This means that the .bss would look like it's filled with zeros even though no RAM is actually allocated for it, and the standard library's startup code wouldn't fill the .bss with zeros.

The alternative is that the OS doesn't guarantee that the .bss is filled with zeros, which gives crappy performance and wasted RAM (as the standard library's startup code would need to fill it) and is also a security problem (a programmer could write their own startup code that scans the trash left in the .bss by the OS for passwords, etc).
bewing wrote:
Brendan wrote:However, it's very rare that dynamically allocated space does need to be filled with zero. Typically you allocate the space for a reason and then initialize a structure in it ...
Structures are very inconvenient to initialize. You pretty much have to do them one at a time, even if you allocate a bunch of them. But I completely disagree with you in theory ... when you allocate an array (even dynamically) you almost always need to initialize it. And arrays are much more efficient than structures, and should be used much more often.
That depends how you use the array. If it's an array of bytes used for a buffer or something, then you never initialize it (you just load data into it and keep track of how much data has been loaded in). If it's an array of structures where the structures are used in order it's the same thing (you have a variable that keeps track of how many entries are in use, where entries that aren't in use aren't initialized and don't need to be filled with zero). If it's an array of structures that aren't used in order, then you have some sort of "used/unused" field in the structure and this field does need to be initialized to "unused" (but not necessarily set to zero) while the other fields are left uninitialized (and not filled with zero).

If you can show me some code that allocates *anything* that does need to be filled with zero, then I'll show you some code that should use "calloc()". Of course depending on how it's implemented, "calloc()" could return a pointer to "allocate on demand" page/s that don't have any RAM at all.
bewing wrote:As discussed on other threads: for security purposes, a free()'ed memory page -- or a de-allocated code/data/.bss segment -- or a stack -- full of passwords (or any other supposedly secure info that JoeUser should not be able to access) needs to be zeroed out before being handed to a new process. And with my compression scheme this is even more important, as it makes the pages more compressible.
The only time you'd compress a page is when it contains non-zero data.

Think about allocation on demand for a minute or so...

Let's say you've got one physical page full of zeros that is never modified, and a process has a 4 MB .bss section. You map this same page into the .bss section 1024 times as "read-only", so that the .bss section looks like it's full of zeros and you don't allocate any RAM at all. Then, when the process tries to modify something in it's .bss section it causes a page fault, the page fault handler allocates a page of RAM and maps it where the page fault occured, then lets the application complete it's write to that page. The allocated page contains zeros for about 20 cycles before it's modified.

Now consider compression. If you find a page full of zeros you do the reverse - free the page and map the "one physical page full of zeros that is never modified" in it's place as "read-only". This is like having an infinite compression ratio - 4096 bytes reduced down to 0 bytes!

So, what exactly is left to compress? Pages that aren't full of zeros...


Cheers,

Brendan