Idea for physical memory management

evoex · Post by **evoex** » Mon Jan 02, 2012 10:11 am

Hello all,

I've got an idea for a physical memory manager, which I will briefly introduce here. I wonder if there are any real downsides to this method...

Basically, the data structure for the information page is a tree with a fixed depth and size. It is always known which node is a child of which, as it is always the same. Each node in the tree is associated with a page index (the actual page's address should probably be calculated as the information page's index plus one plus the index).
Each node has n child nodes, except for the last ones, of course.

Now each node contains n+1 bits (n is the number of child nodes). Bit 0 indicates whether the page this node is associated with is available or not. Bits 1 through n is a bit mask indicating whether the associated child of the current node, or any of its descendants has a free page.
Finding a page for allocation would be done as follows:
1. If the first byte in the information page is 0, all pages are allocated and we move to the next page (we could construct a similar information page so we don't have to loop through all these pages in serial though, so a second order information page, but I won't cover that for now).
2. Otherwise, set index = 0.
3. If the info[index] indicates that this page is available, we're done: allocate the page.
4. If not, find the first child that has an available page (either the page itself or its descendants), and set index to the index of this child.
5. Continue with step 3.

Actually allocating the page involves updating the parents of the allocated nodes, flagging this child node not to have any pages anymore. We're done as soon as a node still has pages other than the currently allocated one. (worst case number of steps, the depth of the tree).
Deallocating the page is simply the opposite of allocating it.

Allocating a page at a specific address is easy: find the proper index and test if available. If not, you can even test its siblings to see if they are available.

All in all, it sounds like a fairly easy scheme, and very efficient. The parameters can be tweaked, but for example one can take:
- 4 children per node.
- 1 byte per node. (4 + 1, rounded up for optimization)
- 4096 bytes per page, so 4096 pages per information page.
- Tree depth of 6 (4^6 = 4096)

This would involve one page every 4096 pages (0.02% overhead). Allocating or deallocating a page would have a worst-case scenario of 6 memory reads/writes.

My questions:
- Are there any problems with this technique I have to be careful about?
- Is this technique known about? (unless it has a flaw, I can't imagine it not being known)
- Would you implement this scheme in an OS? Or would you favor for instance stack based memory management.

evoex · Post by **evoex** » Mon Jan 02, 2012 12:47 pm

berkus wrote:Have you considered concurrent access/locking?

Hmmm I believe most implementations require it to be a critical section, right? Especially the stack based implementation (as windows does it)... As it's part of the kernel, I can't imagine that to be too unexpected, but I might be wrong here.

Brendan · Post by **Brendan** » Mon Jan 02, 2012 1:33 pm

Hi,

evoex wrote:I've got an idea for a physical memory manager, which I will briefly introduce here. I wonder if there are any real downsides to this method...

It didn't make sense to me. Because each node's children are represented by one bit and not one pointer, I fail to see how you're going to find a node's children. The only way that makes sense would be to have a set of bitmaps (explained below), but in that case the "1 byte per node. (4 + 1, rounded up for optimization)" part doesn't make any sense.

For a set of bitmaps (with 6 levels), you might have:

One bitmap where each bit represents one page (cost is 32 KiB per GiB of RAM)
One bitmap where each bit represents a group of 4 pages (cost is 8 KiB per GiB of RAM)
One bitmap where each bit represents a group of 4 groups of 4 pages, or 16 pages (cost is 2 KiB per GiB of RAM)
One bitmap where each bit represents a group of 4 groups of 16 pages, or 64 pages (cost is 512 bytes per GiB of RAM)
One bitmap where each bit represents a group of 4 groups of 64 pages, or 256 pages (cost is 128 bytes per GiB of RAM)
One bitmap where each bit represents a group of 4 groups of 256 pages, or 1024 pages (cost is 32 bytes per GiB of RAM)

In this case it adds up to 43680 bytes per GiB of RAM (or about 42.65625 KiB per GiB of RAM). This is only about 0.004% overhead. Worst case for searching for one free page to allocate and allocating it would be 32 bytes of reads per GiB of RAM, plus 5 more reads, then 6 writes.

This could be improved by increasing the "nodes per ground" and reducing the number of bitmaps. For example:

One bitmap where each bit represents one page (cost is 32 KiB per GiB of RAM)
One bitmap where each bit represents a group of 32 pages (cost is 1 KiB per GiB of RAM)
One bitmap where each bit represents a group of 32 groups of 32 pages, or 1024 pages (cost is 32 bytes per GiB of RAM)

In this case it adds up to 33824 bytes per GiB of RAM (or 33.03125 KiB per GiB of RAM). This is only about 0.00315% overhead. Worst case for searching for one free page to allocate and allocating it would be 32 bytes of reads per GiB of RAM, plus 2 more reads, then 3 writes.

For concurrency/locking, the best I can come up with (without thinking too hard) is having an extra "lock bit" for each bit in the highest level. For example:

One bitmap where each bit represents one page (cost is 32 KiB per GiB of RAM)
One bitmap where each bit represents a group of 32 pages (cost is 1 KiB per GiB of RAM)
One bitmap where each even bit represents a group of 32 groups of 32 pages, or 1024 pages; and each odd bit is a lock that protects the corresponding even bit in this bitmap and all of the corresponding bits in the lower level bitmaps (cost is 64 bytes per GiB of RAM)

In this case it adds up to 33856 bytes per GiB of RAM (or 33.0625 KiB per GiB of RAM), which is still only about 0.003153% overhead. Worst case for searching for one free page to allocate and allocating it would be acquiring and releasing 32 spinlocks spread over 64 bytes per GiB of RAM, plus 2 more reads, then 3 writes.

evoex wrote:- Would you implement this scheme in an OS? Or would you favor for instance stack based memory management.

For me, almost all allocations/de-allocations are a single page and nobody cares what the physical address is; where stack based memory management is faster and has less overhead, and is easier for locking (e.g. one lock per stack).

For the rare cases where you need to allocate physically contiguous pages or need to care what the physical address is, something like this probably isn't worth the hassle - a single bitmap that only covers the first 16 MiB of the physical address space is enough (especially if you keep track of lowest/highest "possibly free page" to reduce search times).

Cheers,

Brendan

Brendan · Post by **Brendan** » Mon Jan 02, 2012 1:39 pm

Hi,

evoex wrote:
berkus wrote:Have you considered concurrent access/locking?
Hmmm I believe most implementations require it to be a critical section, right? Especially the stack based implementation (as windows does it)... As it's part of the kernel, I can't imagine that to be too unexpected, but I might be wrong here.

One lock means all the CPUs potentially fighting for the same lock and getting no useful work done. For "many CPUs" you need many locks or no locks (lockless algorithms).

I tried to think of a lockless way of doing the "set of bitmaps" thing, and it is possible, but only if you never allocate/free more than one page at a time.

Cheers,

Brendan

CrypticalCode0 · Post by **CrypticalCode0** » Mon Jan 02, 2012 1:56 pm

So much for Von Neumann, how about a Harvard example.

I personally use a flat memory model so i'll gonna have trouble on the x86 with it's mapped I/O in 1 MB range.

evoex · Post by **evoex** » Mon Jan 02, 2012 2:45 pm

Thanks guys!

Yeah Brendan, your method actually seems to be better in every way

... Mine probably has a slightly better average case number of reads/writes, but that's not going to be worth it one bit (no pun intended).

Edit: come to think of it, even average case isn't better in my idea... So I guess it's useless completely

.

Combuster · Post by **Combuster** » Tue Jan 10, 2012 4:58 am

Brendan wrote:Because each node's children are represented by one bit and not one pointer, I fail to see how you're going to find a node's children.

Ever heard of the heap stored as an array: left child = 2 * current + 1, right child = 2 * current + 2, parent = (current - 1) >> 1.

Brendan · Post by **Brendan** » Tue Jan 10, 2012 8:40 pm

Hi,

Combuster wrote:
Brendan wrote:Because each node's children are represented by one bit and not one pointer, I fail to see how you're going to find a node's children.
Ever heard of the heap stored as an array: left child = 2 * current + 1, right child = 2 * current + 2, parent = (current - 1) >> 1.

Doh - you're right. That would work (and can be extended to 4 children per node), and would have slightly better cache locality than the "sets of bitmaps" approach. I'm not sure if it's worth the extra calculations or the extra complexity (maintenance hassle) though.

Cheers,

Brendan

Albert · Post by **Albert** » Tue Jan 17, 2012 3:46 pm

Brendan wrote:Hi,

evoex wrote:
berkus wrote:Have you considered concurrent access/locking?
Hmmm I believe most implementations require it to be a critical section, right? Especially the stack based implementation (as windows does it)... As it's part of the kernel, I can't imagine that to be too unexpected, but I might be wrong here.
One lock means all the CPUs potentially fighting for the same lock and getting no useful work done. For "many CPUs" you need many locks or no locks (lockless algorithms).

I tried to think of a lockless way of doing the "set of bitmaps" thing, and it is possible, but only if you never allocate/free more than one page at a time.

Cheers,

Brendan

What I'm doing is each CPU gets a private stack they can use to satisfy page alloc/free requests; the only constraint is the scheduler must be disabled for access. When the local stack fills/empties it then goes to the common, locked stack. Assuming there's a mix of allocations and frees on an individual processor it hits the lock infrequently. Of course, this means you can actually run out of memory while valid free memory is held invisibly, but you can have IPIs in that case to cause flushes.

OSDev.org

Idea for physical memory management

Idea for physical memory management

Re: Idea for physical memory management

Re: Idea for physical memory management

Re: Idea for physical memory management

Re: Idea for physical memory management

Re: Idea for physical memory management

Re: Idea for physical memory management

Re: Idea for physical memory management

Re: Idea for physical memory management