OSDev.org

Posted: **Sat Aug 13, 2011 7:22 am**

Introduction
There are many tutorials on the net that teach you how to write a simple block memory allocator. Although there are many, they belong to one of the two big schools: bitmap or stack. Fortunately there are other schools as well, in this article I will show you a solution that works in a real life kernel.

What's wrong with the two schools?
In according to answer that, it's a must to declare some terms.

Every algorithm can be measured in a way that makes it comparable to others by the means on scalability, that's called asymptotic behave. The worst case is, when running time depends on number of input (n), that's theta n, Θ(n). In best case the algorithm has a constant amount of time to run, or we can assume that n has a maximum (therefore we have a max(Θ(n)) deadline), it's Θ(constant). Because I'm lazy, I'll write Θ(1) as 1 is a constant. Walking through a path in a tree for example is Θ(log(n)) as a binary tree of n elements is log(n) tall. That's a typical value, Θ(1) < Θ(log(n)) < Θ(n).

The other important thing is memory footprint. This also scales, so we must know the borders. Worst case when the administrative area size depends on amount of RAM and scales linear, RR(n). Best case (I think you've already figured it out) is to have a constant size, to have a fixed size area indepently on amount of RAM, RR(1).

Armoured with these, let's compare what we have, shall we?

Code: Select all

       |    asymp.    | resource
       +-------+------+ requirement
       | alloc | free | (memory footprint)
-------+-------+--------------------
ideal  |  Θ(1) | Θ(1) | RR(1)
bitmap |  Θ(n) | Θ(n) | RR(n)
stack  |  Θ(1) | Θ(1) | RR(n)          (RR(bitmap)<RR(stack))

Using bitmaps to implement a page allocator is the worst case: scales poorly, and requires large amount of data. The stack school scales very well, but requires even more memory. In an ideal world we would have good scalability and little memory used, but in real life, we have to make compromises. We would be happy if we could allocate pages fast, and we let the free run longer. Free should not scale on number of input, but something else that considerably less will do.

Okay I belive what's the new school then?
There's no new thing really under the Sun. We just use some good old tricks and combine them together to improve algorithm quality and decrease memory requirement dramatically. Let's see how!

Memory representation
In a bitmap allocator each page has a bit that tells whether it's free or not. In stack approach you have one entry for each free page holding it's address.

Both has a disadvantage that scales poorly. But allocating from a stack is faster, so let's start with that. Stack is a FILO (First In Last Out) type of memory, and we'll use FIFO (First In First Out). This would require memory at changing addresses, to handle this we'll store it in a circular buffer. So FIFO has the same methods (push and pop) like stack, but pop eats the tail of the snake. If you do not know what a circular buffer is, and how it works, you definitely do not have the minimum requirements, read about BIOS key buffer at least.

We didn't shorten the required memory amount yet. Imagine that a stack or FIFO holds not elements, but pairs of elements. So push and pop has two arguments. In a classic stack implementation every element is a linear address, here every even is a frame address, and every odd is an amount.
By frame address I mean an index that gives the linear address in memory after multiplied by frame size.

Code: Select all

     stack     |    our fifo
---------------+-----------------
push(address)  | push(address,amount)
pop()->address | pop()->(address,amount)
top()->address | top().base->address, top().size->amount

This way the do not store every possible free address as stack school would, only the first one, and we store the size of each (amount of frames in that fragment).
Example: we have free pages at 12,13,14,22,23,24,25,26. In a stack model we would have "26,25,24,23,22,14,13,12" (8 words). In our model we have "(12,3),(22,5)" (only 4 words, that's 50% better!).

Good, hope it's everything clear so far.
We have declared everything, stated our goals and representation, let's see some code. I won't give you exact code, I would rather use some kind of pseudo-code to describe the algorithms. If you understand the examples, it's easy to implement them in a language of your choice.

Initialization
I won't bore you with the details. Simplest way is to iterate on E820 and push(address,amount) for each record of type 1. You also have to initialize start and end pointers for the buffer first.

Code: Select all

method init()
empty buffer
foreach E820 as mem
    if mem().type=1 then
        if mem().base>=end_of_kernel_in_memory then
            push(mem().base,mem().size)

Note that some free memory reported by E820 will be preallocated and used by your kernel, the condition in line 5 take care of that. Allocating a page from kernel code and clear it out could be fatal. It's possible you have an upper half kernel, in this case you should check for

Code: Select all

        if mem().base+mem().size<start_of_kernel then

Not much surprise here.

Allocating
It's rather easy, only a bit more complicated as stack's, we'll do the pop in certain case only, otherwise we just modify the element at the bottom.

Code: Select all

method alloc()
bottom().size--
if bottom().size=0 then
    (b,s)=pop()
else
    b=bottom().base
    bottom().base++
end if
return b

First decreases the amount of free memory, and if run out, removes the entry from the FIFO. Otherwise it alters the base address to point to next free frame. This algorithm obviously Θ(1), since it does not contain any iterations. It can be implemented to be really fast, to require only a few cpu cycles.

Freeing
This would be tricky, we must check whether the given address fits to the beginning or ending of any fragment. Since it's an iteration on input, it's Θ(n). Notice that it's still good, since input depends on number of fragments in FIFO, not the amount of RAM. In other words we do not depend on memory size, but on how fragmented it is. If we have a nice defragmenter in background, this should not be the problem.

Code: Select all

method free(address)
foreach bufferelements as this
    //bottom match
    if this().base-1=address then
        this().base--
        this().size++
        return OK
    //top match
    if this().base+this().size=address then
        this().size++
        return OK
//no match, create a new pair
push(address,1)
//safety check
if top().base!=address then
    //instead of reporting the error you can call defragmenter here
    //to make you some space in FIFO by merging blocks
    return ERROR
return OK

Basically we do the opposite of what's alloc done. We find an entry, increase number of free memory in it. If no entry found, we create a new one.
As I said before, this algorithm is Θ(m), where m considerably less than n. But how less m is?

Resource requirement
Let's assume x86 protected mode and say that page's size is 4096 bytes, and word size is 4 bytes. In this case a one page long circular buffer can hold 4096/(4+4)=512 fragments, that's more than enough. An interation on maximum 512 element (since we know a deadline) is considered to be Θ(1). Also, it requires only one page to hold all information on whole 4G space (4k instead of several megabytes). So it's RR(1) and really efficient.

Final result

Code: Select all

       |    asymp.    | resource
       +-------+------+ requirement
       | alloc | free | (memory footprint)
-------+-------+--------------------
ideal  |  Θ(1) | Θ(1) | RR(1)
result |  Θ(1) | Θ(m) | RR(1)          (m<<n, RR(goal) is in range of kilobytes)

I say that's not bad. We describe 4G within 4k, our allocation is terribly fast, only freeing memory could take for a while, but still can be done in time.

Consideration
With 2^32 frame address we can describe physical memory of 2^44 bytes, so one 4k frame for circular buffer enough for 64 bit as well (enough for 16Tb of physical RAM). If want more than 16Tb, use two frames.

The free algorithm can be fasten up even more, if you keep the entries in FIFO ordered by base. If you do so, you can find the straightest match (or report no result) within Θ(log(2)).

Hope you find this article interesting, and it helped someone.

Posted: **Sat Aug 13, 2011 7:50 am**

Using bitmaps to implement a page allocator is the worst case: scales poorly, and requires large amount of data. The stack school scales very well, but requires even more memory. In an ideal world we would have good scalability and little memory used

The stack actually has the best possible speed and memory performance: If all pages are used, the list of free pages is empty and therefore requires no memory to store.

Posted: **Sat Aug 13, 2011 8:52 am**

Combuster wrote:The stack actually has the best possible speed and memory performance: If all pages are used, the list of free pages is empty and therefore requires no memory to store.

Of course, but
1) we introduce only a small overhead, so memory performance is almost as good as stack's
2) due to cache misses handling stack can be slower than using only 1 page that's surely cached
3) usually it's not good, if you have no free pages... regardless of the benefits

Posted: **Sat Aug 13, 2011 9:28 am**

The fact that it is in the kernel will guarantee that almost everything that isn't used on every interrupt will probably not reside in the cache on first access anyway, so the cache argument is moot (and in situations where it would remain in the cache, top of the stack in a stack based allocator will also remain in the cache).

The amount of memory you save because the amount of data on the stack is smaller is moot because that memory isn't used for anything else anyway.

Furthermore you make the argument that a maximum of 512 fragments on physical memory is enough. However the total range of physical memory is 1048510 pages, and lets say approximately three quarters of it is actual memory (rest goes to memory mapped i/o or is reserved for various purposes). It would only take a modest amount of fragmentation to split that into more than 512 fragments, in which case your system fails horribly.

This all combined with the fact that it is a far more complex system than the other two I would strongly recommend against using it.

Posted: **Sat Aug 13, 2011 1:10 pm**

I read the title as Aristotle on PMM. Yeah, I need more coffee.

I like the idea, it certainly seems like it would use much less memory. My view of it is that it would become extremely fragmented very quickly. Say we have 3 free pages, 0, 1, 2. In this method they'd be stored as (0,3). Say that process A allocates page 0, and process B allocates page 1. The process A exits. Now, we have (0,1), (2,1) after 1 process exits. That seems to me that after a while, there'd be a rather significant build up of fragmentation. I know that in my PMM (which is stack based), after a number of processes run, the top of the stack is out of order, there are gaps, etc. Which makes the O(n) nature of your method for freeing seem to be rather costly. Then again, if you're worried about memory usage (in small memory systems), then this method would be advantageous.

It was an interesting read.

-JL

Posted: **Sat Aug 13, 2011 1:44 pm**

Another problem is, how will you allocate 4mb or 2mb pages? And as said, in the worst case you need the double size of a normal stack, because of fragmentation.

Posted: **Sat Aug 13, 2011 2:53 pm**

One way to implement a stack is to have the top of stack pointer stored somewhere in memory, which points to the first free page. What you then do is have the first uintptr_t of the first free page point to the second free page, the first uintptr_t of the second page point to the third free page and so on.

e.g.

Code: Select all

uintptr_t next_free = NULL;

uintptr_t pop()
{
    uintptr_t ret = next_free;
    if(ret != NULL)
        next_free = *(uintptr_t *)next_free;   
    return ret;
}

void push(uintptr_t val)
{
    *(uintptr_t *)val = next_free;
    next_free = val;        
}

So that actually you are only using 1 pointer size area of memory to store the entire stack. Insertion and removal is O(1) and it scales reasonably well. It becomes slightly more difficult with virtual memory, you have to do something like:

get next_free
map next_free to next_free_vaddr
set next_free to *(next_free_vaddr)

which requires your physical and virtual memory managers to play well together. In addition to the above algorithm, you should probably use some sort of mutex to protect the stack. For multi-cpu systems you would probably use one stack per cpu so that they are not all fighting for the same lock, and then some sort of balancing algorithm to reallocate pages from a stack which is large to one which is running out of space (you can determine the length of the stack by having the push and pop functions increment and decrement a counter respectively). Also, you can have a separate stack for 2/4 mb pages for large allocations, and probably some different sort of memory manager for the space below 16 MiB (e.g. a bitmap or buddy system) to allocate space for ISA DMA if you are planning to support this.

Regards,
John.

Posted: **Sat Aug 13, 2011 3:51 pm**

There's an alternative where contiguous allocations and deallocations can be implemented in constant time for a provided architecture, and takes O(1) overhead storage. In addition, it supports allocating multiple page sizes.

Define an extent as a sequence of memory addresses with a size of 2^n, that starts as an offset of a * 2^n. All practical MMUs in existence support such extents provided that n belongs to a set N of acceptable page sizes, which is unique for each MMU. (for reference, on x86 N = {12, 22} or {12, 21} or {12, 21, 30} depending on operating mode and processor features.) A minimum value for n is dictated by the size of the metadata stored (which would be 5 on 32-bits system and 6 on 64-bit systems)

Each unused extent contains a header as follows:
* address of extent as the left child
* address of extent as the right child
* address of parent extent
* n value for this extent
* maximum n present in this node and subtrees
* number of extents in this node and subtrees (or what your favorite criterium is for binary trees)
Since this structure is not present in allocated memory it technically consumes no memory.

Allocations for size n^2 are as follows:
- check if the supplied n <= the maximum n stored in the root node. If the test fails or if no root exists, the allocation can not be performed. Software may decidide on a n <= the maximum n in the root node and perform multiple allocations if necessary.
- The tree is recursed as follows: choose a node from {current node, left child, right child} where the maximum n >= n, if there are multiple matches take preference in order:
1) prefer the node with the lowest n
2) prefer a child node
3) prefer the left child
These conditions may be changed without altering the worst-case running time, but will affect fragmentation.
- if the chosen node has size n remove the extent and rebalance the tree where necessary
- if the chosen node has size n+1, change the size into n, update the parent nodes and return the node's address + 2^n as the allocated memory
- otherwise, recursively split the extent in two, rebalance and repeat with the left child until one of the above conditions apply.

Deallocations for size n^2 at address x = a*2^n
- Recurse the tree, taking left children where x < base address and right node otherwise
- Add a node for the current extent
- Determine if a left node or a right node would be required for a merge, and locate that node.
if present: merge the nodes (one delete and one update) and repeat

Since the amount of memory units is limited to 2^address bits, the max depth of the tree is log(2^address bits) = O(address bits) = constant time for each primary operation (find, insert, delete, traversal).

In the worst case, allocations and deletes require address bits times a merge or a split which makes the total time O(address bits ^ 2) = again constant worst case.

The worst case can be further limited by the total memory amount, the minimal allocation size, and the observation that you can never get more than 50% of the total amount of nodes. (The proof of the O(log(memory size / min allocation size)^2) bound is left as an exercise)

Posted: **Sun Aug 14, 2011 12:26 am**

@Combuster

Correct me if I´m wrong, but in a tree it is O(log n) and not constant, because it depends on the height and the height depends on the number of nodes. So your algorithm isn´t O(1).

@jnc100

So when you have separate stacks for 2/4mb, how will you allocate/deallocate? I mean what is when you have only free 4mb pages and need only 1 4kb page? The problem with the (classical approach) stack is, that the pages aren´t sorted, because then it wouldn´t be anymore O(1).

Posted: **Sun Aug 14, 2011 2:47 am**

I disagree, bitmaps, if you use them right, they can be a pretty good algorithm:
-You can both allocate a frame by it's address, or just search the first free one, which a stack doesn't offer, which might be necessarily with ISA DMA.
-They take only a small amount of RAM: for 4GB of RAM you'll only need 131 KB
-And their speed is decent if you do something like this and do some extra tweaking: like keeping track if "super-frames" (1024 frames) are "dirty": If not you do something like bitmap = 0xFFFFFFFF and like this you can really quickly allocate 4MB.

Code: Select all

while(page_usage_bitmap[i] == 0xFFFFFFFF)
{
	i++;
}

I'm not saying bitmaps are the best idea, but I'm saying they are not the worst idea. Of course you should work it out nicely like this:
http://www.osdever.net/tutorials/view/c ... accounting

Posted: **Sun Aug 14, 2011 4:08 am**

FlashBurn wrote:@jnc100

So when you have separate stacks for 2/4mb, how will you allocate/deallocate? I mean what is when you have only free 4mb pages and need only 1 4kb page? The problem with the (classical approach) stack is, that the pages aren´t sorted, because then it wouldn´t be anymore O(1).

Depending on how the rest of your operating system works, with sensible planning you can ensure this situation happens relatively infrequently. For example, in my OS I have two main areas which request physical pages. For some background, I am using a single address space (64-bit) with memory safety provided in software. I have a single heap and a single area where programs are loaded. Generally both of these areas grow by 2 MiB at a time (large frames on x86_64) so most of my allocations come from the large frame pool. 4 kiB pages are used for paging structures, IPC ring buffers and areas that store CPU-specific information (obviously different for each CPU). I can therefore see that if the entire address space is mapped, I will use about 1 GiB (= 262657 4 kiB pages) in paging structures (4 kiB * (512 * 512 + 512 + 1)) to map 262 TiB of space (= 134 million 2 MiB pages) so that I need, on average around 500 more large pages than small pages. Add on some overhead for paging structures that are not completely filled and IPC buffers and so on and I can probably reasonably well plan how large each stack needs to be.

Obviously sometimes it doesn't work out and worse case scenario you need to split a large page into separate 4 kiB pages. This can be detected in advance (i.e. when the 4 kiB stack is running low) and a background task can do the splitting before it is needed (the same task which reallocates pages between cpu stacks).

Legendmythe wrote:-And their speed is decent if you do something like this and do some extra tweaking: like keeping track if "super-frames" (1024 frames) are "dirty": If not you do something like bitmap = 0xFFFFFFFF and like this you can really quickly allocate 4MB.

What you are describing is essentially a buddy system with the intervening layers removed. They do provide reasonable speed and efficiency and are generally used for those areas of ram where you either need to allocate a fixed address or a large continuous area. If you want to use it for the whole of memory, one optimisation I found works well is based on the idea that most of the requests will be for a piece of memory of a certain size that doesn't have to be in a certain position. These are generally allocated from the start of the bitmap up. They are therefore quick to find when your system first starts but then slows down as each successive search has to search more of the bitmap (as all the earlier frames have been allocated). One way around this is to store a separate pointer (one per each level of the buddy) that points to a position where it is known that all the frames prior to this are used. When you allocate a page, you start your search from this pointer rather than from zero. At the first page found, you update the pointer to that address. When releasing a page that has a lower address than the pointer, reduce the pointer to that address. If you coalesce or split an area on one buddy level to one on another you will need to adjust the pointer on both levels.

Regards,
John.

Posted: **Sun Aug 14, 2011 4:19 am**

@jnc

The problem is not the splitting, but getting the 2/4mb pages back. This is where you would need to scan the whole stack of 4kb pages to look if you´ve got enough pages for one 2/4mb page.

Another point is, this is a generally discussion and there you can´t assume the needed pages (you can, because most of the time, 4kb pages will be used), only if you say so, but this wont help for the topic

Posted: **Sun Aug 14, 2011 5:06 am**

FlashBurn wrote:@jnc

The problem is not the splitting, but getting the 2/4mb pages back. This is where you would need to scan the whole stack of 4kb pages to look if you´ve got enough pages for one 2/4mb page.

Another point is, this is a generally discussion and there you can´t assume the needed pages (you can, because most of the time, 4kb pages will be used), only if you say so, but this wont help for the topic

Agreed, with a stack coalescing is almost impossible. The point I was trying to make to the OP was that a stack does not require a large amount of allocated memory to work (which was his main rationale for not using one). The two stacks for different page sizes idea does work, however, if you can reasonably predict how many of each you are likely to need (note this is not specific to my kernel - similar approximations could be used for any system) and you can confidently say that whilst being allocated a large page for a large area of virtual memory is preferable and efficient, it is not an absolute necessity. If you are in a situation where you require multiple allocations of large contiguous areas then a stack is not the way to go and I would recommend a bitmap or buddy system. In reality, however, this situation is rare and IMHO a hybrid approach with a single bitmap/buddy for 'low' memory and multiple stacks for the space above (per CPU - also considering NUMA domains - and possibly per page size) is preferential.

Regards,
John.

Posted: **Sun Aug 14, 2011 10:11 am**

Thing is, in most reasonable designs splitting is the only scenario that should really matter. Once you switch to larger size pages you most likely will use them almost everywhere except those places where it would be a massive waste of space. Those situations will most likely be limited and you will probably get away with not converting back those small pages to larger ones. One could when using paging implement a fallback to 4k pages if necessary. That said you will need something different for low memory, as that is the place where dma buffers for older hardware will most likely go.

Posted: **Sun Aug 14, 2011 10:19 am**

One possibility is to maintain one small page stack (or bitmap, or buddy) per large page. That way you can know (A) which large pages have had all small pages returned (by the number of contained pages being 512), (B) which large pages are most complete, so you can reorganize a few small pages to get a large page back, and (C) which large pages are most used, so you prefer them for small page allocations

OSDev.org

Article on physical memory management

Article on physical memory management

Re: Article on physical memory management

Re: Article on physical memory management

Re: Article on physical memory management

Re: Article on physical memory management

Re: Article on physical memory management

Re: Article on physical memory management

Re: Article on physical memory management

Re: Article on physical memory management

Re: Article on physical memory management

Re: Article on physical memory management

Re: Article on physical memory management

Re: Article on physical memory management

Re: Article on physical memory management

Re: Article on physical memory management