Article on physical memory management
Posted: Sat Aug 13, 2011 7:22 am
Introduction
There are many tutorials on the net that teach you how to write a simple block memory allocator. Although there are many, they belong to one of the two big schools: bitmap or stack. Fortunately there are other schools as well, in this article I will show you a solution that works in a real life kernel.
What's wrong with the two schools?
In according to answer that, it's a must to declare some terms.
Every algorithm can be measured in a way that makes it comparable to others by the means on scalability, that's called asymptotic behave. The worst case is, when running time depends on number of input (n), that's theta n, Θ(n). In best case the algorithm has a constant amount of time to run, or we can assume that n has a maximum (therefore we have a max(Θ(n)) deadline), it's Θ(constant). Because I'm lazy, I'll write Θ(1) as 1 is a constant. Walking through a path in a tree for example is Θ(log(n)) as a binary tree of n elements is log(n) tall. That's a typical value, Θ(1) < Θ(log(n)) < Θ(n).
The other important thing is memory footprint. This also scales, so we must know the borders. Worst case when the administrative area size depends on amount of RAM and scales linear, RR(n). Best case (I think you've already figured it out) is to have a constant size, to have a fixed size area indepently on amount of RAM, RR(1).
Armoured with these, let's compare what we have, shall we?
Using bitmaps to implement a page allocator is the worst case: scales poorly, and requires large amount of data. The stack school scales very well, but requires even more memory. In an ideal world we would have good scalability and little memory used, but in real life, we have to make compromises. We would be happy if we could allocate pages fast, and we let the free run longer. Free should not scale on number of input, but something else that considerably less will do.
Okay I belive what's the new school then?
There's no new thing really under the Sun. We just use some good old tricks and combine them together to improve algorithm quality and decrease memory requirement dramatically. Let's see how!
Memory representation
In a bitmap allocator each page has a bit that tells whether it's free or not. In stack approach you have one entry for each free page holding it's address.
Both has a disadvantage that scales poorly. But allocating from a stack is faster, so let's start with that. Stack is a FILO (First In Last Out) type of memory, and we'll use FIFO (First In First Out). This would require memory at changing addresses, to handle this we'll store it in a circular buffer. So FIFO has the same methods (push and pop) like stack, but pop eats the tail of the snake. If you do not know what a circular buffer is, and how it works, you definitely do not have the minimum requirements, read about BIOS key buffer at least.
We didn't shorten the required memory amount yet. Imagine that a stack or FIFO holds not elements, but pairs of elements. So push and pop has two arguments. In a classic stack implementation every element is a linear address, here every even is a frame address, and every odd is an amount.
By frame address I mean an index that gives the linear address in memory after multiplied by frame size.
This way the do not store every possible free address as stack school would, only the first one, and we store the size of each (amount of frames in that fragment).
Example: we have free pages at 12,13,14,22,23,24,25,26. In a stack model we would have "26,25,24,23,22,14,13,12" (8 words). In our model we have "(12,3),(22,5)" (only 4 words, that's 50% better!).
Good, hope it's everything clear so far.
We have declared everything, stated our goals and representation, let's see some code. I won't give you exact code, I would rather use some kind of pseudo-code to describe the algorithms. If you understand the examples, it's easy to implement them in a language of your choice.
Initialization
I won't bore you with the details. Simplest way is to iterate on E820 and push(address,amount) for each record of type 1. You also have to initialize start and end pointers for the buffer first.
Note that some free memory reported by E820 will be preallocated and used by your kernel, the condition in line 5 take care of that. Allocating a page from kernel code and clear it out could be fatal. It's possible you have an upper half kernel, in this case you should check for
Not much surprise here.
Allocating
It's rather easy, only a bit more complicated as stack's, we'll do the pop in certain case only, otherwise we just modify the element at the bottom.
First decreases the amount of free memory, and if run out, removes the entry from the FIFO. Otherwise it alters the base address to point to next free frame. This algorithm obviously Θ(1), since it does not contain any iterations. It can be implemented to be really fast, to require only a few cpu cycles.
Freeing
This would be tricky, we must check whether the given address fits to the beginning or ending of any fragment. Since it's an iteration on input, it's Θ(n). Notice that it's still good, since input depends on number of fragments in FIFO, not the amount of RAM. In other words we do not depend on memory size, but on how fragmented it is. If we have a nice defragmenter in background, this should not be the problem.
Basically we do the opposite of what's alloc done. We find an entry, increase number of free memory in it. If no entry found, we create a new one.
As I said before, this algorithm is Θ(m), where m considerably less than n. But how less m is?
Resource requirement
Let's assume x86 protected mode and say that page's size is 4096 bytes, and word size is 4 bytes. In this case a one page long circular buffer can hold 4096/(4+4)=512 fragments, that's more than enough. An interation on maximum 512 element (since we know a deadline) is considered to be Θ(1). Also, it requires only one page to hold all information on whole 4G space (4k instead of several megabytes). So it's RR(1) and really efficient.
Final result
I say that's not bad. We describe 4G within 4k, our allocation is terribly fast, only freeing memory could take for a while, but still can be done in time.
Consideration
With 2^32 frame address we can describe physical memory of 2^44 bytes, so one 4k frame for circular buffer enough for 64 bit as well (enough for 16Tb of physical RAM). If want more than 16Tb, use two frames.
The free algorithm can be fasten up even more, if you keep the entries in FIFO ordered by base. If you do so, you can find the straightest match (or report no result) within Θ(log(2)).
Hope you find this article interesting, and it helped someone.
There are many tutorials on the net that teach you how to write a simple block memory allocator. Although there are many, they belong to one of the two big schools: bitmap or stack. Fortunately there are other schools as well, in this article I will show you a solution that works in a real life kernel.
What's wrong with the two schools?
In according to answer that, it's a must to declare some terms.
Every algorithm can be measured in a way that makes it comparable to others by the means on scalability, that's called asymptotic behave. The worst case is, when running time depends on number of input (n), that's theta n, Θ(n). In best case the algorithm has a constant amount of time to run, or we can assume that n has a maximum (therefore we have a max(Θ(n)) deadline), it's Θ(constant). Because I'm lazy, I'll write Θ(1) as 1 is a constant. Walking through a path in a tree for example is Θ(log(n)) as a binary tree of n elements is log(n) tall. That's a typical value, Θ(1) < Θ(log(n)) < Θ(n).
The other important thing is memory footprint. This also scales, so we must know the borders. Worst case when the administrative area size depends on amount of RAM and scales linear, RR(n). Best case (I think you've already figured it out) is to have a constant size, to have a fixed size area indepently on amount of RAM, RR(1).
Armoured with these, let's compare what we have, shall we?
Code: Select all
| asymp. | resource
+-------+------+ requirement
| alloc | free | (memory footprint)
-------+-------+--------------------
ideal | Θ(1) | Θ(1) | RR(1)
bitmap | Θ(n) | Θ(n) | RR(n)
stack | Θ(1) | Θ(1) | RR(n) (RR(bitmap)<RR(stack))
Okay I belive what's the new school then?
There's no new thing really under the Sun. We just use some good old tricks and combine them together to improve algorithm quality and decrease memory requirement dramatically. Let's see how!
Memory representation
In a bitmap allocator each page has a bit that tells whether it's free or not. In stack approach you have one entry for each free page holding it's address.
Both has a disadvantage that scales poorly. But allocating from a stack is faster, so let's start with that. Stack is a FILO (First In Last Out) type of memory, and we'll use FIFO (First In First Out). This would require memory at changing addresses, to handle this we'll store it in a circular buffer. So FIFO has the same methods (push and pop) like stack, but pop eats the tail of the snake. If you do not know what a circular buffer is, and how it works, you definitely do not have the minimum requirements, read about BIOS key buffer at least.
We didn't shorten the required memory amount yet. Imagine that a stack or FIFO holds not elements, but pairs of elements. So push and pop has two arguments. In a classic stack implementation every element is a linear address, here every even is a frame address, and every odd is an amount.
By frame address I mean an index that gives the linear address in memory after multiplied by frame size.
Code: Select all
stack | our fifo
---------------+-----------------
push(address) | push(address,amount)
pop()->address | pop()->(address,amount)
top()->address | top().base->address, top().size->amount
Example: we have free pages at 12,13,14,22,23,24,25,26. In a stack model we would have "26,25,24,23,22,14,13,12" (8 words). In our model we have "(12,3),(22,5)" (only 4 words, that's 50% better!).
Good, hope it's everything clear so far.
We have declared everything, stated our goals and representation, let's see some code. I won't give you exact code, I would rather use some kind of pseudo-code to describe the algorithms. If you understand the examples, it's easy to implement them in a language of your choice.
Initialization
I won't bore you with the details. Simplest way is to iterate on E820 and push(address,amount) for each record of type 1. You also have to initialize start and end pointers for the buffer first.
Code: Select all
method init()
empty buffer
foreach E820 as mem
if mem().type=1 then
if mem().base>=end_of_kernel_in_memory then
push(mem().base,mem().size)
Code: Select all
if mem().base+mem().size<start_of_kernel then
Allocating
It's rather easy, only a bit more complicated as stack's, we'll do the pop in certain case only, otherwise we just modify the element at the bottom.
Code: Select all
method alloc()
bottom().size--
if bottom().size=0 then
(b,s)=pop()
else
b=bottom().base
bottom().base++
end if
return b
Freeing
This would be tricky, we must check whether the given address fits to the beginning or ending of any fragment. Since it's an iteration on input, it's Θ(n). Notice that it's still good, since input depends on number of fragments in FIFO, not the amount of RAM. In other words we do not depend on memory size, but on how fragmented it is. If we have a nice defragmenter in background, this should not be the problem.
Code: Select all
method free(address)
foreach bufferelements as this
//bottom match
if this().base-1=address then
this().base--
this().size++
return OK
//top match
if this().base+this().size=address then
this().size++
return OK
//no match, create a new pair
push(address,1)
//safety check
if top().base!=address then
//instead of reporting the error you can call defragmenter here
//to make you some space in FIFO by merging blocks
return ERROR
return OK
As I said before, this algorithm is Θ(m), where m considerably less than n. But how less m is?
Resource requirement
Let's assume x86 protected mode and say that page's size is 4096 bytes, and word size is 4 bytes. In this case a one page long circular buffer can hold 4096/(4+4)=512 fragments, that's more than enough. An interation on maximum 512 element (since we know a deadline) is considered to be Θ(1). Also, it requires only one page to hold all information on whole 4G space (4k instead of several megabytes). So it's RR(1) and really efficient.
Final result
Code: Select all
| asymp. | resource
+-------+------+ requirement
| alloc | free | (memory footprint)
-------+-------+--------------------
ideal | Θ(1) | Θ(1) | RR(1)
result | Θ(1) | Θ(m) | RR(1) (m<<n, RR(goal) is in range of kilobytes)
Consideration
With 2^32 frame address we can describe physical memory of 2^44 bytes, so one 4k frame for circular buffer enough for 64 bit as well (enough for 16Tb of physical RAM). If want more than 16Tb, use two frames.
The free algorithm can be fasten up even more, if you keep the entries in FIFO ordered by base. If you do so, you can find the straightest match (or report no result) within Θ(log(2)).
Hope you find this article interesting, and it helped someone.