Help with mapping bits to pages in buddy memory allocator

jrhetf4xb · Post by **jrhetf4xb** » Sun Mar 15, 2015 11:22 pm

Hi,

I've created my own buddy allocator with a bitmap, with each bit corresponding to a page block of particular size.
Since I map the whole 4GiB address space, logically the 1st bit would represent a 4GiB block, the next two bits will represent blocks of 2GiB, the next four bits will represent blocks of 1GiB and so on until we reach just single blocks of 4KiB (page size).
Getting the children of a bit would be as simple as noting the bit position, bitN, and:
leftNode = bitN * 2
rightNode =bitN * 2 + 1
Getting the parent of any bit would be bitN / 2.
In a sense, the whole bitmap would look like:

Code: Select all

bit number: [1] [2 3] [4 5 6 7] [8 9 10 11 12 13 14 15] [16 ... etc.]
holds block: 4G  2G    1G        512M                    256M  - 4KiB

I can map unique physical addresses for each block of identical size, but because of the way this scheme is employed, the addresses will overlap, meaning that the same physical address can be returned, for example, for a 4KiB block or a 16KiB block. This is true especially for bit numbers that are power of 2 -- bits 1, 2, 4, 8, etc. will always point to the leftmost physical block, which is address 0. Bits 3, 6, 12, 24, etc. will also always point to the same block beginning at address 2GiB, but this block may be of any (valid) size!
A good example output would be:

Code: Select all

(assume address 0 is valid and isn't considered NULL for the sake of the example)
allocate(1) -- allocate 1 page, returns address 0x00000000. Valid usable range is 0-4095 (0x00000000 - 0x00001000).
allocate(524288) -- allocate 512k pages (2GiB), returns address 0x80000000. Valid usable range is 0x80000000-0xFFFFFFFF.
free((uint32_t *)0) -- free address 0. Page is reclaimed and merged with its free buddies.
allocate(16) -- allocate 16 pages, returns address 0x00000000! Same address but now with a bigger usable range of 0-65535!

As you can see, with this implementation the same address can point to a block of any size and I want to know how to get this block size so I can index into the bitmap and start merging buddies. But I've spent hours trying to figure out how to do this reverse mapping of finding the bit from the bitmap that corresponds to the address!
I don't have problems allocating any number of pages and I get unique addresses back, but I don't know how to index back into the bitmap once I have to free them!
Any ideas how to go about this?

Brendan · Post by **Brendan** » Mon Mar 16, 2015 3:05 am

Hi,

You will have 1024 bits (one for each page), plus an additional 512 bits (one for each pair of pages, plus an additional 256 bits (one for each group of 4 pages), plus an additional 128 bits (one for each group of 8 pages), etc. This adds up to a total of 2047 bits.

What do the bits indicate? There are 3 answers.

The first answer is that the bit is set whenever any page in the group is free. This would make it an insignificant tiny little bit faster to find a free page, but every time you allocate or free a page you'll have to waste time updating all the bits for larger block sizes so it'd probably end up being slower than just using 1 bit per page without the over-complicated and unnecessary mess.

The second answer is that the bit is set whenever all pages in the group are free, and if/when that happens the bits for the smaller blocks and the individual pages are left "as is". In this case, when you allocate or free a page you have to waste time diddling about with the bits for larger blocks and it'd be faster to simply not bother and only have 1 bit per page.

The third answer is that the bit is set whenever all pages in the group are free, and if/when that happens the bits for the smaller blocks and the individual pages are cleared. In that case to allocate a page you'd have to break up the larger block and when you free a page you'd have to try to merge the smaller blocks into larger blocks. Basically, you have to waste time diddling about with the bits for larger blocks and it'd be faster to simply not bother and only have 1 bit per page.

Note 1: For a 32-bit OS (with PAE) you can have up to 52-bit physical addresses. You will not be limited to 4 GiB of RAM.

Note 2: For a 32-bit OS (with "plain paging" and without PAE or PSE36) the legacy BIOS area will probably ruin the lowest 2 GiB block and the "PCI hole" will definitely ruin the highest 2 GiB block; so you will probably never see a 2 GiB block that's actually free and will also definitely never see a free 4 GiB block.

Note 3: If your OS uses 4 KiB page sizes only; then for 99.99% of all allocations/de-allocations you only need 1 page. The remaining 0.01% of allocations/de-allocations (where you actually need multiple physically contiguous pages and can't use multiple pages that aren't physically contiguous) are far too rare to matter for performance.

Note 4: If your OS uses 4 KiB, 2 MiB and/or 1 GiB page sizes; then you only need to care about 4 KiB, 2 MiB and/or 1 GiB block sizes and have no reason to care about all of the other sizes for 99.99% of all allocations/de-allocations. The remaining 0.01% of allocations/deallocations are still far too rare to matter for performance.

Note 5: Once upon a time (about 30 years ago) some uni student wrote a minimal kernel. At the time this person had no idea what he was doing; so he just mapped everything in the physical address space into kernel space because it sounded easy (and "good enough" probably was good enough back then, partly because he thought the kernel would never actually be used for much anyway, and partly because you were very lucky to see a computer with more than 16 MiB of RAM at the time). This kernel ended up using a buddy allocator to manage virtual memory (where a buddy allocator actually makes sense); however one consequence of the idiotic "map all of the physical address space into kernel space" hack was that the same memory manager that was used for virtual memory could also be used to manage physical memory (where using a buddy allocator is stupid). Unfortunately this kernel was in the right place at the right time, got popular for political rather than technical reasons, and grew too fast too quickly. As a result, it became too hard to fix earlier design mistakes and do physical memory management properly (because too much code grew to depend on the older/original brain-dead stupidity). Linux still uses buddy allocator for physical memory today, even though no other kernel manages physical memory that way, and even though every other kernel does it better than Linux.

Basically what I'm saying here is that for physical memory management buddy allocators are inferior to simpler approaches and you probably shouldn't bother with them in the first place.

Cheers,

Brendan

jrhetf4xb · Post by **jrhetf4xb** » Mon Mar 16, 2015 9:14 am

Interesting point of view on the buddy allocator. Do you have any references that I could read about comparing different allocators (i.e. research, benchmarks, user tests, example implementations)? I also find it difficult to follow the logic behind this 3-phase bit allocator... Can you clarify it with an example?

On a side note, isn't it better to just use paging within the kernel? I was thinking of having just a simple stack-based address storage that pops() a free address and stores a free address with push(). The first few megabytes of the kernel can be identity mapped. All of this is neatly done in a few lines of code and is O(1), requiring just space for the addresses themselves (4MiB). If the kernel employs virtual memory then it would access contiguous logical addresses, whereas the actual order of the physical pages beneath the scenes wouldn't really matter, isn't that right? An object may be split between two physical pages that are non-contiguous but the PTE/PDE will map their corresponding virtual addresses as consecutive. Is there a problem with this approach?

Brendan · Post by **Brendan** » Mon Mar 16, 2015 3:08 pm

Hi,

jrhetf4xb wrote:Interesting point of view on the buddy allocator. Do you have any references that I could read about comparing different allocators (i.e. research, benchmarks, user tests, example implementations)? I also find it difficult to follow the logic behind this 3-phase bit allocator... Can you clarify it with an example?

"3-phase bit allocator"?

All I'm saying is that your problem comes from not knowing what your bits indicate. Once you determine/decide exactly what a bit indicates (e.g. "any smaller block in the group is free" or "all smaller blocks in the group are free") you'll be able to figure out what steps need to be taken when allocating and freeing to ensure that you get correct behaviour.

jrhetf4xb wrote:On a side note, isn't it better to just use paging within the kernel?

Yes; but there's many very different ways of using paging. The problem is that some people see how Linux does things (physical address space mapped 1:1 into kernel space, buddy allocator that seems to be used for physical memory management) and don't realise that Linux is a steaming pile of donkey puke, and then implement a kernel that regurgitates the same mistakes.

In my opinion; it's best to:

Separate the 3 separate concerns (physical memory manager, virtual memory manager and "heap")
For physical memory management, use something that's designed specifically for physical pages (e.g. free page stacks) to improve performance for 99.9% of physical memory allocations/de-allocations
For 32-bit systems; avoid trying to map 4 GiB or more RAM into 1 GiB of kernel space, failing, implementing something else to manage the physical memory that couldn't be mapped (so you end up with 2 different schemes for the same thing), and then trying to blame Intel for your design flaw
For monolithic kernels; avoid trying to map the entire physical address space into kernel space (and only map what is necessary in kernel space) to increase the chance of catching dodgy pointers and other bugs in thousands of buggy device drivers
Avoid using fixed physical addresses for kernel space for performance reasons (e.g. NUMA optimisations, page colouring, etc)
Avoid using fixed physical addresses for kernel space for security reasons (e.g. to help make hardware based attacks like "rowhammer" harder)
Avoid using fixed physical addresses for kernel space for fault tolerance reasons (e.g. if the physical page at 0x00100000 is faulty you can just use a different physical page)
Avoid using fixed physical addresses for kernel space to ensure it can actually boot in the first place (e.g. UEFI doesn't provide any guarantees about which areas are/aren't usable RAM, and there's no guarantee that there's usable RAM at physical address 0x00100000)
For allocating/de-allocating data structures, buffers, etc in the kernel; to improve performance consider using a specially designed allocator for each specific purpose (rather than having a generic thing like "kmalloc" and "kfree" that's used for everything and can't be optimised for any specific purpose). For one simple example; maybe you've got process IDs and a "process data structure" for each process, and the fastest way to find the data structure from the process ID is to do "address_of_structure = &process_data_structure_table[process_ID]" where paging is used to ensure that the huge array doesn't waste RAM for unused entries.
For 32-bit monolithic kernels, recognise that VFS cache may need to be larger than kernel space, and that (e.g. with many video cards or something) kernel space may also be too small to contain all memory mapped PCI devices.

Cheers,

Brendan

jrhetf4xb · Post by **jrhetf4xb** » Mon Mar 16, 2015 11:16 pm

Now that you explain it, it does make sense and it seems so much better to have a stack with addresses rather than a messy allocator that just provides contiguous pages but suffers fragmentation and gets progressively worse performance-wise as more allocations are being made.
But then again since no one has done this I feel like there's a reason

Is there a reason not to do it this way in 0.01% of the cases you mentioned?

Brendan · Post by **Brendan** » Tue Mar 17, 2015 12:42 am

Hi,

jrhetf4xb wrote:Now that you explain it, it does make sense and it seems so much better to have a stack with addresses rather than a messy allocator that just provides contiguous pages but suffers fragmentation and gets progressively worse performance-wise as more allocations are being made.
But then again since no one has done this I feel like there's a reason
Is there a reason not to do it this way in 0.01% of the cases you mentioned?

For physical memory allocations, there's actually 4 cases:

Normal stuff - nobody cares what the physical address is and nobody needs physically contiguous pages. This is the 99%.
Device drivers for "32-bit only" PCI devices (and the PDPT for PAE) - the physical address must be below 0x0000000100000000
Device drivers for PCI devices that need physically contiguous pages
ISA DMA buffers - the RAM must be below 0x01000000, must be physically contiguous, and must not cross a 64 KiB boundary

For the normal stuff, the best option is typically free page stack/s (depending a little on a few other things) because they're very fast. For the "32-bit only" you can also use free page stacks (e.g. one free page stack for RAM below 4 GiB and another for RAM above 4 GiB).

For the remainder (physically contiguous pages for PCI devices and ISA DMA buffers) free page stacks won't work. To fix that it's enough to (e.g.) manage RAM below 0x01000000 with something else (e.g. a "1 bit per page" bitmap). This is slower, but nobody cares because it's rare and not performance critical.

Of course for allocations you'd have a search order. E.g. if possible allocate a page above 4 GiB; else if possible allocate a page above 0x01000000, else allocate a page below 0x01000000. The goal is to try to make sure the more versatile RAM isn't used unnecessarily.

For de-allocations, you'd use the page's physical address to figure out what to do with it (e.g. if the physical address is < 0x01000000 tell the bitmap thing to free it; else...).

Note: There's several optional "OS features" that could complicate things; like support for large page sizes, NUMA optimisation, page colouring, "page cleaning" when CPU has nothing better to do, RAM fault tolerance, etc. Depending on the exact requirements/features supported, free page stacks might not be suitable, or you might need a large number of them.

Cheers,

Brendan

jrhetf4xb · Post by **jrhetf4xb** » Tue Mar 17, 2015 2:07 am

Thank you again for the detailed explanation.
But why do the devices need physically contiguous memory? Isn't it enough to provide them with contiguous virtual memory with addresses below 0x0000000100000000 or below 0x01000000? As far as the drivers are concerned, they'd operate with seemingly contiguous addresses...

Brendan · Post by **Brendan** » Tue Mar 17, 2015 2:35 am

Hi,

jrhetf4xb wrote:Thank you again for the detailed explanation.
But why do the devices need physically contiguous memory? Isn't it enough to provide them with contiguous virtual memory with addresses below 0x0000000100000000 or below 0x01000000? As far as the drivers are concerned, they'd operate with seemingly contiguous addresses...

The CPU converts virtual addresses into physical addresses internally. Nothing outside the CPU knows anything about it. Devices only ever see the physical addresses.

To deal with this; some devices don't access RAM (e.g. they only have memory mapped IO) and it doesn't effect them at all; some devices are specially designed so that they only need "4 KiB or less" pieces to be contiguous (e.g. instead of transferring 40 KiB from device to RAM the device driver will break it into ten 4 KiB pieces for the device); and some devices do need physically contiguous RAM for their transfers.

However; where performance matters it's good if the device can to transfer data directly to/from a process' "normally allocated" pages (e.g. without wasting CPU time copying the data to/from a specially allocated buffer). For this reason most devices (especially high performance stuff, like hard disk controllers and video) don't need physically contiguous RAM for their transfers.

Note: If a computer has an IOMMU it may be possible to set it up such that it mirrors the way that (one or more) CPU's MMU are configured. The problem there is that IOMMU is considered part of Intel's "virtualisation stuff" that's disabled on a lot of CPUs (and is also typically "reserved by the OS for virtual machine use" if it is present).

Cheers,

Brendan

jrhetf4xb · Post by **jrhetf4xb** » Tue Mar 17, 2015 2:42 am

Brendan wrote:The CPU converts virtual addresses into physical addresses internally. Nothing outside the CPU knows anything about it. Devices only ever see the physical addresses.

You're right, I forgot about this part. Thanks for clarifying!

OSDev.org

Help with mapping bits to pages in buddy memory allocator

Help with mapping bits to pages in buddy memory allocator

Re: Help with mapping bits to pages in buddy memory allocato

Re: Help with mapping bits to pages in buddy memory allocato

Re: Help with mapping bits to pages in buddy memory allocato

Re: Help with mapping bits to pages in buddy memory allocato

Re: Help with mapping bits to pages in buddy memory allocato

Re: Help with mapping bits to pages in buddy memory allocato

Re: Help with mapping bits to pages in buddy memory allocato

Re: Help with mapping bits to pages in buddy memory allocato