Preparations for buddy allocation system
Preparations for buddy allocation system
I am currently trying to understand the buddy system for the physical memory manager. Although I understand how the system itself works (meaning the split of blocks, merging of free blocks by checking the other buddy, etc), I can't seem to find how to connect the buddy allocator with the free and available memory listed by the multiboot info structure.
As of now, I can list the available free memory regions in the physical memory:
So given that, there are some question I would like to discuss/ask:
The buddy allocation system uses blocks with the size of multiples of 4096 bytes (page size). Depending on the order/level of the block, the total size of a chunk is 2^0 * 4096, 2^1 * 4096, ..., 2^10 * 4096 (max) bytes.
As you can see in the screenshot, the first available memory region usable for allocations is located at 0x00000000 and has a length of 0x9F000. When I want to use this block of continuous memory for allocation by using the buddy system, I cannot seem to split it since it is not a power of 2:
0x9F000 = 651264
651264 / 4096 = 159 -> 159 pages
The same goes for the other available block of memory, located at 0x00100000 with a size of 0x3FEF0000.
The other question would be, since the buddy allocation itself needs some memory for itself, should I simply reserve some memory statically similar to the situation in paging, where I used an array of uint64_t values for the pml4 table?
I just don't know where to start exactly when setting up a physical frame allocator.
As of now, I can list the available free memory regions in the physical memory:
So given that, there are some question I would like to discuss/ask:
The buddy allocation system uses blocks with the size of multiples of 4096 bytes (page size). Depending on the order/level of the block, the total size of a chunk is 2^0 * 4096, 2^1 * 4096, ..., 2^10 * 4096 (max) bytes.
As you can see in the screenshot, the first available memory region usable for allocations is located at 0x00000000 and has a length of 0x9F000. When I want to use this block of continuous memory for allocation by using the buddy system, I cannot seem to split it since it is not a power of 2:
0x9F000 = 651264
651264 / 4096 = 159 -> 159 pages
The same goes for the other available block of memory, located at 0x00100000 with a size of 0x3FEF0000.
The other question would be, since the buddy allocation itself needs some memory for itself, should I simply reserve some memory statically similar to the situation in paging, where I used an array of uint64_t values for the pml4 table?
I just don't know where to start exactly when setting up a physical frame allocator.
Re: Preparations for buddy allocation system
I can give you my observations from the way Linux works. Generally speaking, the buddy allocator is not used for physical memory allocation per se, but virtual memory allocation. The physical memory fragmentation and locality is not generally important for performance, unless you are trying to optimize for NUMA, bank row conflicts and memory channel utilization, etc. This means that the physical pages can be allocated even with a simple linked list LIFO allocator. Some legacy DMA (which don't support scatter-gather in systems that don't have IOMMU) may need larger contiguous physical chunks, but this can be resolved by setting aside zones of memory that are subdivided in larger units.
However, the solution for virtual memory buddy allocator still depends on how you map the physical pages. If you map the entire physical memory (or a large portion of it) linearly in the kernel address space, the physical layout is important. Essentially, the virtual address space may have holes, since it mirrors the physical address space.
Before going further, I should say that you can store your allocator metadata in the page frame database. This is not very inefficient, because the allocations are done at the page level anyway. Which leaves the question, how do you establish the address for the page frame record that holds information for a given page. I will leave this for the end. Also note, that the strategies for a page allocator and kernel pool allocator are different. The latter usually keeps its metadata in internal structures.
So, the buddy allocator has to check whether the buddy of a block is already coalesced to the previous respective size, and also if it falls inside a memory hole. First, it is important to observe that if the buddy is in a different available region, separated by a hole, it will never be able to coalesce and be paired with another block, because the hole will prevent this. What remains is to check if the buddy itself is inside a hole. The trivial solution is to have a flag in the page frame record, but this requires maintaining records even for the unavailable space, which can be rather inefficient. Alternatively, you can divide the physical memory uniformly in large sections of the same size. Large enough to be efficient in cutting away large memory holes, and small enough to encompass available memory tightly. From a few MB, to a few hundred MB. For each section, you maintain a structure with indication whether it has free memory in it, and the location of its first page frame record. This means you must set aside space for a sections database as well, one record per some fixed number of pages. Knowing the beginning of the sections database, the location of a section record for a given memory address can be determined with arithmetic. If the section indicates that it lacks available physical space, the allocator can be sure that blocks within that section are part of a hole. If the section contains available space (meaning, as reported by GRUB), then the allocator has to also check if the frame record indicates that the page is itself available (to handle sections with only some frames available).
Generally speaking, a better possibility may be to map the available physical regions reported by GRUB adjacent in the kernel address space. This will enable working without handling holes. Not only this, but you could also map the page frame database as one contiguous array as well, and covert between frame records and page addresses with linear arithmetic. The disadvantage is that converting between physical and virtual addresses will be somewhat more convoluted (with lookup into self-mapping page directories), and you cannot use 1G page entries. Also, finding the frame record of a page from a physical address will still involve the section database, as described above. This may be necessary for regions of the kernel address space that use different discontiguous mapping technique.
If you map the entire physical memory (not just the available memory) linearly in the kernel address space, you need to find a way to convert page addresses to page frame records. One way is to use the section lookup described above, adding offset to the page frame database pointer from the section record. Another is to reserve virtual address space to hold frame records for the entire physical address space, but to map only portions of it that will actually hold records corresponding to available memory. This will make the conversion possible using only address arithmetic.
Regarding the memory which you use for the section and page frame database. You can allocate it in any way you want, such as linear first fit, sort and best fit, etc. After that, you can simply treat that memory as if it was unavailable when creating the buddy allocator structure. (I assume that you will not support memory hot-plug for the time being.)
One last thing. If the buddy allocator has to end (or even start) on address that is not perfectly aligned for a power of two, you can always treat the buddies beyond the end boundary as "allocated". In other words, during free, before making further assessment of the candidate buddy, you first can check if it falls beyond the upper limit of the allocator memory and reject it for the purposes of coalsecing if it does. In other words, some blocks will never get buddies.
As you can see, there are many options.
However, the solution for virtual memory buddy allocator still depends on how you map the physical pages. If you map the entire physical memory (or a large portion of it) linearly in the kernel address space, the physical layout is important. Essentially, the virtual address space may have holes, since it mirrors the physical address space.
Before going further, I should say that you can store your allocator metadata in the page frame database. This is not very inefficient, because the allocations are done at the page level anyway. Which leaves the question, how do you establish the address for the page frame record that holds information for a given page. I will leave this for the end. Also note, that the strategies for a page allocator and kernel pool allocator are different. The latter usually keeps its metadata in internal structures.
So, the buddy allocator has to check whether the buddy of a block is already coalesced to the previous respective size, and also if it falls inside a memory hole. First, it is important to observe that if the buddy is in a different available region, separated by a hole, it will never be able to coalesce and be paired with another block, because the hole will prevent this. What remains is to check if the buddy itself is inside a hole. The trivial solution is to have a flag in the page frame record, but this requires maintaining records even for the unavailable space, which can be rather inefficient. Alternatively, you can divide the physical memory uniformly in large sections of the same size. Large enough to be efficient in cutting away large memory holes, and small enough to encompass available memory tightly. From a few MB, to a few hundred MB. For each section, you maintain a structure with indication whether it has free memory in it, and the location of its first page frame record. This means you must set aside space for a sections database as well, one record per some fixed number of pages. Knowing the beginning of the sections database, the location of a section record for a given memory address can be determined with arithmetic. If the section indicates that it lacks available physical space, the allocator can be sure that blocks within that section are part of a hole. If the section contains available space (meaning, as reported by GRUB), then the allocator has to also check if the frame record indicates that the page is itself available (to handle sections with only some frames available).
Generally speaking, a better possibility may be to map the available physical regions reported by GRUB adjacent in the kernel address space. This will enable working without handling holes. Not only this, but you could also map the page frame database as one contiguous array as well, and covert between frame records and page addresses with linear arithmetic. The disadvantage is that converting between physical and virtual addresses will be somewhat more convoluted (with lookup into self-mapping page directories), and you cannot use 1G page entries. Also, finding the frame record of a page from a physical address will still involve the section database, as described above. This may be necessary for regions of the kernel address space that use different discontiguous mapping technique.
If you map the entire physical memory (not just the available memory) linearly in the kernel address space, you need to find a way to convert page addresses to page frame records. One way is to use the section lookup described above, adding offset to the page frame database pointer from the section record. Another is to reserve virtual address space to hold frame records for the entire physical address space, but to map only portions of it that will actually hold records corresponding to available memory. This will make the conversion possible using only address arithmetic.
Regarding the memory which you use for the section and page frame database. You can allocate it in any way you want, such as linear first fit, sort and best fit, etc. After that, you can simply treat that memory as if it was unavailable when creating the buddy allocator structure. (I assume that you will not support memory hot-plug for the time being.)
One last thing. If the buddy allocator has to end (or even start) on address that is not perfectly aligned for a power of two, you can always treat the buddies beyond the end boundary as "allocated". In other words, during free, before making further assessment of the candidate buddy, you first can check if it falls beyond the upper limit of the allocator memory and reject it for the purposes of coalsecing if it does. In other words, some blocks will never get buddies.
As you can see, there are many options.
Re: Preparations for buddy allocation system
Sorry, I'm already a bit lost: I thought the buddy allocator was used as the physical memory allocator in Linux? (As stated here).simeonz wrote:I can give you my observations from the way Linux works. Generally speaking, the buddy allocator is not used for physical memory allocation per se, but virtual memory allocation.
I am mapping - or rather, I am going to map the complete physical address space into the virtual memory space. At which address this is going to happen is something I am deciding/researching in near future.simeonz wrote:However, the solution for virtual memory buddy allocator still depends on how you map the physical pages. If you map the entire physical memory (or a large portion of it) linearly in the kernel address space, the physical layout is important. Essentially, the virtual address space may have holes, since it mirrors the physical address space.
Sorry, never heard of that. A quick search on Google returns the Windows Internals book, but given the context of your answer, I think you were talking about Linux. Are there some similarities or are you giving your own example here? Maybe you can provide a reference?simeonz wrote:Before going further, I should say that you can store your allocator metadata in the page frame database.
So you are saying that there is a kind of "loss of memory" involved, since the rounding to the lower power of 2, which fits into the free memory block, does not use the other half, of which a part is technically free, but since the buddy system treats it as allocated, it is reserved and never used. Hm, hearing this makes me a bit uncomfortable, isn't that kind of a bad point for the buddy allocator?simeonz wrote:One last thing. If the buddy allocator has to end (or even start) on address that is not perfectly aligned for a power of two, you can always treat the buddies beyond the end boundary as "allocated". In other words, during free, before making further assessment of the candidate buddy, you first can check if it falls beyond the upper limit of the allocator memory and reject it for the purposes of coalsecing if it does. In other words, some blocks will never get buddies.
Think about having a huge block of memory, which is almost a power of 2. Since rounding down "reserves"/"discards" almost half of the block, there is a huge loss of memory here.
Thanks a lot for your detailed answer , but I am still looking for an answer to my OP above.
Re: Preparations for buddy allocation system
The Linux OS does not use a coalescing allocator for managing the physical memory intentionally. But since a large portion of the virtual memory directly maps onto the physical memory, the two issues become conflated. By using the buddy allocator to manage the virtual address space, Linux is effectively using it to manage the physical address space as well. This is, you could say, incidental. The actual reason for mapping the physical memory linearly is to avoid TLB thrashing, as well as to simplify the work with physical addresses (for DMA operations, for use in system threads, lookups in the page frame database/mem_map array, etc).CRoemheld wrote:Sorry, I'm already a bit lost: I thought the buddy allocator was used as the physical memory allocator in Linux?
What I was trying to point out is that you have to decide the kind of virtual mapping you prefer to use, before you can evaluate the interactions between the physical memory layout and your virtual memory allocator. If your virtual memory and physical memory are intervolved, it will be incidentally a physical memory allocator as well and will be impacted by the gaps in the available physical memory. Windows uses discontiguous mapping and hence its physical allocator relies on a simple linked list instead, whereas the virtual allocator is combination of first fit and best fit.
Page frame and page frame number are standard terms for any OS. Page frame (number) database is a term used in Windows for the array describing the page frames (which I find appropriate), and I took the liberty to call the structures inside that array page frame records. Page frame descriptor is sometimes used. In Linux, those structures are called page descriptors and are stored in something that is called the mem_map array (by the name of the identifier.) However, I intentionally try not to ascribe to nomenclature originating from source definitions, so either use multiple terms interchangeably to avoid bias and blur the lines, or choose one which I think is the most conceptually appropriate.CRoemheld wrote:Sorry, never heard of that. A quick search on Google returns the Windows Internals book, but given the context of your answer, I think you were talking about Linux. Are there some similarities or are you giving your own example here? Maybe you can provide a reference?
The buddy allocator relies on metadata in the "page descriptors" in Linux, as you can see here. This function decides if a candidate buddy block is ready for coalescing. It checks if the order of the two blocks coincides, by using the page_order function, which in turn uses the page_private macro, which simply extracts the "private" field of the page descriptor structure. (The order here is the block size indicator.) Also, while linux does not use the term "page frame database", both Linux and Windows refer to page frame numbers, as can be seen from the acronyms in many function names.
Last edited by simeonz on Fri Jun 15, 2018 4:39 pm, edited 2 times in total.
Re: Preparations for buddy allocation system
CRoemheld, for "physical" memory allocation, you do demand paging - allocating by 1 or slightly more pages at each page fault. no allocator at all. just pick first available page from the free list stored (inside PFN) map it and voila.
buddy allocation is space wasteful by the way. I wanted to use it on my UEFI, but decided to go with the first fit algorithm.
buddy allocation is space wasteful by the way. I wanted to use it on my UEFI, but decided to go with the first fit algorithm.
Re: Preparations for buddy allocation system
No, I meant something else. Let's say that you have a region starting at address 0 and ending at address 13 (not inclusive). In this case, your buddy allocator will have one free block of size 8, one free block of size 4, and one free block of size 1. In memory they will be ordered (B8),(B4),(B1). If B1 is allocated and then freed, since it has no buddy, it will not be coalesced. But the check here fails not because the buddy is not free, but because it exists outside the boundaries of the region (at address 13). Some blocks will never be coalesced with buddies, but this has no detrimental effect, except maybe making the largest satisfiable allocation slightly smaller.CRoemheld wrote:So you are saying that there is a kind of "loss of memory" involved, since the rounding to the lower power of 2, which fits into the free memory block, does not use the other half, of which a part is technically free, but since the buddy system treats it as allocated, it is reserved and never used. Hm, hearing this makes me a bit uncomfortable, isn't that kind of a bad point for the buddy allocator?
Re: Preparations for buddy allocation system
I should provide some references indeed. I can hardly explain the stuff properly. Unfortunately, there are very few usable documents, because the topic is outside the scope of introductory articles (don't know why).
For Linux, one interesting but brief set of articles is found here: Sparsemem
This is one of the few places that briefly discusses the sparsemem model (which allows efficient handling of large memory gaps), aside from the source tree documents. Another place where it is briefly mentioned is in the Linux Inside booklet. (search "Initialization of the sparse memory")
Linux however does not merely use the GRUB memory map or the bios calls (like int 0x15, 0xe820). There is code for processing the firmware memory maps, but there is also code that tries to discover the numa topology and uses one of two approaches for x86. The first is to parse the ACPI Memory Affinity Structures in SRAT (System/Static Resource Affinity Table), which is done here. Those structures are described in our wiki. The other one is to parse the pci configuration space of the northbridge PnP device, which is done here. The reason why I mention this, is that Linux may be using more sophisticated approach to partition the physical address space into sections. I cannot confirm this yet though. My reverse engineering skills are not that strong.
For Windows, there is a brief section for the page frame database here: Windows NT Virtual Memory (don't mind the article name)
And there is some more information on how it operates here: Mysteries of Windows Memory Management Revealed
You can also find information about the database here: Exploring Windows virtual memory management (search for "Let's get physical")
And also here: Catalog of key Windows kernel data structures (search for nt!_MMPFN or nt!_MMPFNLIST)
The physical memory allocator in Windows is simple enough and is conceptually combined into the page replacement policy algorithms. Both use the page frame state and the list links from the PFN database to maintain their lists.
For Linux, one interesting but brief set of articles is found here: Sparsemem
This is one of the few places that briefly discusses the sparsemem model (which allows efficient handling of large memory gaps), aside from the source tree documents. Another place where it is briefly mentioned is in the Linux Inside booklet. (search "Initialization of the sparse memory")
Linux however does not merely use the GRUB memory map or the bios calls (like int 0x15, 0xe820). There is code for processing the firmware memory maps, but there is also code that tries to discover the numa topology and uses one of two approaches for x86. The first is to parse the ACPI Memory Affinity Structures in SRAT (System/Static Resource Affinity Table), which is done here. Those structures are described in our wiki. The other one is to parse the pci configuration space of the northbridge PnP device, which is done here. The reason why I mention this, is that Linux may be using more sophisticated approach to partition the physical address space into sections. I cannot confirm this yet though. My reverse engineering skills are not that strong.
For Windows, there is a brief section for the page frame database here: Windows NT Virtual Memory (don't mind the article name)
And there is some more information on how it operates here: Mysteries of Windows Memory Management Revealed
You can also find information about the database here: Exploring Windows virtual memory management (search for "Let's get physical")
And also here: Catalog of key Windows kernel data structures (search for nt!_MMPFN or nt!_MMPFNLIST)
The physical memory allocator in Windows is simple enough and is conceptually combined into the page replacement policy algorithms. Both use the page frame state and the list links from the PFN database to maintain their lists.
Re: Preparations for buddy allocation system
Thank you simeonz, zaval!
However, I am going to stick with the buddy allocation here. It also is a good way to test multiple algorithms for allocations.
The only thing which is a bit confusing right now are things like:
However, I am going to stick with the buddy allocation here. It also is a good way to test multiple algorithms for allocations.
The only thing which is a bit confusing right now are things like:
- - In the screenshot in the OP you can see that at address 0x00100000 there's a free memory region. However, right now the length of this area changed, from 0x3FEF0000 to 0x07EE0000. Is this always random, depending on what address the memory hole in the upper memory is showing up?
- The buddy allocator used in Linux uses a maximum order of 10, meaning the biggest available block would be 2^10 * 4096 Bytes = 4MiB in size. However, The first free memory area in lower memory has a size of 639KiB, and the free area in upper memory is 129920KiB in size. Would I need to split up the upper memory area into smaller blocks of 4 MiB and use multiple lists to keep track of the allocations? With point 1: if the size of free memory is not constant, how does the allocator change?
- I am also looking for code examples, because as of now I haven't found a shred of code or even tutorial which uses the buddy allocation system in connection with the free memory areas listed by the multiboot info structs mmap_* fields , but instead simply start allocating pages after the end of the kernel (by defining a variable in the linker script and access it via an external declared variable in C). So this point is the main objective of my question here: How to I setup a buddy allocator which uses the memory areas listed in the screenshot above and how does it deal with regions in memory, which are much larger than 2^10 * 4096 Bytes?
Re: Preparations for buddy allocation system
Hi,
Given that the memory map provided by Bochs is relatively predictable; I'd assume that you changed the amount of RAM it emulates (in its configuration file) from 1 GiB to 128 MiB.
Note that (because of multiple different paging tricks - "allocate on write", memory mapped files, swap space, etc); the single most common case (and the case that the physical memory manager should be designed and optimised for) is single page allocations/de-allocations. Typically the only case where you ever need to allocate multiple contiguous physical pages is special buffers for some (older) devices, and these often have bizarre/annoying additional restrictions (e.g. "can't cross a 64 KiB boundary, must be below 0x01000000"). However; because these "larger than one physical page, with bizarre/annoying additional restrictions" allocations only happen very rarely (e.g. allocated once during boot when driver/s are initialising and never freed or re-allocated) it's a good idea to use two seperate physical memory managers - one designed for all the bizarre/annoying additional restrictions that doesn't need to be fast (e.g. maybe for all physical memory below 0x01000000); plus another physical memory manager (e.g. maybe for all physical memory above 0x01000000) that is incredibly fast because it only ever handles single pages and nothing else.
If you only support one page size, then "free page stack/s" is impossible to beat. If you support different page sizes (e.g. 4 KiB pages and 2 MiB "large pages") it gets a lot more complicated (because you have to be able recombine multiple smaller pages back into a large page, including hunting for the last few smaller pages that are still allocated and replacing them with other small pages so that the whole large page becomes free); and while this might seem "vaguely buddy-like" (because you are splitting larger pages and re-combining) anyone that's willing to deal with the complications will want to custom design something specifically for it.
Cheers,
Brendan
The memory map is different for different computers, and (for one computer) can change if you change the amount of RAM installed or some BIOS options (e.g. how much RAM to steal for integrated video); but if nothing changed the memory map shouldn't have changed either.CRoemheld wrote:In the screenshot in the OP you can see that at address 0x00100000 there's a free memory region. However, right now the length of this area changed, from 0x3FEF0000 to 0x07EE0000. Is this always random, depending on what address the memory hole in the upper memory is showing up?
Given that the memory map provided by Bochs is relatively predictable; I'd assume that you changed the amount of RAM it emulates (in its configuration file) from 1 GiB to 128 MiB.
Yes.CRoemheld wrote:The buddy allocator used in Linux uses a maximum order of 10, meaning the biggest available block would be 2^10 * 4096 Bytes = 4MiB in size. However, The first free memory area in lower memory has a size of 639KiB, and the free area in upper memory is 129920KiB in size. Would I need to split up the upper memory area into smaller blocks of 4 MiB and use multiple lists to keep track of the allocations?
Regardless of how you do physical memory management and regardless of what the memory map looks like; there's always some kind of "for each area that the memory map says is usable RAM { ... }" loop, where you tell the physical memory manager to add the area to its pool of free memory. Note that at some point (either earlier when pre-processing the memory map, or later when initialising the physical memory manager) you'll need to "chop out" any RAM that you are already using (e.g. to store kernel, its stack, information from boot loader, etc); and in some cases this can mean a larger area is divided multiple times (e.g. a large area of RAM might become a free area, a used area, another free area, another used area, ...).CRoemheld wrote:With point 1: if the size of free memory is not constant, how does the allocator change?
Because using a buddy allocator to manage physical RAM isn't a great idea, you probably won't ever find any example/tutorial code that uses a buddy allocator to manage physical RAM.CRoemheld wrote:I am also looking for code examples, because as of now I haven't found a shred of code or even tutorial which uses the buddy allocation system in connection with the free memory areas listed by the multiboot info structs mmap_* fields , but instead simply start allocating pages after the end of the kernel (by defining a variable in the linker script and access it via an external declared variable in C).
Note that (because of multiple different paging tricks - "allocate on write", memory mapped files, swap space, etc); the single most common case (and the case that the physical memory manager should be designed and optimised for) is single page allocations/de-allocations. Typically the only case where you ever need to allocate multiple contiguous physical pages is special buffers for some (older) devices, and these often have bizarre/annoying additional restrictions (e.g. "can't cross a 64 KiB boundary, must be below 0x01000000"). However; because these "larger than one physical page, with bizarre/annoying additional restrictions" allocations only happen very rarely (e.g. allocated once during boot when driver/s are initialising and never freed or re-allocated) it's a good idea to use two seperate physical memory managers - one designed for all the bizarre/annoying additional restrictions that doesn't need to be fast (e.g. maybe for all physical memory below 0x01000000); plus another physical memory manager (e.g. maybe for all physical memory above 0x01000000) that is incredibly fast because it only ever handles single pages and nothing else.
If you only support one page size, then "free page stack/s" is impossible to beat. If you support different page sizes (e.g. 4 KiB pages and 2 MiB "large pages") it gets a lot more complicated (because you have to be able recombine multiple smaller pages back into a large page, including hunting for the last few smaller pages that are still allocated and replacing them with other small pages so that the whole large page becomes free); and while this might seem "vaguely buddy-like" (because you are splitting larger pages and re-combining) anyone that's willing to deal with the complications will want to custom design something specifically for it.
For this; for each area you'd probably start with a "while( ( (next_block_size = address & 0x003FFFFF) != 0) && (remaining_size != 0) ) {" loop to figure out how to break up the start of an area (until "address" is aligned on a 4 MiB boundary); then after "address" is aligned you'd have a "while((remaining_size > 0x00400000) {" loop to handle all the 4 MiB blocks in the middle of the area; then you'd probably have a "while( (next_block_size = remaining_size & 0x003FFFFF) != 0) {" loop to handle any smaller pieces at the end of the area.CRoemheld wrote:So this point is the main objective of my question here: How to I setup a buddy allocator which uses the memory areas listed in the screenshot above and how does it deal with regions in memory, which are much larger than 2^10 * 4096 Bytes?
Cheers,
Brendan
For all things; perfection is, and will always remain, impossible to achieve in practice. However; by striving for perfection we create things that are as perfect as practically possible. Let the pursuit of perfection be our guide.
Re: Preparations for buddy allocation system
Are you sure that you haven't reconfigured Bochs in the meantime. The hole, wherever it is, should not just eat so much remaining physical space. I am not sure if the layout is always consistent in practice, but it should roughly correspond to the physically (or virtually) available resources.CRoemheld wrote:- In the screenshot in the OP you can see that at address 0x00100000 there's a free memory region. However, right now the length of this area changed, from 0x3FEF0000 to 0x07EE0000. Is this always random, depending on what address the memory hole in the upper memory is showing up?
In this context, lets also clarify that the free memory reported by GRUB (derived from int 15, e820 for BIOS) is not entirely free. This is the memory that a PnP protected mode OS can use. But a PnP protected mode OS could reconfigure the memory mapped IO of the PCI devices to desirable non-conflicting addresses. One can still presume that for backward compatibility, the area between 0x500 and 0xA0000 is free for use by a real-mode boot-loader (edit: minus the EBDA area right below 0xA0000), and since GRUB has already switched you to protected mode, anything below 0xA0000 (minus the EBDA area) is free . One could also hope that the BIOS has tried to configure the MMIO devices either inside the 0xA0000 to 0x100000 range (for legacy VGA) or in spaces that are not reported by 0xe820. To cut the story short, mark the region 0xA0000 to 0x100000 as unavailable in your memory map, and prey that nothing breaks, until you have PnP facilities.
Edit: The modules and structures of GRUB itself are also not reflected in the memory map. So, you have to be careful not to overwrite them while you still need them. How you do this will depend on whether you plan to relocate them somewhere or to keep them around longer.
Even with the standard buddy system you will need multiple lists to keep track of the free blocks. But the idea here is not to limit the maximum block size or use separate allocator per memory range. Let me return to my primitive example.CRoemheld wrote:- The buddy allocator used in Linux uses a maximum order of 10, meaning the biggest available block would be 2^10 * 4096 Bytes = 4MiB in size. However, The first free memory area in lower memory has a size of 639KiB, and the free area in upper memory is 129920KiB in size. Would I need to split up the upper memory area into smaller blocks of 4 MiB and use multiple lists to keep track of the allocations? With point 1: if the size of free memory is not constant, how does the allocator change?
This time suppose that the available memory range is for pages 3-13 (zero based, exclusive at the end, 12 is the last valid page). Let's denote a busy (allocated) buddy block of size X pages as BX, and a free budy block of size X pages as FX. Also, lets assume that your maximum buddy block is 16 pages in size. Essentially, you first need to create a maximally sized buddy block that covers the entire range with good alignment. In this case, we need one F16 block that covers the range 0-16. And then we have to break it up into configuration that covers the gaps with busy blocks. In this case the target configuration is B2,B1,F1,F4,F4,F1,B1,B2. As you can see, the allocator end up still managing the entire 16 pages, but now has marked the gaps on the two sides as busy blocks. If the gap is very big, in the trivial implementation, you could end up with a single B16. To track the state of the entire memory like this, you need one frame record per page frame, whether available or unavailable. If you can tolerate this inefficiency, good. If not, you have to make your page frame database a two-level structure, as I already mentioned in my posts. The first level will describe the memory in larger chunks, say 4MB (if you want to match the allocator's maximum block size), and will denote whether they contain usable memory or not. The second level will be a small per-chunk page frame database, only for chunks that have at least one available free frame (or some other threshold.)
You need to implement special allocation routine, e.g. that is parametrized by starting and ending address or something of the sorts (edit: so that you can direct it where to cut the "gap", allocating exactly that range), rather than just by desired block size. You also can give the unavailable blocks a special state if you like to distinguish them from normally allocated blocks.
P.S.: You could also simplify your life and admit to a little inefficiency by rounding all available memory to 4MB boundaries. For new systems, it is not such a great loss and you won't have to break partial blocks. You will still need to think about handling the large gaps with two-level structure, if you wish to support systems with very fragmented physical address space.
For this particular setup, I am not aware of any other public source aside from Linux. There may be others in the Unixes. But pointing you to the sources of a full blown OS would be a bad joke.CRoemheld wrote:- I am also looking for code examples, because as of now I haven't found a shred of code or even tutorial which uses the buddy allocation system in connection with the free memory areas listed by the multiboot info structs mmap_* fields , but instead simply start allocating pages after the end of the kernel (by defining a variable in the linker script and access it via an external declared variable in C). So this point is the main objective of my question here: How to I setup a buddy allocator which uses the memory areas listed in the screenshot above and how does it deal with regions in memory, which are much larger than 2^10 * 4096 Bytes?
Last edited by simeonz on Sun Jun 17, 2018 6:51 am, edited 6 times in total.
Re: Preparations for buddy allocation system
I forgot to mention that, generally speaking, allocators don't coalesce as much. There is usually exponential decay in the probability of coalescing at any higher level. You will rarely get a broken 4MB block to re-coalesce. In the Linux sources here, you can see that there is another constant, for the reasonable maximum order of auto-coalescing, and it is 3. That is, the kernel does not expect to frequently coalesce to a size of more than 32KB. To deal with this, the drivers try to use primarily small allocations. They also use "movable" allocations whenever they can, which enables the kernel to sometimes shuffle memory around, if in dire need of larger space. Also, the code that uses large allocations has fallback path with discontiguously mapped memory, from an entirely different allocator interface.
This issue is not particular to the buddy allocator. But when used with physical memory in a virtually dominant address space, the lack of coalescing becomes a sensitive shortcoming. Counteracting it requires commitment to specific programming techniques and constraints for both the kernel and driver code. (There are other allocators that mitigate some problems situationally, but I will not drift here.)
This issue is not particular to the buddy allocator. But when used with physical memory in a virtually dominant address space, the lack of coalescing becomes a sensitive shortcoming. Counteracting it requires commitment to specific programming techniques and constraints for both the kernel and driver code. (There are other allocators that mitigate some problems situationally, but I will not drift here.)
Re: Preparations for buddy allocation system
Note to past readers: Had to make a correction regarding the memory layout - the EBDA area below 0xA0000 is obviously not free memory, but it is reported by GRUB (and e820) as reserved already. So, even with that correction, unless you are reconfiguring the PCI BARs, the remark to mark 0xA0000 to 0x100000 as reserved in the provided memory map stays.
Re: Preparations for buddy allocation system
Actually, I did not change the value. As you said, the value for the RAM to be emulated is indeed 1GiB, but it does not change even if I increase it to 2GiB:Brendan wrote:The memory map is different for different computers, and (for one computer) can change if you change the amount of RAM installed or some BIOS options (e.g. how much RAM to steal for integrated video); but if nothing changed the memory map shouldn't have changed either.
Given that the memory map provided by Bochs is relatively predictable; I'd assume that you changed the amount of RAM it emulates (in its configuration file) from 1 GiB to 128 MiB.
Code: Select all
# Use $BXSHARE (/opt/bochs/share/bochs) environment variable for setting path:
romimage: file=$BXSHARE/BIOS-bochs-latest
vgaromimage: file=$BXSHARE/VGABIOS-lgpl-latest
# Set CPU options
cpu: model=broadwell_ult, reset_on_triple_fault=0
# cpu: model=corei7_skylake_x, ips=2600000000, reset_on_triple_fault=0
cpuid: vmx=1
clock: sync=slowdown
# Set memory parameters
megs: 1024
# Set boot option (selected in ata0-slave)
boot: cdrom
# CD-ROM
ata0-slave: type=cdrom, path=build/cr0S-x86_64.iso, status=inserted
# Display
display_library: x, options="gui_debug"
# Debug
magic_break: enabled=1
PS: Don't mind the other CPU model entry there, I was told there exists a skylake cpu model for emulation, but apparently my bochs version does not support it.
Since access to memory areas using the buddy allocator does not seem to be a problem, I guess I'll try to either use multiple buddy allocators or set up the limit of the order, so that e.g. 128 MiB is the largest possible "internal" allocation. I said "internal", since you still could limit the size of the to-be-allocated area returned to e.g. 32KiB or something like that.simeonz wrote:Even with the standard buddy system you will need multiple lists to keep track of the free blocks. But the idea here is not to limit the maximum block size or use separate allocator per memory range.
Re: Preparations for buddy allocation system
Hi,
For Bochs; if you tell it to emulate less than about 3 GiB of RAM you should see a single area of usable RAM that starts at 0x00100000 and ends at an address that is slightly less than the amount of RAM being emulated (because Bochs takes a small amount of RAM at the end and uses it for ACPI areas). The screenshot you showed a few days ago is a good example of this (assuming Bochs is emulating 1 GiB of RAM) - e.g. RAM from 0x00100000 to 0x40000000, but the highest 64 KiB of that RAM used by the firmware for an ACPI reclaimable area.
If you tell Bochs to emulate more than about 3 GiB of RAM, then you'll get an area of usable RAM that probably ends just below 0xC0000000 (probably at 0xBFFF0000 with the same "64 KiB used for ACPI"); then a second area starting at 0x0000000100000000 containing the remaining RAM.
If you change nothing (especially if you don't change the amount of RAM Bochs should emulate) the memory map should always be the same. If it's randomly changing (even when you don't change your code) then it'd almost have to be some kind of timing bug (timing is the only cause of non-determinism, unless Bochs has been configured for deterministic timing). If the memory map changes when you change your code then I'd assume it's a "dodgy pointer" type of problem (e.g. something overwriting BIOS data before you get the memory map, or something corrupting the memory map after you obtain it).
Cheers,
Brendan
Then I suspect you've got a bug somewhere.CRoemheld wrote:Actually, I did not change the value. As you said, the value for the RAM to be emulated is indeed 1GiB, but it does not change even if I increase it to 2GiB:Brendan wrote:The memory map is different for different computers, and (for one computer) can change if you change the amount of RAM installed or some BIOS options (e.g. how much RAM to steal for integrated video); but if nothing changed the memory map shouldn't have changed either.
Given that the memory map provided by Bochs is relatively predictable; I'd assume that you changed the amount of RAM it emulates (in its configuration file) from 1 GiB to 128 MiB.
For Bochs; if you tell it to emulate less than about 3 GiB of RAM you should see a single area of usable RAM that starts at 0x00100000 and ends at an address that is slightly less than the amount of RAM being emulated (because Bochs takes a small amount of RAM at the end and uses it for ACPI areas). The screenshot you showed a few days ago is a good example of this (assuming Bochs is emulating 1 GiB of RAM) - e.g. RAM from 0x00100000 to 0x40000000, but the highest 64 KiB of that RAM used by the firmware for an ACPI reclaimable area.
If you tell Bochs to emulate more than about 3 GiB of RAM, then you'll get an area of usable RAM that probably ends just below 0xC0000000 (probably at 0xBFFF0000 with the same "64 KiB used for ACPI"); then a second area starting at 0x0000000100000000 containing the remaining RAM.
If you change nothing (especially if you don't change the amount of RAM Bochs should emulate) the memory map should always be the same. If it's randomly changing (even when you don't change your code) then it'd almost have to be some kind of timing bug (timing is the only cause of non-determinism, unless Bochs has been configured for deterministic timing). If the memory map changes when you change your code then I'd assume it's a "dodgy pointer" type of problem (e.g. something overwriting BIOS data before you get the memory map, or something corrupting the memory map after you obtain it).
Cheers,
Brendan
For all things; perfection is, and will always remain, impossible to achieve in practice. However; by striving for perfection we create things that are as perfect as practically possible. Let the pursuit of perfection be our guide.