Article on physical memory management
Re: Article on physical memory management
Embedded systems and 64GB RAM?
-
- Member
- Posts: 595
- Joined: Mon Jul 05, 2010 4:15 pm
Re: Article on physical memory management
You will see it sooner than you thinkFlashBurn wrote:Embedded systems and 64GB RAM?
Re: Article on physical memory management
Not earlier than it will work so quicklyOSwhatever wrote:You will see it sooner than you think
If you have seen bad English in my words, tell me what's wrong, please.
Re: Article on physical memory management
Yeah, but then the cpu will be fast enough to init a 64GB stack in less than a second
Re: Article on physical memory management
Hi,
You need to understand that there's 3 different (but related) quantities:
There are variations on the free page stack idea though. One variation is to have a linked list of "index" pages, where each "index page" contains the physical addresses of up to 1023 (for 32-bit addresses) or 511 (for 64-bit addresses) more pages. In this case the amount of virtual address space used is 4 KiB, but the last index page can be allocated when it's empty and therefore doesn't count as "used" RAM (so "used" RAM is effectively zero bytes). The amount of "free" RAM used is the same as before.
For performance, there's 6 separate things to consider:
This also means that for allocating 1 page, freeing 1 page, and freeing multiple non-contiguous pages; performance is very important. The other things don't happen often enough to matter.
For free page stacks, allocating 1 page and freeing 1 page is extremely fast. Freeing multiple (potentially non-contiguous) pages is also relatively fast, because you can write all the "next" links into each page before adding them all to the stack itself (which is great for lock contention too). It gets more complex when you're using multiple free page stacks (e.g. one stack for pages with 32-bit addresses and another for pages with 64-bit physical addresses), but it's essentially the same (just done once per stack type). The same applies to initialisation/setup time (it's the equivalent of freeing multiple potentially non-contiguous pages).
Anyway, lets have a look at various methods with all the characteristics..
Free page stacks
On top of everything above, there's the ability to support 3 more (potentially important) features: NUMA optimisations, page colouring, and large page sizes.
For NUMA optimisation, for page stacks and "linked list of index pages" you just have one stack or list per NUMA domain and it's both easy and fast. For "one global bitmap" it makes allocations slower; and for "one bitmap per NUMA domain" it doesn't effect allocation (but virtual space usage and "used" RAM usage increases). For "linked list of extents" you could have one linked list of extents per NUMA domain, which doesn't effect performance (but virtual space usage increases).
For page colouring; for page stacks and "linked list of index pages" you just have one stack or list per page colour and it's both easy and fast. For bitmaps it's also easy and fast to do (e.g. check each nth bit rather than each bit). For "global linked list of extents" it'd be a disaster (you'd be splitting and joining extents all the time). One "linked list of extents" per page colour would work better (but that will also make it very hard to allocate contiguous pages).
Then there's the ability to support large page sizes (2 MiB or 4 MiB pages, and 1 GiB pages). The key characteristic here is being able to coalesce multiple smaller pages into a larger page (e.g. multiple free 4 KiB pages converted into one free 4 MiB page, and multiple free 4 MiB pages converted into one free 1 GiB page), because without that you quickly end up with all larger pages broken up with only 4 KiB pages left (which makes large page support pointless after the OS has been running a little while).
It's always been my opinion that for most OSs (where virtual memory is virtual), regardless of how you do memory management, supporting large page sizes is a waste of time because you're still almost always allocating 4 KiB pages; and the disadvantages (e.g. the extra overhead of coalescing free pages, allocated pages or both) outweighs any advantages (e.g. less TLB misses). It's because of this that most OSs only use larger pages for the kernel's RAM and memory mapped devices where you get some of the "less TLB misses" advantage without any disadvantages.
However, if I had to support large page sizes; then page stacks and "linked list of index pages" fail completely. Bitmaps alone could work (but would be slow and could easily be improved on). For "linked list of extents", it'd be easy to add support for large page sizes without making the performance any worse.
I'd probably use a bitmap (or a bitmap per NUMA domain) to track 4 KiB pages, with an array of "number of free 4 KiB pages" entries for each 2 MiB page (to make it easier to find 2 MiB pages, and find 2 MiB pages that are only missing a few 4 KiB pages). Of course this would mean searching the array and then part of the bitmap when allocating 4 KiB pages, and searching the array when allocating 2 MiB pages; and would also cost 1 bit per 4 KiB page plus 2 bytes per 2 MiB page (or a total of 33 KiB per GiB, for both virtual address space and "used" RAM usage). Of course part of the reason for this is that it'd be easy to make it work for NUMA optimisations and page colouring too. I haven't thought about it much though - there might be a better way.
Cheers,
Brendan
I've written plenty of them.turdus wrote:ROTFL. PLease show me a stack implementation that uses *ONLY* a pointer. No further allocation allowed.gerryg400 wrote:A stack uses only 1 pointer. That's it.
You need to understand that there's 3 different (but related) quantities:
- the amount of virtual address space used
- the amount of "used" RAM used (RAM that can't be allocated)
- the amount of "free" RAM used (RAM that can be allocated)
There are variations on the free page stack idea though. One variation is to have a linked list of "index" pages, where each "index page" contains the physical addresses of up to 1023 (for 32-bit addresses) or 511 (for 64-bit addresses) more pages. In this case the amount of virtual address space used is 4 KiB, but the last index page can be allocated when it's empty and therefore doesn't count as "used" RAM (so "used" RAM is effectively zero bytes). The amount of "free" RAM used is the same as before.
For performance, there's 6 separate things to consider:
- allocating 1 page
- freeing 1 page
- allocating multiple (potentially non-contiguous) pages
- freeing multiple (potentially non-contiguous) pages
- allocating multiple contiguous (potentially with other restrictions - e.g. "must not cross 64 KiB boundary") pages
- freeing multiple contiguous pages
- initialisation/setup time
This also means that for allocating 1 page, freeing 1 page, and freeing multiple non-contiguous pages; performance is very important. The other things don't happen often enough to matter.
For free page stacks, allocating 1 page and freeing 1 page is extremely fast. Freeing multiple (potentially non-contiguous) pages is also relatively fast, because you can write all the "next" links into each page before adding them all to the stack itself (which is great for lock contention too). It gets more complex when you're using multiple free page stacks (e.g. one stack for pages with 32-bit addresses and another for pages with 64-bit physical addresses), but it's essentially the same (just done once per stack type). The same applies to initialisation/setup time (it's the equivalent of freeing multiple potentially non-contiguous pages).
Anyway, lets have a look at various methods with all the characteristics..
Free page stacks
- virtual address space used: 4 or 8 bytes
- "used" RAM used: 4 or 8 bytes
- "free" RAM used: depends on amount of free RAM
- allocating 1 page: extremely fast
- freeing 1 page: extremely fast
- allocating multiple (potentially non-contiguous) pages: average (worst case is lots of 1 page allocations)
- freeing multiple (potentially non-contiguous) pages: average (worst case is lots of 1 page deallocations)
- allocating multiple contiguous pages with restrictions: extremely slow
- freeing multiple contiguous pages: very fast
- initialisation/setup time: average or relatively slow (depending on how and when it's done)
- virtual address space used: 4 KiB
- "used" RAM used: none
- "free" RAM used: depends on amount of free RAM
- allocating 1 page: extremely fast
- freeing 1 page: extremely fast
- allocating multiple (potentially non-contiguous) pages: average (worst case is lots of 1 page allocations)
- freeing multiple (potentially non-contiguous) pages: average (worst case is lots of 1 page deallocations)
- allocating multiple contiguous pages with restrictions: extremely slow
- freeing multiple contiguous pages: very fast
- initialisation/setup time: average or relatively slow (depending on how and when it's done)
- virtual address space used: "total RAM / 4 KiB / 8" bytes (32 KiB per GiB of RAM)
- "used" RAM used: "total RAM / 4 KiB / 8" bytes (32 KiB per GiB of RAM)
- "free" RAM used: none
- allocating 1 page: relatively slow (worst case is searching the entire bitmap)
- freeing 1 page: extremely fast (can be done with one "bts" instruction)
- allocating multiple (potentially non-contiguous) pages: relatively slow (worst case is searching the entire bitmap)
- freeing multiple (potentially non-contiguous) pages: average (worst case is lots of 1 page deallocations)
- allocating multiple contiguous pages with restrictions: relatively slow (worst case is searching the entire bitmap)
- freeing multiple contiguous pages: very fast
- initialisation/setup time: average
- virtual address space used: Don't know - maybe 4 KiB (for one page full of "next + start + length" entries)?
- "used" RAM used: Don't know - maybe 4 KiB (for one page full of "next + start + length" entries)?
- "free" RAM used: none
- allocating 1 page: extremely fast
- freeing 1 page: may not be possible if there's no space left for a new extent ("infinitely slow"?)
- allocating multiple (potentially non-contiguous) pages: average (worse case is lots of 1 page extents)
- freeing multiple (potentially non-contiguous) pages: "infinitely slow" (same as freeing 1 page)
- allocating multiple contiguous pages with restrictions: "infinitely slow" (if you need to allocate the middle pages of an extent to suit restrictions like "must not cross 64 KiB boundary" then the original extent may need to be split into 2 extents, and it may not be possible to create the second extent).
- freeing multiple contiguous pages: "infinitely slow" (same as freeing 1 page)
- initialisation/setup time: extremely fast
- virtual address space used: have to be able to handle worst case (every second page free or "total pages / 2 * sizeof(entry)") so maybe 1 MiB per GiB
- "used" RAM used: none (assuming "allocation on demand" used for storing extents)
- "free" RAM used: with "allocate on demand" depends on number of extents (worst case is maybe 1 MiB per GiB)
- allocating 1 page: extremely fast
- freeing 1 page: very slow. Worst case is searching the entire linked list and not finding an extent to add the page to
- allocating multiple (potentially non-contiguous) pages: average (worse case is lots of 1 page extents)
- freeing multiple (potentially non-contiguous) pages: extremely slow (worst case is lots of 1 page deallocations)
- allocating multiple contiguous pages with restrictions: slow. Worst case is searching (almost) the entire linked list
- freeing multiple contiguous pages: very slow (worst case is searching the entire linked list and not finding an extent to add the page to)
- initialisation/setup time: extremely fast
On top of everything above, there's the ability to support 3 more (potentially important) features: NUMA optimisations, page colouring, and large page sizes.
For NUMA optimisation, for page stacks and "linked list of index pages" you just have one stack or list per NUMA domain and it's both easy and fast. For "one global bitmap" it makes allocations slower; and for "one bitmap per NUMA domain" it doesn't effect allocation (but virtual space usage and "used" RAM usage increases). For "linked list of extents" you could have one linked list of extents per NUMA domain, which doesn't effect performance (but virtual space usage increases).
For page colouring; for page stacks and "linked list of index pages" you just have one stack or list per page colour and it's both easy and fast. For bitmaps it's also easy and fast to do (e.g. check each nth bit rather than each bit). For "global linked list of extents" it'd be a disaster (you'd be splitting and joining extents all the time). One "linked list of extents" per page colour would work better (but that will also make it very hard to allocate contiguous pages).
Then there's the ability to support large page sizes (2 MiB or 4 MiB pages, and 1 GiB pages). The key characteristic here is being able to coalesce multiple smaller pages into a larger page (e.g. multiple free 4 KiB pages converted into one free 4 MiB page, and multiple free 4 MiB pages converted into one free 1 GiB page), because without that you quickly end up with all larger pages broken up with only 4 KiB pages left (which makes large page support pointless after the OS has been running a little while).
It's always been my opinion that for most OSs (where virtual memory is virtual), regardless of how you do memory management, supporting large page sizes is a waste of time because you're still almost always allocating 4 KiB pages; and the disadvantages (e.g. the extra overhead of coalescing free pages, allocated pages or both) outweighs any advantages (e.g. less TLB misses). It's because of this that most OSs only use larger pages for the kernel's RAM and memory mapped devices where you get some of the "less TLB misses" advantage without any disadvantages.
However, if I had to support large page sizes; then page stacks and "linked list of index pages" fail completely. Bitmaps alone could work (but would be slow and could easily be improved on). For "linked list of extents", it'd be easy to add support for large page sizes without making the performance any worse.
I'd probably use a bitmap (or a bitmap per NUMA domain) to track 4 KiB pages, with an array of "number of free 4 KiB pages" entries for each 2 MiB page (to make it easier to find 2 MiB pages, and find 2 MiB pages that are only missing a few 4 KiB pages). Of course this would mean searching the array and then part of the bitmap when allocating 4 KiB pages, and searching the array when allocating 2 MiB pages; and would also cost 1 bit per 4 KiB page plus 2 bytes per 2 MiB page (or a total of 33 KiB per GiB, for both virtual address space and "used" RAM usage). Of course part of the reason for this is that it'd be easy to make it work for NUMA optimisations and page colouring too. I haven't thought about it much though - there might be a better way.
Cheers,
Brendan
For all things; perfection is, and will always remain, impossible to achieve in practice. However; by striving for perfection we create things that are as perfect as practically possible. Let the pursuit of perfection be our guide.
Re: Article on physical memory management
If you really want an allocator/deallocator that approaches O(1) for both operations and will support both page sizes you can use a stack for the 2MB pages and a slab allocator on top of that to keep track of the 4kB pages. You can keep the slab info in one of the pages of the slab (and thus waste 0.2% of each big page thats broken up) or in one of the other memory structures the way (IIRC) the Linux SLUB allocator does it. Unless your prepared to run some sort of defragger though you'll always have the possibility of running out of big pages.
I've not found any reason to allocate big pages yet, other than for static stuff at boot time so I've not really explored this.
I've not found any reason to allocate big pages yet, other than for static stuff at boot time so I've not really explored this.
If a trainstation is where trains stop, what is a workstation ?
Re: Article on physical memory management
Well in theory the whole point of using large pages would be to avoid hammering the PMM with allocation requests when it can be determined ahead of time that a larger area of memory will be needed. Instead of firing off 512 or 1024 page faults you simply get one and fill the request. Given the multimedia nature of today's computing environment I don't think it's hard envision a great number of scenarios where an application will predictably and quickly need in excess of 2-4mb of memory.
That said, any frame allocator which implements multiple page sizes, and the ability to split and coalesce them will be much slower then a single size lifo stack allocator. Even a well designed 2^N system with a predictable worst case scenario like the one combuster described is much much more complicated and time consuming then an allocator that just pushes and pops. So much so that you will end up with slower overall allocation speed if you're just allocating 4k pages most of the time anyways.
Ultimately it'll come down to how applications actually request memory. If you have a scenario where small applications won't frequently request small blocks of memory and large applications will frequently request larger spans of memory then a mixed size allocator would probably be best. If on the other hand every application is just nibbling away at 4k pages there's not a really compelling reason to avoid a single size stack allocator.
That said, any frame allocator which implements multiple page sizes, and the ability to split and coalesce them will be much slower then a single size lifo stack allocator. Even a well designed 2^N system with a predictable worst case scenario like the one combuster described is much much more complicated and time consuming then an allocator that just pushes and pops. So much so that you will end up with slower overall allocation speed if you're just allocating 4k pages most of the time anyways.
Ultimately it'll come down to how applications actually request memory. If you have a scenario where small applications won't frequently request small blocks of memory and large applications will frequently request larger spans of memory then a mixed size allocator would probably be best. If on the other hand every application is just nibbling away at 4k pages there's not a really compelling reason to avoid a single size stack allocator.
Reserved for OEM use.
- Combuster
- Member
- Posts: 9301
- Joined: Wed Oct 18, 2006 3:45 am
- Libera.chat IRC: [com]buster
- Location: On the balcony, where I can actually keep 1½m distance
- Contact:
Re: Article on physical memory management
I can't deduce your implementation based on that description, but one implementation I can see of such a tactic is one of lazy-reclaiming one: In each page of free ram you have the following: a number of pointers that are potentially present, followed by an array of pointers of that size, which might contain null pointers. The first entry in the array (if it exists) points to a structure of similar format, the others point to free pages potentially containing garbage. You can initialize the system by setting each page with an array of size one and a pointer to the next page, i.e. double the amount of work as you would for a stack:However, I still prefer to use doubly linked lists for my physical memory. De-allocation is faster than O(1). I found that I tended to allocate pages 1 (or a few) at a time, but de-allocate in big chunks when processes are killed or a file is closed. If all the pages in a process are linked together, you can return the entire list with 4 pointer assignments.
Code: Select all
start
|
v
+------+ +------+ +------+
| 4 | +-->| 2 | +-->| 3 |
+------+ | +------+ | +------+
| next |--+ | next |--+ | null |--X
+------+ +------+ +------+
| p1 | | p4 | | null |
+------+ +------+ +------+
| p2 | | p5 |
+------+ +------+
| p2 |
+------+
Freeing one page just involves setting n to 1, next to start, and start to the freed page.
The trick comes with block deallocations of virtual memory. If you have a page table with several pages pointing from it, you can just reinterpret that page as a free page stack. The only thing you need to do is make sure that the n and next fields have the proper value. If you do not do other bookkeeping, n will have to be 1023 (32-bit entries) or 511 (64-bit entries), and next should point to the existing stack. Any pointers you have to overwrite this way can be turned into a second list with a known size.
In the likely case, such page tables are filled continuously where code/data/bss is, whereas the heap may have some fragmentation. If you do not have other bookkeeping finding the used pages within that table might cost as much as searching that pagetable completely and freeing the pages one at a time. Instead that process is now performed by the page request function which might in the worst case strip all null values where necessary, but in more likely cases there exists several allocated pages within a pagetable and less operations have to be performed in all. In many cases, you know even know exactly the n needed to cover all pages used. You might even want to add a signature bit if the page is known to be full to make batch allocations even faster.
Let's add to Brendan's list for completeness sake. If you have a better description for your algorithm you may want to share it
Pagetables as index tables
- virtual address space used: 4 or 8 bytes
- "used" RAM used: 4 or 8 bytes
- "free" RAM used: depends on amount of free RAM
- allocating 1 page: amortized extremely fast (read 4k of ram in the worst case)
- freeing 1 page: extremely fast
- allocating multiple (potentially non-contiguous) pages: average (amortized lots of 1-page allocations worst case; the real worst case is when there are lots of fragmented pages)
- freeing multiple (potentially non-contiguous) pages: average (lots of 1-page deallocations)
- allocating multiple contiguous pages with restrictions: extremely slow
- freeing multiple contiguous pages: fast
- initialisation/setup time: average or relatively slow (depending on how and when it's done)
- virtual address space used: 4 or 8 bytes
- "used" RAM used: 4 or 8 bytes
- "free" RAM used: depends on amount of free RAM
- allocating 1 page: average (worst case requires halving the entire address space until a 4k node is returned)
- freeing 1 page: average (worst case requires merging the entire address space)
- allocating multiple (potentially non-contiguous) pages: average (worst case is lots of 1 page extents)
- freeing multiple (potentially non-contiguous) pages: average (worst case is lots of 1 page extents with no positional relation)
- allocating multiple contiguous pages with restrictions: fast (worst case when unaligned allocations are allowed, the actual requested size is very small, and fragmentation is extremely high: the algorithm scales inversely proportional to the requested size)
- freeing multiple contiguous pages: extremely fast
- initialisation/setup time: extremely fast
NUMA can be covered by keeping a structure per domain for the pagetable indexes, which makes the optimal case less achievable because directory-wide deallocations requires knowledge if the NUMA criterion was satisfied. The alternative is to keep two lists per domain, one with guaranteed members and one with likely members. If the guaranteed list is exhausted, the likely list is checked in turn, and then the foreign lists. Any out-of place items can be dispatched to their correct guaranteed-lists, nulls are moved to the end, and also uniform lists can be dispatched to the correct guaranteed-list, which still maintains the same time performed per actual deallocation, but with potentially much higher outliers for individual requests. Coloured pages are even nastier to optimize and go away from the best-case scenarios even further to the point where simply using a stack is both faster and more practical.
Where NUMA domains are single large ranges, the extent tree needs no modification at all, as operations can simply add the desired criterion in the descent, otherwise the tree can be separated per domain. Page colouring can be easily added in a node's data as a bitfield (where for example bit n set means this node or its subnodes contain pages with colour n)
Actually my current implementation is a referencecounting tree having the exact dimensions as the x86 PAE system, with entries that either are a 32-bit reference count, or a pointer to the next table with the amount of unused pages in the bottom bits. The top 8 bits of the pointer (which is architecturally never part of the address) is used as a tag to indicate ram, and if the data is a pointer/counter pair or a usage reference count. It makes deallocations trivial, small allocations decent as the tree is only recursed if there's free ram available, and allows automatic reclaiming of larger pages, but all operations are O(1) when userland wishes to apply a different allocation strategy and just goes supplying the page locations to map and have the kernel do little more than refcounting and sanity checks.
Re: Article on physical memory management
Nothing like that. You show only flaw's that's been think of. For example it will definitely not crash when the buffer is full, as many of you thought. I was only against these.Combuster wrote:Ok, It has been shown again that religion is more important than fact, due to the fact it's blatantly obvious that turdus thinks his implementation is flawless and the world disagrees.
Re: Article on physical memory management
Dear Brendan,Brendan wrote:Hi,I've written plenty of them.turdus wrote:ROTFL. PLease show me a stack implementation that uses *ONLY* a pointer. No further allocation allowed.gerryg400 wrote:A stack uses only 1 pointer. That's it.
I think you misunderstand. I was referring to implementing a stack that uses *ONLY* a pointer as data, in other words, please show me a stack implementation that uses only 4 bytes of memory to keep track of pages. I'm pretty sure you agree that a pointer that points to nowhere is quite useless
Cheers
- Owen
- Member
- Posts: 1700
- Joined: Fri Jun 13, 2008 3:21 pm
- Location: Cambridge, United Kingdom
- Contact:
Re: Article on physical memory management
But it does only use one pointer's worth of memory. All the other data is stored on the free pages. Now, as we have ascertained that these pages are free (After all, if they weren't free, they wouldn't be on the free page stack), any memory used on these free pages does not count.
- Combuster
- Member
- Posts: 9301
- Joined: Wed Oct 18, 2006 3:45 am
- Libera.chat IRC: [com]buster
- Location: On the balcony, where I can actually keep 1½m distance
- Contact:
Re: Article on physical memory management
I rest my case.turdus wrote:Nothing like that.Combuster wrote:Ok, It has been shown again that religion is more important than fact
(...)
Dear Brendan, I think you misunderstand.
Re: Article on physical memory management
You can probably reduce the amount of physcal memory used by a bitmap to almost zero too.
If a page in the bitmap contains all 1's (i.e every page that it repesents is in use) the physical page behind the bitmap can be removed for use by the OS and replaced by a COW page conainting all 1's. When a page is freed in a bitmap page that is COW, the freed page can be used as the physical memory behind the page.
It requires that the backing page for each page in the bitmap come from the group of pages that that section of the bitmap represents. It might be doable but I've never tried it.
If a page in the bitmap contains all 1's (i.e every page that it repesents is in use) the physical page behind the bitmap can be removed for use by the OS and replaced by a COW page conainting all 1's. When a page is freed in a bitmap page that is COW, the freed page can be used as the physical memory behind the page.
It requires that the backing page for each page in the bitmap come from the group of pages that that section of the bitmap represents. It might be doable but I've never tried it.
If a trainstation is where trains stop, what is a workstation ?
Re: Article on physical memory management
The question should be, why is it a problem that a structure for memory management needs like 128kb (e.g. the bitmap) for managing 4gb of ram?
@turdus
You haven´t answered the "flaws" I found in your algo.
@turdus
You haven´t answered the "flaws" I found in your algo.
Re: Article on physical memory management
As I wrote before, I rather use that area as cache, so it does count for me.Owen wrote:any memory used on these free pages does not count.