Hi,
turdus wrote:gerryg400 wrote:A stack uses only 1 pointer. That's it.
ROTFL. PLease show me a stack implementation that uses *ONLY* a pointer. No further allocation allowed.
I've written plenty of them.
You need to understand that there's 3 different (but related) quantities:
- the amount of virtual address space used
- the amount of "used" RAM used (RAM that can't be allocated)
- the amount of "free" RAM used (RAM that can be allocated)
For the amount of virtual address space used, a free page stack needs a minimum of one pointer (the physical address of the page on the top of the stack), plus maybe 1 bit for a spinlock (which can be any of the 12 unused bits of the physical address). The amount of "used" RAM is the same (4 or 8 bytes). The amount of "free" RAM used depends on how many pages are actually free, but it never matters how much "free" RAM is used.
There are variations on the free page stack idea though. One variation is to have a linked list of "index" pages, where each "index page" contains the physical addresses of up to 1023 (for 32-bit addresses) or 511 (for 64-bit addresses) more pages. In this case the amount of virtual address space used is 4 KiB, but the last index page can be allocated when it's empty and therefore doesn't count as "used" RAM (so "used" RAM is effectively zero bytes). The amount of "free" RAM used is the same as before.
For performance, there's 6 separate things to consider:
- allocating 1 page
- freeing 1 page
- allocating multiple (potentially non-contiguous) pages
- freeing multiple (potentially non-contiguous) pages
- allocating multiple contiguous (potentially with other restrictions - e.g. "must not cross 64 KiB boundary") pages
- freeing multiple contiguous pages
- initialisation/setup time
For most OSs, virtual memory is virtual. For example, when a process allocates 123 MiB of RAM the OS doesn't actually allocate anything. Instead it maps the same "page full of zeros" as read-only for the entire 123 MiB area. When the process attempts to write to that area you get a page fault and only then does the kernel allocate a page. Similar tricks are used for memory mapped files (pages marked as "not present" and fetched if/when accessed), swap space, etc. This means that it's very rare to want to allocate more than 1 page at a time; and also means that it's extremely unlikely that a process' pages will ever be physically contiguous.
This also means that for allocating 1 page, freeing 1 page, and freeing multiple non-contiguous pages; performance is very important. The other things don't happen often enough to matter.
For free page stacks, allocating 1 page and freeing 1 page is extremely fast. Freeing multiple (potentially non-contiguous) pages is also relatively fast, because you can write all the "next" links into each page before adding them all to the stack itself (which is great for lock contention too). It gets more complex when you're using multiple free page stacks (e.g. one stack for pages with 32-bit addresses and another for pages with 64-bit physical addresses), but it's essentially the same (just done once per stack type). The same applies to initialisation/setup time (it's the equivalent of freeing multiple potentially non-contiguous pages).
Anyway, lets have a look at various methods with all the characteristics..
Free page stacks
- virtual address space used: 4 or 8 bytes
- "used" RAM used: 4 or 8 bytes
- "free" RAM used: depends on amount of free RAM
- allocating 1 page: extremely fast
- freeing 1 page: extremely fast
- allocating multiple (potentially non-contiguous) pages: average (worst case is lots of 1 page allocations)
- freeing multiple (potentially non-contiguous) pages: average (worst case is lots of 1 page deallocations)
- allocating multiple contiguous pages with restrictions: extremely slow
- freeing multiple contiguous pages: very fast
- initialisation/setup time: average or relatively slow (depending on how and when it's done)
Linked list of index pages
- virtual address space used: 4 KiB
- "used" RAM used: none
- "free" RAM used: depends on amount of free RAM
- allocating 1 page: extremely fast
- freeing 1 page: extremely fast
- allocating multiple (potentially non-contiguous) pages: average (worst case is lots of 1 page allocations)
- freeing multiple (potentially non-contiguous) pages: average (worst case is lots of 1 page deallocations)
- allocating multiple contiguous pages with restrictions: extremely slow
- freeing multiple contiguous pages: very fast
- initialisation/setup time: average or relatively slow (depending on how and when it's done)
Free page bitmap
- virtual address space used: "total RAM / 4 KiB / 8" bytes (32 KiB per GiB of RAM)
- "used" RAM used: "total RAM / 4 KiB / 8" bytes (32 KiB per GiB of RAM)
- "free" RAM used: none
- allocating 1 page: relatively slow (worst case is searching the entire bitmap)
- freeing 1 page: extremely fast (can be done with one "bts" instruction)
- allocating multiple (potentially non-contiguous) pages: relatively slow (worst case is searching the entire bitmap)
- freeing multiple (potentially non-contiguous) pages: average (worst case is lots of 1 page deallocations)
- allocating multiple contiguous pages with restrictions: relatively slow (worst case is searching the entire bitmap)
- freeing multiple contiguous pages: very fast
- initialisation/setup time: average
Turdus' "linked list of extents" method, with limited number of extents
- virtual address space used: Don't know - maybe 4 KiB (for one page full of "next + start + length" entries)?
- "used" RAM used: Don't know - maybe 4 KiB (for one page full of "next + start + length" entries)?
- "free" RAM used: none
- allocating 1 page: extremely fast
- freeing 1 page: may not be possible if there's no space left for a new extent ("infinitely slow"?)
- allocating multiple (potentially non-contiguous) pages: average (worse case is lots of 1 page extents)
- freeing multiple (potentially non-contiguous) pages: "infinitely slow" (same as freeing 1 page)
- allocating multiple contiguous pages with restrictions: "infinitely slow" (if you need to allocate the middle pages of an extent to suit restrictions like "must not cross 64 KiB boundary" then the original extent may need to be split into 2 extents, and it may not be possible to create the second extent).
- freeing multiple contiguous pages: "infinitely slow" (same as freeing 1 page)
- initialisation/setup time: extremely fast
Turdus' "linked list of extents" method, with unlimited number of extents
- virtual address space used: have to be able to handle worst case (every second page free or "total pages / 2 * sizeof(entry)") so maybe 1 MiB per GiB
- "used" RAM used: none (assuming "allocation on demand" used for storing extents)
- "free" RAM used: with "allocate on demand" depends on number of extents (worst case is maybe 1 MiB per GiB)
- allocating 1 page: extremely fast
- freeing 1 page: very slow. Worst case is searching the entire linked list and not finding an extent to add the page to
- allocating multiple (potentially non-contiguous) pages: average (worse case is lots of 1 page extents)
- freeing multiple (potentially non-contiguous) pages: extremely slow (worst case is lots of 1 page deallocations)
- allocating multiple contiguous pages with restrictions: slow. Worst case is searching (almost) the entire linked list
- freeing multiple contiguous pages: very slow (worst case is searching the entire linked list and not finding an extent to add the page to)
- initialisation/setup time: extremely fast
For the most important things (allocating 1 page, freeing 1 page, and freeing multiple non-contiguous pages) the page stacks method wins. It also wins for virtual address space used and "used" RAM used. For the less important things (allocating multiple contiguous pages with restrictions and freeing multiple contiguous pages) the bitmap method wins. The only case where the "linked list of extents" method wins is for initialisation/setup time (but the difference is unlikely to matter - despite the faster initialisation/setup time, boot time will probably be slower anyway because of the extra overhead of allocations/deallocations that occur during boot).
On top of everything above, there's the ability to support 3 more (potentially important) features: NUMA optimisations, page colouring, and large page sizes.
For NUMA optimisation, for page stacks and "linked list of index pages" you just have one stack or list per NUMA domain and it's both easy and fast. For "one global bitmap" it makes allocations slower; and for "one bitmap per NUMA domain" it doesn't effect allocation (but virtual space usage and "used" RAM usage increases). For "linked list of extents" you could have one linked list of extents per NUMA domain, which doesn't effect performance (but virtual space usage increases).
For page colouring; for page stacks and "linked list of index pages" you just have one stack or list per page colour and it's both easy and fast. For bitmaps it's also easy and fast to do (e.g. check each nth bit rather than each bit). For "global linked list of extents" it'd be a disaster (you'd be splitting and joining extents all the time). One "linked list of extents" per page colour would work better (but that will also make it very hard to allocate contiguous pages).
Then there's the ability to support large page sizes (2 MiB or 4 MiB pages, and 1 GiB pages). The key characteristic here is being able to coalesce multiple smaller pages into a larger page (e.g. multiple free 4 KiB pages converted into one free 4 MiB page, and multiple free 4 MiB pages converted into one free 1 GiB page), because without that you quickly end up with all larger pages broken up with only 4 KiB pages left (which makes large page support pointless after the OS has been running a little while).
It's always been my opinion that for most OSs (where virtual memory is virtual), regardless of how you do memory management, supporting large page sizes is a waste of time because you're still almost always allocating 4 KiB pages; and the disadvantages (e.g. the extra overhead of coalescing free pages, allocated pages or both) outweighs any advantages (e.g. less TLB misses). It's because of this that most OSs only use larger pages for the kernel's RAM and memory mapped devices where you get some of the "less TLB misses" advantage without any disadvantages.
However, if I had to support large page sizes; then page stacks and "linked list of index pages" fail completely. Bitmaps alone could work (but would be slow and could easily be improved on). For "linked list of extents", it'd be easy to add support for large page sizes without making the performance any worse.
I'd probably use a bitmap (or a bitmap per NUMA domain) to track 4 KiB pages, with an array of "number of free 4 KiB pages" entries for each 2 MiB page (to make it easier to find 2 MiB pages, and find 2 MiB pages that are only missing a few 4 KiB pages). Of course this would mean searching the array and then part of the bitmap when allocating 4 KiB pages, and searching the array when allocating 2 MiB pages; and would also cost 1 bit per 4 KiB page plus 2 bytes per 2 MiB page (or a total of 33 KiB per GiB, for both virtual address space and "used" RAM usage). Of course part of the reason for this is that it'd be easy to make it work for NUMA optimisations and page colouring too. I haven't thought about it much though - there might be a better way.
Cheers,
Brendan