Hi,
rdos wrote:I'll use the 4K bitmap approach, with a 4 byte header for each bitmap, as this can be implemented without spinlocks, which is a great advantage.
If there are 20 different ways of implementing something (where some methods use spinlocks and some don't); you have to compare each method to determine the advantages/disadvantages of each. You can't assume that "no spinlocks" is an advantage - it's possible that one of the methods that does use spinlocks is better than all of the methods that don't use spinlocks.
rdos wrote:I think I will use 8MB linear address space for the page bitmaps, which would support 256GB of physical memory.
Is there any reason why you can't determine how many page bitmaps you need during boot? You've said that your OS supports old computers and you've said the OS is mostly for embedded systems, and an old embedded system that only has 16 MiB of RAM would only need (part of) 1 bitmap and doesn't need to have half the usable RAM wasted.
rdos wrote:Switching active block is done by finding the first block with the largest number of free pages. This is done like that because allocation speed is faster the more free pages a bitmap contains. This search algorithm is lockless.
If each CPUs was using a different active block; then sooner or later 2 or more CPUs would go looking for the first block with the largest number of free pages at the same/similar time and find the same active block. Once 2 or more CPUs are using the same active block then they end up "accidentally synchronised" - that block becomes full and the CPUs that were using it search for a new active block at the same/similar time and are likely to find the same block as each other again. Once you've got groups of CPUs that happen to be using the same active block, when 2 or more groups of CPUs search for a new active block at the same/similar time then the groups combine.
For example, for a system with 16 CPUs that's been running for a while, you might end up with a 50% chance of 4 CPUs using the same active block.
When CPUs have the same active block, one will modify the block (causing the cache line to be invalidated in all other CPU's caches) and then modify the bitmap header (causing another cache line to be invalidated in all other CPU's caches). Then the next CPU gets 2 cache misses and invalidates 2 cache lines, then the next, etc. This mostly means that when multiple CPUs happen to be using the same active block (which is relatively likely), you end up with a high chance of many cache misses.
These cache misses would obviously be bad for performance. For a NUMA system (where different CPUs are on different physical chips) it's worse - you end up with a lot of extra traffic across a hyper-transport or quick-path link as the cache line bounces between the physical chips. Under heavy load (lots of CPUs allocating pages at the same time), this extra traffic across the hyper-transport/quick-path link can consume a lot of the link's bandwidth and harm performance of other CPUs that are trying to use the link's bandwidth for other things. For example, for a 16-CPU NUMA system, if 8 CPUs are trying to allocate a lot of pages at the same time then the remaining 8 CPUs (which aren't allocating any pages) may be starved and (depending on what they're doing) end up with half the performance due to constantly stalling while waiting for data to come over a congested hyper-transport/quick-path link.
However, if the bitmap headers are only 4 bytes and are packed together (e.g. 16 of these 4-byte headers packed into each 64-byte cache line) then one CPU can modify one bitmap's header and invalidate the cache line that other CPUs are using for different bitmap headers - CPUs don't need to be using the same active page. Consider a computer with 2 GiB of RAM - for this you'd never use more than 16 bitmaps, so every CPU may be pounding the same "header cache line" regardless of which active block it uses. Also consider cache misses caused by searching for a new active block.
Note: This false sharing can be avoided by placing the 4-byte bitmap headers on separate cache lines (e.g. make them "4 bytes with 60 bytes or 124 bytes of unused padding").
Next; for a pathological worst case, allocating a page may take an infinite amount of time. Imagine that there is no bitmap with more than one free page. A CPU searches its active bitmap and finds that it's full, then finds the block with the largest number of free pages (and finds a block with 1 free page because there is nothing with more); but immediately after this a different CPU allocates the last free page in that bitmap. Your CPU searches its new active bitmap and finds that it's full, then finds a new block with the largest number of free pages (and finds another block with 1 free page); but immediately after the last page of that bitmap is allocated. This could happen indefinitely - e.g. for a system with 16 CPUs, 2 of those CPUs might be allocating the last page in a block and freeing 1 page in previously full blocks; and you could have 14 "very unlucky" CPUs that are all trying (and constantly failing) to allocate a page; even when there's a many free pages that could be allocated.
rdos wrote:I've also figured out a few optimizations I'll probably implement.
First, I think the search algorithm for a good bitmap should break when it has found a bitmap with over 50% free pages. No sense in searching for bitmaps that are better than that.
This won't reduce the "higher than expected chance of 2 or more CPUs using the same active page" problem (different CPUs searching for a new active block at the same/similar time would still find the same block).
rdos wrote:Second, I think there is a very effective optimization that could be done. By providing a cache (which uses a stack) per processor-core that contain a fixed number of entries, the core can quickly reclaim recently freed pages.
If the per-CPU cache is too small, then this won't help much to reduce the chance of the "cache thrash when 2 or more CPUs are using the same active page" problem. If the per-CPU cache is too large it will create a new "out-of-memory even though lots of pages are free" problem. Deciding how large the per-CPU cache should be may be a bit like trying to find a value that is greater than 4 and also less than 2 - there might not be any good cache size.
The per-CPU cache also doesn't prevent the "allocating a page may take an infinite amount of time" pathological worst case (a pattern of allocations/de-allocations that causes indefinite starvation still exists).
Cheers,
Brendan