Page 1 of 1
Lock-free bitmap based page frame allocator
Posted: Wed Aug 12, 2015 2:47 pm
by xenos
While I was thinking about the question how to create a page frame allocator which is thread-safe, lock-free and most of all simple and easy to implement, I stumbled across an idea which I'd like to share. Note that this allocator is certainly not the fastest solution, but performance was not the goal of this idea. It's more a neat trick.
The basic concept of this allocator is a bitmap with one bit for each physical page, where the bit is clear if the page is free, and set if the page is allocated. Allocating a page is done by searching for a clear bit and setting it, while freeing a page is done by clearing its bit. Of course it must be made sure that two concurrent threads (such as running on different CPUs at the same time) will not allocate the same page simultaneously, so one can, for example, have a global spin lock for the whole allocator. While one thread is looking for a free page to allocate, all other ones have to wait. Of course this is terrible, leads to lock contention and threads will spend lots of time waiting for the lock.
The idea I had is to have one lock per page, and to simply use its bit in the bitmap as lock. So basically the allocator becomes:
Code: Select all
page alloc(void)
{
for each page
{
if(!test_and_set_bit(page)) // atomically test and set bit, and if it was clear, the page was free and is now allocated
return page;
}
error("No free pages.");
}
void free(page)
{
clear_bit(page);
}
So while this would work in principle and resolves the lock contention problem, because there is no waiting / spinning at a lock anymore, it would be terribly slow to test every single bit like that. Normally one would rather want to check a whole block of pages at once, say 64 pages by comparing whether there are any free bits in a block of type int64_t, or whether it is equal to -1 (all bits are set). So the allocator would look like this:
Code: Select all
page alloc(void)
{
for each block
{
pattern = atomic_exchange(&block, -1); // get all pages in this block and allocate any free ones
if(pattern != -1) // free pages found?
{
page = find_clear_bit(pattern); // find free page among those we just took
set_bit(pattern, page); // set the bit in the pattern - no need to be atomic, since no other thread accesses this (local) variable
atomic_and(&block, pattern); // atomically free those pages which we allocated, but did not use
return page;
}
error("No free pages.");
}
void free(page)
{
clear_bit(page);
}
Of course one can think of more efficient approaches, but this was the most simple one I could think of. Certainly I'm not the first one who got this idea, and certainly there are people who have implemented this and use it (unless it has any flaws I don't see right now). Any comments and ideas are welcome!
Re: Lock-free bitmap based page frame allocator
Posted: Wed Aug 12, 2015 4:03 pm
by bluemoon
This can be replaced by
A suggestion is somehow "cache" the last search result to speed up next searching, and since the cache is just hint, it can be localized per cpu.
Re: Lock-free bitmap based page frame allocator
Posted: Wed Aug 12, 2015 5:19 pm
by turdus
If being atomic is a main concern, I'd use inline assembly:
- BTS (0F AB) bit test and set
- CMPXCHG (0F B1 x) compare and exchange
prefixed by
- LOCK (F0)
I also suggest to read Intel's article on
implementing scalable, multi-core locks.
Re: Lock-free bitmap based page frame allocator
Posted: Wed Aug 12, 2015 8:01 pm
by Brendan
Hi,
With inline assembly (assuming 80x86) you can search for a free page without using any LOCK prefix (e.g. starting with a normal "rep scasb"); then (when you've found a bit) do a "lock bts; jc retry" to set the bit (and go back to searching if it was set by another CPU after you found it but before you set it).
For performance; typically you have a pair of variables ("lowest possibly free page" and a "highest possibly free page") to minimise the area you search. You can't atomically do "set bit and update both variables". However, you don't necessarily need to - as long as you update these variables before freeing a page, and update these variables after allocating a page (and as long as you update the variables with a "lock cmpxchg") the worst case is that you end up searching more than you need to.
For allocating physically contiguous pages (e.g. buffers for ISA DMA or something) it can still work; just using a "lock cmpxchg" to set multiple bits in the bitmap at the same time. This is limited by the maximum number of bits you can atomically set at once (either 32 pages for ancient 32-bit CPUs, 64 pages with either CMPXCHG8B or a 64-bit CPU, or 128 pages with CMPXCHG16B). This is fine for some things (e.g. buffers for ISA DMA where they can't be larger anyway) but may not be fine for other things.
For physically contiguous allocations that are "larger than the limit", you can use multiple "lock cmpxchg" instructions, such that if a later one fails you free the previously "tentatively allocated" pages from previous "lock cmpxchg" instructions before going back to searching.
The real problems are that it's going to be a bit complicated and if you overlook anything you'll end up with race conditions/
"heisenbugs"; and its very likely that performance won't be any better than using a spinlock when there's no contention, and that performance will be much worse than using a spinlock when there is contention. Finally; you can't make it
"fair", which is worth worrying about when you're expecting (potentially) 200+ CPUs all trying to allocate/free pages frequently.
Cheers,
Brendan
Re: Lock-free bitmap based page frame allocator
Posted: Thu Aug 13, 2015 12:29 am
by xenos
@Brendan: Could you explain why performance would be worse than a spinlock when there is contention? I would expect that several threads waiting for a lock would waste more time than searching at different places of a bitmap without waiting. One could let them start their search at different offsets of the "likely to be free" window.
bluemoon wrote:
This can be replaced by
Not really, because another process may have freed some pages within this block in the meantime, and the free bits in the bitmap would be overwritten by the "allocated" status they had when the bitmap was read.
turdus wrote:If being atomic is a main concern, I'd use inline assembly:
- BTS (0F AB) bit test and set
- CMPXCHG (0F B1 x) compare and exchange
prefixed by
- LOCK (F0)
Yes, this is what I had in mind. I used pseudocode in my example because I did not specify a particular architecture like x86. In principle one could also use something like C++11 atomics and let the compiler optimize them (which of course requires getting the memory ordering right and stuff like that, but that also applies to assembly, one might need memory barriers and so on). I'll also have a look at the article, sounds quite interesting.
Re: Lock-free bitmap based page frame allocator
Posted: Thu Aug 13, 2015 12:38 am
by Brendan
Hi,
XenOS wrote:@Brendan: Could you explain why performance would be worse than a spinlock when there is contention? I would expect that several threads waiting for a lock would waste more time than searching at different places of a bitmap without waiting. One could let them start their search at different offsets of the "likely to be free" window.
It's best to think of things in terms of "cache lines bouncing between CPUs". Every time a lockless algorithm has to retry, that's additional cache lines being transferred that would've been avoided if a lock was used.
I'm not sure where you got "several threads searching at different places of a bitmap" from; or how you'd even begin to coordinate something like that (e.g. how one thread would know how many other threads are searching which areas).
Cheers,
Brendan
Re: Lock-free bitmap based page frame allocator
Posted: Thu Aug 13, 2015 1:05 am
by willedwards
This is exactly the kind of task that Hardware Transactional Memory (HTM) solves.
Intel had problems with the first implementation of Transactional Synchronization Extensions (TSX) on Haswell and had to disable it. So HTM fits in fine on the 'theory' forum
But it'll be back, and it'll work, and you'll love it!
RISC Link Load and Link Store can be viewed as a tiny one-word HTM, and that might be enough for particular tasks like acquiring a page.
Various big metal CPUs have transactional memory already too.
Here's how the Mill's optimistic concurrency works (which seems so much more humane than TSX):
You execute the
enterAtomic instruction.
You then make some loads and stores, which are tagged as belonging in the atomic section.
When you are done you execute
exitAtomic instruction and see if it succeeded.
If there were any writes by any other core to any of the memory locations you read from or wrote to, then the whole transaction fails and isn't applied.
(And on the Mill, we're proud to say, there's no 'locking' of cache lines and bouncing of cache lines between cores etc.)
Re: Lock-free bitmap based page frame allocator
Posted: Thu Aug 13, 2015 1:06 am
by xenos
Brendan wrote:It's best to think of things in terms of "cache lines bouncing between CPUs". Every time a lockless algorithm has to retry, that's additional cache lines being transferred that would've been avoided if a lock was used.
Good point, I didn't think about that.
So what about using a traditional bitmap with a spinlock, which can also be used for allocating contiguous blocks of pages, plus having a per-CPU cache (maybe as a stack) of free pages for frequent allocate / free of a single page? (With this cache being filled from / emptied to the bitmap if some watermark is reached). I have seen a similar design in some kernels.
Brendan wrote:I'm not sure where you got "several threads searching at different places of a bitmap" from; or how you'd even begin to coordinate something like that (e.g. how one thread would know how many other threads are searching which areas).
Actually they don't know what the other ones are doing. My thought was that under the assumption that threads want to allocate frames at random times with no correlation between each other, it is more likely that another thread has already started its search and is has moved to some random place in the bitmap, while the newly searching thread starts its search at some predefined place (which is likely to be different from the random one where the other one is searching).
Re: Lock-free bitmap based page frame allocator
Posted: Thu Aug 13, 2015 2:04 am
by Combuster
If two cores always start looking from the start, you always get cache line movement - especially since every core has to scan the same leading zeroes and then bit-test on roughly the same area. If each core starts at core#/total % into the bitmap, each core gets its own area to scan initially and the space doesn't need as much cache sharing between cores.
This does give some issues with maintaining first-free and last-free pointers, since if you don't scan from the start you don't know if you might have missed any leading bits.
Re: Lock-free bitmap based page frame allocator
Posted: Thu Aug 13, 2015 2:33 am
by Brendan
Hi,
XenOS wrote:Brendan wrote:It's best to think of things in terms of "cache lines bouncing between CPUs". Every time a lockless algorithm has to retry, that's additional cache lines being transferred that would've been avoided if a lock was used.
Good point, I didn't think about that.
So what about using a traditional bitmap with a spinlock, which can also be used for allocating contiguous blocks of pages, plus having a per-CPU cache (maybe as a stack) of free pages for frequent allocate / free of a single page? (With this cache being filled from / emptied to the bitmap if some watermark is reached). I have seen a similar design in some kernels.
That would be better (for performance); but creates a balancing problem - e.g. 15 CPUs with heaps of spare pages each, but one CPU with no free pages left (and none left in the bitmap), where allocations fail even though there's plenty of free pages.
XenOS wrote:Brendan wrote:I'm not sure where you got "several threads searching at different places of a bitmap" from; or how you'd even begin to coordinate something like that (e.g. how one thread would know how many other threads are searching which areas).
Actually they don't know what the other ones are doing. My thought was that under the assumption that threads want to allocate frames at random times with no correlation between each other, it is more likely that another thread has already started its search and is has moved to some random place in the bitmap, while the newly searching thread starts its search at some predefined place (which is likely to be different from the random one where the other one is searching).
I'm imagining a scenario that goes like this:
- 2 CPUs both read the same "lowest/highest possibly free pages" variables at the same time, causing that cache line to be transferred to both CPU's caches.
- Both CPUs search the same part of the same bitmap at the same time looking for a free page, causing all those cache lines to be transferred to both CPU's caches.
- Both CPUs find the same free page.
- At this point cache lines for the bitmap and cache lines for the "lowest/highest possibly free pages" variables are all in the "shared" state in both CPU's caches.
- Both CPUs do the "lock bts" part. The first CPU "wins" and forces the cache line into the "modified" state on that CPU (and evicted from the other CPU's cache). The second CPU "loses" and has to fetch the modified cache line again just to know it lost.
- The winner updates the "lowest/highest possibly free pages" variables causing that cache line into the "modified" state (and evicted on the second CPU).
- The second CPU goes back to the start and reads the "lowest/highest possibly free pages" variables again. This can happen before or after the first CPU updates these variables and therefore may or may not cause that cache line to be fetched again.
- The second CPU searches the bitmap again, where the cache lines will probably still be in the "shared" state from last time.
- Eventually (after the second CPU finds and allocates a page) the second CPU has to update the "lowest/highest possibly free pages", which means that if the second CPU was fast enough to avoid fetching it again at the start of retrying then it'll probably have to fetch it now instead.
Now; with 2 CPUs this is fine - the chance of 2 CPUs allocating pages at "close enough to the same time" is tiny, so the cost of the extra "loser has to fetch the modified cache line/s again" is fairly irrelevant. However it's not a linear relationship - twice as many CPUs doesn't double the probability that this will happen. It's more like an exponential increase in probability that a loser will have to re-fetch (and search again, etc), combined with the possibility of "one winner, many losers". For 8 or more CPUs you could easily be spending more time CPU time on retries than you spend on successful attempts.
Cheers,
Brendan
Re: Lock-free bitmap based page frame allocator
Posted: Thu Aug 13, 2015 2:47 am
by Brendan
Hi,
willedwards wrote:This is exactly the kind of task that Hardware Transactional Memory (HTM) solves.
HTM solves is the difficulty of getting locks right; but only if you're willing to sacrifice performance.
What HTM does is build "lock free" into the hardware, with all of the same problems caused by wasting CPU time on retries. Back when Intel released HTM there was some patches for the Linux kernel and subsequent research, where they found that HTM just makes performance worse than the existing locking and all the Linux developer decided not to bother with HTM.
willedwards wrote:Here's how the Mill's optimistic concurrency works (which seems so much more humane than TSX):
You execute the enterAtomic instruction.
You then make some loads and stores, which are tagged as belonging in the atomic section.
When you are done you execute exitAtomic instruction and see if it succeeded.
If there were any writes by any other core to any of the memory locations you read from or wrote to, then the whole transaction fails and isn't applied.
That's virtually exactly the same as all the other HTM implementations: you start a transaction, do a pile of work in the hope that there's a point, then try to commit that work and find out you just wasted a metric buttload of time/power for nothing while also putting unnecessary pressure on shared resources (e.g. RAM/cache bandwidth and any shared execution units if/when SMT is involved).
Note that it's extremely susceptible to live-lock and doesn't even guarantee anything makes any forward progress (2 CPUs can repeatedly cause each other to retry forever).
Cheers,
Brendan
Re: Lock-free bitmap based page frame allocator
Posted: Thu Aug 13, 2015 3:02 am
by xenos
Thanks Brendan, that makes it quite clear. Just one minor question:
Brendan wrote:The second CPU goes back to the start and reads the "lowest/highest possibly free pages" variables again.
Why should the second CPU do that, instead of just checking the next block in the bitmap, until it found a free one?
Re: Lock-free bitmap based page frame allocator
Posted: Thu Aug 13, 2015 4:23 am
by Brendan
Hi,
XenOS wrote:Thanks Brendan, that makes it quite clear. Just one minor question:
Brendan wrote:The second CPU goes back to the start and reads the "lowest/highest possibly free pages" variables again.
Why should the second CPU do that, instead of just checking the next block in the bitmap, until it found a free one?
I was thinking you'd start searching from "lowest possibly free" until you found a free page, then you'd do "lowest = found + 1"; but you can't actually do that because a page could've been freed while you were searching, so you'd actually do "if(lowest == original_lowest) lowest = found + 1; }" (atomically) so that if "lowest" changed while you were searching you don't mess things up. This would've meant you'd want the most up-to-date version of "original_lowest" you can get.
I realise this is wrong and broken now - it only works if a page lower than "original_lowest" is freed while you're searching, and fails when a page between "original_lowest" and "found" is freed while you're searching.
One alternative is to do "if(lowest == found) lowest = found + 1". In this case you don't need to know "original_lowest" at all. This is correct and broken; because as soon as the free pages become fragmented (e.g. a page with a much much lower address is freed) you end up with "lowest" containing a low number that doesn't change after that one lowest page is allocated (unless you're very lucky), and don't avoid very much searching.
The best "correct and not broken" alternative is probably to add a revision number and increment it every time you change the lowest/highest or free a page. More specifically, maybe like this when allocating:
Code: Select all
retry:
old_packed_stuff = packed_stuff;
old_revision = old_packed_stuff >> 16;
old_lowest = old_packed_stuff & 0xFFFF;
// Find page and allocate it here (or "goto retry;" if failed to alloc)
new_revision = (old_revision + 1) & 0xFFFF;
new_lowest = found + 1;
new_packed_stuff = new_lowest << 16 | new_revision;
if(packed_stuff == old_packed_stuff) { packed_stuff = new_packed_stuff; } // Must be done atomically
..and like this when freeing:
Code: Select all
// Free the page
retry:
old_packed_stuff = packed_stuff;
old_revision = old_packed_stuff >> 16;
new_revision = (old_revision + 1) & 0xFFFF;
lowest = old_packed_stuff & 0xFFFF;
if(freed_page < lowest) {
lowest = freedPage;
}
new_packed_stuff = lowest << 16 | new_revision;
if(packed_stuff == old_packed_stuff) { packed_stuff = new_packed_stuff; } // Must be done atomically
else goto retry;
Cheers,
Brendan