Page 1 of 2
Concurrency with Free Page Queues
Posted: Mon May 19, 2014 4:07 pm
by Marionumber1
Right now in my kernel, I'm starting to write the physical memory manager. I've already written code that converts a firmware-provided physical memory map into an array of page structures (the PFN database), with each page containing metadata such as the flags, NUMA domain, and reference count. My idea is to group the page structures into different "lists" of pages with the same flags, like free, used, and bad memory. When allocating a page, you would take the first page off the beginning of the free list and move it to a process's used list. Sounds simple enough, but problems start to appear when attempting to implement the free list system concurrently.
The idea I had for allocation was like this. There's one free page list per NUMA domain, and allocations will take place from the closest possible one to you. You start at the head of the free list and attempt to acquire a lock on the page. If acquiring fails, you'd move onto the next page. If it succeeds, you have to move the page to the process's used page list, set the head of the free page list to the next page, and release the lock before returning the address of the page. However, this could lead to a race condition like so:
-CPU 0 locks page 0
-CPU 1 locks page 1 and sets the head of the free list to page 2
-CPU 0 sets the head of the free list to page 1
The way I see it, the only way to avoid this from happening is to lock the entire free list during allocations, which would be bad in terms of performance and scalability due to lock contention. So now I'm left with a few options, all of which I'm unsure about:
-Go ahead and lock the entire list. Although this isn't a very optimal solution, since page allocations are pretty fast (just take it off the beginning and modify the head pointer), it may end up working fairly well.
-Create page lists for each CPU. This would work, however, it leads to problems because there's no good way to divide the pages between CPUs. For example, if the pages are divided evenly, it may put certain threads at a disadvantage if only one is doing most of the page allocations. Another problem is the availability of consecutive pages (not needed for most purposes, but would be for old DMA). Even if pages are divided evenly in terms of numbers, the amount of consecutive pages may differ greatly.
-Find some lockless queue algorithm to use.
I'd be interested to know what people here think is the best option, or if there are any others I haven't considered.
Re: Concurrency with Free Page Queues
Posted: Mon May 19, 2014 4:56 pm
by Brendan
Hi,
Marionumber1 wrote:I'd be interested to know what people here think is the best option, or if there are any others I haven't considered.
You can use lock-less linked lists. Imagine you've got a variable called "top" that contains a pointer/reference for the page on the top of the stack. For "pop" (allocating a page) you'd:
- Atomically read top
- Get the next pointer from the page on the top of the stack
- Use "lock cmpxchg" to change top and retry if this fails
For "push" (freeing a page) you'd:
- Atomically read top
- Set the next pointer in the page you're freeing
- Use "lock cmpxchg" to change top and retry if this fails
Now; I don't like (and mostly don't use) lock-less algorithms. The problem is that instead of lock contention you get repeated retries; and (because it's typically impossible to make it fair) some CPUs may be "lucky" while other CPUs retry repeatedly. Basically (for kernel code where its easy to prevent task switches while a lock is held) lock-less algorithms are no better than a badly implemented spinlock and are typically inferior to a properly implemented (fair) ticket lock.
For my previous kernel/s; I just have a lock for each free page list. Allocating and freeing pages should be fast, so the lock isn't held for long, so lock contention shouldn't be a large problem. However, I normally support
page colouring. This means that for each NUMA domain you might have (e.g.) 256 different free page lists (one for each page colour); and the chance of 2 or more CPUs wanting the same free page list at the same time is significantly reduced.
Cheers,
Brendan
Re: Concurrency with Free Page Queues
Posted: Mon May 19, 2014 5:02 pm
by singerng
After some Googling, I found
http://www.cs.rochester.edu/u/scott/pap ... queues.pdf, which is a paper about concurrent queues. It seems like the first alternative that they offer (the "non-blocking" alternative) is the right algorithm to use for this application. However, I don't understand how the pseudocode that they give would be translated into code for use in your kernel, specifically which operations have to be atomic. Here is the pseudocode that is given (CAS is their equivalent of cmpxchg), with a couple modifications:
Code: Select all
structure pointer_t {fptr: pointer to node_t, count: unsigned integer}
structure node_t {fvalue: data type, next: pointer_t}
structure queue_t {fHead: pointer_t, Tail: pointer_t}
func initialize(Q: pointer to queue_t)
node = new node() # Allocate a free node
node–>next.ptr = NULL # Make it the only node in the linked list
Q–>Head = Q–>Tail = node # Both Head and Tail point to it
endfunc
func enqueue(Q: pointer to queue_t, value: data type)
node = new node() # Allocate a new node from the free list
node–>value = value # Copy enqueued value into node
node–>next.ptr = NULL # Set next pointer of node to NULL
loop # Keep trying until Enqueue is done
tail = Q–>Tail # Read Tail.ptr and Tail.count together
next = tail.ptr–>next # Read next ptr and count fields together
if tail == Q–>Tail # Are tail and next consistent?
if next.ptr == NULL # Was Tail pointing to the last node?
if CAS(&tail.ptr–>next, next, <node, next.count+1>) # Try to link node at the end of the linked list
break # Enqueue is done. Exit loop
endif
else # Tail was not pointing to the last node
CAS(&Q–>Tail, tail, <next.ptr, tail.count+1>) # Try to swing Tail to the next node
endif
endif
endloop
CAS(&Q–>Tail, tail, <node, tail.count+1>) # Enqueue is done. Try to swing Tail to the inserted node
endfunc
func dequeue(Q: pointer to queue_t, pvalue: pointer to data type): boolean
loop # Keep trying until Dequeue is done
head = Q–>Head # Read Head
tail = Q–>Tail # Read Tail
next = head–>next # Read Head.ptr–>next
if head == Q–>Head # Are head, tail, and next consistent?
if head.ptr == tail.ptr # Is queue empty or Tail falling behind?
if next.ptr == NULL # Is queue empty?
return FALSE # Queue is empty, couldn’t dequeue
endif
CAS(&Q–>Tail, tail, <next.ptr, tail.count+1>) # Tail is falling behind. Try to advance it
else # No need to deal with Tail
#Read value before CAS, otherwise another dequeue might free the next node
*pvalue = next.ptr–>value
if CAS(&Q–>Head, head, <next.ptr, head.count+1>) # Try to swing Head to the next node
break # Dequeue is done. Exit loop
endif
endif
endif
endloop
free(head.ptr) # It is safe now to free the old dummy node
return TRUE # Queue was not empty, dequeue succeeded
endfunc
Do you think this solution would work? If so, which operations would have to be atomic?
Re: Concurrency with Free Page Queues
Posted: Mon May 19, 2014 5:33 pm
by Marionumber1
Brendan wrote:
For my previous kernel/s; I just have a lock for each free page list. Allocating and freeing pages should be fast, so the lock isn't held for long, so lock contention shouldn't be a large problem. However, I normally support
page colouring. This means that for each NUMA domain you might have (e.g.) 256 different free page lists (one for each page colour); and the chance of 2 or more CPUs wanting the same free page list at the same time is significantly reduced.
Thanks for the response. Like I expected, there isn't that much overhead of removing a page from the free list, so locking it should work fine. Cache coloring is also something I was thinking about, but haven't made any serious plans to add it yet. However, locking the entire list may be less efficient when allocating consecutive pages (for ISA and legacy PCI DMA). Allocating consecutive pages would involve iterating through the entire list (which would be sorted, by the way) until a certain number of consecutive pages have been found, then returning all of them. Something like this could take longer (and would have negative effects if all other CPUs were locked out of the list). Also, it's very difficult to implement along with cache coloring. I'd like to hear your thoughts on this.
Re: Concurrency with Free Page Queues
Posted: Mon May 19, 2014 6:10 pm
by Brendan
Hi,
Marionumber1 wrote:However, locking the entire list may be less efficient when allocating consecutive pages (for ISA and legacy PCI DMA). Allocating consecutive pages would involve iterating through the entire list (which would be sorted, by the way) until a certain number of consecutive pages have been found, then returning all of them.
If you need to allocate consecutive pages, then you shouldn't be using free page lists at all. Fortunately, you almost never need to allocate consecutive pages to begin with.
For the rare cases where consecutive physical pages are needed (buffers used by some device drivers, and nothing else) it's enough to have a special area of RAM with different physical memory management for that special area. I typically use a bitmap for the first 16 MiB of RAM for this reason (and use faster free page stacks/lists for all other physical memory management).
Note that old devices that use the legacy/ISA DMA controllers (e.g. floppy controller) have additional constraints - not only are consecutive physical pages needed, but the allocated RAM can't cross a 64 KiB boundary and must be within the first 16 MiB of the physical address space.
The other thing to be aware of is that some PCI devices can only handle 32-bit addresses. With this in mind, you could split physical memory into 3 pools:
- RAM above 4 GiB = managed with one set of free pages stacks that never need to care about consecutive physical pages
- RAM from 16 MiB to 4 GiB = managed with another/separate set of free pages stacks that never need to care about consecutive physical pages
- RAM in the first 16 MiB = managed via. a bitmap or something (for any/all cases when consecutive physical pages are needed)
Cheers,
Brendan
Re: Concurrency with Free Page Queues
Posted: Mon May 19, 2014 6:27 pm
by Marionumber1
Thanks for all that, Brendan. I was actually already planning to divide up the free lists into under-4GiB and over-4GiB areas, but I'll also add in the bitmap for the first 16MiB. With that, I'm almost ready to start implementing my free list code. However, there is one other thing I have to decide beforehand, which isn't strictly related to this thread, but I think it's wasteful to make another thread just for that. Right before jumping to my kernel, my bootloader allocates per-CPU and NUMA domain data structures, which will contain things such as the GDT, TSS, IDT, and scheduling queues (for CPU) and free lists (for NUMA domains). These structures will probably be indexed by Local APIC IDs (on x86). However, early in the boot process, before the Local APICs are initialized yet (and there's no way to map them into the address space yet either), how exactly do I get the current CPU and NUMA domain number?
Re: Concurrency with Free Page Queues
Posted: Mon May 19, 2014 7:21 pm
by Brendan
Hi,
Marionumber1 wrote:Right before jumping to my kernel, my bootloader allocates per-CPU and NUMA domain data structures, which will contain things such as the GDT, TSS, IDT, and scheduling queues (for CPU) and free lists (for NUMA domains). These structures will probably be indexed by Local APIC IDs (on x86).
I wouldn't assume that Local APIC IDs are numbered consecutively - e.g. for a system with 4 CPUs, don't be too surprised if the IDs are 0, 1, 16 and 17. For this example, using the APIC ID as an index into a table means you'd need a table with 18 entries where 14 of the entries are never used.
What I would do is prepare for x2APIC support where Local APIC IDs are 32-bit. In this case (for a silly example) a system with 4 CPUs might use the IDs 0x00000000, 0x00000001, 0xF0000000 and 0xF0000001; and you might need a table with 4 billion entries that aren't used.
Marionumber1 wrote:However, early in the boot process, before the Local APICs are initialized yet (and there's no way to map them into the address space yet either), how exactly do I get the current CPU and NUMA domain number?
See
this wiki page.
Cheers,
Brendan
Re: Concurrency with Free Page Queues
Posted: Mon May 19, 2014 7:41 pm
by Marionumber1
Thanks for the link, I'll take that into account, but it still doesn't solve my problem of how to dynamically get the current CPU. Also, this seems to be straddling the line between a theory and implementation question.
Re: Concurrency with Free Page Queues
Posted: Mon May 19, 2014 9:55 pm
by Brendan
Hi,
Marionumber1 wrote:
Thanks for the link, I'll take that into account, but it still doesn't solve my problem of how to dynamically get the current CPU. Also, this seems to be straddling the line between a theory and implementation question.
If you mean "get the current CPU's local APIC ID", then the easiest way is to determine the base address of the local APIC (using the ACPI "APIC" table or Multi-Processor Specification table) and read it from the local APIC's "Local APIC ID Register".
Note: You may have to switch to protected mode (possibly temporarily) to do that, but that's something you must do anyway.
Cheers,
Brendan
Re: Concurrency with Free Page Queues
Posted: Wed Jul 09, 2014 5:39 am
by jbemmel
Brendan wrote:
Now; I don't like (and mostly don't use) lock-less algorithms. The problem is that instead of lock contention you get repeated retries; and (because it's typically impossible to make it fair) some CPUs may be "lucky" while other CPUs retry repeatedly. Basically (for kernel code where its easy to prevent task switches while a lock is held) lock-less algorithms are no better than a badly implemented spinlock and are typically inferior to a properly implemented (fair) ticket lock.
Lock-less algorithms are fundamentally different than lock-based ones, because they allow progress by any thread. If a cmpxchg fails, a retry by the same thread ( i.e. re-read, modify, write again using cmpxchg ) typically succeeds whereas a spinlock keeps the thread waiting until the lock is released by whichever other thread is holding it ( i.e. other thread must perform some operations first ). In case the other thread crashes or hangs indefinitely ( e.g. due to buggy hardware ), a spinlock may remain forever locked - unlike a lock-less algorithm.
Lock-less algorithms aren't a panacea and they are tricky to get right, but I wouldn't say they are "no better than a badly implemented spinlock"
Re: Concurrency with Free Page Queues
Posted: Wed Jul 09, 2014 7:46 am
by Brendan
Hi,
jbemmel wrote:Brendan wrote:Now; I don't like (and mostly don't use) lock-less algorithms. The problem is that instead of lock contention you get repeated retries; and (because it's typically impossible to make it fair) some CPUs may be "lucky" while other CPUs retry repeatedly. Basically (for kernel code where its easy to prevent task switches while a lock is held) lock-less algorithms are no better than a badly implemented spinlock and are typically inferior to a properly implemented (fair) ticket lock.
Lock-less algorithms are fundamentally different than lock-based ones, because they allow progress by any thread.
Wrong. Block-less algorithms guarantee that all threads make progress (and tend to be superior to both spinlocks and lock-less algorithms because of it). Lock-less algorithms are not block-less algorithms and (just like spinlocks) only guarantee that one thread can make progress.
Note: Some people call this "wait free" instead of "block-less" - same thing.
Of course block-less algorithms are significantly more difficult to design than lock-less algorithms; and it's simply not possible to do a lot of things with block-less algorithms that can be done with lock-less algorithms or spinlocks.
jbemmel wrote:If a cmpxchg fails, a retry by the same thread ( i.e. re-read, modify, write again using cmpxchg ) typically succeeds whereas a spinlock keeps the thread waiting until the lock is released by whichever other thread is holding it ( i.e. other thread must perform some operations first ). In case the other thread crashes or hangs indefinitely ( e.g. due to buggy hardware ), a spinlock may remain forever locked - unlike a lock-less algorithm.
Lock-less algorithms aren't a panacea and they are tricky to get right, but I wouldn't say they are "no better than a badly implemented spinlock"
Under contention; do you honestly think that "repeatedly retrying (to commit work done)" is better than "repeatedly retrying (to acquire a lock)"? Do you understand what "starvation" and "fairness" are? Do you think it's easy to design "fair lock-less algorithms" that limit worst case starvation?
Note that I'm specifically talking about kernel code here; where it is possible to disable/postpone task switches, the maximum number of "things" competing for a lock is determine by the number of CPUs and not the number of tasks/threads, and the length of the critical section is typically very short. For user-space code (where you can't disable/postpone task switches) locks (including mutexes and spinlocks) do have a major problem (e.g. you acquire the lock, then the kernel decides it's time for a task switch and does other things for ages before switching back to the task that holds the lock) and this is where lock-free is actually useful.
Cheers,
Brendan
Re: Concurrency with Free Page Queues
Posted: Wed Jul 09, 2014 8:19 am
by jbemmel
See e.g.
http://newsgroups.derkeiler.com/Archive ... 00009.html
Example assembly code from
http://stackoverflow.com/questions/1195 ... y-spinlock:
Code: Select all
__asm{
spin_lock:
xorl %ecx, %ecx
incl %ecx
spin_lock_retry:
xorl %eax, %eax
lock; cmpxchgl %ecx, (lock_addr)
jnz spin_lock_retry
ret
spin_unlock:
movl $0 (lock_addr)
ret
}
The issue is that the cmpxchg will loop until some other thread releases the lock by calling spin_unlock. In algorithms that do not use spinlocks ( which is what I meant by "lock-less", not a common term in computer theory ) one would typically use cmpxchg in the following way:
Code: Select all
retry:
var h = list->head;
if (h) {
var n = h->next;
if (!cmpxchg( &list->head, h, n )) goto retry;
}
return h;
Here it may happen that some other thread changes list->head after it is read, causing the cmpxchg to fail ( disregarding ABA issues ). However, the thread can retry and succeed on the 2nd iteration, independent of any other thread's execution
Re: Concurrency with Free Page Queues
Posted: Wed Jul 09, 2014 1:42 pm
by alexfru
Brendan wrote:
For the rare cases where consecutive physical pages are needed (buffers used by some device drivers, and nothing else) ...
I remember Intel starting to require TSS to be physically contiguous as of first Pentiums. It looks like they introduced a CPU bug (unless it was a documentation bug that existed all along since 1986) and decided to document it instead of fixing it. I wasn't able to trigger the bug back in 2000. It's possible that it had been fixed by then. It's also possible that the condition was more complex than just having a TSS whose parts are located in multiple pages that are far away from each other. Intel being Intel said nothing of value when I asked. Go figure.
Re: Concurrency with Free Page Queues
Posted: Wed Jul 09, 2014 2:30 pm
by Brendan
Hi,
jbemmel wrote:The issue is that the cmpxchg will loop until some other thread releases the lock by calling spin_unlock. In algorithms that do not use spinlocks ( which is what I meant by "lock-less", not a common term in computer theory ) one would typically use cmpxchg in the following way:
Code: Select all
retry:
var h = list->head;
if (h) {
var n = h->next;
if (!cmpxchg( &list->head, h, n )) goto retry;
}
return h;
Here it may happen that some other thread changes list->head after it is read, causing the cmpxchg to fail ( disregarding ABA issues ). However, the thread can retry and succeed on the 2nd iteration, independent of any other thread's execution
No. Imagine a typical 80x86 "4-core with hyper-threading" system. The first problem is that (e.g.) CPU #0 can read list->head, do some work, then fail to commit the work (the cmpxchg) because CPU #1 modified list->head; then CPU #0 retries and wastes more time doing work and fails to commit it because CPU #2 modified list->head; then wastes more time and fails to commit it because another CPU modified list->head again; and so on. For "worst case" there is no limit on how many times this can happen, and therefore no guarantee that CPU #0 will ever make progress.
The other problem is that, for hyper-threading, a core's resources are shared and work done by one logical CPU effects the performance of the other logical CPU in that core; regardless of whether the work done is successfully committed or discarded/retried. Basically, lock-less algorithms mean that other CPUs get less work done for no reason. Of course this problem isn't limited to the core's shared resources, but effects all shared resources that CPUs compete for (e.g. bus bandwidth, RAM bandwidth, cache slots, etc), which means that it can also effect systems that don't use hyper-threading.
The solution to the first problem is "fairness"; which is typically done using something "loosely derived from"
Lamport's bakery algorithm where some form of counter/ticket is used to limit the worst case. For example, with our typical 80x86 "4-core with hyper-threading" system, with a ticket lock you can guarantee that CPU #0 will make progress after waiting for a maximum of 7 other CPUs to have their turn, and therefore guarantee that CPU #0 will make progress eventually.
The only solution to the second problem is not wasting time doing work unless you know it can be committed.
Now; your spinlock example code and your lock-free example code both suffer from both of these problems. For the second problem they should both be using the "pause" instruction to minimise "core resources wastage", where the spinlock would minimise wastage more. For the first problem, it'd be relatively easy to upgrade the spinlock and make it a "ticket lock"; but for the lock-free version you're basically screwed - to improve it you need to use a lock.
Cheers,
Brendan
Re: Concurrency with Free Page Queues
Posted: Wed Jul 09, 2014 2:46 pm
by Brendan
Hi,
alexfru wrote:Brendan wrote:
For the rare cases where consecutive physical pages are needed (buffers used by some device drivers, and nothing else) ...
I remember Intel starting to require TSS to be physically contiguous as of first Pentiums. It looks like they introduced a CPU bug (unless it was a documentation bug that existed all along since 1986) and decided to document it instead of fixing it. I wasn't able to trigger the bug back in 2000. It's possible that it had been fixed by then. It's also possible that the condition was more complex than just having a TSS whose parts are located in multiple pages that are far away from each other. Intel being Intel said nothing of value when I asked. Go figure.
I don't remember ever seeing this mentioned anywhere, and can't find it in Intel's current "system programming guide". There may have been a CPU bug that I don't remember; however, in the past I went through all Intel CPU errata from 80486 up to "Pentium 4" and enumerated everything that could possibly effect my OS, and if there is a bug I don't remember then it's extremely likely that it's a bug that doesn't effect me (e.g. perhaps it's a bug that only effects people that use hardware task switching).
In any case, if this problem actually exists, most tasks don't need an "IO permission bitmap" and therefore for most tasks the TSS has no need to cross a page boundary anyway. If there are tasks that do need an IO permission bitmap (e.g. some device drivers, for some micro-kernels), then you would need contiguous pages, but it'd be negligible.
Cheers,
Brendan