Concurrency with Free Page Queues

Discussions on more advanced topics such as monolithic vs micro-kernels, transactional memory models, and paging vs segmentation should go here. Use this forum to expand and improve the wiki!
dschatz
Member
Member
Posts: 61
Joined: Wed Nov 10, 2010 10:55 pm

Re: Concurrency with Free Page Queues

Post by dschatz »

Just to add to the discussion about lock-free algorithms:

The whole idea is that independent of how threads are scheduled, some thread will be making forward progress at some time. This is particularly useful in the case where you have over-committed your threads to your cores (more threads than cores). In this case a locking algorithm could see a thread acquire a lock and then be preempted which blocks all other threads from making forward progress. Kernel level locks can just disable interrupts so that preemption can't happen and therefore there is little reason to use lock-free algorithms in a kernel.
jbemmel
Member
Member
Posts: 53
Joined: Fri May 11, 2012 11:54 am

Re: Concurrency with Free Page Queues

Post by jbemmel »

Brendan wrote: No. Imagine a typical 80x86 "4-core with hyper-threading" system. The first problem is that (e.g.) CPU #0 can read list->head, do some work, then fail to commit the work (the cmpxchg) because CPU #1 modified list->head; then CPU #0 retries and wastes more time doing work and fails to commit it because CPU #2 modified list->head; then wastes more time and fails to commit it because another CPU modified list->head again; and so on. For "worst case" there is no limit on how many times this can happen, and therefore no guarantee that CPU #0 will ever make progress.
In a real kernel ( as opposed to a synthetic test case in which multiple threads all call lock/release repeatedly ) the number of instructions in the lock-less code is small compared to the other instructions executed by each thread. This means that such a "worst case" scenario in which a CPU is starved is extremely unlikely, and even if it happens that all CPUs enter the code section at the same time, they would all be executing at more or less the same speed, such that it is very likely that each CPU eventually gets its turn within at most #CPUs iterations. Therefore, I think starvation is an issue in theory, but not in practice.

You are right about the 'pause' instruction, but the code wasn't meant to be complete or optimized, I merely wanted to illustrate the difference in the use of cmpxchg. IIRC Linux now uses "lock dec [mem]" instead of cmpxchg anyway

My main motivation for looking into lock-less alternatives, is that I dislike the idea of disabling interrupts as a general principle - even if for the briefest period of time - because it degrades the worst case responsiveness of the OS. That, plus the fact that "everyone else" is already using spinlocks makes me want to explore lock-less alternatives. What good is an "alternative OS" if it works the same as everything else? :wink:
User avatar
Brendan
Member
Member
Posts: 8561
Joined: Sat Jan 15, 2005 12:00 am
Location: At his keyboard!
Contact:

Re: Concurrency with Free Page Queues

Post by Brendan »

Hi,
jbemmel wrote:
Brendan wrote:No. Imagine a typical 80x86 "4-core with hyper-threading" system. The first problem is that (e.g.) CPU #0 can read list->head, do some work, then fail to commit the work (the cmpxchg) because CPU #1 modified list->head; then CPU #0 retries and wastes more time doing work and fails to commit it because CPU #2 modified list->head; then wastes more time and fails to commit it because another CPU modified list->head again; and so on. For "worst case" there is no limit on how many times this can happen, and therefore no guarantee that CPU #0 will ever make progress.
In a real kernel ( as opposed to a synthetic test case in which multiple threads all call lock/release repeatedly ) the number of instructions in the lock-less code is small compared to the other instructions executed by each thread. This means that such a "worst case" scenario in which a CPU is starved is extremely unlikely, and even if it happens that all CPUs enter the code section at the same time, they would all be executing at more or less the same speed, such that it is very likely that each CPU eventually gets its turn within at most #CPUs iterations. Therefore, I think starvation is an issue in theory, but not in practice.
In theory "starved forever" (the worst case) is "infinitely unlikely". In practice "starved for long enough to cause latency problems" depends on how much starvation/latency is a problem and the number of CPUs that could be waiting for the lock to be released (which in turn depends on both the length of the critical section and the number of CPUs). It doesn't matter too much if it's a large critical section with a small number of CPUs or a short critical with a large number of CPUs.

Also; don't forget that there's typically a "hierarchy of things" - something (e.g. IPC) might acquire a lock and call something else (e.g. virtual memory manager) which acquires a different lock and then calls something else (e.g. physical memory manager) that acquires a third lock. For lock-less algorithms this can become a major mess, where you end up with new code paths for backing out that reverses work successfully committed by subordinates. E.g. maybe the virtual memory manager asks the physical memory manager to allocate a page (and the physical memory manager has to retry several times to allocate successfully) but then the virtual memory manager can't commit its work and has to back out by telling the physical memory manager to free the physical page that was previously allocated (and the physical memory manager has to retry several times to free the page successfully). Essentially; the "back out on commit failed" makes the problems with lock-less algorithms under contention exponential.

Now; the next version of Xeon Phi (due to be released soon) will have up to 72 cores with 4 logical CPUs per core, and will be usable as a normal/main CPU. This means a large server with a 4-socket motherboard could end up with 1152 CPUs. It also means that the "wasted work" (work that can't be committed) done on one logical CPU in a core will effect the performance of 3 other logical CPUs (and the cache/bus traffic for many more). If you think about lockless algorithms running on one of these machines with all the "failed to commit; back out then retry" stuff, it's a major disaster (e.g. I'd expect that scalability may be so bad that disabling half of the CPUs might improve performance).

For normal workstation/desktop/mobile things, hardware really needs more cores/CPUs for several reasons (mostly performance per watt) but is being held back by crappy software - there no point having 8 or more logical CPUs when applications won't use them. Sadly; application developers are still learning how to develop applications, but despite the fact that it will continue to be a frustratingly slow progression they are going to get there eventually.

Of course it is wrong to only look at the "contended" case. For the "no contention" case, a well written ticket lock is no worse than a lockless algorithm. For an example, consider Intel's transactional extensions - their transactional extensions essentially convert locks and critical sections into lock-less algorithms (e.g. the lock is ignored and the work retried if it can't be committed). Recent research investigated the use of transactional extensions for the Linux kernel and found that the existing ticket locks are better than the "converted to lock-less by hardware" equivalent code.
jbemmel wrote:My main motivation for looking into lock-less alternatives, is that I dislike the idea of disabling interrupts as a general principle - even if for the briefest period of time - because it degrades the worst case responsiveness of the OS.
Of course the "worst contended case" applies to interrupt handlers too; and anything that isn't fair (has an unlimited "worst contended case") is likely to be quite bad for "worst case interrupt latency".

I also don't like disabling interrupts too much either. Fortunately for a lot of a kernel's locks there's no reason to disable interrupts and disabling task switches alone is enough; and only locks that may be acquired by code that's executed in the context of an interrupt handler needs to disable interrupts.

Most OSs split interrupt handling into 2 parts anyway - one part that executes in the context of an interrupt handler that pushes some sort of notification onto some sort of queue (e.g. the "top half" interrupt handler in Linux, deferred procedure calls in Windows, or some sort of "send message" in most micro-kernels), and another part that does not (e.g. the "bottom half" interrupt handler in Linux, or code using some sort of "get message" in most micro-kernels). This means that for most OSs very little code is executed in the context of an interrupt handler and therefore very few locks need to disable interrupts (e.g. for sending the notification to the "top bottom half" or message receiver or whatever).
jbemmel wrote:That, plus the fact that "everyone else" is already using spinlocks makes me want to explore lock-less alternatives. What good is an "alternative OS" if it works the same as everything else? :wink:
While I've never been afraid of doing things differently; an alternative does need to have at least some benefits - "different and better" is good, "same" is average, and "different but worse" is bad. With this in mind, perhaps you could try several alternatives and profile/benchmark several cases (e.g. single-CPU, 2 CPU and "many CPU") to determine where lock-less actually helps and where it doesn't. I think you'll find that the more CPUs there are the worse lock-less (and simple spin locks) becomes compared to ticket locks.


Cheers,

Brendan
For all things; perfection is, and will always remain, impossible to achieve in practice. However; by striving for perfection we create things that are as perfect as practically possible. Let the pursuit of perfection be our guide.
jbemmel
Member
Member
Posts: 53
Joined: Fri May 11, 2012 11:54 am

Re: Concurrency with Free Page Queues

Post by jbemmel »

If I had a system with 1172 logical CPUs, I'd probably give each its own share of memory to avoid contention altogether. This is good for NUMA alignment too. In the extreme case, you'd have 1172 separate systems that happen to be able to communicate directly through memory.

Thanks for the elaborate response, it's an interesting topic. To be continued...
Post Reply