Hi,
jbemmel wrote:Brendan wrote:No. Imagine a typical 80x86 "4-core with hyper-threading" system. The first problem is that (e.g.) CPU #0 can read list->head, do some work, then fail to commit the work (the cmpxchg) because CPU #1 modified list->head; then CPU #0 retries and wastes more time doing work and fails to commit it because CPU #2 modified list->head; then wastes more time and fails to commit it because another CPU modified list->head again; and so on. For "worst case" there is no limit on how many times this can happen, and therefore no guarantee that CPU #0 will ever make progress.
In a real kernel ( as opposed to a synthetic test case in which multiple threads all call lock/release repeatedly ) the number of instructions in the lock-less code is small compared to the other instructions executed by each thread. This means that such a "worst case" scenario in which a CPU is starved is extremely unlikely, and even if it happens that all CPUs enter the code section at the same time, they would all be executing at more or less the same speed, such that it is very likely that each CPU eventually gets its turn within at most #CPUs iterations. Therefore, I think starvation is an issue in theory, but not in practice.
In theory "starved forever" (the worst case) is "infinitely unlikely". In practice "starved for long enough to cause latency problems" depends on how much starvation/latency is a problem and the number of CPUs that could be waiting for the lock to be released (which in turn depends on both the length of the critical section and the number of CPUs). It doesn't matter too much if it's a large critical section with a small number of CPUs or a short critical with a large number of CPUs.
Also; don't forget that there's typically a "hierarchy of things" - something (e.g. IPC) might acquire a lock and call something else (e.g. virtual memory manager) which acquires a different lock and then calls something else (e.g. physical memory manager) that acquires a third lock. For lock-less algorithms this can become a major mess, where you end up with new code paths for backing out that reverses work successfully committed by subordinates. E.g. maybe the virtual memory manager asks the physical memory manager to allocate a page (and the physical memory manager has to retry several times to allocate successfully) but then the virtual memory manager can't commit its work and has to back out by telling the physical memory manager to free the physical page that was previously allocated (and the physical memory manager has to retry several times to free the page successfully). Essentially; the "back out on commit failed" makes the problems with lock-less algorithms under contention exponential.
Now; the next version of Xeon Phi (due to be released soon) will have up to 72 cores with 4 logical CPUs per core, and will be usable as a normal/main CPU. This means a large server with a 4-socket motherboard could end up with 1152 CPUs. It also means that the "wasted work" (work that can't be committed) done on one logical CPU in a core will effect the performance of 3 other logical CPUs (and the cache/bus traffic for many more). If you think about lockless algorithms running on one of these machines with all the "failed to commit; back out then retry" stuff, it's a major disaster (e.g. I'd expect that scalability may be so bad that disabling half of the CPUs might improve performance).
For normal workstation/desktop/mobile things, hardware really needs more cores/CPUs for several reasons (mostly performance per watt) but is being held back by crappy software - there no point having 8 or more logical CPUs when applications won't use them. Sadly; application developers are still learning how to develop applications, but despite the fact that it will continue to be a frustratingly slow progression they are going to get there eventually.
Of course it is wrong to only look at the "contended" case. For the "no contention" case, a well written ticket lock is no worse than a lockless algorithm. For an example, consider Intel's transactional extensions - their transactional extensions essentially convert locks and critical sections into lock-less algorithms (e.g. the lock is ignored and the work retried if it can't be committed). Recent research investigated the use of transactional extensions for the Linux kernel and found that the existing ticket locks are better than the "converted to lock-less by hardware" equivalent code.
jbemmel wrote:My main motivation for looking into lock-less alternatives, is that I dislike the idea of disabling interrupts as a general principle - even if for the briefest period of time - because it degrades the worst case responsiveness of the OS.
Of course the "worst contended case" applies to interrupt handlers too; and anything that isn't fair (has an unlimited "worst contended case") is likely to be quite bad for "worst case interrupt latency".
I also don't like disabling interrupts too much either. Fortunately for a lot of a kernel's locks there's no reason to disable interrupts and disabling task switches alone is enough; and only locks that may be acquired by code that's executed in the context of an interrupt handler needs to disable interrupts.
Most OSs split interrupt handling into 2 parts anyway - one part that executes in the context of an interrupt handler that pushes some sort of notification onto some sort of queue (e.g. the "top half" interrupt handler in Linux, deferred procedure calls in Windows, or some sort of "send message" in most micro-kernels), and another part that does not (e.g. the "bottom half" interrupt handler in Linux, or code using some sort of "get message" in most micro-kernels). This means that for most OSs very little code is executed in the context of an interrupt handler and therefore very few locks need to disable interrupts (e.g. for sending the notification to the "top bottom half" or message receiver or whatever).
jbemmel wrote:That, plus the fact that "everyone else" is already using spinlocks makes me want to explore lock-less alternatives. What good is an "alternative OS" if it works the same as everything else?
While I've never been afraid of doing things differently; an alternative does need to have at least some benefits - "different and better" is good, "same" is average, and "different but worse" is bad. With this in mind, perhaps you could try several alternatives and profile/benchmark several cases (e.g. single-CPU, 2 CPU and "many CPU") to determine where lock-less actually helps and where it doesn't. I think you'll find that the more CPUs there are the worse lock-less (and simple spin locks) becomes compared to ticket locks.
Cheers,
Brendan