OSDev.org

Posted: **Sat Nov 05, 2016 7:54 pm**

I'm writing an MCS spinlock using C11's atomics. I would like to specify the weakest memory ordering constraints possible for a correct implementation that will work on any architecture (so, not just x86). This looks correct to me, but I'd love it if anyone could poke holes in it, particularly with examples of when it might fail or perform sub-optimally.

Another question I have is regarding _Atomic qualifiers. Since I'm not using any language-level operations that need to be atomic, and I would like for spin_lock's initialization of node not to be atomic (since no other thread will read those values), is it okay that I've left them out? Or is it illegal to cast a pointer to _Atomic?

(As an aside, the standard doesn't seem to require <stdatomic.h> in freestanding implementations, but GCC seems to provide it anyway. Am I missing something?)

Code: Select all

struct spinlock {
	struct spinlock_node *tail;
};

struct spinlock_node {
	struct spinlock_node *next;
	bool locked;
};

static inline void spin_lock(struct spinlock *lock, struct spinlock_node *node) {
	node->next = NULL;
	node->locked = false;

	struct spinlock_node *prev = atomic_exchange_explicit(&lock->tail, node, memory_order_acquire);
	if (prev == NULL) {
		return;
	}

	atomic_store_explicit(&prev->next, node, memory_order_relaxed);
	while (!atomic_load_explicit(&node->locked, memory_order_acquire)) {
		__asm__ volatile ("pause");
	}
}

static inline void spin_unlock(struct spinlock *lock, struct spinlock_node *node) {
	struct spinlock_node *next = atomic_load_explicit(&node->next, memory_order_relaxed);
	if (next == NULL) {
		struct spinlock_node *prev = node;
		if (atomic_compare_exchange_strong_explicit(
			&lock->tail, &prev, NULL, memory_order_release, memory_order_relaxed
		)) {
			return;
		}
		while ((next = atomic_load_explicit(&node->next, memory_order_relaxed)) == NULL) {
			__asm__ volatile ("pause");
		}
	}
	atomic_store_explicit(&next->locked, true, memory_order_release);
}

Posted: **Sat Nov 05, 2016 8:59 pm**

Hi,

Rusky wrote:I'm writing an MCS spinlock using C11's atomics. I would like to specify the weakest memory ordering constraints possible for a correct implementation that will work on any architecture (so, not just x86). This looks correct to me, but I'd love it if anyone could poke holes in it, particularly with examples of when it might fail or perform sub-optimally.

You can tell that it either isn't portable, isn't correct or isn't optimal from the topic's title alone (without even looking at the code). For a very obvious example, the "pause" instruction is not portable (but you can't simply delete it, even though it'd be "correct" if you did, because it'd be "less optimal" for 80x86).

Rusky wrote:(As an aside, the standard doesn't seem to require <stdatomic.h> in freestanding implementations, but GCC seems to provide it anyway. Am I missing something?)

Imagine a simple (RISC?) CPU designed and used in a single-CPU system, that doesn't have atomics built into the instruction set. For this case, to implement <stdatomic.h> the library will have to either ask the kernel to disable the scheduler temporarily or ask the kernel to do the atomic operation (so that a task switch doesn't happen at the wrong time - e.g. between a "compare" and an "exchange").

Basically; I'd assume that in some cases it's impossible to support <stdatomic.h> in freestanding implementations; and GCC only provides it in freestanding implementations if it is possible.

Cheers,

Brendan

Posted: **Sun Nov 06, 2016 1:35 am**

Brendan wrote:For a very obvious example, the "pause" instruction is not portable (but you can't simply delete it, even though it'd be "correct" if you did, because it'd be "less optimal" for 80x86).

indeed. I specifically asked about memory ordering constraints that would work on any architecture. Obviously the pause instruction is a separate issue.

Brendan wrote:Basically; I'd assume that in some cases it's impossible to support <stdatomic.h> in freestanding implementations; and GCC only provides it in freestanding implementations if it is possible.

The standard provides a separate mechanism for that- __STDC_NO_ATOMICS__. My confusion is more from a lack of mention in the standard regarding the interaction between stdatomic.h and freestanding.

Posted: **Sun Nov 06, 2016 6:26 am**

Rusky wrote:My confusion is more from a lack of mention in the standard regarding the interaction between stdatomic.h and freestanding.

The architecture may lack direct support for some or all atomic operations and so they must be implemented by the OS/kernel/whatever (synchronizing CPUs, caches, that sort of thing) and you may have none of that in freestanding. The standard should probably be more explicit about this, but it's something that should be expected. It would be interesting to see what gcc will do for the i80386/i80486 in freestanding if you ask it to compile a 64-bit atomic increment or exchange. It may either completely crap out or generate calls to functions from libatomic, which may just use a few spinlocks (implemented with xchg) to guard accesses to atomic variables.

Posted: **Sun Nov 06, 2016 7:25 am**

From a brief googling, the majority opinion seems to be that if you use any non-atomic access, the following accesses from any other thread are undefined, no matter how you perform those. Seems very consistent with the standardization methodology for C++ and C in general. The workaround appears to be to use atomic types in the structure, but to qualify the initializing operations explicitly with memory_order_relaxed. That is, in your case, you can use things like atomic_store_explicit.

Alternatively, write your own portable types and use atomic builtins/interlocked operations with those. You wont get toolchain abstraction for free, but the behavior will always be well defined to the extent of your understanding of the underlying mechanisms (rather than some arbitrary standard definition). In practice, the standard constructs are probably implemented in the same way and you are wasting effort, but you just don't have the clarity you will require.

Also, in case you end up with runtime dependencies, you should be able to build some of the runtime library code for a freestanding environment. Very few of the language features that a system programmer might need rely on system calls. You may provide the implementations yourself, if some intrinsic or another generates a call (which probably indicates you don't want to use it to begin with).

Posted: **Sun Nov 06, 2016 11:20 am**

simeonz wrote:if you use any non-atomic access, the following accesses from any other thread are undefined, no matter how you perform those.

That can't be it, since atomics are used to implement locks, and you don't need to mark the data protected by the lock as _Atomic. This reference also seems to disagree (emphasis mine):

If an atomic store in thread A is tagged memory_order_release and an atomic load in thread B from the same variable is tagged memory_order_acquire, all memory writes (non-atomic and relaxed atomic) that happened-before the atomic store from the point of view of thread A, become visible side-effects in thread B, that is, once the atomic load is completed, thread B is guaranteed to see everything thread A wrote to memory.

Posted: **Sun Nov 06, 2016 12:01 pm**

Rusky wrote:That can't be it, since atomics are used to implement locks, and you don't need to mark the data protected by the lock as _Atomic.

You are right. The data accessed in synchronized sections of code does not have to be atomic. The way I formulated it, the usage of things like mtx_lock would be cumbersome to say the least. But casting of atomics to non-atomics shouldn't be legal.

Posted: **Sun Nov 06, 2016 12:19 pm**

simeonz wrote:casting of atomics to non-atomics shouldn't be legal.

That's not what I'm doing, though- I'm casting non-atomics to atomics, by passing them to the atomic_* functions. The analogous case for volatile is extremely common and well-supported by the standard and implementations- the idea is that the access itself is what's qualified, not the memory location or data.

Posted: **Sun Nov 06, 2016 2:27 pm**

Well, I am scanning through the C11 draft here. You may notice the use of the term "atomic object" when defining the associated order.

All modifications to a particular atomic object M occur in some particular total order, called the modification order of M.

Other places refer to memory locations instead, but if one memory location is used both atomically and non-atomically, does that location constitute an atomic object with an associated order? Also, note that there is a special macro for initializing atomics, named "ATOMIC_VAR_INIT" and a corresponding function "atomic_init".

The ATOMIC_VAR_INIT macro expands to a token sequence suitable for initializing an atomic object of a type that is initialization-compatible with value. An atomic object with automatic storage duration that is not explicitly initialized using ATOMIC_VAR_INIT is initially in an indeterminate state; however, the default (zero) initialization for objects with static or thread-local storage duration is guaranteed to produce a valid state.

The atomic_init generic function initializes the atomic object pointed to by obj to the value value, while also initializing any additional state that the implementation might need to carry for the atomic object.

This again hints that the atomic qualifier, unlike the cv-qualifiers may be based on structural type differences (although I doubt it).

OSDev.org

Code review: MCS lock with C11 atomics

Code review: MCS lock with C11 atomics

Re: Code review: MCS lock with C11 atomics

Re: Code review: MCS lock with C11 atomics

Re: Code review: MCS lock with C11 atomics

Re: Code review: MCS lock with C11 atomics

Re: Code review: MCS lock with C11 atomics

Re: Code review: MCS lock with C11 atomics

Re: Code review: MCS lock with C11 atomics

Re: Code review: MCS lock with C11 atomics