[solved]lock cmpxchg didn't happen atomically?

xeyes · Post by **xeyes** » Sun Jan 17, 2021 10:18 pm

solved, bug is elsewhere (described in this thread, but probably not of much interest as it is a "logic error" in the upper level), lock cmpxchg is doing what it should do, original post below:

I'm pretty sure it's my bug or misunderstanding of something, yet it looks like "lock cmpxchg" didn't happen atomically from 2 CPUs.

Context:

It's a simple reader/writer lock (lock for my inode cache) implemented using 1 DWORD, different states represented by values:

// 0xFFFFFFFF: write locked
// 1 to 0xFFFFFFFE: read locked, the number is ref count
// 0: not locked

Atomic instructions are used to modify the values.

Yet I was able to get this sequence to happen on 2 CPUs :

Code: Select all

CPU 1: before lock 2, val 0
CPU 1: after lock 2, val 1
CPU 0: before lock 2, val 0
CPU 0: after lock 2, val 1

As you can see both CPU 0 and CPU 1 think it has successfully incremented the value at index 2 from 0 to 1

Code doing the increment (read lock, which is what both the CPU ran between the pair of prints) is below. I can't seem to see what's wrong with it but something has to be off.

It is also quite ugly (especially the part that it reads in the value from *ecx and then having to do both copy and increment) so would also like some suggestions on improving.

r is pointer to the lock.
_R_LOCK_MAX is 0xFFFFFFFE, but both CPUs should have blasted through the initial spin part as the value is so far from reaching it anyway.

Code: Select all

    asm volatile ("_r_keep_spinning:                        \n\
                   movl (%%ecx), %%ebx                      \n\
                   movl %%ebx, %%eax                        \n\
                   addl $1, %%ebx                           \n\
                   cmp $" EXSTR(_R_LOCK_MAX) ", %%eax       \n\
                   jb _try_get_r_lock                       \n\
                   pause                                    \n\
                   jmp _r_keep_spinning                     \n\
                   _try_get_r_lock:                         \n\
                   lock cmpxchg %%ebx, (%%ecx)              \n\
                   jne _r_keep_spinning                      "
                   :
                   : "c"(r)
                   : "eax", "ebx", "memory", "cc");

Octocontrabass · Post by **Octocontrabass** » Mon Jan 18, 2021 12:39 am

xeyes wrote:As you can see both CPU 0 and CPU 1 think it has successfully incremented the value at index 2 from 0 to 1

You're not doing anything silly like setting it to zero on both CPUs before running your test code, right? (Or using different physical addresses?)

xeyes wrote:It is also quite ugly (especially the part that it reads in the value from *ecx and then having to do both copy and increment) so would also like some suggestions on improving.

I suggest writing it without using any inline assembly.

Something like this:

Code: Select all

void acquire_read( unsigned int * lock )
{
    unsigned int temp = __atomic_load_n( lock, __ATOMIC_RELAXED );
    do
    {
        while( temp >= _R_LOCK_MAX - 1 )
        {
            __builtin_ia32_pause();
            temp = __atomic_load_n( lock, __ATOMIC_RELAXED );
        }
    }
    while( !__atomic_compare_exchange_n( lock, &temp, temp + 1, 0, __ATOMIC_ACQUIRE, __ATOMIC_RELAXED ) );
}

(This code is untested and may be incorrect. Read the documentation carefully.)

kzinti · Post by **kzinti** » Mon Jan 18, 2021 2:19 am

Yes please, do not write such code in assembly. Use compiler intrinsics as suggested by Octocontrabass.

xeyes · Post by **xeyes** » Mon Jan 18, 2021 3:08 am

Octocontrabass wrote:
xeyes wrote:As you can see both CPU 0 and CPU 1 think it has successfully incremented the value at index 2 from 0 to 1
You're not doing anything silly like setting it to zero on both CPUs before running your test code, right? (Or using different physical addresses?)

Thanks! You are right it was a silly issue and the 2 CPUs indeed aren't targeting the same address.

I had another instance of this same r/w lock on the whole cache, and obviously normal look up is read lock and write lock is used for 'cache maintenance' like read in new inodes (and potentially evict old ones).

Since I assumed that this kind of r/w lock's owner can't atomically change from r->w (on second thought, it might be possible, like spin for a 1 to 0xFFFFFFFF transition), I let the thread has to give up r and relock with w after a cache miss in order to read from disk and add to cache.

Thus there are situations where 2 threads (CPUs) both look up inode 2, both see that the cache missed, and both added it to the cache (in serial, but neither re-checked after relock with w thus they added 2 copies) They then proceeded to both "atomically change 0 to 1" on the inode's lock (targeting the 2 different copies, of course).

The code itself actually works fine, to isolate the issue I moved it to run in user space and with 8 threads (1 writer 7 readers) on 8 CPUs spinning lock/unlock things did seem to happen atomically, at least the check that found the case I'm asking about didn't trigger.

Also noticed an unexpected "unfairness" where the write thread wins vast majority of the time and itself does more iterations than 7 readers combined. Probably because the write lock path is much faster and it can go for cmpxchg right after reading *ecx and a single cmp, without wasting time doing the move and add like the readers.

Code: Select all

    
asm volatile ("_w_init_spin:                            \n\
                   movl $" EXSTR(_RW_UNLOCKED) ", %%eax     \n\
                   _w_keep_spinning:                        \n\
                   cmp (%%ecx), %%eax                       \n\
                   je _w_try_get_lock                       \n\
                   pause                                    \n\
                   jmp _w_keep_spinning                     \n\
                   _w_try_get_lock:                         \n\
                   lock cmpxchg %%ebx, (%%ecx)              \n\
                   jne _w_init_spin                          "
                   :
                   : "c"(r), "b"(_W_LOCKED)
                   : "eax", "memory", "cc");

I suggest writing it without using any inline assembly.

Will look into this, I always thought this kind of code is where inline is useful, but maybe I'm missing something important.

xeyes · Post by **xeyes** » Mon Jan 18, 2021 3:16 am

kzinti wrote:Yes please, do not write such code in assembly. Use compiler intrinsics as suggested by Octocontrabass.

Thanks, if multiple people are saying the same thing there must be some good reasons for it.

nullplan · Post by **nullplan** » Mon Jan 18, 2021 1:08 pm

xeyes wrote:Thanks, if multiple people are saying the same thing there must be some good reasons for it.

There are good reasons for it. Inline assembly is notoriously difficult to get right. Did you use the right qualifiers, the right constraints, and the right clobbers? You might not immediately notice if you did not. Now, my solution to that is to write complete functions in assembler and call them with normal calls, but some people dislike this since it inhibits compiler optimization. Essentially, the compiler only sees a black box where the function would be, and cannot do anything with it. Cannot inline it, either. So compiler intrinsics help with that. Plus they make it easier to get the message across.

xeyes · Post by **xeyes** » Mon Jan 18, 2021 9:36 pm

nullplan wrote:
xeyes wrote:Thanks, if multiple people are saying the same thing there must be some good reasons for it.
There are good reasons for it. Inline assembly is notoriously difficult to get right. Did you use the right qualifiers, the right constraints, and the right clobbers? You might not immediately notice if you did not. Now, my solution to that is to write complete functions in assembler and call them with normal calls, but some people dislike this since it inhibits compiler optimization. Essentially, the compiler only sees a black box where the function would be, and cannot do anything with it. Cannot inline it, either. So compiler intrinsics help with that. Plus they make it easier to get the message across.

Good points!

Yes GCC somehow has the feature but doesn't really seem to play too nice with inline asm, very easy to get inlined label conflict with another copy of itself and lots of confusion on who clobbered who

Maybe it takes a real expert to use it well, I've replaced my ugly mov and add with lea now but probably still slower than GCC's output.

OSDev.org

[solved]lock cmpxchg didn't happen atomically?

[solved]lock cmpxchg didn't happen atomically?

Re: lock cmpxchg didn't happen atomically?

Re: lock cmpxchg didn't happen atomically?

Re: lock cmpxchg didn't happen atomically?

Re: lock cmpxchg didn't happen atomically?

Re: lock cmpxchg didn't happen atomically?

Re: lock cmpxchg didn't happen atomically?