Guaranteed atomic instructions

devc1 · **Posted:** Sat Aug 05, 2023 6:14 am

Are there any guaranteed fast atomic instructions on the AMD64 Platform ?

thewrongchristian · **Joined:** Tue Apr 03, 2018 2:44 am **Posts:** 403

devc1 wrote:

Are there any guaranteed fast atomic instructions on the AMD64 Platform ?

C11 defines atomic intrinsics

If you're not using C, C++ has similar atomics, as does rust

In all these, atomic operations will boil down to fast atomic instructions on AMD64 if that is your target.

So, something like the following, which atomically prepends a list element in a single linked list (I think, it's untested:)

Code:

#include <stdatomic.h>

struct list_item {
    void * item;
    struct list_item * next;
};

_Atomic(struct list_item *) head;

void prepend_list(struct list_item * item)
{
    do {
        item->next = head;
    } while(!atomic_compare_exchange_weak(&head, &item->next, item));
}

when plugged into Compiler Explorer, you can see it generates the following atomic instruction on amd64 to do the actual atomic compare and exchange:

Code:

lock cmpxchg QWORD PTR [rsi], rcx

devc1 · **Posted:** Sat Aug 05, 2023 4:09 pm

I know but,

what instructions are guaranteed to be atomic, because using the lock prefix is often expensive.

I tried something in real hardware by the way, removing interlocked(atomic) functions and replacing them with normal directives in a function made it almost 5 times faster, and probably even faster as of what I remember.

But I heard that some instructions are guaranteed to be atomic, does that still makes them fast and not slow the code down ?

Can I confidently add or substract a value from different cpus without an incorrect behaviour, for e.g. like RemainingMemory

I'm also planning on making a general purpose heap manager, that is designed to run the page allocator, and heap functions + user mode heaps + heap inside a device, e.g. allocating a gmr region in Vmware SVGA GPU.

And I'm trying to get a way to avoid searching, have a sorted list of free memory (I already found a solution that is a bit memory expensive, and found a way to make it less expensive).

Octocontrabass · **Joined:** Mon Mar 25, 2013 7:01 pm **Posts:** 5146

devc1 wrote:

But I heard that some instructions are guaranteed to be atomic, does that still makes them fast and not slow the code down ?

General-purpose instructions with naturally-aligned memory operands that are only read or only written are guaranteed to be atomic. Some floating-point instructions have the same guarantee. Read-modify-write operands are not guaranteed to be atomic without a LOCK prefix. Misaligned operands are not guaranteed to be atomic without a LOCK prefix. Even if individual instructions are guaranteed to be atomic, a sequence of such instructions is not guaranteed to be atomic.

The XCHG instruction always includes a LOCK prefix.

devc1 wrote:

Can I confidently add or substract a value from different cpus without an incorrect behaviour, for e.g. like RemainingMemory

That's a read-modify-write operation, so you need a LOCK prefix.

devc1 · **Posted:** Sat Aug 05, 2023 4:51 pm

And what about the heap allocator, should I just spinlock before allocating a heap when there are multiple processors involved.

Octocontrabass · **Joined:** Mon Mar 25, 2013 7:01 pm **Posts:** 5146

Any time multiple processors can access a single shared resource, you need either mutual exclusion (such as a spinlock) or a lock-free algorithm. Lock-free algorithms are not always possible. (And lock-free does not mean LOCK-prefix-free!)

thewrongchristian · **Joined:** Tue Apr 03, 2018 2:44 am **Posts:** 403

devc1 wrote:

And what about the heap allocator, should I just spinlock before allocating a heap when there are multiple processors involved.

Sounds to me like you're thinking of premature optimisation.

You need to get it working correctly, before you think about getting it working fast. And before you start thinking about what optimisations to apply, you need to measure where you need these optimisations.

After all, if it turns out your heap allocator lock is hardly contended, you might be able to redesign your locking to be coarser, reducing the overhead of the lock. Or it might be that your lock is highly contended, in which case it might be prudent to redesign your locking to be finer. But neither involves changing the lock mechanism, the lock still needs to function correctly as a lock, and it is this mechanism that by necessity makes it slower than non-atomic instruction sequences.

There are other ways to avoid per-thread heap locks. You could instead use a per-thread memory arena, for quick temporary allocations using a simple bump-pointer allocator. Such allocated memory might be used within the processing of a single system call, and once that system call is complete, you just discard all the memory from the arena (any data that must persist beyond the system call, however, would require the regular heap.) Being per-thread, this arena would require no locking, and thus be much faster than your general purpose heap, with the double whammy of reducing use of (and thus lock contention) the main heap, and would probably improve performance significantly more than any optimisation you could do in the main heap locking.

devc1 · **Posted:** Sun Aug 06, 2023 10:00 am

no I already remade the heap manager alot, everytime I come up with some idea to make it faster.

Now, it's time to optimize it !

I'm using a concept that I invented called the heap image, where you can share a heap between multiple images and use fixed alignment allocations. so for example if AHCI does not support 64 bit, I can go to Image2 that only contains heaps that are below 4gb, where Image 1 shares the heaps with Image 2 and adds heaps > 4gb.
This heap sharing will also benefit me when I decide to implement NUMA, so that each processor searches for memory closer to it.

But you gave me a great idea, I will add local thread heap images, so that you can take a portion of memory and manage it without worying about locks.

I will also use trees that resemble paging tables, to have a fast free function (In cases like a page allocator), but for normal heaps i will include an option where the heap descriptor is included with the heap and we will use much less memory space.

- The only con with this, which does not even seem like a con is that you will have to specify minAddr and maxAddr so the heap manager can optimize memory usage and not bloat the computer.

for e.g. having a tree that represents heaps on a 16GB computer like mine, better use a tree that represents only 16GB of memory divided by page size, comes out at 4 million entries to be represented which is way efficient than representing a whole 256TB Address space, or at least the 128TB Upper half of system memory.

I will also need this when implementing the VMSVGA Driver to allocate regions in the VRAM, because I want to learn abit about 3D before making a subsystem similar to DirectX/OpenGL and a compositor.

What do u guys think ?

OSDev.org

Guaranteed atomic instructions

Who is online