Reasons for false/true sharing?

simeonz · Post by **simeonz** » Thu Sep 22, 2016 9:39 am

I am reading an article on LWN here:
https://lwn.net/Articles/531254/

I will quote one paragraph, which discusses false sharing and made me reassess my understanding of the MESI protocol for x86 cpus:

Kernel code will acquire a lock to work with (and, usually, modify) a structure's contents. Often, changing a field within the protected structure will require access to the same cache line that holds the structure's spinlock. If the lock is uncontended, that access is not a problem; the CPU owning the lock probably owns the cache line as well. But if the lock is contended, there will be one or more other CPUs constantly querying its value, obtaining shared access to that same cache line and depriving the lock holder of the exclusive access it needs. A subsequent modification of data within the affected cache line will thus incur a cache miss. So CPUs querying a contended lock can slow the lock owner considerably, even though that owner is not accessing the lock directly.

I did not fully appreciate the performance implications of reader/writer asymmetric false contention until now. I wonder, why is this penalty necessary?

Let me illustrate with example. Suppose that objects "A" and "B" reside on the same cache line. core0 reads "A", core1 writes "B", eventually core0 reads "A" again. The invalidation caused from core1 will eventually mark the line invalid on core0. If I understand the above paragraph correctly, the next read on core0 will be stalled to fetch the new contents of the cache line. Why? I agree that the line must be fetched, but shouldn't this be done in the background. core0 is not guaranteed ordering anyway (due to invalidation queue optimizations). If that data is unchanged (which may be by design of the software), why stall the reads unnecessarily? What new guarantee will this achieve for core0?

As to core1 (the modifying core), thrashing seems to be MESI artifact. I am left with the impression that the MOESI protocol can avoid switching from modified to shared by keeping "Owned" state at all times. In principle speaking, broadcasting contents should not block the writer unless sequential ordering is desired. Only acquisition of the cache line to avoid conflict with concurrent updates to other parts has reason to stall. Even then, and even for MESI, with write buffering and speculative execution, I expected the performance effect to be largely mitigated.

What do you think? Do you think that there are reasons to degrade performance in such asymmetric (reader/writer) false sharing scenario when sequential consistency is not expected by the memory model? Is this some kind of architectural shortcoming or optimization?

dchapiesky · Post by **dchapiesky** » Tue Jan 03, 2017 10:53 pm

To be honest it looked more like Intel moving the cache towards being to handle Transactional Memory, when and if the Sun Rocks processor ever became a potential "we got it, intel doesn't" problem.

Even though sequential consistency was not expected at the time the chip was designed the hooks for becoming multi-socket transactionaly consistent were added for software based transactional memory.

Pure speculation on my part. (your post really made me think hard... thanks for it)

simeonz · Post by **simeonz** » Tue May 30, 2017 11:41 pm

Thanks for the response. I haven't visited in a while and didn't catch replies.

Your hunch may be right. I don't see why would such premature design be taken to production, but they could have kept draft ideas to reduce R&D and validation costs for all I know. Now, if Intel supported mixing with old generation CPUs on multi-processor boards and wanted to keep backwards compatibility for the cache coherence protocol, I would understand this. But heterogeneous pairing of CPUs it is not supported AFAIK. Anyway, all sorts of considerations may have come into play. Your speculation offers at least some explanation.

Brendan · Post by **Brendan** » Wed May 31, 2017 2:11 am

Hi,

simeonz wrote:I am reading an article on LWN here:
https://lwn.net/Articles/531254/

I will quote one paragraph, which discusses false sharing and made me reassess my understanding of the MESI protocol for x86 cpus:
Kernel code will acquire a lock to work with (and, usually, modify) a structure's contents. Often, changing a field within the protected structure will require access to the same cache line that holds the structure's spinlock. If the lock is uncontended, that access is not a problem; the CPU owning the lock probably owns the cache line as well. But if the lock is contended, there will be one or more other CPUs constantly querying its value, obtaining shared access to that same cache line and depriving the lock holder of the exclusive access it needs. A subsequent modification of data within the affected cache line will thus incur a cache miss. So CPUs querying a contended lock can slow the lock owner considerably, even though that owner is not accessing the lock directly.
I did not fully appreciate the performance implications of reader/writer asymmetric false contention until now. I wonder, why is this penalty necessary?

Let me illustrate with example. Suppose that objects "A" and "B" reside on the same cache line. core0 reads "A", core1 writes "B", eventually core0 reads "A" again. The invalidation caused from core1 will eventually mark the line invalid on core0. If I understand the above paragraph correctly, the next read on core0 will be stalled to fetch the new contents of the cache line. Why?

The CPU only knows "something in that cache line was changed" (and doesn't know what) so it has to fetch the whole cache line (in case the part that was changed actually matters). Keeping track of which bytes of each cache lines were/weren't changed (so that false sharing can be avoided) would completely ruin performance for almost everything (for the sake of a negligible improvement for a rare corner-case) due to massive increase in the amount of cache line coherency traffic and a huge efficiency reduction in the caches themselves (e.g. most of the cache's silicon spent on masks and/or tags and not used for cached data).

Cheers,

Brendan

simeonz · Post by **simeonz** » Wed May 31, 2017 3:30 am

Brendan wrote:The CPU only knows "something in that cache line was changed" ...

Agreed.

Brendan wrote:(and doesn't know what) so it has to fetch the whole cache line ...

Not in case of reads, IMO.

Technically, it needs to, but that is artifact of the MESI coherence scheme, not necessity of the memory ordering specification of the x86 ISA. The function of coherence for x86-based CPUs is only to guarantee that writes to non-overlapping aligned locations, possibly residing on the same cache line, do not suffer from the lost update syndrome. Aside from that you get some ordering derived from the write buffer and invalidation queues, and some data dependence ordering, but reads will retrieve arbitrary ancient data. Noone knows how old.

So my question was - when one core is notified that a cache line was modified in another core, why wouldn't it just keep the line with some flag that it is stale, and deliver reads from the stale data. Why stall a read on one core after write on another, when your ISA makes no promise that non-atomic instructions get fresh data anyway. The situation with writes is different, because you have no way to "merge" the changes after concurrent updates from two cores. So the hypothetical "stale" flag would have to stall writes in any case, but not reads.

In fact, the issue has nothing to do with false sharing. Even if the cache had word granularity, why would modifications on one core stall conflicting reads on other cores as per the memory ordering of x86. Granted, it would make some sense to stall, because useful work here would be corner case, but if the software designer did not introduce atomic operation, than it did not care to resolve the ordering, whether due to a bug or a deliberate choice. In the case of false sharing between a writer and a reader core, however, the code author could know that the write and the read are on non-overlapping location at design time. The code gets what it bargains for. So, why stall read instructions on cache lines that are currently altered, if that contributes nothing to the memory ordering?

Brendan · Post by **Brendan** » Wed May 31, 2017 6:19 am

Hi,

simeonz wrote:So my question was - when one core is notified that a cache line was modified in another core, why wouldn't it just keep the line with some flag that it is stale, and deliver reads from the stale data. Why stall a read on one core after write on another, when your ISA makes no promise that non-atomic instructions get fresh data anyway. The situation with writes is different, because you have no way to "merge" the changes after concurrent updates from two cores. So the hypothetical "stale" flag would have to stall writes in any case, but not reads.

The "80x86 ISA" does guarantee that writes are observed to have occurred in program order (excluding special corner cases that deliberately bypass that guarantee - e.g. non-temporal stores).

For example, one CPU can do:

Code: Select all

    mov dword [a],0x12345678
    mov dword [b],1

And another CPU can do:

Code: Select all

.wait:
    cmp dword [b],1
    jne .wait

    mov eax,[a]        ;This is guaranteed to be the new data, because writes are guaranteed to be observed in the correct order

This applies to everything (spinlocks, mutexes, semaphores and lockless algorithms). For example, one CPU does a write that releases a lock, and if another CPU sees that the lock was freed then all writes to data that the lock protected (done during the critical section) are guaranteed to have happened.

If you break this guarantee, then you break almost everything that could be used to ensure synchronised access to data shared by multiple CPUs/threads. The only work-around that could make it vaguely possible to write useful code (for multi-CPU) would be to resort to explicit cache line flushes and fences everywhere, but that would have the same "false sharing" problem, so the end result would be "exactly the same except more bugs and more bloat".

Cheers,

Brendan

simeonz · Post by **simeonz** » Wed May 31, 2017 7:42 am

Brendan wrote:The "80x86 ISA" does guarantee that writes are observed to have occurred in program order (excluding special corner cases that deliberately bypass that guarantee - e.g. non-temporal stores).

True. I always thought (incorrectly) that since write buffers and invalidation queues are processed in order, that provides certain ordering in and of itself. I believed that the only real price for no-writes-reordered guarantee is executing the write buffer changes in order (and the consequent parasitic stalls for writes). But in your example, I see that reading old data, while simultaneously triggering a read request for the new contents of the cache line on the interconnect will reorder the writes. So blocking reads on modified cache lines is also required to make sure that writes stay ordered.

Thanks.

OSDev.org

Reasons for false/true sharing?

Reasons for false/true sharing?

Re: Reasons for false/true sharing?

Re: Reasons for false/true sharing?

Re: Reasons for false/true sharing?

Re: Reasons for false/true sharing?

Re: Reasons for false/true sharing?

Re: Reasons for false/true sharing?