https://lwn.net/Articles/531254/
I will quote one paragraph, which discusses false sharing and made me reassess my understanding of the MESI protocol for x86 cpus:
I did not fully appreciate the performance implications of reader/writer asymmetric false contention until now. I wonder, why is this penalty necessary?Kernel code will acquire a lock to work with (and, usually, modify) a structure's contents. Often, changing a field within the protected structure will require access to the same cache line that holds the structure's spinlock. If the lock is uncontended, that access is not a problem; the CPU owning the lock probably owns the cache line as well. But if the lock is contended, there will be one or more other CPUs constantly querying its value, obtaining shared access to that same cache line and depriving the lock holder of the exclusive access it needs. A subsequent modification of data within the affected cache line will thus incur a cache miss. So CPUs querying a contended lock can slow the lock owner considerably, even though that owner is not accessing the lock directly.
Let me illustrate with example. Suppose that objects "A" and "B" reside on the same cache line. core0 reads "A", core1 writes "B", eventually core0 reads "A" again. The invalidation caused from core1 will eventually mark the line invalid on core0. If I understand the above paragraph correctly, the next read on core0 will be stalled to fetch the new contents of the cache line. Why? I agree that the line must be fetched, but shouldn't this be done in the background. core0 is not guaranteed ordering anyway (due to invalidation queue optimizations). If that data is unchanged (which may be by design of the software), why stall the reads unnecessarily? What new guarantee will this achieve for core0?
As to core1 (the modifying core), thrashing seems to be MESI artifact. I am left with the impression that the MOESI protocol can avoid switching from modified to shared by keeping "Owned" state at all times. In principle speaking, broadcasting contents should not block the writer unless sequential ordering is desired. Only acquisition of the cache line to avoid conflict with concurrent updates to other parts has reason to stall. Even then, and even for MESI, with write buffering and speculative execution, I expected the performance effect to be largely mitigated.
What do you think? Do you think that there are reasons to degrade performance in such asymmetric (reader/writer) false sharing scenario when sequential consistency is not expected by the memory model? Is this some kind of architectural shortcoming or optimization?