OSDev.org

Posted: **Fri Jul 21, 2017 12:18 pm**

I am reading a paper (hopefully) describing the Linux graphics stack, and the following passage confuses me a little (my emphasis):

Notice that these caching modes apply to the CPU only, the GPU accesses are not di-
rectly affected by the current caching mode. However, when the GPU has to access an
area of memory which was previously filled by the CPU, uncached modes ensure that
the memory writes are actually done, and are not pending sitting in a CPU cache. An-
other way to achieve the same effect is the use of cache flushing instructions present on
some x86 processors (like cflush). However this is less portable than using the caching
modes. Yet another (portable) way is the use of memory barriers, which ensures that
pending memory writes have been committed to main memory before moving on.

I am confused that mfence is suggested as substitute for clflush in terms of DMA coherence on x86. Admittedly, I am not familiar with such details of OS programming yet, so it may be an obvious question, but it prompted me to conduct a brief google search.

I found a 10 year old discussion on osdev, which seems to have ended up somewhat inconclusively. The gist of it seems to be that the SMP cache coherence and the PCI bus mastering are probably integrated. But as I said, it seems to have been left for empirical validation.

I also found discussions that pointed to the Intel Data Direct I/O technology as the mechanic responsible for DMA coherence. From skimming over it, it seems to be populating the cache (and skipping memory) rather than just invalidating it, not to mention that it may be very model bound. It sounds like an upgrade to the previous coherence technology, rather than the pioneer tech in this regard.

A remark in this SO response (somewhere in the middle) indicated that some Intel genarations use the L4 cache for this purpose:

This even lets Skylake's eDRAM L4 (not available in any desktop CPUs ) work as a memory-side cache (unlike Broadwell, which used it as a victim cache for L3 IIRC), sitting between memory and everything else in the system so it can even cache DMA

Here it was suggested that Nahalem introduced changes to the architecture which routed the I/O traffic through the CPU interconnect.

Another SO post, discusses a PCIe feature (exposed in the Linux kernel apparently) that turns on cache snooping for a DMA request.

I know those are random sources, but it was a preliminary search. If someone could just give me a broad strokes answer off the top of their head, I would be grateful. If not, I will find out on my own (eventually). I want to get an idea of the trend, for mainstream CPUs at least. If you are aware of the state of affairs on non-x86 architectures, I would appreciate to hear about it.

P.S. I can understand that kernel memory can use the PTE, the memory type range registers, or the page attribute table, to exclude itself from the caches for the long haul. If the kernel so choses. For user-space mapped files however, the cache should generally be enabled. Therefore, my understanding is that either a clflush or mfence has to be performed before the operation (the latter in combination with the PCI/PCIe facilities.) And I expect the choice to impact the I/O stack performance.

Thanks.

Posted: **Fri Jul 21, 2017 12:55 pm**

Another post on osdev referred to this Linux kernel mailing list message.

What I gather from it is that for cache-able memory to work correctly with DMA on sepculative and prefetching hardware, DMA has to be cache coherent. Particularly, since the effects and conditions for prefetching and speculative execution are largely left unsepcified, evicting data from the cache (for cache-able memory region) with explicit instructions is unreliable. So, DMA coherence seems to be implied, unless the OS is expected to turn caching for the user pages on and off around each request. (I again emphasize memory mapped files.) However, that is not to say that a logical argument such as this is substitute for a binding requirement.

Edit: Still, the poster appears to have been confident in the information.

Edit: I don't want to bump through self-posting, so I'll add some more thoughts here.

A little more reading and it appears to be debatable, whether clflush should not be used anyway, because the invalidation of the CPU caches from the I/O hub might be slower (in terms of total work, not time taken to complete in parallel) when the old data to be overwritten is in fact in the L3 caches and needs to be discarded. On the other hand, I find it strange for the CPU to perform such "control and maintenance" duty, when the entire idea of DMA is that the supporting operations are performed in a different place (even if an on-chip one), where they can complete in parallel to the main data production, which is the CPU's job. So, it makes me wonder if the CPU should take time to kick the cache lines from the cache, or let the I/O hub do it. Assuming that a simple mfence is even an option to begin with.

Posted: **Fri Jul 21, 2017 2:35 pm**

Hi,

Everything (including DMA) on 80x86 is cache coherent; except for the following cases (where you have deliberately "weakened" cache coherency for performance reasons or broken cache coherency):

You're using the "write-combining" caching type in the MTRRs and/or PAT. In this case writes that are sitting in a CPU's write-combining buffer won't be seen by other hardware.
You're using non-temporal stores. In this case writes that are sitting in a CPU's write-combining buffer won't be seen by other hardware.
You've seriously messed things up (e.g. used the "INVD" instruction at CPL=0).

All hardware trickery (e.g. eDRAM caches, etc) is designed to uphold this.

Cheers,

Brendan

Posted: **Fri Jul 21, 2017 2:57 pm**

Thanks. That answers my question.

Posted: **Tue Jul 25, 2017 12:46 am**

Brendan wrote:
You're using the "write-combining" caching type in the MTRRs and/or PAT. In this case writes that are sitting in a CPU's write-combining buffer won't be seen by other hardware.

You're using non-temporal stores. In this case writes that are sitting in a CPU's write-combining buffer won't be seen by other hardware.

You've seriously messed things up (e.g. used the "INVD" instruction at CPL=0).

You're using PCIe no-snoop transactions, supported e.g. by XHCI.
Intel's integrated graphics uses RAM as VRAM. Accessing the RAM by usual means (i.e. not via the PCI BAR) is not cache coherent.

Posted: **Tue Jul 25, 2017 1:56 am**

Korona wrote:
Brendan wrote:
You're using the "write-combining" caching type in the MTRRs and/or PAT. In this case writes that are sitting in a CPU's write-combining buffer won't be seen by other hardware.

You're using non-temporal stores. In this case writes that are sitting in a CPU's write-combining buffer won't be seen by other hardware.

You've seriously messed things up (e.g. used the "INVD" instruction at CPL=0).

You're using PCIe no-snoop transactions, supported e.g. by XHCI.

Intel's integrated graphics uses RAM as VRAM. Accessing the RAM by usual means (i.e. not via the PCI BAR) is not cache coherent.

Thanks for the added clarification.

In fact, to summarize my understanding:

in no applicable case is a memory barrier insufficient for guaranteeing coherence
in no case, other than the "uncached" caching type is it redundant
in no case is flushing of the cache useful for guaranteeing coherence (due to hyper-threading and prefetch)
the no-snoop feature is encouraged when the region caching is disabled by MTRRs and/or PAT, and particularly for PAT (apparently the I/O hub knows the MTRRs locally, but not the PATs)

Also, the performance tradeoffs between cached, cached but flushed (for speedup only), and uncached no-snoop approaches vary from generation to generation.

Edit: Got carried away. I leave the remark for posterity, but the store and write combining buffers cannot hold stale data unlike the cache, so even if it is inconsistent (i.e. reordered), it is coherent. Strikethrough: Now I am starting to think that the store buffer, write combining buffer, etc, could get entries from speculative execution. But I can feel my deep ignorance peering through here. Apparently, I am not sufficiently well read to know the exact extent of the visible side effects from speculative execution.

Brendan wrote:Everything (including DMA) on 80x86 is cache coherent; except ...

I know I asked about the modern hardware and its trends in my original question, but some remarks on the net suggest vaguely that older hardware differs in this regard. Was DMA always cache coherent on x86?

OSDev.org

(Solved) Is DMA cache coherent?

(Solved) Is DMA cache coherent?

Re: Is DMA cache coherent?

Re: Is DMA cache coherent?

Re: (Solved) Is DMA cache coherent?

Re: Is DMA cache coherent?

Re: (Solved) Is DMA cache coherent?