(Solved) Is DMA cache coherent?
Posted: Fri Jul 21, 2017 12:18 pm
I am reading a paper (hopefully) describing the Linux graphics stack, and the following passage confuses me a little (my emphasis):
I found a 10 year old discussion on osdev, which seems to have ended up somewhat inconclusively. The gist of it seems to be that the SMP cache coherence and the PCI bus mastering are probably integrated. But as I said, it seems to have been left for empirical validation.
I also found discussions that pointed to the Intel Data Direct I/O technology as the mechanic responsible for DMA coherence. From skimming over it, it seems to be populating the cache (and skipping memory) rather than just invalidating it, not to mention that it may be very model bound. It sounds like an upgrade to the previous coherence technology, rather than the pioneer tech in this regard.
A remark in this SO response (somewhere in the middle) indicated that some Intel genarations use the L4 cache for this purpose:
Another SO post, discusses a PCIe feature (exposed in the Linux kernel apparently) that turns on cache snooping for a DMA request.
I know those are random sources, but it was a preliminary search. If someone could just give me a broad strokes answer off the top of their head, I would be grateful. If not, I will find out on my own (eventually). I want to get an idea of the trend, for mainstream CPUs at least. If you are aware of the state of affairs on non-x86 architectures, I would appreciate to hear about it.
P.S. I can understand that kernel memory can use the PTE, the memory type range registers, or the page attribute table, to exclude itself from the caches for the long haul. If the kernel so choses. For user-space mapped files however, the cache should generally be enabled. Therefore, my understanding is that either a clflush or mfence has to be performed before the operation (the latter in combination with the PCI/PCIe facilities.) And I expect the choice to impact the I/O stack performance.
Thanks.
I am confused that mfence is suggested as substitute for clflush in terms of DMA coherence on x86. Admittedly, I am not familiar with such details of OS programming yet, so it may be an obvious question, but it prompted me to conduct a brief google search.Notice that these caching modes apply to the CPU only, the GPU accesses are not di-
rectly affected by the current caching mode. However, when the GPU has to access an
area of memory which was previously filled by the CPU, uncached modes ensure that
the memory writes are actually done, and are not pending sitting in a CPU cache. An-
other way to achieve the same effect is the use of cache flushing instructions present on
some x86 processors (like cflush). However this is less portable than using the caching
modes. Yet another (portable) way is the use of memory barriers, which ensures that
pending memory writes have been committed to main memory before moving on.
I found a 10 year old discussion on osdev, which seems to have ended up somewhat inconclusively. The gist of it seems to be that the SMP cache coherence and the PCI bus mastering are probably integrated. But as I said, it seems to have been left for empirical validation.
I also found discussions that pointed to the Intel Data Direct I/O technology as the mechanic responsible for DMA coherence. From skimming over it, it seems to be populating the cache (and skipping memory) rather than just invalidating it, not to mention that it may be very model bound. It sounds like an upgrade to the previous coherence technology, rather than the pioneer tech in this regard.
A remark in this SO response (somewhere in the middle) indicated that some Intel genarations use the L4 cache for this purpose:
Here it was suggested that Nahalem introduced changes to the architecture which routed the I/O traffic through the CPU interconnect.This even lets Skylake's eDRAM L4 (not available in any desktop CPUs ) work as a memory-side cache (unlike Broadwell, which used it as a victim cache for L3 IIRC), sitting between memory and everything else in the system so it can even cache DMA
Another SO post, discusses a PCIe feature (exposed in the Linux kernel apparently) that turns on cache snooping for a DMA request.
I know those are random sources, but it was a preliminary search. If someone could just give me a broad strokes answer off the top of their head, I would be grateful. If not, I will find out on my own (eventually). I want to get an idea of the trend, for mainstream CPUs at least. If you are aware of the state of affairs on non-x86 architectures, I would appreciate to hear about it.
P.S. I can understand that kernel memory can use the PTE, the memory type range registers, or the page attribute table, to exclude itself from the caches for the long haul. If the kernel so choses. For user-space mapped files however, the cache should generally be enabled. Therefore, my understanding is that either a clflush or mfence has to be performed before the operation (the latter in combination with the PCI/PCIe facilities.) And I expect the choice to impact the I/O stack performance.
Thanks.