Brendan wrote:Since 80386 Intel's caches have been controlled by global flags in CR0 and "per page" flags in page table entries that can be used to either disable caches or disable write-back/enable write-through. Then they added MTRRs (in Pentium) to allow caching to be controlled on a "per physical range" basis, then added support for write combining (in Pentium 4 I think). Of course other 80x86 CPU manufacturers (e.g. AMD, VIA) copied everything and support all of it; but Cyrix had something like MTRRs before Intel did (with a different name and different/incompatible access mechanism), and AMD added some extra stuff to MTRRs (mostly shifting some of the memory controller configuration from chipset into MTTRs).
The result of all of this is that for all 8x086 made in the last 10 years or more; there are multiple ways to configure an area as either "write-back" (default for RAM), "write-through", "write combining" or "uncached" (default for memory mapped devices).
I actually didn't pay much attention to the write-though option. But I assume that it is intended primarily for MMIO and DMA. For general purpose transfers, the writes to memory will be synchronous and potentially more numerous than usual. And this is not what AMD does with its L1D write-through.
Brendan wrote:That's only talking about (e.g.) the implementation of the L1 cache, not the overall behaviour (including L2 and L3). For those AMD CPUs, if a page uses "write-back caching" then the L1 cache will "write through" to L2 (but will not "write through" to L3 or RAM).
AMD Bulldozer to my understanding flushes the changes in L1D asynchronously. Specifically, it uses a buffer of sorts (write coalescing cache) that sends the recent changes to L2 while the core and L1D continue communicating. Since L2 is write-back, it might stall on write miss, and L3 might in turn stall as well, but they wont block the the core directly. The overall effect is that the write-out to RAM can be done in the breather room between the memory modification bouts, assuming that the program is designed to utilize this kind of access patterns. Of course, the memory traffic is buffered only to a limited extent, which is 4K worth. Even less, if you trigger set collisions in this write coalescing cache. The overall hierarchy is not write-through, but rather something hybrid.
Brendan wrote:That depends (e.g. consider something like Gentoo Linux, where almost everything is AOT compiled from source when installed). In this case it's not the hardware information that libraries lack, it's usage information (will the caller be copying aligned data, and how much, and how often).
I used AOT in the sense of prepackaged code. But even when building on the client machine, it will be a little inconvenient to exploit the precise hardware characteristics. The software will have to be recompiled after any configuration change - such as a CPU change.
As I hinted in my post, macros and generics can offer more versatile compile-time code parametrization. With language features such as templates, a source library can cross-breed strategies controlled by different instantiating parameters. Some of the parameters can even be deduced from the type system and the argument values. This means that many usage hints can be combined together, controlled through type traits, etc. The developer can also use profile driven optimization and link time optimization to reduce the need for explicit hinting. To be clear, I am not making a covert case for open-source development here, although you will need at least some kind of intermediate code or obfuscated source code to do those things.
Brendan wrote:If the process doesn't care about performance and doesn't use things like memory mapped files, etc; then they deserve whatever (poor) performance they get; and I'd argue that deliberately making performance as bad as possible is a good way to encourage them to write better software (and/or encourage users to replace the bad software with better software) and improve performance in the long run.
If the software exhibits performance artifacts and that causes the OS to loose traction, it will even further reduce your ability to set the standards. Writing software is usually about attaining the best trade-off between cost and functionality. The platform can raise the quality by offering better facilities, but shouldn't forcefully constrain the developer's choice, because it doesn't have the complete picture, of the budgeting constraints, the user's requirements, the long term project goals, etc.
Brendan wrote:For discarding individual cache lines there's CLFLUSH.
Writing to part of a cache line (e.g. writing 8 bytes in a 64 byte cache line) means that the unmodified bytes must be read. The only cases where read-for-write can be avoided is when the stored data is not cached ("uncached", "write combining", non-temporal stores); or when the entire cache line is being modified (CLZERO, "rep movsb/d/q" and AVX512).
CLZERO will avoid the read, but it will still zero the cache line, which is redundant if you plan to overwrite the data. It is still much cheaper then a memory read or L2 read obviously. CLFLUSH discards the line, but after flushing it. I would like to be able to mark the data as de-initialized, i.e. as trimmed from the set. For example, if I allocate memory and deallocate it, I want to be able to mark it as invalid without flushing it.
Brendan wrote:Why?
You do a bunch of little writes to a "write combining" area, and they get stored into the CPU's write combining buffer (ignoring cache) where they're combined if possible. Later (when CPU needs to evict/flush part of the write combining buffer, hopefully after you've finished writing to that area) CPU tries to store the "combined writes" using a small number of larger stores and knows exactly how large the store is (and knows if it does/doesn't fill an entire cache line).
My confusion was deeper. I was under the impression that something like write-combining may be working under the hood even for cacheable stores, with the aim to elide reads of lines that will be completely overwritten. Store buffers serve a different purpose, so I thought that write-combining buffers are used. The idea was to accumulate cacheable writes, postponing the read-for-write until the sequence terminates. If it terminates with the cache line partially filled, the read would commence, but if it terminates with the cache line completely overwritten, the read would be elided. The problem is, of course, that if the line is not fully replaced and the last write is followed by a read to another cache line, you have twice the latency until the requests complete. You need explicit instructions to make this happen efficiently.