Brendan, your technical points (i.e. memcpy_small() and memcpy_large() will certainly outperform a single memcpy()) are valid, however, they miss the point of software design. There are large amounts of existing code that use memcpy() to copy small buffers; similarly, there are large amounts of code that use memcpy() to copy large buffers. You do not control this code. Developers have the expectation that memcpy() will perform reasonably well on any size. By making memcpy() deliberately bad for half of the use cases you're just screwing your (potential) users without any real benefit. "I made the code deliberately bad so that software writers are forced to adapt" is not something that you can sell to users; they will just switch to a libc that is not broken (and does respect common expectations) or just abandon your OS altogether.
People don't expect memcpy() to be optimal in very situation, but they expect it to always be reasonably fast.
Other assertions in this thread are also unfounded without further proof: For example, the use of paging tricks is only becoming more expensive with increasing clock speeds (as TLB size does not scale as well as memory bandwidth) and probably not worth it in many cases. Copying gigabyte-sized buffers can often be faster than any paging trick you can do.
Memset implementation optimizing ... to itself
Re: Memset implementation optimizing ... to itself
managarm: Microkernel-based OS capable of running a Wayland desktop (Discord: https://discord.gg/7WB6Ur3). My OS-dev projects: [mlibc: Portable C library for managarm, qword, Linux, Sigma, ...] [LAI: AML interpreter] [xbstrap: Build system for OS distributions].
Re: Memset implementation optimizing ... to itself
Hi,
You're are right though - deliberately making it slower is silly. For huge memory copies it would be a lot more effective to terminate the process with a "the developer of this executable is incompetent" error message.
Cheers,
Brendan
Yes, there's large amounts of existing poor quality trash that should have been discarded and replaced (as part of some kind of "continual quality improvement" scheme, for the user's benefit), but nobody really writes better quality replacements because there isn't enough incentive to bother because the poor quality trash still "works"; so end users just end up suffering from poor quality trash for many decades.Korona wrote:Brendan, your technical points (i.e. memcpy_small() and memcpy_large() will certainly outperform a single memcpy()) are valid, however, they miss the point of software design. There are large amounts of existing code that use memcpy() to copy small buffers; similarly, there are large amounts of code that use memcpy() to copy large buffers. You do not control this code. Developers have the expectation that memcpy() will perform reasonably well on any size. By making memcpy() deliberately bad for half of the use cases you're just screwing your (potential) users without any real benefit. "I made the code deliberately bad so that software writers are forced to adapt" is not something that you can sell to users; they will just switch to a libc that is not broken (and does respect common expectations) or just abandon your OS altogether.
You're are right though - deliberately making it slower is silly. For huge memory copies it would be a lot more effective to terminate the process with a "the developer of this executable is incompetent" error message.
How many operating systems have switched from paging tricks to "memcpy()" for their implementation of "fork()"?Korona wrote:Other assertions in this thread are also unfounded without further proof: For example, the use of paging tricks is only becoming more expensive with increasing clock speeds (as TLB size does not scale as well as memory bandwidth) and probably not worth it in many cases. Copying gigabyte-sized buffers can often be faster than any paging trick you can do.
Cheers,
Brendan
For all things; perfection is, and will always remain, impossible to achieve in practice. However; by striving for perfection we create things that are as perfect as practically possible. Let the pursuit of perfection be our guide.
Re: Memset implementation optimizing ... to itself
While nobody is preventing you to just crash userspace programs if they don't make optimal use of the CPU, no user will continue to run your OS if that happens. That's something I'm not going to accept for my OS but everyone has to decide which trade offs he is going to take anyway. But sure, go ahead and implement that. Might be more effort than a reasonably fast memcpy(), though.
That is taking my statement out of context - I never mentioned fork(). I'm talking about zero-copy vs. copy-based I/O. For network I/O or other complex I/O devices, copying I/O is often faster than zero-copy I/O. The same holds true for copying garbage collection, e.g. as implemented in VMs or in special-purpose high-performance memory allocators.Brendan wrote:How many operating systems have switched from paging tricks to "memcpy()" for their implementation of "fork()"?
managarm: Microkernel-based OS capable of running a Wayland desktop (Discord: https://discord.gg/7WB6Ur3). My OS-dev projects: [mlibc: Portable C library for managarm, qword, Linux, Sigma, ...] [LAI: AML interpreter] [xbstrap: Build system for OS distributions].
Re: Memset implementation optimizing ... to itself
Hi,
No user would ever see it happen because no developer would be stupid enough to create native applications designed for my OS that always crash.
For copying a gigabyte-sized buffer, one paging trick you can do (assuming 64-bit 80x86) is to copy an 8 byte PDPT entry into a previously "not present" PDPT entry (so there's no need to invalidate TLBs) and set it as "read only". Show me how reading 1 GiB from RAM and writing 1 GiB to RAM (and touching all of the data and potentially suffering from thousands of TLB misses for both source and destination while you're doing this) can be faster.
Cheers,
Brendan
You're focusing on porting the same software that's available for free on every "*nix like" distro and ensuring that there's never any reason for anyone to bother writing native applications for your OS and never any reason for anyone to bother switching to your OS. That's not something I'm going to accept for my OS.Korona wrote:While nobody is preventing you to just crash userspace programs if they don't make optimal use of the CPU, no user will continue to run your OS if that happens. That's something I'm not going to accept for my OS but everyone has to decide which trade offs he is going to take anyway. But sure, go ahead and implement that. Might be more effort than a reasonably fast memcpy(), though.
No user would ever see it happen because no developer would be stupid enough to create native applications designed for my OS that always crash.
For high speed networking interfaces, both Windows and Linux (and NICs themselves) implemented a "per-connection network queues in user-space" approach to avoid kernel overhead and avoid copy-based I/O; and there's no sign of memory mapped files (and all the paging tricks they use) changing in any well known OS. For garbage collection, if you care about performance the first step is to make sure you're not using a garbage collector.Korona wrote:That is taking my statement out of context - I never mentioned fork(). I'm talking about zero-copy vs. copy-based I/O. For network I/O or other complex I/O devices, copying I/O is often faster than zero-copy I/O. The same holds true for copying garbage collection, e.g. as implemented in VMs or in special-purpose high-performance memory allocators.Brendan wrote:How many operating systems have switched from paging tricks to "memcpy()" for their implementation of "fork()"?
For copying a gigabyte-sized buffer, one paging trick you can do (assuming 64-bit 80x86) is to copy an 8 byte PDPT entry into a previously "not present" PDPT entry (so there's no need to invalidate TLBs) and set it as "read only". Show me how reading 1 GiB from RAM and writing 1 GiB to RAM (and touching all of the data and potentially suffering from thousands of TLB misses for both source and destination while you're doing this) can be faster.
Cheers,
Brendan
For all things; perfection is, and will always remain, impossible to achieve in practice. However; by striving for perfection we create things that are as perfect as practically possible. Let the pursuit of perfection be our guide.
Re: Memset implementation optimizing ... to itself
I actually didn't pay much attention to the write-though option. But I assume that it is intended primarily for MMIO and DMA. For general purpose transfers, the writes to memory will be synchronous and potentially more numerous than usual. And this is not what AMD does with its L1D write-through.Brendan wrote:Since 80386 Intel's caches have been controlled by global flags in CR0 and "per page" flags in page table entries that can be used to either disable caches or disable write-back/enable write-through. Then they added MTRRs (in Pentium) to allow caching to be controlled on a "per physical range" basis, then added support for write combining (in Pentium 4 I think). Of course other 80x86 CPU manufacturers (e.g. AMD, VIA) copied everything and support all of it; but Cyrix had something like MTRRs before Intel did (with a different name and different/incompatible access mechanism), and AMD added some extra stuff to MTRRs (mostly shifting some of the memory controller configuration from chipset into MTTRs).
The result of all of this is that for all 8x086 made in the last 10 years or more; there are multiple ways to configure an area as either "write-back" (default for RAM), "write-through", "write combining" or "uncached" (default for memory mapped devices).
AMD Bulldozer to my understanding flushes the changes in L1D asynchronously. Specifically, it uses a buffer of sorts (write coalescing cache) that sends the recent changes to L2 while the core and L1D continue communicating. Since L2 is write-back, it might stall on write miss, and L3 might in turn stall as well, but they wont block the the core directly. The overall effect is that the write-out to RAM can be done in the breather room between the memory modification bouts, assuming that the program is designed to utilize this kind of access patterns. Of course, the memory traffic is buffered only to a limited extent, which is 4K worth. Even less, if you trigger set collisions in this write coalescing cache. The overall hierarchy is not write-through, but rather something hybrid.Brendan wrote:That's only talking about (e.g.) the implementation of the L1 cache, not the overall behaviour (including L2 and L3). For those AMD CPUs, if a page uses "write-back caching" then the L1 cache will "write through" to L2 (but will not "write through" to L3 or RAM).
I used AOT in the sense of prepackaged code. But even when building on the client machine, it will be a little inconvenient to exploit the precise hardware characteristics. The software will have to be recompiled after any configuration change - such as a CPU change.Brendan wrote:That depends (e.g. consider something like Gentoo Linux, where almost everything is AOT compiled from source when installed). In this case it's not the hardware information that libraries lack, it's usage information (will the caller be copying aligned data, and how much, and how often).
As I hinted in my post, macros and generics can offer more versatile compile-time code parametrization. With language features such as templates, a source library can cross-breed strategies controlled by different instantiating parameters. Some of the parameters can even be deduced from the type system and the argument values. This means that many usage hints can be combined together, controlled through type traits, etc. The developer can also use profile driven optimization and link time optimization to reduce the need for explicit hinting. To be clear, I am not making a covert case for open-source development here, although you will need at least some kind of intermediate code or obfuscated source code to do those things.
If the software exhibits performance artifacts and that causes the OS to loose traction, it will even further reduce your ability to set the standards. Writing software is usually about attaining the best trade-off between cost and functionality. The platform can raise the quality by offering better facilities, but shouldn't forcefully constrain the developer's choice, because it doesn't have the complete picture, of the budgeting constraints, the user's requirements, the long term project goals, etc.Brendan wrote:If the process doesn't care about performance and doesn't use things like memory mapped files, etc; then they deserve whatever (poor) performance they get; and I'd argue that deliberately making performance as bad as possible is a good way to encourage them to write better software (and/or encourage users to replace the bad software with better software) and improve performance in the long run.
CLZERO will avoid the read, but it will still zero the cache line, which is redundant if you plan to overwrite the data. It is still much cheaper then a memory read or L2 read obviously. CLFLUSH discards the line, but after flushing it. I would like to be able to mark the data as de-initialized, i.e. as trimmed from the set. For example, if I allocate memory and deallocate it, I want to be able to mark it as invalid without flushing it.Brendan wrote:For discarding individual cache lines there's CLFLUSH.
Writing to part of a cache line (e.g. writing 8 bytes in a 64 byte cache line) means that the unmodified bytes must be read. The only cases where read-for-write can be avoided is when the stored data is not cached ("uncached", "write combining", non-temporal stores); or when the entire cache line is being modified (CLZERO, "rep movsb/d/q" and AVX512).
My confusion was deeper. I was under the impression that something like write-combining may be working under the hood even for cacheable stores, with the aim to elide reads of lines that will be completely overwritten. Store buffers serve a different purpose, so I thought that write-combining buffers are used. The idea was to accumulate cacheable writes, postponing the read-for-write until the sequence terminates. If it terminates with the cache line partially filled, the read would commence, but if it terminates with the cache line completely overwritten, the read would be elided. The problem is, of course, that if the line is not fully replaced and the last write is followed by a read to another cache line, you have twice the latency until the requests complete. You need explicit instructions to make this happen efficiently.Brendan wrote:Why?
You do a bunch of little writes to a "write combining" area, and they get stored into the CPU's write combining buffer (ignoring cache) where they're combined if possible. Later (when CPU needs to evict/flush part of the write combining buffer, hopefully after you've finished writing to that area) CPU tries to store the "combined writes" using a small number of larger stores and knows exactly how large the store is (and knows if it does/doesn't fill an entire cache line).
Re: Memset implementation optimizing ... to itself
The ELF format apparently has a GNU extension, which enables dynamic selection of the function to be used at run-time, while performing the lazy binding. There is documentation here and some notes here. Not resurrecting the thread.simeonz wrote:So, it does report ERMS. I was looking at stale information. That is pretty good. If you could dynamically choose the code to use (say, at boot or load time), SSE or trivial, that would be best. I will reiterate that according to some insider source, just because the ERMS optimization exists in a range of models, it doesn't mean that it is equally efficiently implemented in microcode in all of them. But I doubt that trusting it will be much worse than manually aligned SSE transfers.Octocontrabass wrote:At least memcpy and memset should perform pretty well on CPUs that report ERMS through CPUID.