How to speed up memory copying within RAM?

Schol-R-LEA · Post by **Schol-R-LEA** » Mon Sep 05, 2016 10:20 am

Interesting that none of you have given the default correct answer to any question of the form, 'Is x faster than y?', namely, "profile them and find out".

Mind you, in this case you'd need to have several use cases under several different test conditions, and probably would want to incorporate them into your regular testing suite to track changes in both the code and the system conditions.

Brendan · Post by **Brendan** » Mon Sep 05, 2016 10:21 am

mikegonta wrote:
Brendan wrote:No; "rep movs" (of any size) has been optimised to work on entire cache lines for a very long time, and moves 64 bytes when it can
(which even includes "swizzling" for modern CPUs, where CPU can/will read 2 halves of a cache line and write a whole cache line).
Yes; however the total bytes moved is still the product of ecx and the bytes moved by movs, cache line size notwithstanding.
The OP indicated that the sse iterations were 1/8 that of the movs resulting in 4X as many bytes moved.

For the number of iterations of the loop, who cares? What is important (at least for copying to video display memory where the PCI bus is always going to be the bottleneck) is the number of writes issued across the PCI bus.

For a simple example, let's assume that each write has an 7-byte header (1 byte for "type", 4 bytes for address, 2 bytes for length) that precedes the data being written. For the SSE version (without write combining), writing 512 bytes ends up being 32 writes (at 16 bytes per write), for a total of 32*7+512 = 736 bytes of PCI bus traffic. For the "rep movsd" version (assuming it is operating on cache lines), writing the same 512 bytes ends up being 8 writes (at 64 bytes per write), for a total of 8*7+512 = 658 bytes of PCI bus traffic. In this case (which is over-simplified), SSE would be almost 12% slower.

Cheers,

Brendan

Brendan · Post by **Brendan** » Mon Sep 05, 2016 10:29 am

Hi,

Schol-R-LEA wrote:Interesting that none of you have given the default correct answer to any question of the form, 'Is x faster than y?', namely, "profile them and find out".

Mind you, in this case you'd need to have several use cases under several different test conditions, and probably would want to incorporate them into your regular testing suite to track changes in both the code and the system conditions.

That's the problem with "profile them" - you'd need to do about 20 tests per computer (with different alignments, with different "amount being moved", with different source and destination caching policies) on at least 10 different computers (older, newer, NUMA, Intel, AMD, discrete video, integrated video, ...) just to get a rough idea, and if you actually do the 200+ tests you'd just stare at the test results for 10 minutes and give up while still not having any idea what was better..

The danger here is that it's just as likely that someone will do bad tests and come to a wrong conclusion. Unfortunately, I've seen this far too many times for "memcpy()" implementations; where they benchmark how quickly it can copy a massive amount of memory (several orders of magnitude larger than "last level cache") and completely ignore the fact that most of the time "memcpy()" is used for small (less than 4 KiB) copies where the setup overhead of SSE is far more significant.

Cheers,

Brendan

glauxosdever · Post by **glauxosdever** » Mon Sep 05, 2016 10:58 am

Hi,

On topic, though, I think you should read the Intel's Optimisation Manual, section 3.7.6. It refers to the efficiency of the different possible implementations of the memcpy(), memmove() and memset() functions.

Regards,
glauxosdever

Schol-R-LEA · Post by **Schol-R-LEA** » Mon Sep 05, 2016 11:02 am

Perhaps we need to delineate a core set of bulk copy/compare/delete/update profiles for this purpose, then. Not actual code, of course, but more along the line a pattern repository - a list of condition cases, a checklist and set of base algorithms and data structures which can be used for instrumenting the code to be tested, known hazards in performing the tests, etc. Something that would make profiling these things less haphazard and arduous, especially the actual analysis of the results.

We could also set up a common set of analysis tools with a set of common data formats, with ways of collecting the data so that it could be sent to the development host. Maybe even some auto-balancing algorithms or heuristics that the OSes themselves could implement, run the tests automatically on boot and adjusting themselves accordingly, though that is going out of the realm of most hobbyist development.

MichaelFarthing · Post by **MichaelFarthing** » Mon Sep 05, 2016 11:12 am

Schol-R-LEA wrote:Perhaps we need to delineate a core set of bulk copy/compare/delete/update profiles for this purpose, then. Not actual code, of course, but more along the line a pattern repository - a checklist and set of base algorithms and data structures which can be used for instrumenting the code to be tested, known hazards in performing the tests, etc. Something that would make profiling these things less haphazard and arduous, especially the actual analysis of the results.

We could also set up a common set of analysis tools with a set of common data formats, with ways of collecting the data so that it could be sent to the development host. Maybe even some auto-balancing algorithms or heuristics that the OSes themselves could implement, run the tests automatically on boot and adjusting themselves accordingly, though that is going out of the realm of most hobbyist development.

Well if the user cannot immediately see which alogorithm is faster it might be sensible to calculate the expected saving in time over a few thousand invocations of the program compared with the time spent by developers on profiling their algorithms and not getting on with the next stage of their OS. [Perhaps that was the point that you were making?]

Schol-R-LEA · Post by **Schol-R-LEA** » Mon Sep 05, 2016 11:17 am

That's... actually a good point. Arguing over the optimal algorithm for a given configuration and situation is sort of pointless if all the candidates perform adequately in the first place.

EDIT: Though it occurs to me that this is all the more reason to come up with the checklist; we would just put "is the performance of any of the existing implementations adequate for the current uses?" as the first question, with "yes -> stop", "no -> next question" as the checklist states. The real purpose of the checklist would not be to encourage premature optimization, but to shave time off for those who are going to insist on doing it anyway.

ashishkumar4 · Post by **ashishkumar4** » Mon Sep 05, 2016 12:14 pm

Ah, ok; So much argument ;_; btw, i Think in my scenario, "rep movsd" method performed slightly better then sse; probably because of that 12% calculation performed by Brenden :/ I cam upon this conclusion after 20-30 tests for both ways ;_; SSE might be better in other situations but because as Brenden pointed out, to bypass the PCI bottleneck, we need to send as much chunk of data at once as possible. Thanks for the help and clearing my concepts :p thanks everyone :p

SpyderTL · Post by **SpyderTL** » Tue Sep 06, 2016 10:23 am

Since we are on the subject, I just finished some code that I was working on that would blank the screen to a specific color using an interface that would work whether you were in graphics mode or in text mode. (Text mode pixels are just blank/null cells with foreground and background set to the nearest palette color)

I first implemented the text mode code using rep movsw (each cell is 16 bits wide) with CX=2000 (80x25), and I later replaced that code using MMX 64-bit wide copys with CX=500, and I swear that the first method is faster in VirtualBox. Also, the MMX method looks like crap, with glitches and blocks that are not updated properly. I assumed at first that I had a bug somewhere in the MMX code, but I've yet to find it, and the code's not that complex.

Strangely, I get entirely opposite results when in 1024x768x32 mode. Using MMX 64-bit copys is probably 8x faster, on screen. When I get some time, I'll do a little more digging to see if I can figure out what is going on, but for now, I'm just using whichever method looks best.

OSDev.org

How to speed up memory copying within RAM?

Re: How to speed up memory copying within RAM?

Re: How to speed up memory copying within RAM?

Re: How to speed up memory copying within RAM?

Re: How to speed up memory copying within RAM?

Re: How to speed up memory copying within RAM?

Re: How to speed up memory copying within RAM?

Re: How to speed up memory copying within RAM?

Re: How to speed up memory copying within RAM?

Re: How to speed up memory copying within RAM?