How to speed up memory copying within RAM?

Question about which tools to use, bugs, the best way to implement a function, etc should go here. Don't forget to see if your question is answered in the wiki first! When in doubt post here.
User avatar
Schol-R-LEA
Member
Member
Posts: 1925
Joined: Fri Oct 27, 2006 9:42 am
Location: Athens, GA, USA

Re: How to speed up memory copying within RAM?

Post by Schol-R-LEA »

Interesting that none of you have given the default correct answer to any question of the form, 'Is x faster than y?', namely, "profile them and find out".

Mind you, in this case you'd need to have several use cases under several different test conditions, and probably would want to incorporate them into your regular testing suite to track changes in both the code and the system conditions.
Rev. First Speaker Schol-R-LEA;2 LCF ELF JAM POEE KoR KCO PPWMTF
Ordo OS Project
Lisp programmers tend to seem very odd to outsiders, just like anyone else who has had a religious experience they can't quite explain to others.
User avatar
Brendan
Member
Member
Posts: 8561
Joined: Sat Jan 15, 2005 12:00 am
Location: At his keyboard!
Contact:

Re: How to speed up memory copying within RAM?

Post by Brendan »

mikegonta wrote:
Brendan wrote:No; "rep movs" (of any size) has been optimised to work on entire cache lines for a very long time, and moves 64 bytes when it can
(which even includes "swizzling" for modern CPUs, where CPU can/will read 2 halves of a cache line and write a whole cache line).
Yes; however the total bytes moved is still the product of ecx and the bytes moved by movs, cache line size notwithstanding.
The OP indicated that the sse iterations were 1/8 that of the movs resulting in 4X as many bytes moved.
For the number of iterations of the loop, who cares? What is important (at least for copying to video display memory where the PCI bus is always going to be the bottleneck) is the number of writes issued across the PCI bus.

For a simple example, let's assume that each write has an 7-byte header (1 byte for "type", 4 bytes for address, 2 bytes for length) that precedes the data being written. For the SSE version (without write combining), writing 512 bytes ends up being 32 writes (at 16 bytes per write), for a total of 32*7+512 = 736 bytes of PCI bus traffic. For the "rep movsd" version (assuming it is operating on cache lines), writing the same 512 bytes ends up being 8 writes (at 64 bytes per write), for a total of 8*7+512 = 658 bytes of PCI bus traffic. In this case (which is over-simplified), SSE would be almost 12% slower.


Cheers,

Brendan
For all things; perfection is, and will always remain, impossible to achieve in practice. However; by striving for perfection we create things that are as perfect as practically possible. Let the pursuit of perfection be our guide.
User avatar
Brendan
Member
Member
Posts: 8561
Joined: Sat Jan 15, 2005 12:00 am
Location: At his keyboard!
Contact:

Re: How to speed up memory copying within RAM?

Post by Brendan »

Hi,
Schol-R-LEA wrote:Interesting that none of you have given the default correct answer to any question of the form, 'Is x faster than y?', namely, "profile them and find out".

Mind you, in this case you'd need to have several use cases under several different test conditions, and probably would want to incorporate them into your regular testing suite to track changes in both the code and the system conditions.
That's the problem with "profile them" - you'd need to do about 20 tests per computer (with different alignments, with different "amount being moved", with different source and destination caching policies) on at least 10 different computers (older, newer, NUMA, Intel, AMD, discrete video, integrated video, ...) just to get a rough idea, and if you actually do the 200+ tests you'd just stare at the test results for 10 minutes and give up while still not having any idea what was better.. :)

The danger here is that it's just as likely that someone will do bad tests and come to a wrong conclusion. Unfortunately, I've seen this far too many times for "memcpy()" implementations; where they benchmark how quickly it can copy a massive amount of memory (several orders of magnitude larger than "last level cache") and completely ignore the fact that most of the time "memcpy()" is used for small (less than 4 KiB) copies where the setup overhead of SSE is far more significant.


Cheers,

Brendan
For all things; perfection is, and will always remain, impossible to achieve in practice. However; by striving for perfection we create things that are as perfect as practically possible. Let the pursuit of perfection be our guide.
glauxosdever
Member
Member
Posts: 501
Joined: Wed Jun 17, 2015 9:40 am
Libera.chat IRC: glauxosdever
Location: Athens, Greece

Re: How to speed up memory copying within RAM?

Post by glauxosdever »

Hi,


On topic, though, I think you should read the Intel's Optimisation Manual, section 3.7.6. It refers to the efficiency of the different possible implementations of the memcpy(), memmove() and memset() functions.


Regards,
glauxosdever
User avatar
Schol-R-LEA
Member
Member
Posts: 1925
Joined: Fri Oct 27, 2006 9:42 am
Location: Athens, GA, USA

Re: How to speed up memory copying within RAM?

Post by Schol-R-LEA »

Perhaps we need to delineate a core set of bulk copy/compare/delete/update profiles for this purpose, then. Not actual code, of course, but more along the line a pattern repository - a list of condition cases, a checklist and set of base algorithms and data structures which can be used for instrumenting the code to be tested, known hazards in performing the tests, etc. Something that would make profiling these things less haphazard and arduous, especially the actual analysis of the results.

We could also set up a common set of analysis tools with a set of common data formats, with ways of collecting the data so that it could be sent to the development host. Maybe even some auto-balancing algorithms or heuristics that the OSes themselves could implement, run the tests automatically on boot and adjusting themselves accordingly, though that is going out of the realm of most hobbyist development.
Rev. First Speaker Schol-R-LEA;2 LCF ELF JAM POEE KoR KCO PPWMTF
Ordo OS Project
Lisp programmers tend to seem very odd to outsiders, just like anyone else who has had a religious experience they can't quite explain to others.
User avatar
MichaelFarthing
Member
Member
Posts: 167
Joined: Thu Mar 10, 2016 7:35 am
Location: Lancaster, England, Disunited Kingdom

Re: How to speed up memory copying within RAM?

Post by MichaelFarthing »

Schol-R-LEA wrote:Perhaps we need to delineate a core set of bulk copy/compare/delete/update profiles for this purpose, then. Not actual code, of course, but more along the line a pattern repository - a checklist and set of base algorithms and data structures which can be used for instrumenting the code to be tested, known hazards in performing the tests, etc. Something that would make profiling these things less haphazard and arduous, especially the actual analysis of the results.

We could also set up a common set of analysis tools with a set of common data formats, with ways of collecting the data so that it could be sent to the development host. Maybe even some auto-balancing algorithms or heuristics that the OSes themselves could implement, run the tests automatically on boot and adjusting themselves accordingly, though that is going out of the realm of most hobbyist development.
Well if the user cannot immediately see which alogorithm is faster it might be sensible to calculate the expected saving in time over a few thousand invocations of the program compared with the time spent by developers on profiling their algorithms and not getting on with the next stage of their OS. [Perhaps that was the point that you were making?]
User avatar
Schol-R-LEA
Member
Member
Posts: 1925
Joined: Fri Oct 27, 2006 9:42 am
Location: Athens, GA, USA

Re: How to speed up memory copying within RAM?

Post by Schol-R-LEA »

That's... actually a good point. Arguing over the optimal algorithm for a given configuration and situation is sort of pointless if all the candidates perform adequately in the first place.

EDIT: Though it occurs to me that this is all the more reason to come up with the checklist; we would just put "is the performance of any of the existing implementations adequate for the current uses?" as the first question, with "yes -> stop", "no -> next question" as the checklist states. The real purpose of the checklist would not be to encourage premature optimization, but to shave time off for those who are going to insist on doing it anyway.
Rev. First Speaker Schol-R-LEA;2 LCF ELF JAM POEE KoR KCO PPWMTF
Ordo OS Project
Lisp programmers tend to seem very odd to outsiders, just like anyone else who has had a religious experience they can't quite explain to others.
ashishkumar4
Member
Member
Posts: 73
Joined: Wed Dec 23, 2015 10:42 pm

Re: How to speed up memory copying within RAM?

Post by ashishkumar4 »

Ah, ok; So much argument ;_; btw, i Think in my scenario, "rep movsd" method performed slightly better then sse; probably because of that 12% calculation performed by Brenden :/ I cam upon this conclusion after 20-30 tests for both ways ;_; SSE might be better in other situations but because as Brenden pointed out, to bypass the PCI bottleneck, we need to send as much chunk of data at once as possible. Thanks for the help and clearing my concepts :p thanks everyone :p
The best method for accelerating a computer is the one that boosts it by 9.8 m/s2.
My OS : https://github.com/AshishKumar4/Aqeous
User avatar
SpyderTL
Member
Member
Posts: 1074
Joined: Sun Sep 19, 2010 10:05 pm

Re: How to speed up memory copying within RAM?

Post by SpyderTL »

Since we are on the subject, I just finished some code that I was working on that would blank the screen to a specific color using an interface that would work whether you were in graphics mode or in text mode. (Text mode pixels are just blank/null cells with foreground and background set to the nearest palette color)

I first implemented the text mode code using rep movsw (each cell is 16 bits wide) with CX=2000 (80x25), and I later replaced that code using MMX 64-bit wide copys with CX=500, and I swear that the first method is faster in VirtualBox. Also, the MMX method looks like crap, with glitches and blocks that are not updated properly. I assumed at first that I had a bug somewhere in the MMX code, but I've yet to find it, and the code's not that complex.

Strangely, I get entirely opposite results when in 1024x768x32 mode. Using MMX 64-bit copys is probably 8x faster, on screen. When I get some time, I'll do a little more digging to see if I can figure out what is going on, but for now, I'm just using whichever method looks best.
Project: OZone
Source: GitHub
Current Task: LIB/OBJ file support
"The more they overthink the plumbing, the easier it is to stop up the drain." - Montgomery Scott
Post Reply