Interesting that none of you have given the default correct answer to any question of the form, 'Is x faster than y?', namely, "profile them and find out".
Mind you, in this case you'd need to have several use cases under several different test conditions, and probably would want to incorporate them into your regular testing suite to track changes in both the code and the system conditions.
How to speed up memory copying within RAM?
- Schol-R-LEA
- Member
- Posts: 1925
- Joined: Fri Oct 27, 2006 9:42 am
- Location: Athens, GA, USA
Re: How to speed up memory copying within RAM?
Rev. First Speaker Schol-R-LEA;2 LCF ELF JAM POEE KoR KCO PPWMTF
Ordo OS Project
Lisp programmers tend to seem very odd to outsiders, just like anyone else who has had a religious experience they can't quite explain to others.
Ordo OS Project
Lisp programmers tend to seem very odd to outsiders, just like anyone else who has had a religious experience they can't quite explain to others.
Re: How to speed up memory copying within RAM?
For the number of iterations of the loop, who cares? What is important (at least for copying to video display memory where the PCI bus is always going to be the bottleneck) is the number of writes issued across the PCI bus.mikegonta wrote:Yes; however the total bytes moved is still the product of ecx and the bytes moved by movs, cache line size notwithstanding.Brendan wrote:No; "rep movs" (of any size) has been optimised to work on entire cache lines for a very long time, and moves 64 bytes when it can
(which even includes "swizzling" for modern CPUs, where CPU can/will read 2 halves of a cache line and write a whole cache line).
The OP indicated that the sse iterations were 1/8 that of the movs resulting in 4X as many bytes moved.
For a simple example, let's assume that each write has an 7-byte header (1 byte for "type", 4 bytes for address, 2 bytes for length) that precedes the data being written. For the SSE version (without write combining), writing 512 bytes ends up being 32 writes (at 16 bytes per write), for a total of 32*7+512 = 736 bytes of PCI bus traffic. For the "rep movsd" version (assuming it is operating on cache lines), writing the same 512 bytes ends up being 8 writes (at 64 bytes per write), for a total of 8*7+512 = 658 bytes of PCI bus traffic. In this case (which is over-simplified), SSE would be almost 12% slower.
Cheers,
Brendan
For all things; perfection is, and will always remain, impossible to achieve in practice. However; by striving for perfection we create things that are as perfect as practically possible. Let the pursuit of perfection be our guide.
Re: How to speed up memory copying within RAM?
Hi,
The danger here is that it's just as likely that someone will do bad tests and come to a wrong conclusion. Unfortunately, I've seen this far too many times for "memcpy()" implementations; where they benchmark how quickly it can copy a massive amount of memory (several orders of magnitude larger than "last level cache") and completely ignore the fact that most of the time "memcpy()" is used for small (less than 4 KiB) copies where the setup overhead of SSE is far more significant.
Cheers,
Brendan
That's the problem with "profile them" - you'd need to do about 20 tests per computer (with different alignments, with different "amount being moved", with different source and destination caching policies) on at least 10 different computers (older, newer, NUMA, Intel, AMD, discrete video, integrated video, ...) just to get a rough idea, and if you actually do the 200+ tests you'd just stare at the test results for 10 minutes and give up while still not having any idea what was better..Schol-R-LEA wrote:Interesting that none of you have given the default correct answer to any question of the form, 'Is x faster than y?', namely, "profile them and find out".
Mind you, in this case you'd need to have several use cases under several different test conditions, and probably would want to incorporate them into your regular testing suite to track changes in both the code and the system conditions.
The danger here is that it's just as likely that someone will do bad tests and come to a wrong conclusion. Unfortunately, I've seen this far too many times for "memcpy()" implementations; where they benchmark how quickly it can copy a massive amount of memory (several orders of magnitude larger than "last level cache") and completely ignore the fact that most of the time "memcpy()" is used for small (less than 4 KiB) copies where the setup overhead of SSE is far more significant.
Cheers,
Brendan
For all things; perfection is, and will always remain, impossible to achieve in practice. However; by striving for perfection we create things that are as perfect as practically possible. Let the pursuit of perfection be our guide.
-
- Member
- Posts: 501
- Joined: Wed Jun 17, 2015 9:40 am
- Libera.chat IRC: glauxosdever
- Location: Athens, Greece
Re: How to speed up memory copying within RAM?
Hi,
On topic, though, I think you should read the Intel's Optimisation Manual, section 3.7.6. It refers to the efficiency of the different possible implementations of the memcpy(), memmove() and memset() functions.
Regards,
glauxosdever
On topic, though, I think you should read the Intel's Optimisation Manual, section 3.7.6. It refers to the efficiency of the different possible implementations of the memcpy(), memmove() and memset() functions.
Regards,
glauxosdever
- Schol-R-LEA
- Member
- Posts: 1925
- Joined: Fri Oct 27, 2006 9:42 am
- Location: Athens, GA, USA
Re: How to speed up memory copying within RAM?
Perhaps we need to delineate a core set of bulk copy/compare/delete/update profiles for this purpose, then. Not actual code, of course, but more along the line a pattern repository - a list of condition cases, a checklist and set of base algorithms and data structures which can be used for instrumenting the code to be tested, known hazards in performing the tests, etc. Something that would make profiling these things less haphazard and arduous, especially the actual analysis of the results.
We could also set up a common set of analysis tools with a set of common data formats, with ways of collecting the data so that it could be sent to the development host. Maybe even some auto-balancing algorithms or heuristics that the OSes themselves could implement, run the tests automatically on boot and adjusting themselves accordingly, though that is going out of the realm of most hobbyist development.
We could also set up a common set of analysis tools with a set of common data formats, with ways of collecting the data so that it could be sent to the development host. Maybe even some auto-balancing algorithms or heuristics that the OSes themselves could implement, run the tests automatically on boot and adjusting themselves accordingly, though that is going out of the realm of most hobbyist development.
Rev. First Speaker Schol-R-LEA;2 LCF ELF JAM POEE KoR KCO PPWMTF
Ordo OS Project
Lisp programmers tend to seem very odd to outsiders, just like anyone else who has had a religious experience they can't quite explain to others.
Ordo OS Project
Lisp programmers tend to seem very odd to outsiders, just like anyone else who has had a religious experience they can't quite explain to others.
- MichaelFarthing
- Member
- Posts: 167
- Joined: Thu Mar 10, 2016 7:35 am
- Location: Lancaster, England, Disunited Kingdom
Re: How to speed up memory copying within RAM?
Well if the user cannot immediately see which alogorithm is faster it might be sensible to calculate the expected saving in time over a few thousand invocations of the program compared with the time spent by developers on profiling their algorithms and not getting on with the next stage of their OS. [Perhaps that was the point that you were making?]Schol-R-LEA wrote:Perhaps we need to delineate a core set of bulk copy/compare/delete/update profiles for this purpose, then. Not actual code, of course, but more along the line a pattern repository - a checklist and set of base algorithms and data structures which can be used for instrumenting the code to be tested, known hazards in performing the tests, etc. Something that would make profiling these things less haphazard and arduous, especially the actual analysis of the results.
We could also set up a common set of analysis tools with a set of common data formats, with ways of collecting the data so that it could be sent to the development host. Maybe even some auto-balancing algorithms or heuristics that the OSes themselves could implement, run the tests automatically on boot and adjusting themselves accordingly, though that is going out of the realm of most hobbyist development.
- Schol-R-LEA
- Member
- Posts: 1925
- Joined: Fri Oct 27, 2006 9:42 am
- Location: Athens, GA, USA
Re: How to speed up memory copying within RAM?
That's... actually a good point. Arguing over the optimal algorithm for a given configuration and situation is sort of pointless if all the candidates perform adequately in the first place.
EDIT: Though it occurs to me that this is all the more reason to come up with the checklist; we would just put "is the performance of any of the existing implementations adequate for the current uses?" as the first question, with "yes -> stop", "no -> next question" as the checklist states. The real purpose of the checklist would not be to encourage premature optimization, but to shave time off for those who are going to insist on doing it anyway.
EDIT: Though it occurs to me that this is all the more reason to come up with the checklist; we would just put "is the performance of any of the existing implementations adequate for the current uses?" as the first question, with "yes -> stop", "no -> next question" as the checklist states. The real purpose of the checklist would not be to encourage premature optimization, but to shave time off for those who are going to insist on doing it anyway.
Rev. First Speaker Schol-R-LEA;2 LCF ELF JAM POEE KoR KCO PPWMTF
Ordo OS Project
Lisp programmers tend to seem very odd to outsiders, just like anyone else who has had a religious experience they can't quite explain to others.
Ordo OS Project
Lisp programmers tend to seem very odd to outsiders, just like anyone else who has had a religious experience they can't quite explain to others.
-
- Member
- Posts: 73
- Joined: Wed Dec 23, 2015 10:42 pm
Re: How to speed up memory copying within RAM?
Ah, ok; So much argument ;_; btw, i Think in my scenario, "rep movsd" method performed slightly better then sse; probably because of that 12% calculation performed by Brenden :/ I cam upon this conclusion after 20-30 tests for both ways ;_; SSE might be better in other situations but because as Brenden pointed out, to bypass the PCI bottleneck, we need to send as much chunk of data at once as possible. Thanks for the help and clearing my concepts :p thanks everyone :p
The best method for accelerating a computer is the one that boosts it by 9.8 m/s2.
My OS : https://github.com/AshishKumar4/Aqeous
My OS : https://github.com/AshishKumar4/Aqeous
Re: How to speed up memory copying within RAM?
Since we are on the subject, I just finished some code that I was working on that would blank the screen to a specific color using an interface that would work whether you were in graphics mode or in text mode. (Text mode pixels are just blank/null cells with foreground and background set to the nearest palette color)
I first implemented the text mode code using rep movsw (each cell is 16 bits wide) with CX=2000 (80x25), and I later replaced that code using MMX 64-bit wide copys with CX=500, and I swear that the first method is faster in VirtualBox. Also, the MMX method looks like crap, with glitches and blocks that are not updated properly. I assumed at first that I had a bug somewhere in the MMX code, but I've yet to find it, and the code's not that complex.
Strangely, I get entirely opposite results when in 1024x768x32 mode. Using MMX 64-bit copys is probably 8x faster, on screen. When I get some time, I'll do a little more digging to see if I can figure out what is going on, but for now, I'm just using whichever method looks best.
I first implemented the text mode code using rep movsw (each cell is 16 bits wide) with CX=2000 (80x25), and I later replaced that code using MMX 64-bit wide copys with CX=500, and I swear that the first method is faster in VirtualBox. Also, the MMX method looks like crap, with glitches and blocks that are not updated properly. I assumed at first that I had a bug somewhere in the MMX code, but I've yet to find it, and the code's not that complex.
Strangely, I get entirely opposite results when in 1024x768x32 mode. Using MMX 64-bit copys is probably 8x faster, on screen. When I get some time, I'll do a little more digging to see if I can figure out what is going on, but for now, I'm just using whichever method looks best.
Project: OZone
Source: GitHub
Current Task: LIB/OBJ file support
"The more they overthink the plumbing, the easier it is to stop up the drain." - Montgomery Scott
Source: GitHub
Current Task: LIB/OBJ file support
"The more they overthink the plumbing, the easier it is to stop up the drain." - Montgomery Scott