Whats faster?

Question about which tools to use, bugs, the best way to implement a function, etc should go here. Don't forget to see if your question is answered in the wiki first! When in doubt post here.
Post Reply
guferr
Posts: 22
Joined: Sat Aug 20, 2011 1:57 pm

Whats faster?

Post by guferr »

Hi, i just wondered a little thing:
When you use the instruction Rep, it will repeat a string operation, like in:

Code: Select all

Mov cx,0Ah
Rep Movsw
In this code, the operation movsw will be repeated ten times.
But, when i looked at the 8086 emulator, it showed that the Rep instruction is repeatedly executed, so the time spend on the operation is half executing movsw's, and other half executing Rep's.

I would like to know if this happens on the real processor, because, according to this emulator, replacing this code above by this:

Code: Select all

Movsw
Movsw
Movsw
Movsw
Movsw
Movsw
Movsw
Movsw
Movsw
Movsw


Would be faster, once you don't have spend time executing the Rep instruction.
Thanks.
User avatar
MDM
Member
Member
Posts: 57
Joined: Wed Jul 21, 2010 9:05 pm

Re: Whats faster?

Post by MDM »

It'd seem to me that repeating the instruction rep is still faster than continuously accessing RAM. Even if you consider queuing, the processor could be queuing more important instructions than 0xA movsws.
User avatar
qw
Member
Member
Posts: 792
Joined: Mon Jan 26, 2009 2:48 am

Re: Whats faster?

Post by qw »

guferr wrote:Would be faster, once you don't have spend time executing the Rep instruction.
Technically, REP is not an instruction but a prefix. Whether REP is faster or not depends on the exact machine. If you are interested in optimization, you may want read Agner Fog's articles: http://www.agner.org/optimize/.

Roel
User avatar
Brendan
Member
Member
Posts: 8561
Joined: Sat Jan 15, 2005 12:00 am
Location: At his keyboard!
Contact:

Re: Whats faster?

Post by Brendan »

Hi,
guferr wrote:But, when i looked at the 8086 emulator, it showed that the Rep instruction is repeatedly executed, so the time spend on the operation is half executing movsw's, and other half executing Rep's.
The emulator is misleading.

For a modern CPU, the CPU would decode the entire "rep movsw" instruction once, then check the initial values to determine if it can operate on cache lines (e.g. the count in CX is large enough, if the addresses in SI and/or DI meet some other restriction/s, etc). If the CPU decides it can't work on entire cache lines (likely for your case as it's only 20 bytes), then it'd probably do a loop in microcode (and wouldn't read or decode the instruction a second time).

For performance, it's a compromise between different issues. The first is loading the executable from disk into RAM (smaller executables load faster). Then there's instruction fetch (less bytes is better), then decoding the instructions (less is typically better). The "20 separate movsw instructions" approach would make performance worse for all of these. However, there's also register pressure (e.g. using separate movsw instructions means CX isn't used, and that can mean that CX doesn't need to be stored/restored or that CX can be used to improve performance of something else - e.g. an outer loop). Then there's those checks the CPU does when it first starts a "rep movsw" to see if it can operate on cache lines rather than words - using separate movsw instructions avoids that check, which could improve performance a little for very small (e.g. 20-byte) string operations. Finally there's the "microcode loop" - at the end of each iteration of the loop the CPU would have to decrease CX and check if it's become zero. The CPU would assume CX will be non-zero and the loop continues, but there may be a small "exit penalty" when the instruction completes (where the CPU's assumes the loop will continue, but the assumption is wrong). As you can see there's both advantages and disadvantages (and for different CPUs the advantages/disadvantages can be different).

The next thing to consider is how often you're doing this. If you only execute this piece of code once, then the disadvantages of "many separate movsw instructions" (e.g. code size) will probably make performance worse. If you execute the code multiple times, but not often enough to keep it in the CPU's instruction caches, then it'll probably be similar - the disadvantages (instruction fetch) make performance worse. However, if you execute the code often enough then the most of the disadvantages of "many separate movsw instructions" will only occur once and most of the advantages occur each time; so "many separate movsw instructions" could improve performance. However, different CPUs are different - "many separate movsw instructions" might be faster on some CPUs (e.g. where the bottleneck is data cache access latency) and be slower on other CPUs (e.g. where the bottleneck is the instruction decoder).


Cheers,

Brendan
For all things; perfection is, and will always remain, impossible to achieve in practice. However; by striving for perfection we create things that are as perfect as practically possible. Let the pursuit of perfection be our guide.
User avatar
Owen
Member
Member
Posts: 1700
Joined: Fri Jun 13, 2008 3:21 pm
Location: Cambridge, United Kingdom
Contact:

Re: Whats faster?

Post by Owen »

Note that on most modern CPUs, movs is microcoded, and in fact the equivalent sequence of primitive instructions is quicker. "rep movsw" is almost always faster.

For pretty much everything between the first Pentium & Sandy Bridge, and pretty much all AMD CPUs, an MMX or SSE copy loop is faster than "rep movsw". In particular, for K8 and K10, "rep movs" is documented to take minimum 12+rCX cycles
Post Reply