Whats faster?

guferr · Post by **guferr** » Thu Jan 12, 2012 10:56 pm

Hi, i just wondered a little thing:
When you use the instruction Rep, it will repeat a string operation, like in:

Mov cx,0Ah
Rep Movsw

In this code, the operation movsw will be repeated ten times.
But, when i looked at the 8086 emulator, it showed that the Rep instruction is repeatedly executed, so the time spend on the operation is half executing movsw's, and other half executing Rep's.

I would like to know if this happens on the real processor, because, according to this emulator, replacing this code above by this:

Code: Select all

Movsw
Movsw
Movsw
Movsw
Movsw
Movsw
Movsw
Movsw
Movsw
Movsw

Would be faster, once you don't have spend time executing the Rep instruction.
Thanks.

MDM · Post by **MDM** » Thu Jan 12, 2012 11:45 pm

It'd seem to me that repeating the instruction rep is still faster than continuously accessing RAM. Even if you consider queuing, the processor could be queuing more important instructions than 0xA movsws.

qw · Post by qw » Fri Jan 13, 2012 2:27 am

guferr wrote:Would be faster, once you don't have spend time executing the Rep instruction.

Technically, REP is not an instruction but a prefix. Whether REP is faster or not depends on the exact machine. If you are interested in optimization, you may want read Agner Fog's articles: http://www.agner.org/optimize/.

Roel

Brendan · Post by **Brendan** » Fri Jan 13, 2012 2:38 am

Hi,

guferr wrote:But, when i looked at the 8086 emulator, it showed that the Rep instruction is repeatedly executed, so the time spend on the operation is half executing movsw's, and other half executing Rep's.

The emulator is misleading.

For a modern CPU, the CPU would decode the entire "rep movsw" instruction once, then check the initial values to determine if it can operate on cache lines (e.g. the count in CX is large enough, if the addresses in SI and/or DI meet some other restriction/s, etc). If the CPU decides it can't work on entire cache lines (likely for your case as it's only 20 bytes), then it'd probably do a loop in microcode (and wouldn't read or decode the instruction a second time).

For performance, it's a compromise between different issues. The first is loading the executable from disk into RAM (smaller executables load faster). Then there's instruction fetch (less bytes is better), then decoding the instructions (less is typically better). The "20 separate movsw instructions" approach would make performance worse for all of these. However, there's also register pressure (e.g. using separate movsw instructions means CX isn't used, and that can mean that CX doesn't need to be stored/restored or that CX can be used to improve performance of something else - e.g. an outer loop). Then there's those checks the CPU does when it first starts a "rep movsw" to see if it can operate on cache lines rather than words - using separate movsw instructions avoids that check, which could improve performance a little for very small (e.g. 20-byte) string operations. Finally there's the "microcode loop" - at the end of each iteration of the loop the CPU would have to decrease CX and check if it's become zero. The CPU would assume CX will be non-zero and the loop continues, but there may be a small "exit penalty" when the instruction completes (where the CPU's assumes the loop will continue, but the assumption is wrong). As you can see there's both advantages and disadvantages (and for different CPUs the advantages/disadvantages can be different).

The next thing to consider is how often you're doing this. If you only execute this piece of code once, then the disadvantages of "many separate movsw instructions" (e.g. code size) will probably make performance worse. If you execute the code multiple times, but not often enough to keep it in the CPU's instruction caches, then it'll probably be similar - the disadvantages (instruction fetch) make performance worse. However, if you execute the code often enough then the most of the disadvantages of "many separate movsw instructions" will only occur once and most of the advantages occur each time; so "many separate movsw instructions" could improve performance. However, different CPUs are different - "many separate movsw instructions" might be faster on some CPUs (e.g. where the bottleneck is data cache access latency) and be slower on other CPUs (e.g. where the bottleneck is the instruction decoder).

Cheers,

Brendan

Owen · Post by **Owen** » Fri Jan 13, 2012 4:31 am

Note that on most modern CPUs, movs is microcoded, and in fact the equivalent sequence of primitive instructions is quicker. "rep movsw" is almost always faster.

For pretty much everything between the first Pentium & Sandy Bridge, and pretty much all AMD CPUs, an MMX or SSE copy loop is faster than "rep movsw". In particular, for K8 and K10, "rep movs" is documented to take minimum 12+rCX cycles

OSDev.org

Whats faster?

Whats faster?

Re: Whats faster?

Re: Whats faster?

Re: Whats faster?

Re: Whats faster?