Hi,
guferr wrote:But, when i looked at the 8086 emulator, it showed that the Rep instruction is repeatedly executed, so the time spend on the operation is half executing movsw's, and other half executing Rep's.
The emulator is misleading.
For a modern CPU, the CPU would decode the entire "rep movsw" instruction once, then check the initial values to determine if it can operate on cache lines (e.g. the count in CX is large enough, if the addresses in SI and/or DI meet some other restriction/s, etc). If the CPU decides it can't work on entire cache lines (likely for your case as it's only 20 bytes), then it'd probably do a loop in microcode (and wouldn't read or decode the instruction a second time).
For performance, it's a compromise between different issues. The first is loading the executable from disk into RAM (smaller executables load faster). Then there's instruction fetch (less bytes is better), then decoding the instructions (less is typically better). The "20 separate movsw instructions" approach would make performance worse for all of these. However, there's also register pressure (e.g. using separate movsw instructions means CX isn't used, and that can mean that CX doesn't need to be stored/restored or that CX can be used to improve performance of something else - e.g. an outer loop). Then there's those checks the CPU does when it first starts a "rep movsw" to see if it can operate on cache lines rather than words - using separate movsw instructions avoids that check, which could improve performance a little for very small (e.g. 20-byte) string operations. Finally there's the "microcode loop" - at the end of each iteration of the loop the CPU would have to decrease CX and check if it's become zero. The CPU would assume CX will be non-zero and the loop continues, but there may be a small "exit penalty" when the instruction completes (where the CPU's assumes the loop will continue, but the assumption is wrong). As you can see there's both advantages and disadvantages (and for different CPUs the advantages/disadvantages can be different).
The next thing to consider is how often you're doing this. If you only execute this piece of code once, then the disadvantages of "many separate movsw instructions" (e.g. code size) will probably make performance worse. If you execute the code multiple times, but not often enough to keep it in the CPU's instruction caches, then it'll probably be similar - the disadvantages (instruction fetch) make performance worse. However, if you execute the code often enough then the most of the disadvantages of "many separate movsw instructions" will only occur once and most of the advantages occur each time; so "many separate movsw instructions" could improve performance. However, different CPUs are different - "many separate movsw instructions" might be faster on some CPUs (e.g. where the bottleneck is data cache access latency) and be slower on other CPUs (e.g. where the bottleneck is the instruction decoder).
Cheers,
Brendan