How can a nop instruction speed up my test program?

Question about which tools to use, bugs, the best way to implement a function, etc should go here. Don't forget to see if your question is answered in the wiki first! When in doubt post here.
Post Reply
User avatar
haolee
Posts: 22
Joined: Thu Jul 13, 2017 9:52 am

How can a nop instruction speed up my test program?

Post by haolee »

I write a test program to test cache performance on x86 and found an unexpected problem.

Code: Select all

#include <stdio.h>
#include <time.h>
#include <string.h>
#include <stdlib.h>

#define BUF_SIZE 8192
#define ROUND 100000000UL
int main(int argc, char **argv)
{
	char *buf, *buf_newaddr, *buf_pageend;
	unsigned long i __attribute__((aligned(64)));
	int buf_realsize;
	unsigned long offset __attribute__((aligned(64)));
	struct timespec start={0,0}, end={0,0};
	double start_ns, end_ns;

	if (argc != 2) {
		printf("missing args\n");
		exit(-1);
	}

	offset = atoi(argv[1]);

again:
	buf = (void *)malloc(BUF_SIZE);
	buf_pageend = (void *)((unsigned long)(buf + 4095) & 0xfffffffffffff000UL);
	if (buf_pageend - buf < 1024) { // make sure we have enough space for negative 'offset'
		// don't free, occupt it in order to alloc another different block
		goto again;
	}
	memset(buf, 0, BUF_SIZE);

	printf("&i = %lx, &offset=%lx\n", &i, &offset);
	clock_gettime(CLOCK_MONOTONIC, &start);
	for (i = 0; i < ROUND; i++) {
		*((unsigned long *)(buf_pageend + offset)) = 0; // mark
	}
	clock_gettime(CLOCK_MONOTONIC, &end);
	start_ns = start.tv_sec*1000000000 + start.tv_nsec;
	end_ns = end.tv_sec*1000000000 + end.tv_nsec;
	printf("ns: %lf\n", (end_ns - start_ns)/ROUND);
}
The asm corresponding to the loop is as following:

Code: Select all

        call    clock_gettime
        movq    $0, -112(%rbp)
        jmp     .L5
.L6:
        movq    -176(%rbp), %rdx
        movq    -64(%rbp), %rax
        addq    %rdx, %rax
        movq    $0, (%rax)
        movq    -112(%rbp), %rax
        addq    $1, %rax
        movq    %rax, -112(%rbp)
.L5:
        movq    -112(%rbp), %rax
        cmpq    $99999999, %rax
        jbe     .L6
        leaq    -208(%rbp), %rax
        movq    %rax, %rsi
        movl    $1, %edi
        call    clock_gettime
The program will print:

Code: Select all

$ ./a.out 0
&i = 7fffd9811700, &offset=7fffd98116c0
ns: 3.217088
But if I insert a nop instruction:
.L6:
movq -176(%rbp), %rdx
movq -64(%rbp), %rax
addq %rdx, %rax
nop
movq $0, (%rax)
movq -112(%rbp), %rax
addq $1, %rax
movq %rax, -112(%rbp)
.L5:
The result will be:

Code: Select all

$ ./a.out 0
&i = 7ffef00bb380, &offset=7ffef00bb340
ns: 2.104905
My program is compiled with O0.
I can't understand how the nop instruction speed up my program. Thanks!
User avatar
iansjack
Member
Member
Posts: 4703
Joined: Sat Mar 31, 2012 3:07 am
Location: Chichester, UK

Re: How can a nop instruction speed up my test program?

Post by iansjack »

I couldn’t say if it’s the case with your program, but nop instructions can alter execution times by altering the alignment of memory accesses.
devc1
Member
Member
Posts: 439
Joined: Fri Feb 11, 2022 4:55 am
Location: behind the keyboard

Re: How can a nop instruction speed up my test program?

Post by devc1 »

the added latency can be just from the host OS scheduler. Does it show the same result when you run it multiple times ?
thewrongchristian
Member
Member
Posts: 426
Joined: Tue Apr 03, 2018 2:44 am

Re: How can a nop instruction speed up my test program?

Post by thewrongchristian »

haolee wrote:I write a test program to test cache performance on x86 and found an unexpected problem.

My program is compiled with O0.
I can't understand how the nop instruction speed up my program. Thanks!
-O0 turns off most optimisations. When I use -O0, I do replicate your results, but see a much smaller effect, as much as 10% difference compared to your ~50% difference. Running 10 times in each binary, I get about 10% variation from fastest to slowest of each type, so the margin of error is probably quite high.

But once I turn optimisations on, your for loops is in fact invariant, so it's optimised away and replaced with a single store.

The difference with the nop is probably micro-architecture specific, relating to pipeline scheduling (my CPU is a Sandy Bridge based i5-2520M). I'd be surprised if it's instruction alignment based, Intel will have optimised that away in the 25+ years of the P6 derived architecture.

Your printf format specifiers are wrong as well. To print pointers, you should use "%p".
User avatar
haolee
Posts: 22
Joined: Thu Jul 13, 2017 9:52 am

Re: How can a nop instruction speed up my test program?

Post by haolee »

iansjack wrote:I couldn’t say if it’s the case with your program, but nop instructions can alter execution times by altering the alignment of memory accesses.
This could be a reason. Do we have any methods to prove it?
User avatar
haolee
Posts: 22
Joined: Thu Jul 13, 2017 9:52 am

Re: How can a nop instruction speed up my test program?

Post by haolee »

devc1 wrote:the added latency can be just from the host OS scheduler. Does it show the same result when you run it multiple times ?
I have run it many times and the result is very steady.
User avatar
haolee
Posts: 22
Joined: Thu Jul 13, 2017 9:52 am

Re: How can a nop instruction speed up my test program?

Post by haolee »

-O0 turns off most optimisations. When I use -O0, I do replicate your results, but see a much smaller effect, as much as 10% difference compared to your ~50% difference. Running 10 times in each binary, I get about 10% variation from fastest to slowest of each type, so the margin of error is probably quite high.
Yes, the result may have a certain amount of error but the difference is still significant.
But once I turn optimisations on, your for loops is in fact invariant, so it's optimised away and replaced with a single store.
O2 will eliminate the assignment statement and we can use

Code: Select all

*((volatile unsigned long *)(buf_pageend + offset)) = 0;
to prevent it to be optimized out. And if we use O2 optimisation level, the code will be very fast and nop is useless.
The difference with the nop is probably micro-architecture specific, relating to pipeline scheduling (my CPU is a Sandy Bridge based i5-2520M). I'd be surprised if it's instruction alignment based, Intel will have optimised that away in the 25+ years of the P6 derived architecture.
I also suspect nop instruction affects the pipeline but I can find any direct evidence.
Your printf format specifiers are wrong as well. To print pointers, you should use "%p".
Yes, %p should be better. Thanks.
User avatar
haolee
Posts: 22
Joined: Thu Jul 13, 2017 9:52 am

Re: How can a nop instruction speed up my test program?

Post by haolee »

I use perf to collect some important counters. here are the results:

Without nop instruction:

Code: Select all

$ perf stat -e uops_executed.stall_cycles,uops_issued.stall_cycles,uops_retired.stall_cycles,cycle_activity.cycles_mem_any,icache_64b.iftag_hit -M Branch_Misprediction_Cost -M Pipeline ./slow 0
&i = 7ffc07966f40, &offset=7ffc07966f00
ns: 3.728576

 Performance counter stats for './slow 0':

       213,867,558      uops_executed.stall_cycles                                     (17.56%)
       491,423,003      uops_issued.stall_cycles                                      (23.45%)
       512,748,389      uops_retired.stall_cycles                                     (17.62%)
       883,538,689      cycle_activity.cycles_mem_any                                     (23.51%)
       204,676,274      icache_64b.iftag_hit                                          (23.55%)
             7,437      br_misp_retired.all_branches #    872.9 Branch_Misprediction_Cost  (23.55%)
           149,274      machine_clears.count                                          (23.55%)
     1,009,768,664      uops_issued.any                                               (23.55%)
     1,005,689,857      uops_retired.retire_slots                                     (23.56%)
         1,029,988      int_misc.recovery_cycles                                      (23.56%)
       889,352,008      cycles                                                        (29.44%)
       375,429,113      idq_uops_not_delivered.cycles_0_uops_deliv.core                                     (29.44%)
         1,303,779      int_misc.clear_resteer_cycles                                     (29.44%)
            24,653      baclears.any                                                  (29.44%)
     1,005,647,085      uops_retired.retire_slots #      0.9 UPI                      (29.44%)
     1,104,896,299      inst_retired.any                                              (29.44%)
     1,101,945,035      inst_retired.any          #      0.8 CPI                      (23.56%)
       889,131,234      cycles                                                        (23.56%)
     1,200,172,618      uops_executed.thread      #      1.7 ILP                      (17.60%)
       725,629,824      uops_executed.core_cycles_ge_1                                     (17.60%)

       0.374539772 seconds time elapsed

       0.370386000 seconds user
       0.000987000 seconds sys
With nop instruction:

Code: Select all

$ $ perf stat -e uops_executed.stall_cycles,uops_issued.stall_cycles,uops_retired.stall_cycles,cycle_activity.cycles_mem_any,icache_64b.iftag_hit -M Branch_Misprediction_Cost -M Pipeline ./fast 0
&i = 7ffd56fbeac0, &offset=7ffd56fbea80
ns: 2.212173

        91,068,788      uops_retired.stall_cycles                                     (18.01%)
       523,629,214      cycle_activity.cycles_mem_any                                     (23.87%)
         2,324,443      icache_64b.iftag_hit                                          (23.46%)
             4,823      br_misp_retired.all_branches #   1171.2 Branch_Misprediction_Cost  (23.42%)
           122,449      machine_clears.count                                          (23.43%)
     1,216,774,633      uops_issued.any                                               (23.42%)
     1,206,512,556      uops_retired.retire_slots                                     (23.42%)
           705,325      int_misc.recovery_cycles                                      (23.43%)
       528,397,938      cycles                                                        (29.28%)
        22,295,204      idq_uops_not_delivered.cycles_0_uops_deliv.core                                     (29.28%)
         1,151,776      int_misc.clear_resteer_cycles                                     (29.28%)
            15,162      baclears.any                                                  (29.28%)
     1,205,231,069      uops_retired.retire_slots #      1.0 UPI                      (29.28%)
     1,204,519,156      inst_retired.any                                              (29.28%)
     1,204,031,494      inst_retired.any          #      0.4 CPI                      (23.43%)
       528,232,797      cycles                                                        (23.43%)
     1,312,254,164      uops_executed.thread      #      2.6 ILP                      (17.57%)
       497,642,069      uops_executed.core_cycles_ge_1                                     (17.57%)

       0.222836590 seconds time elapsed

       0.217941000 seconds user
       0.002964000 seconds sys
We can see that the slower one has more cycle_activity.cycles_mem_any, uops_issued.stall_cycles, br_misp_retired.all_branches and icache_64b.iftag_hit. I don't know if this means the slower one has more misprediction in pipeline and thus have to re-fetch more instructions.
thewrongchristian
Member
Member
Posts: 426
Joined: Tue Apr 03, 2018 2:44 am

Re: How can a nop instruction speed up my test program?

Post by thewrongchristian »

haolee wrote:I use perf to collect some important counters. here are the results:

Without nop instruction:

Code: Select all

$ perf stat -e uops_executed.stall_cycles,uops_issued.stall_cycles,uops_retired.stall_cycles,cycle_activity.cycles_mem_any,icache_64b.iftag_hit -M Branch_Misprediction_Cost -M Pipeline ./slow 0
&i = 7ffc07966f40, &offset=7ffc07966f00
ns: 3.728576

 Performance counter stats for './slow 0':

       213,867,558      uops_executed.stall_cycles                                     (17.56%)
       491,423,003      uops_issued.stall_cycles                                      (23.45%)
       512,748,389      uops_retired.stall_cycles                                     (17.62%)
       883,538,689      cycle_activity.cycles_mem_any                                     (23.51%)
       204,676,274      icache_64b.iftag_hit                                          (23.55%)
             7,437      br_misp_retired.all_branches #    872.9 Branch_Misprediction_Cost  (23.55%)
           149,274      machine_clears.count                                          (23.55%)
     1,009,768,664      uops_issued.any                                               (23.55%)
     1,005,689,857      uops_retired.retire_slots                                     (23.56%)
         1,029,988      int_misc.recovery_cycles                                      (23.56%)
       889,352,008      cycles                                                        (29.44%)
       375,429,113      idq_uops_not_delivered.cycles_0_uops_deliv.core                                     (29.44%)
         1,303,779      int_misc.clear_resteer_cycles                                     (29.44%)
            24,653      baclears.any                                                  (29.44%)
     1,005,647,085      uops_retired.retire_slots #      0.9 UPI                      (29.44%)
     1,104,896,299      inst_retired.any                                              (29.44%)
     1,101,945,035      inst_retired.any          #      0.8 CPI                      (23.56%)
       889,131,234      cycles                                                        (23.56%)
     1,200,172,618      uops_executed.thread      #      1.7 ILP                      (17.60%)
       725,629,824      uops_executed.core_cycles_ge_1                                     (17.60%)

       0.374539772 seconds time elapsed

       0.370386000 seconds user
       0.000987000 seconds sys
With nop instruction:

Code: Select all

$ $ perf stat -e uops_executed.stall_cycles,uops_issued.stall_cycles,uops_retired.stall_cycles,cycle_activity.cycles_mem_any,icache_64b.iftag_hit -M Branch_Misprediction_Cost -M Pipeline ./fast 0
&i = 7ffd56fbeac0, &offset=7ffd56fbea80
ns: 2.212173

        91,068,788      uops_retired.stall_cycles                                     (18.01%)
       523,629,214      cycle_activity.cycles_mem_any                                     (23.87%)
         2,324,443      icache_64b.iftag_hit                                          (23.46%)
             4,823      br_misp_retired.all_branches #   1171.2 Branch_Misprediction_Cost  (23.42%)
           122,449      machine_clears.count                                          (23.43%)
     1,216,774,633      uops_issued.any                                               (23.42%)
     1,206,512,556      uops_retired.retire_slots                                     (23.42%)
           705,325      int_misc.recovery_cycles                                      (23.43%)
       528,397,938      cycles                                                        (29.28%)
        22,295,204      idq_uops_not_delivered.cycles_0_uops_deliv.core                                     (29.28%)
         1,151,776      int_misc.clear_resteer_cycles                                     (29.28%)
            15,162      baclears.any                                                  (29.28%)
     1,205,231,069      uops_retired.retire_slots #      1.0 UPI                      (29.28%)
     1,204,519,156      inst_retired.any                                              (29.28%)
     1,204,031,494      inst_retired.any          #      0.4 CPI                      (23.43%)
       528,232,797      cycles                                                        (23.43%)
     1,312,254,164      uops_executed.thread      #      2.6 ILP                      (17.57%)
       497,642,069      uops_executed.core_cycles_ge_1                                     (17.57%)

       0.222836590 seconds time elapsed

       0.217941000 seconds user
       0.002964000 seconds sys
We can see that the slower one has more cycle_activity.cycles_mem_any, uops_issued.stall_cycles, br_misp_retired.all_branches and icache_64b.iftag_hit. I don't know if this means the slower one has more misprediction in pipeline and thus have to re-fetch more instructions.

I would turn off hyper-threading, start up in single user mode, and run the tests again. The numbers above look like the pipeline might be switching between threads of execution, and the nop is perhaps allowing that switch to happen at a more opportune time (note, this is just a guess, I'm no hyper-threading expert.)

With no hyper-threading, I suspect the difference will narrow. I'll also check my hyper-threading status.
thewrongchristian
Member
Member
Posts: 426
Joined: Tue Apr 03, 2018 2:44 am

Re: How can a nop instruction speed up my test program?

Post by thewrongchristian »

thewrongchristian wrote:
haolee wrote:I use perf to collect some important counters. here are the results:

[snip]

I would turn off hyper-threading, start up in single user mode, and run the tests again. The numbers above look like the pipeline might be switching between threads of execution, and the nop is perhaps allowing that switch to happen at a more opportune time (note, this is just a guess, I'm no hyper-threading expert.)

With no hyper-threading, I suspect the difference will narrow. I'll also check my hyper-threading status.
I did some tests using perf, and the consistent result I saw was a higher CPU clock correlated with higher performance (as you'd expect).

I didn't turn off hyper-threading nor run single user to do the test.

Just using: perf spec <cmd> 0

Code: Select all

&i = 7ffcfc0a0d40, &offset=7ffcfc0a0d80
ns: 2.162476

 Performance counter stats for './cachep 0':

            216.82 msec task-clock                #    0.997 CPUs utilized          
                 2      context-switches          #    9.224 /sec                   
                 0      cpu-migrations            #    0.000 /sec                   
                62      page-faults               #  285.953 /sec                   
       630,072,269      cycles                    #    2.906 GHz                      (83.40%)
       328,427,135      stalled-cycles-frontend   #   52.13% frontend cycles idle     (83.40%)
         4,334,423      stalled-cycles-backend    #    0.69% backend cycles idle      (66.80%)
       999,062,021      instructions              #    1.59  insn per cycle         
                                                  #    0.33  stalled cycles per insn  (83.40%)
        99,788,133      branches                  #  460.238 M/sec                    (83.40%)
             2,898      branch-misses             #    0.00% of all branches          (83.01%)

       0.217446367 seconds time elapsed

       0.217450000 seconds user
       0.000000000 seconds sys


&i = 7ffc256532c0, &offset=7ffc25653300
ns: 2.017880

 Performance counter stats for './cachepnop 0':

            202.43 msec task-clock                #    0.997 CPUs utilized          
                 5      context-switches          #   24.699 /sec                   
                 0      cpu-migrations            #    0.000 /sec                   
                64      page-faults               #  316.152 /sec                   
       618,866,183      cycles                    #    3.057 GHz                      (82.22%)
       317,277,966      stalled-cycles-frontend   #   51.27% frontend cycles idle     (82.22%)
         4,277,784      stalled-cycles-backend    #    0.69% backend cycles idle      (67.89%)
     1,100,042,543      instructions              #    1.78  insn per cycle         
                                                  #    0.29  stalled cycles per insn  (84.20%)
       100,308,804      branches                  #  495.512 M/sec                    (84.22%)
             5,670      branch-misses             #    0.01% of all branches          (83.44%)

       0.203112777 seconds time elapsed

       0.203148000 seconds user
       0.000000000 seconds sys
So I think the nop is just preventing CPU clock throttling.

Retry your tests without -e flag to perf stat, and see what your CPU clock rate is for nop-less and nop tests.
User avatar
haolee
Posts: 22
Joined: Thu Jul 13, 2017 9:52 am

Re: How can a nop instruction speed up my test program?

Post by haolee »

So I think the nop is just preventing CPU clock throttling.

Retry your tests without -e flag to perf stat, and see what your CPU clock rate is for nop-less and nop tests.
Unfortunately, on my Xeon Platinum 8260 CPU, the slower one and the faster one don't have difference on CPU clock rate which is both 2.394GHz.

Code: Select all

# ./slow 0
772,752,989      cycles                    #    2.394 GHz
# ./fast 0
537,562,083      cycles                    #    2.394 GHz
More over, I found another strange thing. We first generate a asm file.

Code: Select all

$ gcc -S test.c -o slow.s
Then edit this file.

Before:

Code: Select all

.L6:
        movq    -176(%rbp), %rdx
        movq    -64(%rbp), %rax
        addq    %rdx, %rax
        movq    $0, (%rax)
        movq    -112(%rbp), %rax
        addq    $1, %rax
        movq    %rax, -112(%rbp)
.L5:
        movq    -112(%rbp), %rax
        cmpq    $99999999, %rax
After:

Code: Select all

.L6:
        movq    -176(%rbp), %rdx
        movq    -64(%rbp), %rax
        addq    %rdx, %rax
        movq    $0, (%rax)
        addq    $1, -112(%rbp)
.L5:
        movq    -112(%rbp), %rax
        cmpq    $99999999, %rax
Then compile it and test.

Code: Select all

$ gcc slow.s -o slow ; ./slow 0
&i = 7ffe81263600, &offset=7ffe812635c0
ns: 3.218585
It's slow as we expected. But now we use -ftree-ter to compile this program from .c file.

Code: Select all

$ gcc -ftree-ter test.c -o fast; ./fast 0
&i = 7ffc6c8c72c0, &offset=7ffc6c8c7280
ns: 2.094179
It becomes faster.

And now we compare their asm code:

Code: Select all

objdump slow -d > slow.dump
objdump fast -d > fast.dump
vim slow.dump # delete the address columns
vim fast.dump # delete the address columns
diff -u slow.dump fast.dump

Code: Select all

-e8 b0 fd ff ff         callq  400620 <clock_gettime@plt>
+e8 b3 fd ff ff         callq  400620 <clock_gettime@plt>
 48 c7 45 90 00 00 00   movq   $0x0,-0x70(%rbp)
 00
-eb 1a                  jmp    400894 <main+0x12e>
+eb 1a                  jmp    400891 <main+0x12b>
 48 8b 95 50 ff ff ff   mov    -0xb0(%rbp),%rdx
 48 8b 45 c0            mov    -0x40(%rbp),%rax
 48 01 d0               add    %rdx,%rax
@@ -204,11 +203,11 @@
 48 83 45 90 01         addq   $0x1,-0x70(%rbp)
 48 8b 45 90            mov    -0x70(%rbp),%rax
 48 3d ff e0 f5 05      cmp    $0x5f5e0ff,%rax
-76 da                  jbe    40087a <main+0x114>
+76 da                  jbe    400877 <main+0x111>
 48 8d 85 30 ff ff ff   lea    -0xd0(%rbp),%rax
 48 89 c6               mov    %rax,%rsi
 bf 01 00 00 00         mov    $0x1,%edi
-e8 6c fd ff ff         callq  400620 <clock_gettime@plt>
+e8 6f fd ff ff         callq  400620 <clock_gettime@plt>
They don't have any significant difference. This means their only differences are the instruction addresses. what a weird behavior...
Post Reply