Page 2 of 2

Re: Slow Floating Point Performance in VirtualBox

Posted: Sun Feb 09, 2020 8:49 am
by Octocontrabass
kemosparc wrote:So the XRSTOR or XRSTORS need to be done at boot time once, right?
Typically you would use XRSTOR/XRSTORS on each task switch (with a corresponding XSAVE/XSAVEC/XSAVEOPT/XSAVES). If you don't have task switches yet, then just once during boot should be enough.
kemosparc wrote:I expect that if I try to inject XRSTOR/XRSTORS/VZEROUPPER/VZEROALL in my code things might get faster? My question is what are other things that I should look for regarding throttling? In my case it seems not normal!
I'm not aware of any other situation that could cause the CPU to throttle only when running floating-point code.

Re: Slow Floating Point Performance in VirtualBox

Posted: Sun Feb 09, 2020 9:05 am
by kemosparc
Hi,

I got around 3.5-4% increase in speed when I added vzeroupper to all my avx2 code :)

I have another questions though if you can help me with please. How can I implement fastcall in x86_64. I have read that it should only be used with i386 but I have some very small functions whose body can work directly on the integral parameters of the ABI (rdi,rsi,rdx,rcx,rbx,r8,r9) passed through the registers and still the compiler copy the registers value to the stack and thn back to another set of registers which I think can introduce some optimizations.

Thanks,
Karim.

Re: Slow Floating Point Performance in VirtualBox

Posted: Sun Feb 09, 2020 9:25 am
by Octocontrabass
There's no fastcall ABI for x64. The two x64 ABIs, System V and Microsoft, are pretty close to fastcall anyway.

Can you show the code for the function you're trying to optimize? You might be doing something that convinces GCC it has to spill the registers to the stack. (Unfortunately, it's also possible GCC is just stupid.)

Re: Slow Floating Point Performance in VirtualBox

Posted: Sun Feb 09, 2020 11:18 am
by kemosparc
It seems that I am the stupid one not GCC. I had this question for earlier, but now as I have increased the optimization levels GCC is getting smarter and eliminating those parts :)

One extra question :)

Now I am trying to run benchmarks on data being read from files. I did not have the time to implement a buffer cache in my Kernel, but I am planning to do so soon. So I would like linux to bypass its buffer cache as part of the measurement is the file I/O. In previous simpler benchmarks I was able to isolate I/O from processing and measure only the processing part. But this time the I/O is interleaved with processing so it is kind of difficult to do that.

Is there a way to make Linux bypass its buffer cache.

Thanks,
Karim.

Re: Slow Floating Point Performance in VirtualBox

Posted: Sun Feb 09, 2020 11:24 am
by Korona
You can use O_DIRECT and/or madvise/fadvise(DONTNEED) for that.

Re: Slow Floating Point Performance in VirtualBox

Posted: Tue Feb 11, 2020 8:06 am
by kemosparc
Thanks a lot :)

I am working on it.

Karim.

Re: Slow Floating Point Performance in VirtualBox

Posted: Tue Feb 11, 2020 10:49 am
by Korona
By the way, I noticed that in my apples-to-apples comparison list, I forgot to mention the elephant in the room: make sure to disable speculative execution mitigations on Linux. That might explain the slowness of softirqs in your system.

Re: Slow Floating Point Performance in VirtualBox

Posted: Wed Feb 12, 2020 9:37 am
by kemosparc
Hi Korona,

Thanks for the message. Can you please provide more information about how to disable it?

Hi Octocontrabass,

So I am a little bit confused about the XRSTOR/XRSTORS and XSAVE/XSAVEC/XSAVEOPT/XSAVES. I read the documentation and it is clear that they save and restore the extended registers. So what I inferred, and please correct me if I am wrong, is that I need to xsave into memory at the beginning of my interrupt handler and xrstor before returning to user mode, right?

In my case I have 4 cores, one of them is the BSP and have the apic local timer enabled on it and I guess in that case I need to xsave/xrstor? right? please confirm.
I have a core that I dedicated for disk I/O and used APIC/IO to route all disk interrupts to it, and I use DMA to Read/Write LBA48, but no apic timer is enabled. So in that case, I will need to xsave/xrstor on every ATA interrupt? right? please confirm.
A third core with no HW interrupts enabled at all and no apic timer enabled on it and which is used specifically for running my code, which contains a of floating point instructions; most of it floating point calculation for some scientific application that I got it code ready. But the code generates system calls interrupts for File I/O and upon returning from I/O I still go back to the same process, so there is ring switching between ring3 and ring4, but really no switching between different processes so far, I do not have even to switch the address space as I map the kernel space in read-only into my process page table. Do I have to perform xsave/xrstor in the system call interrupt handler at ring 0?

My final most important question is why xsave/xrstor will enhance FPU execution and reduce throttling, as the documentation says it just saves and restores the extended registers? Also you advised that I run it once in the beginning in case I have no process switching, so in that case what I should save and restore?

Note: my kernel does not perform any FPU instructions but it still uses ymm registers in some cases.

Also Korona, please if you have answers to the above inquiries kindly let me know.

Thank you both for your feedback :)
Karim.

Re: Slow Floating Point Performance in VirtualBox

Posted: Thu Feb 13, 2020 5:54 am
by Octocontrabass
kemosparc wrote:So I am a little bit confused about the XRSTOR/XRSTORS and XSAVE/XSAVEC/XSAVEOPT/XSAVES. I read the documentation and it is clear that they save and restore the extended registers. So what I inferred, and please correct me if I am wrong, is that I need to xsave into memory at the beginning of my interrupt handler and xrstor before returning to user mode, right?

In my case I have 4 cores, one of them is the BSP and have the apic local timer enabled on it and I guess in that case I need to xsave/xrstor? right? please confirm.
I have a core that I dedicated for disk I/O and used APIC/IO to route all disk interrupts to it, and I use DMA to Read/Write LBA48, but no apic timer is enabled. So in that case, I will need to xsave/xrstor on every ATA interrupt? right? please confirm.
Assuming both the interrupted program and the interrupt handler are allowed to use x87/SSE/AVX registers, yes, you must save and restore them. If the interrupted program was using AVX registers, your interrupt handler will have the AVX penalty until it clears the AVX state.
kemosparc wrote:A third core with no HW interrupts enabled at all and no apic timer enabled on it and which is used specifically for running my code, which contains a of floating point instructions; most of it floating point calculation for some scientific application that I got it code ready. But the code generates system calls interrupts for File I/O and upon returning from I/O I still go back to the same process, so there is ring switching between ring3 and ring4, but really no switching between different processes so far, I do not have even to switch the address space as I map the kernel space in read-only into my process page table. Do I have to perform xsave/xrstor in the system call interrupt handler at ring 0?
That's up to you. It's your system call ABI, you get to decide how it works.
kemosparc wrote:My final most important question is why xsave/xrstor will enhance FPU execution and reduce throttling, as the documentation says it just saves and restores the extended registers?
It also tracks whether the extended registers are in their initial state. The YMM registers must be in their initial state to avoid penalties.
kemosparc wrote:Also you advised that I run it once in the beginning in case I have no process switching, so in that case what I should save and restore?
Create an XSAVE area with all registers in their initial state and all bits of XSTATE_BV clear, then load it with XRSTOR or XRSTORS.

Re: Slow Floating Point Performance in VirtualBox

Posted: Sun Mar 08, 2020 12:57 pm
by kemosparc
Hi,
I am still working on that with very slow progress. I have applied the suggestions of adding XRSTOR/XRSTORS and XSAVE/XSAVEC/XSAVEOPT/XSAVES and made sure they execute during the interrupt handling.

I have done some optimizations and things are relatively better, but here is the situation so far and I appreciate if anyone can provide an explanation.

I have two small pieces of code that I run on both environments Linux and my OS. both allocates memory and perform a long running double nested loops on some arithmetic operations. The first piece of code performs floating point double precision operations in the body of the inner loop and the other one performs long integer operations (64 bit).

Code: Select all

	double * d  = (double *) calloc (1024*1024*16,sizeof(double));
	struct timeval st, et;
	gettimeofday(&st,NULL);
    	for ( int j = 1 ; j < 1024*8 ; j ++)
       	 	for ( int i = 0 ; i  < 1024*1024*16 ; i ++)
            		d[i] = d[i] * j + i;
	printf ("%f\n",d[1000]);
	gettimeofday(&et,NULL);
	printf ("%lu microsec\n",((et.tv_sec - st.tv_sec) * 1000000) + (et.tv_usec - st.tv_usec));
    	free ( d );

Code: Select all

	long * d  = (long *) calloc (1024*1024*16,sizeof(long));
	struct timeval st, et;
	gettimeofday(&st,NULL);
    	for ( int j = 1 ; j < 1024*8 ; j ++)
       	 	for ( int i = 0 ; i  < 1024*1024*16 ; i ++)
            		d[i] = d[i] * j + i;
	printf ("%lu\n",d[1000]);
	gettimeofday(&et,NULL);
	printf ("%lu microsec\n",((et.tv_sec - st.tv_sec) * 1000000) + (et.tv_usec - st.tv_usec));
    	free ( d );
I compile both on linux and on my OS using the following g++ compiler switches:

Code: Select all

-Ofast  -m128bit-long-double -m64 -m80387 -mabm -maes -malign-stringops -mavx -mavx2 -mbmi -mbmi2 -mcx16 -mf16c -mfancy-math-387 -mfma -mfp-ret-in-387 -mfsgsbase -mfxsr -mhard-float -mhle -mieee-fp -mlong-double-80 -mlzcnt -mmmx -mmovbe -mpclmul -mpopcnt -mprfchw -mpush-args -mrdrnd -mred-zone -msahf -msse -msse2 -msse3 -msse4 -msse4.1 -msse4.2 -mssse3 -mstv -mtls-direct-seg-refs -mvzeroupper -mxsave -mxsaveopt -ffast-math -malign-double -fno-omit-frame-pointer -fno-align-functions -fno-align-loops -m3dnow -fsched-spec -fsched-interblock -fschedule-insns -fschedule-insns2 -fsched2-use-traces -freschedule-modulo-scheduled-loops -fselective-scheduling  -fsel-sched-pipelining -fsel-sched-pipelining-outer-loops 
Now the floating point code takes on Linux and My OS almost the same time of an average 50 seconds taken over multiple runs. On the other hand the long integer code takes on Linux 110 seconds and on my OS 60 seconds; on average as well.

What actually attracts my attention is that the long integer code is slower that the floating point in both environments. Of course in my OS the slowdown rate is less. So for a moment forget about my OS, the floating point code is almost double the speed as the long integer on linux !!! This is very much counter intuitive as it is very well known that floating point operations are more resource/time demanding than integer arithmetic!

Keeping in mind that in my OS case the cores running those two pieces of code gave the timer disabled, so the case of the integer arithmetic makes sense to me as it complies with all the experiments I ran earlier.

This brings this question up: Can it be that the floating point operations are executed by some co-processor or isolated hardware that does not get affected by the timer and the interrupt handling??

Keep in mind that all time calculations measured above exclude memory allocation, so it is just the loop and the arithmetic.

Please if you have any explanation for the above kindly let me know.

Thanks a lot,
Karim.

Re: Slow Floating Point Performance in VirtualBox

Posted: Sun Mar 08, 2020 2:12 pm
by nullplan
I cannot reproduce your results on my laptop (Intel Core i5 processor). Both programs take about the same time to complete. Several effects can be at work here, though:

The program using "double" prints "inf" on my machine. Maybe your processor has an optimization to shortcut calculations with infinity, thus removing processing time from the later runs through the loop. The program using "long" cannot do that, for obvious reasons.

Your compiler may be using vectorization in case of "double" but not "long". Have it generate the assembly (-S) and check that to be sure. Generating two results per iteration probably makes the loop more effective.

To answer your question: No, maths is not done on a co-processor anymore, and in any case, it wouldn't ignore interrupts. Normal coprocessors are programmed by writing a program into their memory and having the coproc run them. This is not how x87 ever worked. x87 got its instructions directly from the x86 instruction stream. Thus, if the x86 core was interrupted, the x87 could also not continue, even if it was a separate unit. But ever since the 486DX, the x87 core is integrated into the CPU (the 487 was a full-blown 486DX core and would take over all processing duties on systems it was attached to). And none of this matters to your case anyway, since in 64-bit mode, arithmetic on "float" and "double" is done with SSE instructions anyway, and that was never a separate unit.

Re: Slow Floating Point Performance in VirtualBox

Posted: Wed Mar 11, 2020 9:31 am
by kemosparc
Hi,

Here is the objdump of the unsigned long program:

Code: Select all

test_lng:     file format elf64-x86-64


Disassembly of section .init:

0000000000001000 <_init>:
    1000:	48 83 ec 08          	sub    $0x8,%rsp
    1004:	48 8b 05 e5 2f 00 00 	mov    0x2fe5(%rip),%rax        # 3ff0 <__gmon_start__>
    100b:	48 85 c0             	test   %rax,%rax
    100e:	74 02                	je     1012 <_init+0x12>
    1010:	ff d0                	callq  *%rax
    1012:	48 83 c4 08          	add    $0x8,%rsp
    1016:	c3                   	retq   

Disassembly of section .plt:

0000000000001020 <.plt>:
    1020:	ff 35 e2 2f 00 00    	pushq  0x2fe2(%rip)        # 4008 <_GLOBAL_OFFSET_TABLE_+0x8>
    1026:	ff 25 e4 2f 00 00    	jmpq   *0x2fe4(%rip)        # 4010 <_GLOBAL_OFFSET_TABLE_+0x10>
    102c:	0f 1f 40 00          	nopl   0x0(%rax)

0000000000001030 <printf@plt>:
    1030:	ff 25 e2 2f 00 00    	jmpq   *0x2fe2(%rip)        # 4018 <printf@GLIBC_2.2.5>
    1036:	68 00 00 00 00       	pushq  $0x0
    103b:	e9 e0 ff ff ff       	jmpq   1020 <.plt>

0000000000001040 <calloc@plt>:
    1040:	ff 25 da 2f 00 00    	jmpq   *0x2fda(%rip)        # 4020 <calloc@GLIBC_2.2.5>
    1046:	68 01 00 00 00       	pushq  $0x1
    104b:	e9 d0 ff ff ff       	jmpq   1020 <.plt>

0000000000001050 <free@plt>:
    1050:	ff 25 d2 2f 00 00    	jmpq   *0x2fd2(%rip)        # 4028 <free@GLIBC_2.2.5>
    1056:	68 02 00 00 00       	pushq  $0x2
    105b:	e9 c0 ff ff ff       	jmpq   1020 <.plt>

0000000000001060 <gettimeofday@plt>:
    1060:	ff 25 ca 2f 00 00    	jmpq   *0x2fca(%rip)        # 4030 <gettimeofday@GLIBC_2.2.5>
    1066:	68 03 00 00 00       	pushq  $0x3
    106b:	e9 b0 ff ff ff       	jmpq   1020 <.plt>

Disassembly of section .plt.got:

0000000000001070 <__cxa_finalize@plt>:
    1070:	ff 25 62 2f 00 00    	jmpq   *0x2f62(%rip)        # 3fd8 <__cxa_finalize@GLIBC_2.2.5>
    1076:	66 90                	xchg   %ax,%ax

Disassembly of section .text:

0000000000001080 <main>:
    1080:	55                   	push   %rbp
    1081:	be 08 00 00 00       	mov    $0x8,%esi
    1086:	bf 00 00 00 01       	mov    $0x1000000,%edi
    108b:	48 89 e5             	mov    %rsp,%rbp
    108e:	53                   	push   %rbx
    108f:	48 83 e4 e0          	and    $0xffffffffffffffe0,%rsp
    1093:	48 83 ec 60          	sub    $0x60,%rsp
    1097:	e8 a4 ff ff ff       	callq  1040 <calloc@plt>
    109c:	31 f6                	xor    %esi,%esi
    109e:	48 8d 7c 24 40       	lea    0x40(%rsp),%rdi
    10a3:	48 89 c3             	mov    %rax,%rbx
    10a6:	e8 b5 ff ff ff       	callq  1060 <gettimeofday@plt>
    10ab:	c5 7d 6f 2d 6d 0f 00 	vmovdqa 0xf6d(%rip),%ymm13        # 2020 <_IO_stdin_used+0x20>
    10b2:	00 
    10b3:	b9 ff 0f 00 00       	mov    $0xfff,%ecx
    10b8:	48 89 d8             	mov    %rbx,%rax
    10bb:	c5 7d 6f 1d 7d 0f 00 	vmovdqa 0xf7d(%rip),%ymm11        # 2040 <_IO_stdin_used+0x40>
    10c2:	00 
    10c3:	ba 01 00 00 00       	mov    $0x1,%edx
    10c8:	48 8d b3 00 00 00 08 	lea    0x8000000(%rbx),%rsi
    10cf:	c4 c2 7d 25 f5       	vpmovsxdq %xmm13,%ymm6
    10d4:	44 8d 4a 01          	lea    0x1(%rdx),%r9d
    10d8:	48 63 fa             	movslq %edx,%rdi
    10db:	c4 43 7d 39 e8 01    	vextracti128 $0x1,%ymm13,%xmm8
    10e1:	49 89 d8             	mov    %rbx,%r8
    10e4:	44 89 4c 24 20       	mov    %r9d,0x20(%rsp)
    10e9:	c4 e2 7d 58 44 24 20 	vpbroadcastd 0x20(%rsp),%ymm0
    10f0:	c4 c1 15 fe fb       	vpaddd %ymm11,%ymm13,%ymm7
    10f5:	c4 42 7d 25 c0       	vpmovsxdq %xmm8,%ymm8
    10fa:	48 89 7c 24 20       	mov    %rdi,0x20(%rsp)
    10ff:	c4 e2 7d 59 5c 24 20 	vpbroadcastq 0x20(%rsp),%ymm3
    1106:	c4 c3 7d 39 c1 01    	vextracti128 $0x1,%ymm0,%xmm9
    110c:	c4 e2 7d 25 d0       	vpmovsxdq %xmm0,%ymm2
    1111:	c5 9d 73 d2 20       	vpsrlq $0x20,%ymm2,%ymm12
    1116:	c5 ad 73 d3 20       	vpsrlq $0x20,%ymm3,%ymm10
    111b:	c4 42 7d 25 c9       	vpmovsxdq %xmm9,%ymm9
    1120:	c4 c1 55 73 d1 20    	vpsrlq $0x20,%ymm9,%ymm5
    1126:	c5 fd 7f 6c 24 20    	vmovdqa %ymm5,0x20(%rsp)
    112c:	c5 fa 6f 60 20       	vmovdqu 0x20(%rax),%xmm4
    1131:	c5 fa 6f 28          	vmovdqu (%rax),%xmm5
    1135:	48 83 c0 40          	add    $0x40,%rax
    1139:	c5 f8 29 7c 24 10    	vmovaps %xmm7,0x10(%rsp)
    113f:	c4 e3 5d 38 48 f0 01 	vinserti128 $0x1,-0x10(%rax),%ymm4,%ymm1
    1146:	c4 e3 55 38 40 d0 01 	vinserti128 $0x1,-0x30(%rax),%ymm5,%ymm0
    114d:	c4 e3 7d 39 3c 24 01 	vextracti128 $0x1,%ymm7,(%rsp)
    1154:	c4 c1 45 fe fb       	vpaddd %ymm11,%ymm7,%ymm7
    1159:	c5 d5 73 d1 20       	vpsrlq $0x20,%ymm1,%ymm5
    115e:	c5 dd 73 d0 20       	vpsrlq $0x20,%ymm0,%ymm4
    1163:	c5 2d f4 f9          	vpmuludq %ymm1,%ymm10,%ymm15
    1167:	c5 2d f4 f0          	vpmuludq %ymm0,%ymm10,%ymm14
    116b:	c5 d5 f4 eb          	vpmuludq %ymm3,%ymm5,%ymm5
    116f:	c5 dd f4 e3          	vpmuludq %ymm3,%ymm4,%ymm4
    1173:	c5 fd f4 c3          	vpmuludq %ymm3,%ymm0,%ymm0
    1177:	c5 f5 f4 cb          	vpmuludq %ymm3,%ymm1,%ymm1
    117b:	c4 c1 55 d4 ef       	vpaddq %ymm15,%ymm5,%ymm5
    1180:	c4 c1 5d d4 e6       	vpaddq %ymm14,%ymm4,%ymm4
    1185:	c5 d5 73 f5 20       	vpsllq $0x20,%ymm5,%ymm5
    118a:	c5 dd 73 f4 20       	vpsllq $0x20,%ymm4,%ymm4
    118f:	c5 f5 d4 cd          	vpaddq %ymm5,%ymm1,%ymm1
    1193:	c5 fd d4 e4          	vpaddq %ymm4,%ymm0,%ymm4
    1197:	c4 c1 75 d4 c0       	vpaddq %ymm8,%ymm1,%ymm0
    119c:	c5 dd d4 ce          	vpaddq %ymm6,%ymm4,%ymm1
    11a0:	c5 7d f4 7c 24 20    	vpmuludq 0x20(%rsp),%ymm0,%ymm15
    11a6:	c5 d5 73 d0 20       	vpsrlq $0x20,%ymm0,%ymm5
    11ab:	c5 dd 73 d1 20       	vpsrlq $0x20,%ymm1,%ymm4
    11b0:	c5 1d f4 f1          	vpmuludq %ymm1,%ymm12,%ymm14
    11b4:	c4 c1 55 f4 e9       	vpmuludq %ymm9,%ymm5,%ymm5
    11b9:	c5 dd f4 e2          	vpmuludq %ymm2,%ymm4,%ymm4
    11bd:	c4 c1 7d f4 c1       	vpmuludq %ymm9,%ymm0,%ymm0
    11c2:	c5 f5 f4 ca          	vpmuludq %ymm2,%ymm1,%ymm1
    11c6:	c4 c1 55 d4 ef       	vpaddq %ymm15,%ymm5,%ymm5
    11cb:	c4 c1 5d d4 e6       	vpaddq %ymm14,%ymm4,%ymm4
    11d0:	c5 d5 73 f5 20       	vpsllq $0x20,%ymm5,%ymm5
    11d5:	c5 dd 73 f4 20       	vpsllq $0x20,%ymm4,%ymm4
    11da:	c5 fd d4 c5          	vpaddq %ymm5,%ymm0,%ymm0
    11de:	c5 f5 d4 cc          	vpaddq %ymm4,%ymm1,%ymm1
    11e2:	c4 c1 7d d4 c0       	vpaddq %ymm8,%ymm0,%ymm0
    11e7:	c4 62 7d 25 04 24    	vpmovsxdq (%rsp),%ymm8
    11ed:	c5 f5 d4 ce          	vpaddq %ymm6,%ymm1,%ymm1
    11f1:	c4 e3 7d 39 40 f0 01 	vextracti128 $0x1,%ymm0,-0x10(%rax)
    11f8:	c5 f8 11 40 e0       	vmovups %xmm0,-0x20(%rax)
    11fd:	c4 e2 7d 25 74 24 10 	vpmovsxdq 0x10(%rsp),%ymm6
    1204:	c5 f8 11 48 c0       	vmovups %xmm1,-0x40(%rax)
    1209:	c4 e3 7d 39 48 d0 01 	vextracti128 $0x1,%ymm1,-0x30(%rax)
    1210:	48 39 f0             	cmp    %rsi,%rax
    1213:	0f 85 13 ff ff ff    	jne    112c <main+0xac>
    1219:	83 c2 02             	add    $0x2,%edx
    121c:	83 e9 01             	sub    $0x1,%ecx
    121f:	c4 c2 7d 25 f5       	vpmovsxdq %xmm13,%ymm6
    1224:	48 89 d8             	mov    %rbx,%rax
    1227:	0f 85 a7 fe ff ff    	jne    10d4 <main+0x54>
    122d:	4c 63 ca             	movslq %edx,%r9
    1230:	41 ba ff 1f 00 00    	mov    $0x1fff,%r10d
    1236:	31 f6                	xor    %esi,%esi
    1238:	41 29 d2             	sub    %edx,%r10d
    123b:	4c 89 c9             	mov    %r9,%rcx
    123e:	49 8d 51 01          	lea    0x1(%r9),%rdx
    1242:	49 8b 00             	mov    (%r8),%rax
    1245:	4a 8d 3c 12          	lea    (%rdx,%r10,1),%rdi
    1249:	48 0f af c1          	imul   %rcx,%rax
    124d:	48 89 d1             	mov    %rdx,%rcx
    1250:	48 01 f0             	add    %rsi,%rax
    1253:	48 39 fa             	cmp    %rdi,%rdx
    1256:	48 8d 52 01          	lea    0x1(%rdx),%rdx
    125a:	75 ed                	jne    1249 <main+0x1c9>
    125c:	48 83 c6 01          	add    $0x1,%rsi
    1260:	49 89 00             	mov    %rax,(%r8)
    1263:	4c 89 c9             	mov    %r9,%rcx
    1266:	49 83 c0 08          	add    $0x8,%r8
    126a:	48 81 fe 00 00 00 01 	cmp    $0x1000000,%rsi
    1271:	75 cb                	jne    123e <main+0x1be>
    1273:	48 8b b3 40 1f 00 00 	mov    0x1f40(%rbx),%rsi
    127a:	48 8d 3d 83 0d 00 00 	lea    0xd83(%rip),%rdi        # 2004 <_IO_stdin_used+0x4>
    1281:	31 c0                	xor    %eax,%eax
    1283:	c5 f8 77             	vzeroupper 
    1286:	e8 a5 fd ff ff       	callq  1030 <printf@plt>
    128b:	48 8d 7c 24 50       	lea    0x50(%rsp),%rdi
    1290:	31 f6                	xor    %esi,%esi
    1292:	e8 c9 fd ff ff       	callq  1060 <gettimeofday@plt>
    1297:	48 8b 74 24 50       	mov    0x50(%rsp),%rsi
    129c:	48 2b 74 24 40       	sub    0x40(%rsp),%rsi
    12a1:	31 c0                	xor    %eax,%eax
    12a3:	48 69 f6 40 42 0f 00 	imul   $0xf4240,%rsi,%rsi
    12aa:	48 8d 3d 58 0d 00 00 	lea    0xd58(%rip),%rdi        # 2009 <_IO_stdin_used+0x9>
    12b1:	48 03 74 24 58       	add    0x58(%rsp),%rsi
    12b6:	48 2b 74 24 48       	sub    0x48(%rsp),%rsi
    12bb:	e8 70 fd ff ff       	callq  1030 <printf@plt>
    12c0:	48 89 df             	mov    %rbx,%rdi
    12c3:	e8 88 fd ff ff       	callq  1050 <free@plt>
    12c8:	31 c0                	xor    %eax,%eax
    12ca:	48 8b 5d f8          	mov    -0x8(%rbp),%rbx
    12ce:	c9                   	leaveq 
    12cf:	c3                   	retq   

00000000000012d0 <set_fast_math>:
    12d0:	0f ae 5c 24 fc       	stmxcsr -0x4(%rsp)
    12d5:	81 4c 24 fc 40 80 00 	orl    $0x8040,-0x4(%rsp)
    12dc:	00 
    12dd:	0f ae 54 24 fc       	ldmxcsr -0x4(%rsp)
    12e2:	c3                   	retq   
    12e3:	66 2e 0f 1f 84 00 00 	nopw   %cs:0x0(%rax,%rax,1)
    12ea:	00 00 00 
    12ed:	0f 1f 00             	nopl   (%rax)

00000000000012f0 <_start>:
    12f0:	31 ed                	xor    %ebp,%ebp
    12f2:	49 89 d1             	mov    %rdx,%r9
    12f5:	5e                   	pop    %rsi
    12f6:	48 89 e2             	mov    %rsp,%rdx
    12f9:	48 83 e4 f0          	and    $0xfffffffffffffff0,%rsp
    12fd:	50                   	push   %rax
    12fe:	54                   	push   %rsp
    12ff:	4c 8d 05 3a 01 00 00 	lea    0x13a(%rip),%r8        # 1440 <__libc_csu_fini>
    1306:	48 8d 0d d3 00 00 00 	lea    0xd3(%rip),%rcx        # 13e0 <__libc_csu_init>
    130d:	48 8d 3d 6c fd ff ff 	lea    -0x294(%rip),%rdi        # 1080 <main>
    1314:	ff 15 ce 2c 00 00    	callq  *0x2cce(%rip)        # 3fe8 <__libc_start_main@GLIBC_2.2.5>
    131a:	f4                   	hlt    
    131b:	0f 1f 44 00 00       	nopl   0x0(%rax,%rax,1)

0000000000001320 <deregister_tm_clones>:
    1320:	48 8d 3d 21 2d 00 00 	lea    0x2d21(%rip),%rdi        # 4048 <__TMC_END__>
    1327:	48 8d 05 1a 2d 00 00 	lea    0x2d1a(%rip),%rax        # 4048 <__TMC_END__>
    132e:	48 39 f8             	cmp    %rdi,%rax
    1331:	74 15                	je     1348 <deregister_tm_clones+0x28>
    1333:	48 8b 05 a6 2c 00 00 	mov    0x2ca6(%rip),%rax        # 3fe0 <_ITM_deregisterTMCloneTable>
    133a:	48 85 c0             	test   %rax,%rax
    133d:	74 09                	je     1348 <deregister_tm_clones+0x28>
    133f:	ff e0                	jmpq   *%rax
    1341:	0f 1f 80 00 00 00 00 	nopl   0x0(%rax)
    1348:	c3                   	retq   
    1349:	0f 1f 80 00 00 00 00 	nopl   0x0(%rax)

0000000000001350 <register_tm_clones>:
    1350:	48 8d 3d f1 2c 00 00 	lea    0x2cf1(%rip),%rdi        # 4048 <__TMC_END__>
    1357:	48 8d 35 ea 2c 00 00 	lea    0x2cea(%rip),%rsi        # 4048 <__TMC_END__>
    135e:	48 29 fe             	sub    %rdi,%rsi
    1361:	48 c1 fe 03          	sar    $0x3,%rsi
    1365:	48 89 f0             	mov    %rsi,%rax
    1368:	48 c1 e8 3f          	shr    $0x3f,%rax
    136c:	48 01 c6             	add    %rax,%rsi
    136f:	48 d1 fe             	sar    %rsi
    1372:	74 14                	je     1388 <register_tm_clones+0x38>
    1374:	48 8b 05 7d 2c 00 00 	mov    0x2c7d(%rip),%rax        # 3ff8 <_ITM_registerTMCloneTable>
    137b:	48 85 c0             	test   %rax,%rax
    137e:	74 08                	je     1388 <register_tm_clones+0x38>
    1380:	ff e0                	jmpq   *%rax
    1382:	66 0f 1f 44 00 00    	nopw   0x0(%rax,%rax,1)
    1388:	c3                   	retq   
    1389:	0f 1f 80 00 00 00 00 	nopl   0x0(%rax)

0000000000001390 <__do_global_dtors_aux>:
    1390:	80 3d b1 2c 00 00 00 	cmpb   $0x0,0x2cb1(%rip)        # 4048 <__TMC_END__>
    1397:	75 2f                	jne    13c8 <__do_global_dtors_aux+0x38>
    1399:	55                   	push   %rbp
    139a:	48 83 3d 36 2c 00 00 	cmpq   $0x0,0x2c36(%rip)        # 3fd8 <__cxa_finalize@GLIBC_2.2.5>
    13a1:	00 
    13a2:	48 89 e5             	mov    %rsp,%rbp
    13a5:	74 0c                	je     13b3 <__do_global_dtors_aux+0x23>
    13a7:	48 8b 3d 92 2c 00 00 	mov    0x2c92(%rip),%rdi        # 4040 <__dso_handle>
    13ae:	e8 bd fc ff ff       	callq  1070 <__cxa_finalize@plt>
    13b3:	e8 68 ff ff ff       	callq  1320 <deregister_tm_clones>
    13b8:	c6 05 89 2c 00 00 01 	movb   $0x1,0x2c89(%rip)        # 4048 <__TMC_END__>
    13bf:	5d                   	pop    %rbp
    13c0:	c3                   	retq   
    13c1:	0f 1f 80 00 00 00 00 	nopl   0x0(%rax)
    13c8:	c3                   	retq   
    13c9:	0f 1f 80 00 00 00 00 	nopl   0x0(%rax)

00000000000013d0 <frame_dummy>:
    13d0:	e9 7b ff ff ff       	jmpq   1350 <register_tm_clones>
    13d5:	66 2e 0f 1f 84 00 00 	nopw   %cs:0x0(%rax,%rax,1)
    13dc:	00 00 00 
    13df:	90                   	nop

00000000000013e0 <__libc_csu_init>:
    13e0:	41 57                	push   %r15
    13e2:	49 89 d7             	mov    %rdx,%r15
    13e5:	41 56                	push   %r14
    13e7:	49 89 f6             	mov    %rsi,%r14
    13ea:	41 55                	push   %r13
    13ec:	41 89 fd             	mov    %edi,%r13d
    13ef:	41 54                	push   %r12
    13f1:	4c 8d 25 b8 29 00 00 	lea    0x29b8(%rip),%r12        # 3db0 <__frame_dummy_init_array_entry>
    13f8:	55                   	push   %rbp
    13f9:	48 8d 2d c0 29 00 00 	lea    0x29c0(%rip),%rbp        # 3dc0 <__init_array_end>
    1400:	53                   	push   %rbx
    1401:	4c 29 e5             	sub    %r12,%rbp
    1404:	48 83 ec 08          	sub    $0x8,%rsp
    1408:	e8 f3 fb ff ff       	callq  1000 <_init>
    140d:	48 c1 fd 03          	sar    $0x3,%rbp
    1411:	74 1b                	je     142e <__libc_csu_init+0x4e>
    1413:	31 db                	xor    %ebx,%ebx
    1415:	0f 1f 00             	nopl   (%rax)
    1418:	4c 89 fa             	mov    %r15,%rdx
    141b:	4c 89 f6             	mov    %r14,%rsi
    141e:	44 89 ef             	mov    %r13d,%edi
    1421:	41 ff 14 dc          	callq  *(%r12,%rbx,8)
    1425:	48 83 c3 01          	add    $0x1,%rbx
    1429:	48 39 dd             	cmp    %rbx,%rbp
    142c:	75 ea                	jne    1418 <__libc_csu_init+0x38>
    142e:	48 83 c4 08          	add    $0x8,%rsp
    1432:	5b                   	pop    %rbx
    1433:	5d                   	pop    %rbp
    1434:	41 5c                	pop    %r12
    1436:	41 5d                	pop    %r13
    1438:	41 5e                	pop    %r14
    143a:	41 5f                	pop    %r15
    143c:	c3                   	retq   
    143d:	0f 1f 00             	nopl   (%rax)

0000000000001440 <__libc_csu_fini>:
    1440:	c3                   	retq   

Disassembly of section .fini:

0000000000001444 <_fini>:
    1444:	48 83 ec 08          	sub    $0x8,%rsp
    1448:	48 83 c4 08          	add    $0x8,%rsp
    144c:	c3                   	retq 
It is clear that it uses ymm registers which proves that it utilizes avx2.

Thanks,
Karim.