Slow Floating Point Performance in VirtualBox

kemosparc · Post by **kemosparc** » Wed Feb 05, 2020 1:09 pm

Hi,

I have a performance problem with my kernel when executing floating point instructions.

My kernel executes normal code at a speed similar to or better than Linux and MacOS (Same C code)

To isolate the problem I have used a very simple loop that iterates over an array of doubles and assign each element the result of a floating point operation:

Code: Select all

for ( int i = 0 ; i < 1024*1024*1024 ; i ++ )
    array[i] = b+ aa[i] * c;

All the variables, array, b, aa, and c are doubles. I also excluded from the time calculation the memory allocation time for the arrays; just measured the loop execution time. This loop runs almost 3 times slower in my kernel. If I change the types to char it runs 1.5 faster than linux and MacOS.

I have been reading a lot of resources and searching for a solution with no luck. I have read about FPU and there are some posts over the internet that indicates that I need to detect its support in the processor and enable it.

My questions are:

-If FPU is not enabled does this disable the execution of floating point operations completely or run it at a lower speed? I mean my code still runs functionally correct but the problem is the speed.
-How can I detect FPU and enable it if this is the problem?

Appreciate the feedback,
Thanks,

nielsd · Post by **nielsd** » Wed Feb 05, 2020 3:38 pm

How are you measuring the performance?
Have you checked what code the compiler is generating, that may give a hint about the performance of the code.

On an i386, the FPU was optional, but on i486 or higher, this is built in in the CPU. So unless you really want to support the 386, you can assume it is available.
If you want to detect the FPU anyway, check out the FPU wiki page. It has a section on how to detect the FPU, it also explains the bits in the control register needed to enable it.
Also, if you are in long mode, you can at least assume SSE2 support because it is required by the specification.

kemosparc · Post by **kemosparc** » Thu Feb 06, 2020 3:29 am

Hi,

Thanks for the reply.

I made sure to compile both versions of the program with the same exact switches and the same gcc version 8.3.0: -O3 -msse4.2

The only difference is that for my kernel I use a cross compiler and for the Linux program I use the compiler that comes with it. BUt I made sure that my cross compiler is of the same version.

Any other ideas, might be a problem with VirtualBox?

Thanks,
Karim.

Korona · Post by **Korona** » Thu Feb 06, 2020 5:31 am

What are the exact switches that you use for your OS? Is this in 64 bit long mode? Did you look at the generated code? Does the compiler generate a AVX target clone that it cannot use on your OS?

Note that it's also quite surprising that integer code is faster on your OS (that's even more surprising than the double code being slower!). It likely indicates a problem with your benchmarking method.

Octocontrabass · Post by **Octocontrabass** » Thu Feb 06, 2020 6:30 am

kemosparc wrote:-If FPU is not enabled does this disable the execution of floating point operations completely or run it at a lower speed? I mean my code still runs functionally correct but the problem is the speed.

If you haven't enabled SSE instructions, trying to use SSE instructions will cause exceptions. Since your code runs without causing exceptions, you have enabled SSE instructions.

For SSE instructions, check MXCSR. By default, its value should be 0x1F80 to mask all exceptions. Your code may run faster if you set the FTZ and DAZ bits. Compare against Linux and MacOS to see if they're doing anything different.

For x87 instructions (when your code uses long double), make sure CR0.NE is set, and make sure FCW is 0x37F to mask all exceptions. Your code may run faster if you change the precision control in FCW, but it may also cause problems if your code isn't expecting the reduced precision.

Unfortunately, it's possible that the lower speed is caused by the overhead from running your OS in VirtualBox. Try doing the benchmark again, but running your OS directly on bare metal.

kemosparc · Post by **kemosparc** » Fri Feb 07, 2020 1:11 pm

Thanks for the reply,

I have et the CR0.NE and set DAZ and FZ in the MXCSR. Did not really improve.

Can you please point me to some documentation or examples for setting the FCW.

Thanks a lot,
Karim.

kemosparc · Post by **kemosparc** » Fri Feb 07, 2020 1:16 pm

Answering/Replying to Korona,

Everything is expected to be faster on my kernel and this is my problem now with FP.

The whole idea is that I run those programs on dedicated cores with no timer being enabled on them. I do have one core with the APIC local timer on to manage the others.

Having the timer off always run programs at 30-40% faster.

Thanks,
Karim.

kemosparc · Post by **kemosparc** » Fri Feb 07, 2020 3:17 pm

Hi,

I have done some more precise experiments; in a way to isolate more. Basically running the same code on double and long types:

Code: Select all

        double * d  = (double *) calloc (1024*1024*16,sizeof(double));
        struct timeval st, et;
        gettimeofday(&st,NULL);
        for ( int j = 1 ; j < 1024*16 ; j ++)
                for ( int i = 0 ; i  < 1024*1024*16 ; i ++)
                        d[i] = d[i] * j + i;
        gettimeofday(&et,NULL);
        printf ("%lu microsec\n", ((et.tv_sec - st.tv_sec) * 1000000) + (et.tv_usec - st.tv_usec));
        free ( d );
        return 0;

Code: Select all

        long * d  = (long *) calloc (1024*1024*16,sizeof(long));
        struct timeval st, et;
        gettimeofday(&st,NULL);
        for ( int j = 1 ; j < 1024*16 ; j ++)
                for ( int i = 0 ; i  < 1024*1024*16 ; i ++)
                        d[i] = d[i] * j + i;
        gettimeofday(&et,NULL);
        printf ("%lu microsec\n", ((et.tv_sec - st.tv_sec) * 1000000) + (et.tv_usec - st.tv_usec));
        free ( d );

The "double" code takes on both my kernel and Linux 140 seconds, but the "long" code takes 140 seconds on my kernel compared to 204 seconds on Linux.

I used objdump to spit out the assembly generated by the compiler for both. The "double" assembly code is pretty much the same with minor variation. When I looked at the "long" assembly it is fairly different. The most important thing is that my kernel version does not use SSE registers/operations but the linux version does. Here is a sample of the dump.

Linux:

Code: Select all

0000000000001080 <main>:
#include <stdio.h>
#include <unistd.h>
#include <stdlib.h>
#include <sys/time.h>
int main ()
{
    1080:	53                   	push   %rbx
    	long * d  = (long *) calloc (1024*1024*16,sizeof(long));
    1081:	be 08 00 00 00       	mov    $0x8,%esi
    1086:	bf 00 00 00 01       	mov    $0x1000000,%edi
{
    108b:	48 83 ec 30          	sub    $0x30,%rsp
    	long * d  = (long *) calloc (1024*1024*16,sizeof(long));
    108f:	e8 ac ff ff ff       	callq  1040 <calloc@plt>
	struct timeval st, et;
	gettimeofday(&st,NULL);
    1094:	31 f6                	xor    %esi,%esi
    1096:	48 8d 7c 24 10       	lea    0x10(%rsp),%rdi
    	long * d  = (long *) calloc (1024*1024*16,sizeof(long));
    109b:	48 89 c3             	mov    %rax,%rbx
	gettimeofday(&st,NULL);
    109e:	e8 bd ff ff ff       	callq  1060 <gettimeofday@plt>
    10a3:	b9 ff 1f 00 00       	mov    $0x1fff,%ecx
    10a8:	66 44 0f 6f 15 6f 0f 	movdqa 0xf6f(%rip),%xmm10        # 2020 <_IO_stdin_used+0x20>
    10af:	00 00 
    10b1:	66 0f 6f 3d 77 0f 00 	movdqa 0xf77(%rip),%xmm7        # 2030 <_IO_stdin_used+0x30>
    10b8:	00 
    	for ( int j = 1 ; j < 1024*16 ; j ++)
    10b9:	ba 01 00 00 00       	mov    $0x1,%edx
    10be:	48 8d b3 00 00 00 08 	lea    0x8000000(%rbx),%rsi
    10c5:	48 63 c2             	movslq %edx,%rax
    10c8:	8d 7a 01             	lea    0x1(%rdx),%edi
    10cb:	66 41 0f 6f da       	movdqa %xmm10,%xmm3
    10d0:	49 89 d8             	mov    %rbx,%r8
    10d3:	48 89 44 24 08       	mov    %rax,0x8(%rsp)
    10d8:	f2 0f 12 54 24 08    	movddup 0x8(%rsp),%xmm2
    10de:	66 0f 6f f2          	movdqa %xmm2,%xmm6
    10e2:	48 89 d8             	mov    %rbx,%rax
    10e5:	89 7c 24 08          	mov    %edi,0x8(%rsp)
    10e9:	66 0f 6e 4c 24 08    	movd   0x8(%rsp),%xmm1
    10ef:	66 0f 73 d6 20       	psrlq  $0x20,%xmm6
    10f4:	66 0f 70 c9 00       	pshufd $0x0,%xmm1,%xmm1
		for ( int i = 0 ; i  < 1024*1024*16 ; i ++)
       	     		d[i] = d[i] * j + i;
    10f9:	66 0f 6f e1          	movdqa %xmm1,%xmm4
    10fd:	66 0f 38 25 c9       	pmovsxdq %xmm1,%xmm1
    1102:	66 0f 73 dc 08       	psrldq $0x8,%xmm4
    1107:	66 44 0f 6f c1       	movdqa %xmm1,%xmm8
    110c:	66 0f 38 25 e4       	pmovsxdq %xmm4,%xmm4
    1111:	66 41 0f 73 d0 20    	psrlq  $0x20,%xmm8
    1117:	66 44 0f 6f cc       	movdqa %xmm4,%xmm9
    111c:	66 41 0f 73 d1 20    	psrlq  $0x20,%xmm9
    1122:	66 44 0f 6f e3       	movdqa %xmm3,%xmm12
    1127:	66 0f 38 25 eb       	pmovsxdq %xmm3,%xmm5
    112c:	48 83 c0 20          	add    $0x20,%rax
    1130:	66 41 0f 73 dc 08    	psrldq $0x8,%xmm12
    1136:	66 0f fe df          	paddd  %xmm7,%xmm3
    113a:	66 45 0f 38 25 ec    	pmovsxdq %xmm12,%xmm13
    1140:	f3 44 0f 6f 60 f0    	movdqu -0x10(%rax),%xmm12
    1146:	66 45 0f 6f dc       	movdqa %xmm12,%xmm11
    114b:	66 41 0f 6f c4       	movdqa %xmm12,%xmm0
    1150:	66 41 0f 73 d3 20    	psrlq  $0x20,%xmm11
    1156:	66 44 0f f4 e6       	pmuludq %xmm6,%xmm12
    115b:	66 44 0f f4 da       	pmuludq %xmm2,%xmm11
    1160:	66 0f f4 c2          	pmuludq %xmm2,%xmm0
    1164:	66 45 0f d4 dc       	paddq  %xmm12,%xmm11
    1169:	66 41 0f 73 f3 20    	psllq  $0x20,%xmm11
    116f:	66 41 0f d4 c3       	paddq  %xmm11,%xmm0
    1174:	66 41 0f d4 c5       	paddq  %xmm13,%xmm0
    1179:	66 44 0f 6f d8       	movdqa %xmm0,%xmm11
    117e:	66 44 0f 6f e0       	movdqa %xmm0,%xmm12
    1183:	66 41 0f 73 d3 20    	psrlq  $0x20,%xmm11
    1189:	66 41 0f f4 c1       	pmuludq %xmm9,%xmm0
    118e:	66 44 0f f4 dc       	pmuludq %xmm4,%xmm11
    1193:	66 44 0f f4 e4       	pmuludq %xmm4,%xmm12
    1198:	66 44 0f d4 d8       	paddq  %xmm0,%xmm11
    119d:	66 41 0f 73 f3 20    	psllq  $0x20,%xmm11
    11a3:	66 45 0f d4 e3       	paddq  %xmm11,%xmm12
    11a8:	66 45 0f d4 e5       	paddq  %xmm13,%xmm12
    11ad:	f3 44 0f 6f 68 e0    	movdqu -0x20(%rax),%xmm13
    11b3:	44 0f 11 60 f0       	movups %xmm12,-0x10(%rax)
    11b8:	66 45 0f 6f dd       	movdqa %xmm13,%xmm11
    11bd:	66 41 0f 6f c5       	movdqa %xmm13,%xmm0
    11c2:	66 41 0f 73 d3 20    	psrlq  $0x20,%xmm11
    11c8:	66 44 0f f4 ee       	pmuludq %xmm6,%xmm13
    11cd:	66 44 0f f4 da       	pmuludq %xmm2,%xmm11
    11d2:	66 0f f4 c2          	pmuludq %xmm2,%xmm0
    11d6:	66 45 0f d4 dd       	paddq  %xmm13,%xmm11
    11db:	66 41 0f 73 f3 20    	psllq  $0x20,%xmm11
    11e1:	66 41 0f d4 c3       	paddq  %xmm11,%xmm0
    11e6:	66 0f d4 c5          	paddq  %xmm5,%xmm0
    11ea:	66 44 0f 6f d8       	movdqa %xmm0,%xmm11
    11ef:	66 44 0f 6f e8       	movdqa %xmm0,%xmm13
    11f4:	66 41 0f 73 d3 20    	psrlq  $0x20,%xmm11
    11fa:	66 41 0f f4 c0       	pmuludq %xmm8,%xmm0
    11ff:	66 44 0f f4 d9       	pmuludq %xmm1,%xmm11
    1204:	66 44 0f f4 e9       	pmuludq %xmm1,%xmm13
    1209:	66 44 0f d4 d8       	paddq  %xmm0,%xmm11
    120e:	66 41 0f 73 f3 20    	psllq  $0x20,%xmm11
    1214:	66 45 0f d4 dd       	paddq  %xmm13,%xmm11
    1219:	66 41 0f d4 eb       	paddq  %xmm11,%xmm5
    121e:	0f 11 68 e0          	movups %xmm5,-0x20(%rax)
    1222:	48 39 f0             	cmp    %rsi,%rax
    1225:	0f 85 f7 fe ff ff    	jne    1122 <main+0xa2>
    	for ( int j = 1 ; j < 1024*16 ; j ++)
    122b:	83 c2 02             	add    $0x2,%edx
    122e:	83 e9 01             	sub    $0x1,%ecx
    1231:	0f 85 8e fe ff ff    	jne    10c5 <main+0x45>
    1237:	41 ba ff 3f 00 00    	mov    $0x3fff,%r10d
    123d:	31 f6                	xor    %esi,%esi
    123f:	4c 63 ca             	movslq %edx,%r9
    1242:	41 29 d2             	sub    %edx,%r10d
    1245:	49 8d 51 01          	lea    0x1(%r9),%rdx
    1249:	49 8b 00             	mov    (%r8),%rax
    124c:	4c 89 c9             	mov    %r9,%rcx
    124f:	4a 8d 3c 12          	lea    (%rdx,%r10,1),%rdi
    1253:	eb 07                	jmp    125c <main+0x1dc>
    1255:	0f 1f 00             	nopl   (%rax)
    1258:	48 83 c2 01          	add    $0x1,%rdx
       	     		d[i] = d[i] * j + i;
    125c:	48 0f af c1          	imul   %rcx,%rax
    1260:	48 89 d1             	mov    %rdx,%rcx
    1263:	48 01 f0             	add    %rsi,%rax
		for ( int i = 0 ; i  < 1024*1024*16 ; i ++)
    1266:	48 39 fa             	cmp    %rdi,%rdx
    1269:	75 ed                	jne    1258 <main+0x1d8>
    126b:	48 83 c6 01          	add    $0x1,%rsi
    126f:	49 89 00             	mov    %rax,(%r8)
    1272:	49 83 c0 08          	add    $0x8,%r8
    	for ( int j = 1 ; j < 1024*16 ; j ++)
    1276:	48 81 fe 00 00 00 01 	cmp    $0x1000000,%rsi
    127d:	75 c6                	jne    1245 <main+0x1c5>
	gettimeofday(&et,NULL);
    127f:	48 8d 7c 24 20       	lea    0x20(%rsp),%rdi
    1284:	31 f6                	xor    %esi,%esi
    1286:	e8 d5 fd ff ff       	callq  1060 <gettimeofday@plt>
	printf ("%lu microsec\n", ((et.tv_sec - st.tv_sec) * 1000000) + (et.tv_usec - st.tv_usec));
    128b:	48 8b 74 24 20       	mov    0x20(%rsp),%rsi
    1290:	48 2b 74 24 10       	sub    0x10(%rsp),%rsi
    1295:	31 c0                	xor    %eax,%eax
    1297:	48 69 f6 40 42 0f 00 	imul   $0xf4240,%rsi,%rsi
    129e:	48 8d 3d 5f 0d 00 00 	lea    0xd5f(%rip),%rdi        # 2004 <_IO_stdin_used+0x4>
    12a5:	48 03 74 24 28       	add    0x28(%rsp),%rsi
    12aa:	48 2b 74 24 18       	sub    0x18(%rsp),%rsi
    12af:	e8 7c fd ff ff       	callq  1030 <printf@plt>
    	free ( d );
    12b4:	48 89 df             	mov    %rbx,%rdi
    12b7:	e8 94 fd ff ff       	callq  1050 <free@plt>
	return 0;
}
    12bc:	48 83 c4 30          	add    $0x30,%rsp
    12c0:	31 c0                	xor    %eax,%eax
    12c2:	5b                   	pop    %rbx
    12c3:	c3                   	retq   
    12c4:	66 2e 0f 1f 84 00 00 	nopw   %cs:0x0(%rax,%rax,1)
    12cb:	00 00 00 
    12ce:	66 90                	xchg   %ax,%ax

My Kernel:

Code: Select all

void test_long()
{
  246970:	41 54                	push   %r12
    long * d  = (long *) calloc (1024*1024*16,sizeof(long));
  246972:	be 08 00 00 00       	mov    $0x8,%esi
  246977:	bf 00 00 00 01       	mov    $0x1000000,%edi
{
  24697c:	55                   	push   %rbp
  24697d:	48 83 ec 08          	sub    $0x8,%rsp
    long * d  = (long *) calloc (1024*1024*16,sizeof(long));
  246981:	e8 aa ae fe ff       	callq  231830 <calloc>
  246986:	48 89 c5             	mov    %rax,%rbp
    clock_t clock0 = clock();
  246989:	e8 f2 23 fe ff       	callq  228d80 <clock>
  24698e:	b9 02 00 00 00       	mov    $0x2,%ecx
  246993:	49 89 c4             	mov    %rax,%r12
    for ( int j = 1 ; j < 1024*16 ; j ++)
        for ( int i = 0 ; i  < 1024*1024*16 ; i ++)
  246996:	48 8d 71 ff          	lea    -0x1(%rcx),%rsi
    clock_t clock0 = clock();
  24699a:	31 d2                	xor    %edx,%edx
  24699c:	89 cf                	mov    %ecx,%edi
  24699e:	66 90                	xchg   %ax,%ax
            d[i] = d[i] * j + i;
  2469a0:	48 8b 44 d5 00       	mov    0x0(%rbp,%rdx,8),%rax
  2469a5:	48 0f af c6          	imul   %rsi,%rax
  2469a9:	48 01 d0             	add    %rdx,%rax
  2469ac:	48 0f af c1          	imul   %rcx,%rax
  2469b0:	48 01 d0             	add    %rdx,%rax
  2469b3:	48 89 44 d5 00       	mov    %rax,0x0(%rbp,%rdx,8)
        for ( int i = 0 ; i  < 1024*1024*16 ; i ++)
  2469b8:	48 83 c2 01          	add    $0x1,%rdx
  2469bc:	48 81 fa 00 00 00 01 	cmp    $0x1000000,%rdx
  2469c3:	75 db                	jne    2469a0 <_Z9test_longv+0x30>
    for ( int j = 1 ; j < 1024*16 ; j ++)
  2469c5:	48 83 c1 02          	add    $0x2,%rcx
  2469c9:	8d 47 01             	lea    0x1(%rdi),%eax
  2469cc:	48 81 f9 00 40 00 00 	cmp    $0x4000,%rcx
  2469d3:	75 c1                	jne    246996 <_Z9test_longv+0x26>
  2469d5:	41 ba ff 3f 00 00    	mov    $0x3fff,%r10d
  2469db:	49 89 e8             	mov    %rbp,%r8
        for ( int i = 0 ; i  < 1024*1024*16 ; i ++)
  2469de:	31 f6                	xor    %esi,%esi
  2469e0:	4c 63 c8             	movslq %eax,%r9
  2469e3:	41 29 c2             	sub    %eax,%r10d
  2469e6:	49 8d 51 01          	lea    0x1(%r9),%rdx
  2469ea:	49 8b 00             	mov    (%r8),%rax
  2469ed:	4c 89 c9             	mov    %r9,%rcx
  2469f0:	4a 8d 3c 12          	lea    (%rdx,%r10,1),%rdi
  2469f4:	eb 0e                	jmp    246a04 <_Z9test_longv+0x94>
  2469f6:	66 2e 0f 1f 84 00 00 	nopw   %cs:0x0(%rax,%rax,1)
  2469fd:	00 00 00 
  246a00:	48 83 c2 01          	add    $0x1,%rdx
            d[i] = d[i] * j + i;
  246a04:	48 0f af c1          	imul   %rcx,%rax
  246a08:	48 89 d1             	mov    %rdx,%rcx
  246a0b:	48 01 f0             	add    %rsi,%rax
        for ( int i = 0 ; i  < 1024*1024*16 ; i ++)
  246a0e:	48 39 d7             	cmp    %rdx,%rdi
  246a11:	75 ed                	jne    246a00 <_Z9test_longv+0x90>
  246a13:	48 83 c6 01          	add    $0x1,%rsi
  246a17:	49 89 00             	mov    %rax,(%r8)
  246a1a:	49 83 c0 08          	add    $0x8,%r8
    for ( int j = 1 ; j < 1024*16 ; j ++)
  246a1e:	48 81 fe 00 00 00 01 	cmp    $0x1000000,%rsi
  246a25:	75 bf                	jne    2469e6 <_Z9test_longv+0x76>
    clock_t clock1 = clock();
  246a27:	e8 54 23 fe ff       	callq  228d80 <clock>
    double t = getClockDiff(clock0,clock1);
  246a2c:	4c 89 e7             	mov    %r12,%rdi
    clock_t clock1 = clock();
  246a2f:	48 89 c6             	mov    %rax,%rsi
    double t = getClockDiff(clock0,clock1);
  246a32:	e8 59 23 fe ff       	callq  228d90 <getClockDiff>
    printf ("Time taken: %f\n",t);
  246a37:	bf 3d 3e 25 00       	mov    $0x253e3d,%edi
  246a3c:	b8 01 00 00 00       	mov    $0x1,%eax
  246a41:	e8 ba 3f fe ff       	callq  22aa00 <printf>
    free ( d );
}

My feeling is that virtual box does not execute sse instructions fast in my case. I then created a virtual machine on virtualbox with linux on it and tried the two programs above on the same baremetal beneath, and I got the same results! I have used the same compilation switches with all programs at both fronts.

May be there is something I need to set in my kernel to make it faster, I mean there might be some HW setting for the processor that I am missing for the FP? I did detect the sse and avx features of my cores and set them up for all cores: I did enable the avx and sse bits in CR4 and the Extend Control Register (ECR).

Please let me know if there is anything else I should look at?

Thanks,
Karim.

Octocontrabass · Post by **Octocontrabass** » Sat Feb 08, 2020 4:37 am

kemosparc wrote:I did detect the sse and avx features of my cores and set them up for all cores: I did enable the avx and sse bits in CR4 and the Extend Control Register (ECR).

Intel CPUs will throttle when the AVX registers are dirty. Did you use the AVX registers and forget to clear them afterwards with XRSTOR/XRSTORS/VZEROUPPER/VZEROALL?

Korona · Post by **Korona** » Sat Feb 08, 2020 9:15 am

The timer IRQ has negligible overhead. If you see a 30% slowdown with the timer IRQ enabled, there is something wrong with your setup. Note that (modern) Linux also disables the timer IRQ when only a single program is active on a core. I'd rather suspect something else to be responsible for your differences, e.g., CPU frequency scaling. Which CPU governor do you use on Linux?

kemosparc · Post by **kemosparc** » Sat Feb 08, 2020 1:14 pm

Hi,

Thanks for the reply,

Intel CPUs will throttle when the AVX registers are dirty. Did you use the AVX registers and forget to clear them afterwards with XRSTOR/XRSTORS/VZEROUPPER/VZEROALL?

Octocontrabass: Definitely I did not do that. This might be the problem!! How can I do that? can you point me to some documentation? Does the compiler do that when generate code that uses AVX?

Another thing I have discovered is regarding my earlier post is that, my "long" program when get compiler on linux it uses xmm registers, I tried to disable that using -mno-mmx -mno-sse and the compilation failed. It appeared that the Linux standard library is compiled using SSE and hence if I disable them in my program compilation linking fails! is thare any way to find the compilation switches that were used to compiler the C standard library?

The timer IRQ has negligible overhead. If you see a 30% slowdown with the timer IRQ enabled, there is something wrong with your setup. Note that (modern) Linux also disables the timer IRQ when only a single program is active on a core. I'd rather suspect something else to be responsible for your differences, e.g., CPU frequency scaling. Which CPU governor do you use on Linux?

Korona: it is not with my setup it is the Linux setup

the same programs run faster in my VM on cores with APIC local timer disabled. Also in the case of the Linux kernel when disables the timer IRQ when only a single program is active on a core that is not accurately true, the timer still needs to fire at least once per second to handle some soft_irqs; this would benefit more saving power than gaining performance as still address space switching does occur unlike my case; the timer itself is not very significant the address space switching is the pain. Moreover, with multi-threaded applications that maps to multiple kernel threads, it is very difficult to make sure that this situation can be established. Also you will need a lot of administration to manually pin CPUs with specific processes using taskset to set CPU affinity, or else redesign the whole scheduler. In high workloads I do not think it is practical. Surprisingly, in many cases the timer has significant overhead especially with tasks that are 100% CPU bound with no interleaved processing and I/O. For example consider a word count program that loads a couple of GB file into memory in the beginning one shot and then process it by counting the spaces with a tight loop over the buffer; I was able to achieve 2.5 folds speed up over Linux !

Anyways, I would be grateful if you send me some resources regarding the AVX registers cleaning.

Thanks,
Karim.

Korona · Post by **Korona** » Sun Feb 09, 2020 4:20 am

A 2.5x speedup seems absolutely out of proportion. In my experience, the differences among OSes for compute-bound workloads are usually within the noise and sometimes in the low single-digit percent range. I could maybe believe a 10% difference due to different synchronization mechanisms and the timer IRQ. Check that you're really doing an apples-to-apples comparison:
- Make sure that you run absolutely the same code on both OSes. Unless you are very careful to enable all of GCCs features for your OS, it might yield different code. It's probably best to just compare pre-compiled assembly on both OSes (i.e., compile computational kernel to assembly with one compiler, then link twice with main() and the support code).
- Make sure to control for differences due to static/dynamic linking, especially if you call into libraries.
- Are you using the same timing methods on both OSes? Are you comparing TSCs or counts of retired instructions? Does your CPU have invariant TSC?
- Make sure to use the maximum performance CPU governor.
- Control for I/O. If Linux spends a lot of time in softirqs, maybe there is a lot of RCU work to do? Maybe your server is connected to a high speed ethernet or infiniband interconnect and spends considerable time in work queues for those devices? Run perf to exclude these factors.
- Are you using CPU binding?
- Are you using NUMA memory binding?
- Do you control huge TLB allocation via madvise()?
- Do you allocate memory eagerly? Do you pin memory?
- Are you using the same synchronization primitives on all OSes?
- Do you ensure that background processes are not scheduled to often? Use perf to measure the number of involuntary context switches.
- Is your input large enough? E.g., a simple parallel word count on 16 GiB of RAM might finish in milliseconds. For such a simple task, consider using larger inputs that take seconds to hours.
- Do you control for variance? What's the standard deviation of your timings relative to the mean?

kemosparc · Post by **kemosparc** » Sun Feb 09, 2020 5:40 am

Dear Korona,

Thanks a lot for the thorough message, will definitely look at it after get the problem at hand fixed. Yes, the amount or data I process is pretty large up to 32 GB, and it takes on my intel i-core7 on linux about 20 seconds, and 7 seconds on my kernel on the same hardware producing the same count number. Yes I used the APIC timer coupled with RTC to compute time and I used a wall clock at the same time. The only thing is that 2.5 folds were achieved on applications that read form memory, when I benchmark quick sort for example I was not able to achieve more than 1.6 speed up because a lot of memory swapping happens; to be more accurate and general at the same time, the maximum speed up was achieved when the whole mode of operation of the application is memory reading in spatial mode. Definitely synchronization is not the same and I use a SLAB-like implementation for my memory manager. There are a lot of differences between my kernel and linux I agree and the speed up might be attributed to other factors, but as soon as I turn on the timer on the core that I run the tests on I get results close or slower than Linux. Regarding the word count it is just an 8 line program that calloc memory, read whole file from disk in one go and loop over the buffer, and I exclude in my time measurements file I/O and memory allocation time. Again, I appreciate all the points you raised but the important thing now for me is to solve the slow floating point as it is not even as fast as Linux

which really stresses me out

Thanks again man. Please if you can provide any help regarding the floating point issue it will be much appreciated.

Dear Octocontrabass and Korona,

The current status is that I found that the compilation switches are not the same as what I am using in my kernel compilation. So I used the following command to inquire on the native compilation options:

Code: Select all

g++ -march=native --help=target -Q | grep enabled

And then I took all of the enabled switching and added them to my make file compilation options. I got huge speed up now!!!! Yes not what I seek yet, but that was a huge speed up.

The thing is that when I increase the loop iteration count to consume more time I find in my kernel case, unlike the Linux case, a huge throttling to the extent that I hear the fans spinning so hard until the program finishes.

I expect that if I try to inject XRSTOR/XRSTORS/VZEROUPPER/VZEROALL in my code things might get faster? My question is what are other things that I should look for regarding throttling? In my case it seems not normal!

Thanks,
Karim.

Octocontrabass · Post by **Octocontrabass** » Sun Feb 09, 2020 6:00 am

kemosparc wrote:Octocontrabass: Definitely I did not do that. This might be the problem!! How can I do that? can you point me to some documentation? Does the compiler do that when generate code that uses AVX?

If you're using XRSTOR or XRSTORS to initialize the CPU state, make sure the appropriate bits of XSTATE_BV are clear to tell the processor to reset the AVX state to its initial configuration. You can use VZEROUPPER or VZEROALL at any time to clear the AVX state.

GCC and Clang will automatically emit VZEROUPPER instructions when generating AVX-optimized code, unless your target CPU is Xeon Phi (which doesn't have a penalty for using AVX, but does have a penalty for using VZEROUPPER).

kemosparc · Post by **kemosparc** » Sun Feb 09, 2020 6:09 am

Thanks for the quick reply:)

So the XRSTOR or XRSTORS need to be done at boot time once, right?

Also can you please give me your feedback on that:

I expect that if I try to inject XRSTOR/XRSTORS/VZEROUPPER/VZEROALL in my code things might get faster? My question is what are other things that I should look for regarding throttling? In my case it seems not normal!

Thanks,
Karim.

OSDev.org

Slow Floating Point Performance in VirtualBox

Slow Floating Point Performance in VirtualBox

Re: Slow Floating Point Performance in VirtualBox

Re: Slow Floating Point Performance in VirtualBox

Re: Slow Floating Point Performance in VirtualBox

Re: Slow Floating Point Performance in VirtualBox

Re: Slow Floating Point Performance in VirtualBox

Re: Slow Floating Point Performance in VirtualBox

Re: Slow Floating Point Performance in VirtualBox

Re: Slow Floating Point Performance in VirtualBox

Re: Slow Floating Point Performance in VirtualBox

Re: Slow Floating Point Performance in VirtualBox

Re: Slow Floating Point Performance in VirtualBox

Re: Slow Floating Point Performance in VirtualBox

Re: Slow Floating Point Performance in VirtualBox

Re: Slow Floating Point Performance in VirtualBox