Page 3 of 3

Re: Best way to implement MMU on x86_64 Higher Half kernel?

Posted: Tue May 10, 2022 5:00 pm
by Octocontrabass
rdos wrote:If you introduce an 16-bit sin-table in the code and use integer parameters, then the compiler can produce code with similar speed, but this needs some thinking of proper algorithms to use. Something the C compiler cannot fix for you, even if you present it with the same interface.
No one is saying that a compiler can magically come up with a better algorithm. What we are saying is that the compiler does a better job of turning algorithms into machine code.

I suggest you benchmark a C compiler against your hand-written assembly. I wouldn't be surprised if it's faster even without vectorization.

Re: Best way to implement MMU on x86_64 Higher Half kernel?

Posted: Tue May 10, 2022 5:11 pm
by Ethin
kzinti wrote:
Ethin wrote:Why? Your not going to be using this in kernel mode, only in user mode. The normal sin() should be fine. There's no reason whatsoever to write your own math library from scratch or something.
Having work in video games for almost 20 years and having had to avoid the C library's sin(), cos() and other trigonometric function on every project, I have to disagree with you.

sin() is not fine where performance and exact behaviour matters. sin() is slow, especially on x86. C's model doesn't map perfectly to the x86 FPU and the implementation has to go through different hoops to work around that. Sure you can disable some of it with compiler switches, but it is still slow and non-portable.

I remember a time where sin() would not produce the same results on different x86 processors because they had different versions of SSE. This would make our game go out of sync in network play because computers were basically not producing the same simulation.
I thought that the hardware FSIN (for example) was less accurate and slower than the C version. Wouldn't that only be worse?

Re: Best way to implement MMU on x86_64 Higher Half kernel?

Posted: Tue May 10, 2022 8:09 pm
by kzinti
Ethin wrote:I thought that the hardware FSIN (for example) was less accurate and slower than the C version. Wouldn't that only be worse?
I am not sure I follow your logic. What I have seen C implementation do is basically use fsin() or the SSE equivalent depending on what is available on the processor. fsin()'s result can actually change depending on what is in the FPU state when you execute it (internally the FPU registers have more bits than what your C code is using). C library implementations typically have extra code to work around this, which means that calling sin() is more than just a fsin instruction. This can sometimes be disabled / tweaked with implementation specific functions, but you are losing determinism/portability when you start changing the behaviour of standard C functions.

What matters (for video games anyways) is:
1) performance
2) deterministic

You can tune your implementation to whatever precision/accuracy you require (which might be different for each usage)

So in this context, we would go to extra lengths to ensure no one was using fsin (and SSE at the time since it wasn't available everywhere). This basically means no C library trigonometry (or sqrt). We had to write our own implementation, and that meant a mix of lookup tables and Newton iterations).

Re: Best way to implement MMU on x86_64 Higher Half kernel?

Posted: Tue May 10, 2022 8:18 pm
by Ethin
kzinti wrote:
Ethin wrote:I thought that the hardware FSIN (for example) was less accurate and slower than the C version. Wouldn't that only be worse?
I am not sure I follow your logic. What I have seen C implementation do is basically use fsin() or the SSE equivalent depending on what is available on the processor. fsin()'s result can actually change depending on what is in the FPU state when you execute it (internally the FPU registers have more bits than what your C code is using). Again I understand you can configure your FPU precision/behaviour, but you are losing portability/determinism.

What matters (for video games anyways) is:
1) performance
2) deterministic

You can tune your implementation to whatever precision/accuracy your require (which might be different for each usage)

So in this context, we would go to extra lengths to ensure no one was using fsin (and SSE at the time since it wasn't available everywhere). This basically means no C library trigonometry (or sqrt). We had to write our own implementation, and that meant a mix of lookup tables and Newton iterations).
As I said in a previous post, I've written a lot of math using cos(), sin(), etc., and I've never seen a compiler emit a single FPU instruction -- its always a library call to libc's sin()/cos() function. If there's a way to get GCC/Clang to emit FPU instructions, I don't know what it is -- and its obviously not enabled by default or the compiler chooses a more efficient method (SSE, SSE2, ...), though even in that case there's no SSE/AVX-based sin calculations, its a library call.

Re: Best way to implement MMU on x86_64 Higher Half kernel?

Posted: Tue May 10, 2022 8:20 pm
by kzinti
Ethin wrote:As I said in a previous post, I've written a lot of math using cos(), sin(), etc., and I've never seen a compiler emit a single FPU instruction -- its always a library call to libc's sin()/cos() function.
That is because standard C's sin() function doesn't map to the FPU fsin instruction or the SSE equivalent. You can't replace a call to sin() with a single fsin instruction. They don't do the same thing. I know it's is not intuitive, but basically the C implementation has to do extra work to ensure the behaviour of the function matches what the C standard says it should do. This cost significant performance at a minimum. If you do a call once in a while, it doesn't matter. But if you need tons of sin() in a tight loop, it can be a problem.

And some of the C sin() implementation I have seen will use the FPU or a different version of SSE/AVX/etc depending on what the processor supports.

Re: Best way to implement MMU on x86_64 Higher Half kernel?

Posted: Tue May 10, 2022 9:57 pm
by Octocontrabass
Ethin wrote:As I said in a previous post, I've written a lot of math using cos(), sin(), etc., and I've never seen a compiler emit a single FPU instruction
GCC will do it if you enable optimizations that aren't compatible with the C standard. But even then it often chooses the library function for float and double, since many x87 instructions will waste time calculating a long double result.

Re: Best way to implement MMU on x86_64 Higher Half kernel?

Posted: Wed May 11, 2022 1:22 am
by Demindiro
So I decided to generate some random data and sin tables so I could at least benchmark something.

After struggling to get GCC to compile something that doesn't segfault (damn PIE code) I've come up with the following:

Code: Select all

#!/usr/bin/env python3

from random import randint

with open("/tmp/sin_table.c", "w") as f:
    f.write("short sin_table[0x40000] = {\n");
    for _ in range(0x4_0000):
        f.write(str(randint(-255, 255)) + '\n,')
    f.write("};");

with open("/tmp/data.c", "w") as f:
    f.write("short data[1 << 22] = {\n");
    for _ in range(1 << 22):
        f.write(str(randint(-255, 255)) + '\n,')
    f.write("};");

Code: Select all

#include "stdio.h"

extern short sin_table[0x40000];

extern short data[1 << 22];

static int CalcPower(short int *data, int size, int init_phase, int phase_per_sample, long long *power) {
   int phase = init_phase;
   long long sum = 0;

   for (int i = 0; i < size; i++) {
      int p = (phase >> 14) & 0x3ffff;
      sum += sin_table[p] * data[2 * i];
      phase += phase_per_sample;
   }

   *power = sum;
   return phase;
}

int main(int argc, char **argv) {

   volatile long long dont_optimize_me;
   for (int i = 0; i < 100; i++) {
      for (int k = 10; k < 20; k++) {
         long long power;
         CalcPower(data, sizeof(data) / sizeof(*data) / 2, i, k, &power); 
         dont_optimize_me = power;
      }
   }

   // Print one to verify correctness
   printf("%lld\n", dont_optimize_me);

   return 0;
}

Code: Select all

.intel_syntax noprefix
.globl _CalcFreqPowerA
.globl main

.section .text._CalcFreqPowerA
.p2align 4
#       PARAMETERS:     Data
#                       Size
#                       InitPhase
#                       PhasePerSample
#                       Power
#
#       RETURNS:        End phase

_CalcFreqPowerA:
    push ebx
    push ecx
    push edx
    push esi
    push edi
    push ebp
#
    mov esi,[esp+0x1C]
    mov ecx,[esp+0x20]
    mov ebp,[esp+0x24]
    mov edi,[esp+0x2C]
#
    xor eax,eax
    mov [edi + 0],eax
    mov [edi + 4],eax

cfpaLoop:
    mov ebx,ebp
    shr ebx,13
    and bl,0xFE
#
    mov ax, [ebx + sin_table]
    imul word ptr [esi]
    movsx edx,dx
    add  word ptr [edi + 0],ax
    adc dword ptr [edi + 2],edx
#
    add esi,4
    add ebp,[esp+0x28]
    loop cfpaLoop
#
    movsx eax,word ptr [edi + 4]
    mov [edi + 4],eax
#
    mov eax,ebp
#
    pop ebp
    pop edi
    pop esi
    pop edx
    pop ecx
    pop ebx
    ret 20

.section .text.main
.p2align 4
main:
    sub esp, 8 # reserve storage for power value

    # for (int i = 0; i < 100; i++)
    xor ecx, ecx

2:
    # for (int k = 10; k < 20; k++)
    mov edx, 10

3:
    # CalcPower
    push esp
    push edx
    push ecx
    push 1 << 21
    push [data_ptr]
    call _CalcFreqPowerA

    # for k
    inc edx
    cmp edx, 20
    jnz 3b

    # for i
    inc ecx
    cmp ecx, 100
    jnz 2b

    # printf
    push [format_str_lld_ptr]
    call printf
    add esp, 4 + 8

    # return 0
    xor eax, eax
    ret

data_ptr:
    .long data

format_str_lld_ptr:
    .long format_str_lld

.section .text.rodata
format_str_lld:
    .asciz "%lld\n"

Code: Select all

CARGS += -O3
CARGS += -Wall
#CARGS += -flto

all: sin2_64_c_native sin2_64_c sin2_32_c sin2_32_asm

sin2_64_c_native: sin2.c sin_table.c data_64.o
	$(CC) $(CARGS) -march=native $^ -o $@

sin2_64_c: sin2.c sin_table.c data_64.o
	$(CC) $(CARGS) $^ -o $@

sin2_32_c: sin2.c sin_table.c data_32.o
	$(CC) $(CARGS) -m32 $^ -o $@ -g3

sin2_32_asm: sin2.s sin_table.c data_32.o
	$(CC) $(CARGS) -no-pie -fno-pie -m32 $^ -o $@

data_64.o: data.c
	$(CC) $(CARGS) -c $^ -o $@

data_32.o: data.c
	$(CC) $(CARGS) -c -m32 $^ -o $@
Benchmarking with `perf stat` yields the following results:

Code: Select all

david@pc1:/tmp$ perf stat ./sin2_64_c_native
-50213557

 Performance counter stats for './sin2_64_c_native':

           1634.27 msec task-clock                #    1.000 CPUs utilized          
                 3      context-switches          #    0.002 K/sec                  
                 0      cpu-migrations            #    0.000 K/sec                  
               184      page-faults               #    0.113 K/sec                  
        6498854866      cycles                    #    3.977 GHz                      (83.36%)
           3982498      stalled-cycles-frontend   #    0.06% frontend cycles idle     (83.36%)
        3418439760      stalled-cycles-backend    #   52.60% backend cycles idle      (83.36%)
       23030851488      instructions              #    3.54  insn per cycle         
                                                  #    0.15  stalled cycles per insn  (83.36%)
        2090698244      branches                  # 1279.289 M/sec                    (83.36%)
             47624      branch-misses             #    0.00% of all branches          (83.22%)

       1.634828733 seconds time elapsed

       1.634865000 seconds user
       0.000000000 seconds sys


david@pc1:/tmp$ perf stat ./sin2_64_c
-50213557

 Performance counter stats for './sin2_64_c':

           1639.85 msec task-clock                #    1.000 CPUs utilized          
                 2      context-switches          #    0.001 K/sec                  
                 2      cpu-migrations            #    0.001 K/sec                  
               183      page-faults               #    0.112 K/sec                  
        6457718560      cycles                    #    3.938 GHz                      (83.17%)
           3505234      stalled-cycles-frontend   #    0.05% frontend cycles idle     (83.37%)
        3482426541      stalled-cycles-backend    #   53.93% backend cycles idle      (83.41%)
       23106438497      instructions              #    3.58  insn per cycle         
                                                  #    0.15  stalled cycles per insn  (83.41%)
        2098396719      branches                  # 1279.628 M/sec                    (83.41%)
             51983      branch-misses             #    0.00% of all branches          (83.22%)

       1.640573324 seconds time elapsed

       1.640610000 seconds user
       0.000000000 seconds sys


david@pc1:/tmp$ perf stat ./sin2_32_c
-50213557

 Performance counter stats for './sin2_32_c':

           2043.25 msec task-clock                #    1.000 CPUs utilized          
                 2      context-switches          #    0.001 K/sec                  
                 0      cpu-migrations            #    0.000 K/sec                  
               175      page-faults               #    0.086 K/sec                  
        8239753164      cycles                    #    4.033 GHz                      (83.36%)
           5344609      stalled-cycles-frontend   #    0.06% frontend cycles idle     (83.36%)
        5192399565      stalled-cycles-backend    #   63.02% backend cycles idle      (83.36%)
       27240135813      instructions              #    3.31  insn per cycle         
                                                  #    0.19  stalled cycles per insn  (83.36%)
        2097472578      branches                  # 1026.539 M/sec                    (83.36%)
             58111      branch-misses             #    0.00% of all branches          (83.20%)

       2.043735436 seconds time elapsed

       2.043777000 seconds user
       0.000000000 seconds sys


david@pc1:/tmp$ perf stat ./sin2_32_asm
-50213557

 Performance counter stats for './sin2_32_asm':

           3969.39 msec task-clock                #    0.999 CPUs utilized          
                19      context-switches          #    0.005 K/sec                  
                 7      cpu-migrations            #    0.002 K/sec                  
               175      page-faults               #    0.044 K/sec                  
       15536969090      cycles                    #    3.914 GHz                      (64.97%)
          10339765      stalled-cycles-frontend   #    0.07% frontend cycles idle     (64.86%)
        9893809047      stalled-cycles-backend    #   63.68% backend cycles idle      (64.83%)
       23164560938      instructions              #    1.49  insn per cycle         
                                                  #    0.43  stalled cycles per insn  (64.77%)
        2104231298      branches                  #  530.114 M/sec                    (64.85%)
            173766      branch-misses             #    0.01% of all branches          (64.97%)

       3.973235657 seconds time elapsed

       3.969949000 seconds user
       0.000000000 seconds sys
The 64 C bit version is ~25% faster than the 32 C bit version. The 32 bit C version is ~95% faster than the 32 bit assembly version. I think the results speak for themselves.

Important to notice is that while the assembly version executes less instructions than the C version, the C version has much better instruction scheduling which allows it to achieve ~2x IPC.

Re: Best way to implement MMU on x86_64 Higher Half kernel?

Posted: Wed May 11, 2022 11:33 am
by rdos
Demindiro wrote: The 64 C bit version is ~25% faster than the 32 C bit version. The 32 bit C version is ~95% faster than the 32 bit assembly version. I think the results speak for themselves.

Important to notice is that while the assembly version executes less instructions than the C version, the C version has much better instruction scheduling which allows it to achieve ~2x IPC.
So, what is your hardware? Do you have the assembly code for the 32 bit C version?

The phase values should be random ints, but it might not matter.

Re: Best way to implement MMU on x86_64 Higher Half kernel?

Posted: Wed May 11, 2022 11:40 am
by Demindiro
rdos wrote:So, what is your hardware?
I use a 2700X
rdos wrote:Do you have the assembly code for the 32 bit C version?
The relevant assembly is:

Code: Select all

00001060 <main>:
}

int main(int argc, char **argv) {

   volatile long long dont_optimize_me;
   for (int i = 0; i < 100; i++) {
    1060:	e8 04 02 00 00       	call   1269 <__x86.get_pc_thunk.ax>
    1065:	05 9b 2f 00 00       	add    eax,0x2f9b
int main(int argc, char **argv) {
    106a:	8d 4c 24 04          	lea    ecx,[esp+0x4]
    106e:	83 e4 f0             	and    esp,0xfffffff0
    1071:	ff 71 fc             	push   DWORD PTR [ecx-0x4]
    1074:	55                   	push   ebp
    1075:	89 e5                	mov    ebp,esp
    1077:	57                   	push   edi
    1078:	56                   	push   esi
    1079:	53                   	push   ebx
    107a:	51                   	push   ecx
    107b:	83 ec 38             	sub    esp,0x38
    107e:	89 45 c0             	mov    DWORD PTR [ebp-0x40],eax
    1081:	8d b8 40 00 00 00    	lea    edi,[eax+0x40]
    1087:	8d 80 40 00 08 00    	lea    eax,[eax+0x80040]
   for (int i = 0; i < 100; i++) {
    108d:	c7 45 c8 00 00 00 00 	mov    DWORD PTR [ebp-0x38],0x0
    1094:	89 7d cc             	mov    DWORD PTR [ebp-0x34],edi
    1097:	89 45 c4             	mov    DWORD PTR [ebp-0x3c],eax
    109a:	05 00 00 80 00       	add    eax,0x800000
    109f:	89 45 d0             	mov    DWORD PTR [ebp-0x30],eax
    10a2:	8d b6 00 00 00 00    	lea    esi,[esi+0x0]
      for (int k = 10; k < 20; k++) {
    10a8:	c7 45 d4 0a 00 00 00 	mov    DWORD PTR [ebp-0x2c],0xa
    10af:	90                   	nop
   for (int i = 0; i < size; i++) {
    10b0:	8b 4d c4             	mov    ecx,DWORD PTR [ebp-0x3c]
int main(int argc, char **argv) {
    10b3:	8b 5d c8             	mov    ebx,DWORD PTR [ebp-0x38]
   long long sum = 0;
    10b6:	31 f6                	xor    esi,esi
    10b8:	31 ff                	xor    edi,edi
    10ba:	8d b6 00 00 00 00    	lea    esi,[esi+0x0]
      sum += sin_table[p] * data[2 * i];
    10c0:	8b 55 cc             	mov    edx,DWORD PTR [ebp-0x34]
      int p = (phase >> 14) & 0x3ffff;
    10c3:	89 d8                	mov    eax,ebx
    10c5:	c1 e8 0e             	shr    eax,0xe
      sum += sin_table[p] * data[2 * i];
    10c8:	0f bf 04 42          	movsx  eax,WORD PTR [edx+eax*2]
    10cc:	0f bf 11             	movsx  edx,WORD PTR [ecx]
    10cf:	0f af c2             	imul   eax,edx
    10d2:	99                   	cdq    
    10d3:	01 c6                	add    esi,eax
    10d5:	11 d7                	adc    edi,edx
      phase += phase_per_sample;
    10d7:	03 5d d4             	add    ebx,DWORD PTR [ebp-0x2c]
   for (int i = 0; i < size; i++) {
    10da:	83 c1 04             	add    ecx,0x4
    10dd:	39 4d d0             	cmp    DWORD PTR [ebp-0x30],ecx
    10e0:	75 de                	jne    10c0 <main+0x60>
      for (int k = 10; k < 20; k++) {
    10e2:	83 45 d4 01          	add    DWORD PTR [ebp-0x2c],0x1
    10e6:	8b 45 d4             	mov    eax,DWORD PTR [ebp-0x2c]
         long long power;
         CalcPower(data, sizeof(data) / sizeof(*data) / 2, i, k, &power); 
         dont_optimize_me = power;
    10e9:	89 75 e0             	mov    DWORD PTR [ebp-0x20],esi
    10ec:	89 7d e4             	mov    DWORD PTR [ebp-0x1c],edi
      for (int k = 10; k < 20; k++) {
    10ef:	83 f8 14             	cmp    eax,0x14
    10f2:	75 bc                	jne    10b0 <main+0x50>
   for (int i = 0; i < 100; i++) {
    10f4:	83 45 c8 01          	add    DWORD PTR [ebp-0x38],0x1
    10f8:	8b 45 c8             	mov    eax,DWORD PTR [ebp-0x38]
    10fb:	83 f8 64             	cmp    eax,0x64
    10fe:	75 a8                	jne    10a8 <main+0x48>
      }
   }

   // Print one to verify correctness
   printf("%lld\n", dont_optimize_me);
    1100:	8b 45 e0             	mov    eax,DWORD PTR [ebp-0x20]
    1103:	8b 5d c0             	mov    ebx,DWORD PTR [ebp-0x40]
    1106:	83 ec 04             	sub    esp,0x4
    1109:	8b 55 e4             	mov    edx,DWORD PTR [ebp-0x1c]
    110c:	52                   	push   edx
    110d:	50                   	push   eax
    110e:	8d 83 08 e0 ff ff    	lea    eax,[ebx-0x1ff8]
    1114:	50                   	push   eax
    1115:	e8 16 ff ff ff       	call   1030 <printf@plt>

   return 0;
}
    111a:	83 c4 10             	add    esp,0x10
    111d:	8d 65 f0             	lea    esp,[ebp-0x10]
    1120:	31 c0                	xor    eax,eax
    1122:	59                   	pop    ecx
    1123:	5b                   	pop    ebx
    1124:	5e                   	pop    esi
    1125:	5f                   	pop    edi
    1126:	5d                   	pop    ebp
    1127:	8d 61 fc             	lea    esp,[ecx-0x4]
    112a:	c3                   	ret    
    112b:	66 90                	xchg   ax,ax
    112d:	66 90                	xchg   ax,ax
    112f:	90                   	nop

Re: Best way to implement MMU on x86_64 Higher Half kernel?

Posted: Wed May 11, 2022 11:57 am
by rdos
I'm a bit impressed. :-)

So, I modified the code slighty to avoid mixing bitness of operators and a few other things:

Code: Select all

_CalcFreqPowerA  Proc near
    push ebx
    push ecx
    push edx
    push esi
    push edi
    push ebp
;
    mov esi,[esp+1Ch]
    mov ecx,[esp+20h]
    mov ebp,[esp+24h]
    mov edi,[esp+2Ch]
;
    xor eax,eax
    mov [edi].pow_c,eax
    mov [edi].pow_c+4,eax

cfpaLoop:
    mov ebx,ebp
    shr ebx,14
;
    movsx eax,word ptr [2 * ebx].sin_tab
    movsx edx,word ptr [esi]
    imul eax,edx
    cdq
    add dword ptr [edi].pow_c,eax
    adc dword ptr [edi].pow_c+4,edx
;
    add esi,4
    add ebp,[esp+28h]
    sub ecx,1
    jnz cfpaLoop
;
    mov eax,ebp
;
    pop ebp
    pop edi
    pop esi
    pop edx
    pop ecx
    pop ebx
    ret 20
_CalcFreqPowerA    Endp
I wonder a bit about the use of imul eax,edx and cdq vs using only imul edx. The latter should be faster.

Will test it on real calculations to see if it makes a difference. I'm using a 3.5 GHz 12-core threadripper (2920X).

However, this is without using fsin and generating a 16-bit sin-table. The standard code using sin() and floating point is sure to do much worse since sin() in this implementation basically is a single integer operation.

Re: Best way to implement MMU on x86_64 Higher Half kernel?

Posted: Thu May 12, 2022 1:17 am
by rdos
The "C" variant actually reduce the execution time of my analysis tool from a bit over 7 hours to 5 hours. Not 95%, but it is significantly faster.

Still, the big issue is how standard C code (using floating point trigometry) is hopelessly slow for these kinds of applications and needs to be replaced with better algorithms.

Re: Best way to implement MMU on x86_64 Higher Half kernel?

Posted: Thu May 12, 2022 4:13 pm
by Ethin
rdos wrote:The "C" variant actually reduce the execution time of my analysis tool from a bit over 7 hours to 5 hours. Not 95%, but it is significantly faster.

Still, the big issue is how standard C code (using floating point trigometry) is hopelessly slow for these kinds of applications and needs to be replaced with better algorithms.
Talk about a very subjective claim... Given the fact that 99 percent of C/C++ apps use libm for math routines, and sometimes go with -ffast-math, I'd say this is a pretty false one in the general case. Or they use openlibm. But regardless, the vast majority of C applications use math libraries because the processors fPU math functions (or SSE/AVX equivalents if they even exist) have varying behavior depending on processor, and consistency is required. That, and most apps don't require math functions that can do their thing in 1.2us or something. (Also, I think you'll find that actually implementing the trigonometric functions is not exactly an easy thing to do, but hey, if you can write significantly faster versions, I'm sure the developers of openlibm or the math libraries that come with most Linux systems would appreciate the performance gains.) But hey, if you do need performance over accuracy or consistency, either implement your own versions of the trigonometric functions or go with FSIN/FCOS/etc. if you don't mind processor weirdness giving you different results on different processors (or Intel underestimating the error bounds).
Edit: or you could see about cmathl. Not sure how good or fast it is, but it looks promising. It appears to stick to using floats as much as possible instead of the weird FP hacks and undefined behavior like the other FP libs I've seen.

Re: Best way to implement MMU on x86_64 Higher Half kernel?

Posted: Thu May 12, 2022 5:55 pm
by Octocontrabass
rdos wrote:Still, the big issue is how standard C code (using floating point trigometry) is hopelessly slow for these kinds of applications and needs to be replaced with better algorithms.
And standard x86 assembly (using fsin) is also hopelessly slow for these kinds of applications.

A compiler isn't going to magically come up with a better algorithm, but it will almost always come up with better machine code for your algorithm.

Re: Best way to implement MMU on x86_64 Higher Half kernel?

Posted: Fri May 13, 2022 5:17 am
by rdos
Ethin wrote:
rdos wrote:The "C" variant actually reduce the execution time of my analysis tool from a bit over 7 hours to 5 hours. Not 95%, but it is significantly faster.

Still, the big issue is how standard C code (using floating point trigometry) is hopelessly slow for these kinds of applications and needs to be replaced with better algorithms.
Talk about a very subjective claim... Given the fact that 99 percent of C/C++ apps use libm for math routines, and sometimes go with -ffast-math, I'd say this is a pretty false one in the general case. Or they use openlibm. But regardless, the vast majority of C applications use math libraries because the processors fPU math functions (or SSE/AVX equivalents if they even exist) have varying behavior depending on processor, and consistency is required. That, and most apps don't require math functions that can do their thing in 1.2us or something. (Also, I think you'll find that actually implementing the trigonometric functions is not exactly an easy thing to do, but hey, if you can write significantly faster versions, I'm sure the developers of openlibm or the math libraries that come with most Linux systems would appreciate the performance gains.) But hey, if you do need performance over accuracy or consistency, either implement your own versions of the trigonometric functions or go with FSIN/FCOS/etc. if you don't mind processor weirdness giving you different results on different processors (or Intel underestimating the error bounds).
Edit: or you could see about cmathl. Not sure how good or fast it is, but it looks promising. It appears to stick to using floats as much as possible instead of the weird FP hacks and undefined behavior like the other FP libs I've seen.
I think you fail to see the issue. The example is something that is a quite common scenario in digital signal analysis. When this is implemented in signal processors, it's not implemented using floating point simply because it is overkill. I don't need a double or long double result when the precision of the input signal is only 14-bits. A 16-bit result is quite enough. I also don't need to have a double or long double input either. It's more convenient to use a 32-bit or less precision input that is a fraction where 0 is 0 degrees, 0x40000000 is 90 degrees, 0x80000000 is 180 degrees and 0xC0000000 is 270 degrees.

Re: Best way to implement MMU on x86_64 Higher Half kernel?

Posted: Fri May 13, 2022 8:25 am
by nullplan
rdos wrote:I think you fail to see the issue.
Hello pot, I am kettle. You are black.

You can implement all of the things you said as C, even as standard C. Yes, there is no standard C function that implements sin() as a table lookup, but you can write your own version of that and call it my_sin() or whatever (actually, the source code of XClock contains something like that). And I guarantee you the compiler will turn that into assembler that is at least as efficient as your hand-written assembler, unless you write your code in a brain-dead way, or use a compiler that ceased development around the time Monica Lewinsky's choice of dry cleaner mattered.

With the added bonus, of course, that a well-written C version of the code compiles for x86, AMD64, ARM, AArch64, PowerPC, Microblaze, PDP-11 and even the 6502, whereas your code is and always will be x86, or maybe AMD64 with a lot of macros. That may not matter to you now, but time makes fools of us all.

BTW, how does any of this relate to VMM on AMD64?