Page 3 of 3
Re: Best way to implement MMU on x86_64 Higher Half kernel?
Posted: Tue May 10, 2022 5:00 pm
by Octocontrabass
rdos wrote:If you introduce an 16-bit sin-table in the code and use integer parameters, then the compiler can produce code with similar speed, but this needs some thinking of proper algorithms to use. Something the C compiler cannot fix for you, even if you present it with the same interface.
No one is saying that a compiler can magically come up with a better algorithm. What we are saying is that the compiler does a better job of turning algorithms into machine code.
I suggest you benchmark a C compiler against your hand-written assembly.
I wouldn't be surprised if it's faster even without vectorization.
Re: Best way to implement MMU on x86_64 Higher Half kernel?
Posted: Tue May 10, 2022 5:11 pm
by Ethin
kzinti wrote:Ethin wrote:Why? Your not going to be using this in kernel mode, only in user mode. The normal sin() should be fine. There's no reason whatsoever to write your own math library from scratch or something.
Having work in video games for almost 20 years and having had to avoid the C library's sin(), cos() and other trigonometric function on every project, I have to disagree with you.
sin() is not fine where performance and exact behaviour matters. sin() is slow, especially on x86. C's model doesn't map perfectly to the x86 FPU and the implementation has to go through different hoops to work around that. Sure you can disable some of it with compiler switches, but it is still slow and non-portable.
I remember a time where sin() would not produce the same results on different x86 processors because they had different versions of SSE. This would make our game go out of sync in network play because computers were basically not producing the same simulation.
I thought that the hardware FSIN (for example) was less accurate and slower than the C version. Wouldn't that only be worse?
Re: Best way to implement MMU on x86_64 Higher Half kernel?
Posted: Tue May 10, 2022 8:09 pm
by kzinti
Ethin wrote:I thought that the hardware FSIN (for example) was less accurate and slower than the C version. Wouldn't that only be worse?
I am not sure I follow your logic. What I have seen C implementation do is basically use fsin() or the SSE equivalent depending on what is available on the processor. fsin()'s result can actually change depending on what is in the FPU state when you execute it (internally the FPU registers have more bits than what your C code is using). C library implementations typically have extra code to work around this, which means that calling sin() is more than just a fsin instruction. This can sometimes be disabled / tweaked with implementation specific functions, but you are losing determinism/portability when you start changing the behaviour of standard C functions.
What matters (for video games anyways) is:
1) performance
2) deterministic
You can tune your implementation to whatever precision/accuracy you require (which might be different for each usage)
So in this context, we would go to extra lengths to ensure no one was using fsin (and SSE at the time since it wasn't available everywhere). This basically means no C library trigonometry (or sqrt). We had to write our own implementation, and that meant a mix of lookup tables and Newton iterations).
Re: Best way to implement MMU on x86_64 Higher Half kernel?
Posted: Tue May 10, 2022 8:18 pm
by Ethin
kzinti wrote:Ethin wrote:I thought that the hardware FSIN (for example) was less accurate and slower than the C version. Wouldn't that only be worse?
I am not sure I follow your logic. What I have seen C implementation do is basically use fsin() or the SSE equivalent depending on what is available on the processor. fsin()'s result can actually change depending on what is in the FPU state when you execute it (internally the FPU registers have more bits than what your C code is using). Again I understand you can configure your FPU precision/behaviour, but you are losing portability/determinism.
What matters (for video games anyways) is:
1) performance
2) deterministic
You can tune your implementation to whatever precision/accuracy your require (which might be different for each usage)
So in this context, we would go to extra lengths to ensure no one was using fsin (and SSE at the time since it wasn't available everywhere). This basically means no C library trigonometry (or sqrt). We had to write our own implementation, and that meant a mix of lookup tables and Newton iterations).
As I said in a previous post, I've written a lot of math using cos(), sin(), etc., and I've never seen a compiler emit a single FPU instruction -- its always a library call to libc's sin()/cos() function. If there's a way to get GCC/Clang to emit FPU instructions, I don't know what it is -- and its obviously not enabled by default or the compiler chooses a more efficient method (SSE, SSE2, ...), though even in that case there's no SSE/AVX-based sin calculations, its a library call.
Re: Best way to implement MMU on x86_64 Higher Half kernel?
Posted: Tue May 10, 2022 8:20 pm
by kzinti
Ethin wrote:As I said in a previous post, I've written a lot of math using cos(), sin(), etc., and I've never seen a compiler emit a single FPU instruction -- its always a library call to libc's sin()/cos() function.
That is because standard C's sin() function doesn't map to the FPU fsin instruction or the SSE equivalent. You can't replace a call to sin() with a single fsin instruction. They don't do the same thing. I know it's is not intuitive, but basically the C implementation has to do extra work to ensure the behaviour of the function matches what the C standard says it should do. This cost significant performance at a minimum. If you do a call once in a while, it doesn't matter. But if you need tons of sin() in a tight loop, it can be a problem.
And some of the C sin() implementation I have seen will use the FPU or a different version of SSE/AVX/etc depending on what the processor supports.
Re: Best way to implement MMU on x86_64 Higher Half kernel?
Posted: Tue May 10, 2022 9:57 pm
by Octocontrabass
Ethin wrote:As I said in a previous post, I've written a lot of math using cos(), sin(), etc., and I've never seen a compiler emit a single FPU instruction
GCC will do it if you enable optimizations that aren't compatible with the C standard. But even then it often chooses the library function for float and double, since many x87 instructions will waste time calculating a long double result.
Re: Best way to implement MMU on x86_64 Higher Half kernel?
Posted: Wed May 11, 2022 1:22 am
by Demindiro
So I decided to generate some random data and sin tables so I could at least benchmark something.
After struggling to get GCC to compile something that doesn't segfault (damn PIE code) I've come up with the following:
Code: Select all
#!/usr/bin/env python3
from random import randint
with open("/tmp/sin_table.c", "w") as f:
f.write("short sin_table[0x40000] = {\n");
for _ in range(0x4_0000):
f.write(str(randint(-255, 255)) + '\n,')
f.write("};");
with open("/tmp/data.c", "w") as f:
f.write("short data[1 << 22] = {\n");
for _ in range(1 << 22):
f.write(str(randint(-255, 255)) + '\n,')
f.write("};");
Code: Select all
#include "stdio.h"
extern short sin_table[0x40000];
extern short data[1 << 22];
static int CalcPower(short int *data, int size, int init_phase, int phase_per_sample, long long *power) {
int phase = init_phase;
long long sum = 0;
for (int i = 0; i < size; i++) {
int p = (phase >> 14) & 0x3ffff;
sum += sin_table[p] * data[2 * i];
phase += phase_per_sample;
}
*power = sum;
return phase;
}
int main(int argc, char **argv) {
volatile long long dont_optimize_me;
for (int i = 0; i < 100; i++) {
for (int k = 10; k < 20; k++) {
long long power;
CalcPower(data, sizeof(data) / sizeof(*data) / 2, i, k, &power);
dont_optimize_me = power;
}
}
// Print one to verify correctness
printf("%lld\n", dont_optimize_me);
return 0;
}
Code: Select all
.intel_syntax noprefix
.globl _CalcFreqPowerA
.globl main
.section .text._CalcFreqPowerA
.p2align 4
# PARAMETERS: Data
# Size
# InitPhase
# PhasePerSample
# Power
#
# RETURNS: End phase
_CalcFreqPowerA:
push ebx
push ecx
push edx
push esi
push edi
push ebp
#
mov esi,[esp+0x1C]
mov ecx,[esp+0x20]
mov ebp,[esp+0x24]
mov edi,[esp+0x2C]
#
xor eax,eax
mov [edi + 0],eax
mov [edi + 4],eax
cfpaLoop:
mov ebx,ebp
shr ebx,13
and bl,0xFE
#
mov ax, [ebx + sin_table]
imul word ptr [esi]
movsx edx,dx
add word ptr [edi + 0],ax
adc dword ptr [edi + 2],edx
#
add esi,4
add ebp,[esp+0x28]
loop cfpaLoop
#
movsx eax,word ptr [edi + 4]
mov [edi + 4],eax
#
mov eax,ebp
#
pop ebp
pop edi
pop esi
pop edx
pop ecx
pop ebx
ret 20
.section .text.main
.p2align 4
main:
sub esp, 8 # reserve storage for power value
# for (int i = 0; i < 100; i++)
xor ecx, ecx
2:
# for (int k = 10; k < 20; k++)
mov edx, 10
3:
# CalcPower
push esp
push edx
push ecx
push 1 << 21
push [data_ptr]
call _CalcFreqPowerA
# for k
inc edx
cmp edx, 20
jnz 3b
# for i
inc ecx
cmp ecx, 100
jnz 2b
# printf
push [format_str_lld_ptr]
call printf
add esp, 4 + 8
# return 0
xor eax, eax
ret
data_ptr:
.long data
format_str_lld_ptr:
.long format_str_lld
.section .text.rodata
format_str_lld:
.asciz "%lld\n"
Code: Select all
CARGS += -O3
CARGS += -Wall
#CARGS += -flto
all: sin2_64_c_native sin2_64_c sin2_32_c sin2_32_asm
sin2_64_c_native: sin2.c sin_table.c data_64.o
$(CC) $(CARGS) -march=native $^ -o $@
sin2_64_c: sin2.c sin_table.c data_64.o
$(CC) $(CARGS) $^ -o $@
sin2_32_c: sin2.c sin_table.c data_32.o
$(CC) $(CARGS) -m32 $^ -o $@ -g3
sin2_32_asm: sin2.s sin_table.c data_32.o
$(CC) $(CARGS) -no-pie -fno-pie -m32 $^ -o $@
data_64.o: data.c
$(CC) $(CARGS) -c $^ -o $@
data_32.o: data.c
$(CC) $(CARGS) -c -m32 $^ -o $@
Benchmarking with `perf stat` yields the following results:
Code: Select all
david@pc1:/tmp$ perf stat ./sin2_64_c_native
-50213557
Performance counter stats for './sin2_64_c_native':
1634.27 msec task-clock # 1.000 CPUs utilized
3 context-switches # 0.002 K/sec
0 cpu-migrations # 0.000 K/sec
184 page-faults # 0.113 K/sec
6498854866 cycles # 3.977 GHz (83.36%)
3982498 stalled-cycles-frontend # 0.06% frontend cycles idle (83.36%)
3418439760 stalled-cycles-backend # 52.60% backend cycles idle (83.36%)
23030851488 instructions # 3.54 insn per cycle
# 0.15 stalled cycles per insn (83.36%)
2090698244 branches # 1279.289 M/sec (83.36%)
47624 branch-misses # 0.00% of all branches (83.22%)
1.634828733 seconds time elapsed
1.634865000 seconds user
0.000000000 seconds sys
david@pc1:/tmp$ perf stat ./sin2_64_c
-50213557
Performance counter stats for './sin2_64_c':
1639.85 msec task-clock # 1.000 CPUs utilized
2 context-switches # 0.001 K/sec
2 cpu-migrations # 0.001 K/sec
183 page-faults # 0.112 K/sec
6457718560 cycles # 3.938 GHz (83.17%)
3505234 stalled-cycles-frontend # 0.05% frontend cycles idle (83.37%)
3482426541 stalled-cycles-backend # 53.93% backend cycles idle (83.41%)
23106438497 instructions # 3.58 insn per cycle
# 0.15 stalled cycles per insn (83.41%)
2098396719 branches # 1279.628 M/sec (83.41%)
51983 branch-misses # 0.00% of all branches (83.22%)
1.640573324 seconds time elapsed
1.640610000 seconds user
0.000000000 seconds sys
david@pc1:/tmp$ perf stat ./sin2_32_c
-50213557
Performance counter stats for './sin2_32_c':
2043.25 msec task-clock # 1.000 CPUs utilized
2 context-switches # 0.001 K/sec
0 cpu-migrations # 0.000 K/sec
175 page-faults # 0.086 K/sec
8239753164 cycles # 4.033 GHz (83.36%)
5344609 stalled-cycles-frontend # 0.06% frontend cycles idle (83.36%)
5192399565 stalled-cycles-backend # 63.02% backend cycles idle (83.36%)
27240135813 instructions # 3.31 insn per cycle
# 0.19 stalled cycles per insn (83.36%)
2097472578 branches # 1026.539 M/sec (83.36%)
58111 branch-misses # 0.00% of all branches (83.20%)
2.043735436 seconds time elapsed
2.043777000 seconds user
0.000000000 seconds sys
david@pc1:/tmp$ perf stat ./sin2_32_asm
-50213557
Performance counter stats for './sin2_32_asm':
3969.39 msec task-clock # 0.999 CPUs utilized
19 context-switches # 0.005 K/sec
7 cpu-migrations # 0.002 K/sec
175 page-faults # 0.044 K/sec
15536969090 cycles # 3.914 GHz (64.97%)
10339765 stalled-cycles-frontend # 0.07% frontend cycles idle (64.86%)
9893809047 stalled-cycles-backend # 63.68% backend cycles idle (64.83%)
23164560938 instructions # 1.49 insn per cycle
# 0.43 stalled cycles per insn (64.77%)
2104231298 branches # 530.114 M/sec (64.85%)
173766 branch-misses # 0.01% of all branches (64.97%)
3.973235657 seconds time elapsed
3.969949000 seconds user
0.000000000 seconds sys
The 64 C bit version is ~25% faster than the 32 C bit version. The 32 bit C version is ~95% faster than the 32 bit assembly version. I think the results speak for themselves.
Important to notice is that while the assembly version executes
less instructions than the C version, the C version has much better instruction scheduling which allows it to achieve ~2x IPC.
Re: Best way to implement MMU on x86_64 Higher Half kernel?
Posted: Wed May 11, 2022 11:33 am
by rdos
Demindiro wrote:
The 64 C bit version is ~25% faster than the 32 C bit version. The 32 bit C version is ~95% faster than the 32 bit assembly version. I think the results speak for themselves.
Important to notice is that while the assembly version executes less instructions than the C version, the C version has much better instruction scheduling which allows it to achieve ~2x IPC.
So, what is your hardware? Do you have the assembly code for the 32 bit C version?
The phase values should be random ints, but it might not matter.
Re: Best way to implement MMU on x86_64 Higher Half kernel?
Posted: Wed May 11, 2022 11:40 am
by Demindiro
rdos wrote:So, what is your hardware?
I use a 2700X
rdos wrote:Do you have the assembly code for the 32 bit C version?
The relevant assembly is:
Code: Select all
00001060 <main>:
}
int main(int argc, char **argv) {
volatile long long dont_optimize_me;
for (int i = 0; i < 100; i++) {
1060: e8 04 02 00 00 call 1269 <__x86.get_pc_thunk.ax>
1065: 05 9b 2f 00 00 add eax,0x2f9b
int main(int argc, char **argv) {
106a: 8d 4c 24 04 lea ecx,[esp+0x4]
106e: 83 e4 f0 and esp,0xfffffff0
1071: ff 71 fc push DWORD PTR [ecx-0x4]
1074: 55 push ebp
1075: 89 e5 mov ebp,esp
1077: 57 push edi
1078: 56 push esi
1079: 53 push ebx
107a: 51 push ecx
107b: 83 ec 38 sub esp,0x38
107e: 89 45 c0 mov DWORD PTR [ebp-0x40],eax
1081: 8d b8 40 00 00 00 lea edi,[eax+0x40]
1087: 8d 80 40 00 08 00 lea eax,[eax+0x80040]
for (int i = 0; i < 100; i++) {
108d: c7 45 c8 00 00 00 00 mov DWORD PTR [ebp-0x38],0x0
1094: 89 7d cc mov DWORD PTR [ebp-0x34],edi
1097: 89 45 c4 mov DWORD PTR [ebp-0x3c],eax
109a: 05 00 00 80 00 add eax,0x800000
109f: 89 45 d0 mov DWORD PTR [ebp-0x30],eax
10a2: 8d b6 00 00 00 00 lea esi,[esi+0x0]
for (int k = 10; k < 20; k++) {
10a8: c7 45 d4 0a 00 00 00 mov DWORD PTR [ebp-0x2c],0xa
10af: 90 nop
for (int i = 0; i < size; i++) {
10b0: 8b 4d c4 mov ecx,DWORD PTR [ebp-0x3c]
int main(int argc, char **argv) {
10b3: 8b 5d c8 mov ebx,DWORD PTR [ebp-0x38]
long long sum = 0;
10b6: 31 f6 xor esi,esi
10b8: 31 ff xor edi,edi
10ba: 8d b6 00 00 00 00 lea esi,[esi+0x0]
sum += sin_table[p] * data[2 * i];
10c0: 8b 55 cc mov edx,DWORD PTR [ebp-0x34]
int p = (phase >> 14) & 0x3ffff;
10c3: 89 d8 mov eax,ebx
10c5: c1 e8 0e shr eax,0xe
sum += sin_table[p] * data[2 * i];
10c8: 0f bf 04 42 movsx eax,WORD PTR [edx+eax*2]
10cc: 0f bf 11 movsx edx,WORD PTR [ecx]
10cf: 0f af c2 imul eax,edx
10d2: 99 cdq
10d3: 01 c6 add esi,eax
10d5: 11 d7 adc edi,edx
phase += phase_per_sample;
10d7: 03 5d d4 add ebx,DWORD PTR [ebp-0x2c]
for (int i = 0; i < size; i++) {
10da: 83 c1 04 add ecx,0x4
10dd: 39 4d d0 cmp DWORD PTR [ebp-0x30],ecx
10e0: 75 de jne 10c0 <main+0x60>
for (int k = 10; k < 20; k++) {
10e2: 83 45 d4 01 add DWORD PTR [ebp-0x2c],0x1
10e6: 8b 45 d4 mov eax,DWORD PTR [ebp-0x2c]
long long power;
CalcPower(data, sizeof(data) / sizeof(*data) / 2, i, k, &power);
dont_optimize_me = power;
10e9: 89 75 e0 mov DWORD PTR [ebp-0x20],esi
10ec: 89 7d e4 mov DWORD PTR [ebp-0x1c],edi
for (int k = 10; k < 20; k++) {
10ef: 83 f8 14 cmp eax,0x14
10f2: 75 bc jne 10b0 <main+0x50>
for (int i = 0; i < 100; i++) {
10f4: 83 45 c8 01 add DWORD PTR [ebp-0x38],0x1
10f8: 8b 45 c8 mov eax,DWORD PTR [ebp-0x38]
10fb: 83 f8 64 cmp eax,0x64
10fe: 75 a8 jne 10a8 <main+0x48>
}
}
// Print one to verify correctness
printf("%lld\n", dont_optimize_me);
1100: 8b 45 e0 mov eax,DWORD PTR [ebp-0x20]
1103: 8b 5d c0 mov ebx,DWORD PTR [ebp-0x40]
1106: 83 ec 04 sub esp,0x4
1109: 8b 55 e4 mov edx,DWORD PTR [ebp-0x1c]
110c: 52 push edx
110d: 50 push eax
110e: 8d 83 08 e0 ff ff lea eax,[ebx-0x1ff8]
1114: 50 push eax
1115: e8 16 ff ff ff call 1030 <printf@plt>
return 0;
}
111a: 83 c4 10 add esp,0x10
111d: 8d 65 f0 lea esp,[ebp-0x10]
1120: 31 c0 xor eax,eax
1122: 59 pop ecx
1123: 5b pop ebx
1124: 5e pop esi
1125: 5f pop edi
1126: 5d pop ebp
1127: 8d 61 fc lea esp,[ecx-0x4]
112a: c3 ret
112b: 66 90 xchg ax,ax
112d: 66 90 xchg ax,ax
112f: 90 nop
Re: Best way to implement MMU on x86_64 Higher Half kernel?
Posted: Wed May 11, 2022 11:57 am
by rdos
I'm a bit impressed.
So, I modified the code slighty to avoid mixing bitness of operators and a few other things:
Code: Select all
_CalcFreqPowerA Proc near
push ebx
push ecx
push edx
push esi
push edi
push ebp
;
mov esi,[esp+1Ch]
mov ecx,[esp+20h]
mov ebp,[esp+24h]
mov edi,[esp+2Ch]
;
xor eax,eax
mov [edi].pow_c,eax
mov [edi].pow_c+4,eax
cfpaLoop:
mov ebx,ebp
shr ebx,14
;
movsx eax,word ptr [2 * ebx].sin_tab
movsx edx,word ptr [esi]
imul eax,edx
cdq
add dword ptr [edi].pow_c,eax
adc dword ptr [edi].pow_c+4,edx
;
add esi,4
add ebp,[esp+28h]
sub ecx,1
jnz cfpaLoop
;
mov eax,ebp
;
pop ebp
pop edi
pop esi
pop edx
pop ecx
pop ebx
ret 20
_CalcFreqPowerA Endp
I wonder a bit about the use of imul eax,edx and cdq vs using only imul edx. The latter should be faster.
Will test it on real calculations to see if it makes a difference. I'm using a 3.5 GHz 12-core threadripper (2920X).
However, this is without using fsin and generating a 16-bit sin-table. The standard code using sin() and floating point is sure to do much worse since sin() in this implementation basically is a single integer operation.
Re: Best way to implement MMU on x86_64 Higher Half kernel?
Posted: Thu May 12, 2022 1:17 am
by rdos
The "C" variant actually reduce the execution time of my analysis tool from a bit over 7 hours to 5 hours. Not 95%, but it is significantly faster.
Still, the big issue is how standard C code (using floating point trigometry) is hopelessly slow for these kinds of applications and needs to be replaced with better algorithms.
Re: Best way to implement MMU on x86_64 Higher Half kernel?
Posted: Thu May 12, 2022 4:13 pm
by Ethin
rdos wrote:The "C" variant actually reduce the execution time of my analysis tool from a bit over 7 hours to 5 hours. Not 95%, but it is significantly faster.
Still, the big issue is how standard C code (using floating point trigometry) is hopelessly slow for these kinds of applications and needs to be replaced with better algorithms.
Talk about a very subjective claim... Given the fact that 99 percent of C/C++ apps use libm for math routines, and sometimes go with -ffast-math, I'd say this is a pretty false one in the general case. Or they use openlibm. But regardless, the vast majority of C applications use math libraries because the processors fPU math functions (or SSE/AVX equivalents if they even exist) have varying behavior depending on processor, and consistency is required. That, and most apps don't require math functions that can do their thing in 1.2us or something. (Also, I think you'll find that actually implementing the trigonometric functions is not exactly an easy thing to do, but hey, if you can write significantly faster versions, I'm sure the developers of openlibm or the math libraries that come with most Linux systems would appreciate the performance gains.) But hey, if you do need performance over accuracy or consistency, either implement your own versions of the trigonometric functions or go with FSIN/FCOS/etc. if you don't mind processor weirdness giving you different results on different processors (or
Intel underestimating the error bounds).
Edit: or you could see about
cmathl. Not sure how good or fast it is, but it looks promising. It appears to stick to using floats as much as possible instead of the weird FP hacks and undefined behavior like the other FP libs I've seen.
Re: Best way to implement MMU on x86_64 Higher Half kernel?
Posted: Thu May 12, 2022 5:55 pm
by Octocontrabass
rdos wrote:Still, the big issue is how standard C code (using floating point trigometry) is hopelessly slow for these kinds of applications and needs to be replaced with better algorithms.
And standard x86 assembly (using fsin) is also hopelessly slow for these kinds of applications.
A compiler isn't going to magically come up with a better algorithm, but it will almost always come up with better machine code for your algorithm.
Re: Best way to implement MMU on x86_64 Higher Half kernel?
Posted: Fri May 13, 2022 5:17 am
by rdos
Ethin wrote:rdos wrote:The "C" variant actually reduce the execution time of my analysis tool from a bit over 7 hours to 5 hours. Not 95%, but it is significantly faster.
Still, the big issue is how standard C code (using floating point trigometry) is hopelessly slow for these kinds of applications and needs to be replaced with better algorithms.
Talk about a very subjective claim... Given the fact that 99 percent of C/C++ apps use libm for math routines, and sometimes go with -ffast-math, I'd say this is a pretty false one in the general case. Or they use openlibm. But regardless, the vast majority of C applications use math libraries because the processors fPU math functions (or SSE/AVX equivalents if they even exist) have varying behavior depending on processor, and consistency is required. That, and most apps don't require math functions that can do their thing in 1.2us or something. (Also, I think you'll find that actually implementing the trigonometric functions is not exactly an easy thing to do, but hey, if you can write significantly faster versions, I'm sure the developers of openlibm or the math libraries that come with most Linux systems would appreciate the performance gains.) But hey, if you do need performance over accuracy or consistency, either implement your own versions of the trigonometric functions or go with FSIN/FCOS/etc. if you don't mind processor weirdness giving you different results on different processors (or
Intel underestimating the error bounds).
Edit: or you could see about
cmathl. Not sure how good or fast it is, but it looks promising. It appears to stick to using floats as much as possible instead of the weird FP hacks and undefined behavior like the other FP libs I've seen.
I think you fail to see the issue. The example is something that is a quite common scenario in digital signal analysis. When this is implemented in signal processors, it's not implemented using floating point simply because it is overkill. I don't need a double or long double result when the precision of the input signal is only 14-bits. A 16-bit result is quite enough. I also don't need to have a double or long double input either. It's more convenient to use a 32-bit or less precision input that is a fraction where 0 is 0 degrees, 0x40000000 is 90 degrees, 0x80000000 is 180 degrees and 0xC0000000 is 270 degrees.
Re: Best way to implement MMU on x86_64 Higher Half kernel?
Posted: Fri May 13, 2022 8:25 am
by nullplan
rdos wrote:I think you fail to see the issue.
Hello pot, I am kettle. You are black.
You can implement all of the things you said as C, even as standard C. Yes, there is no standard C function that implements sin() as a table lookup, but you can write your own version of that and call it my_sin() or whatever (actually, the source code of XClock contains something like that). And I guarantee you the compiler will turn that into assembler that is at least as efficient as your hand-written assembler, unless you write your code in a brain-dead way, or use a compiler that ceased development around the time Monica Lewinsky's choice of dry cleaner mattered.
With the added bonus, of course, that a well-written C version of the code compiles for x86, AMD64, ARM, AArch64, PowerPC, Microblaze, PDP-11 and even the 6502, whereas your code is and always will be x86, or maybe AMD64 with a lot of macros. That may not matter to you now, but time makes fools of us all.
BTW, how does any of this relate to VMM on AMD64?