rdos wrote:
That's still a lot of code, and the sin() function is a function call with unknown number of operations.
This is basically how it looks like in assembly (there is a 256kx2 entries sin table in the same source file, sin_tab):
Code: Select all
; PARAMETERS: Data
; Size
; InitPhase
; PhasePerSample
; Power
;
; RETURNS: End phase
_CalcFreqPowerA Proc near
push ebx
push ecx
push edx
push esi
push edi
push ebp
;
mov esi,[esp+1Ch]
mov ecx,[esp+20h]
mov ebp,[esp+24h]
mov edi,[esp+2Ch]
;
xor eax,eax
mov [edi].pow_c,eax
mov [edi].pow_c+4,eax
cfpaLoop:
mov ebx,ebp
shr ebx,13
and bl,0FEh
;
mov ax,[ebx].sin_tab
imul word ptr [esi]
movsx edx,dx
add word ptr [edi].pow_c,ax
adc dword ptr [edi].pow_c+2,edx
;
add esi,4
add ebp,[esp+28h]
loop cfpaLoop
;
movsx eax,word ptr [edi].pow_c+4
mov [edi].pow_c+4,eax
;
mov eax,ebp
;
pop ebp
pop edi
pop esi
pop edx
pop ecx
pop ebx
ret 20
_CalcFreqPowerA Endp
The 64-bit version would be one less instruction in the inner loop since it could use a 64-bit register.
One thing that immediately stands out is the use of a loop instruction. While this is fast on AMD processors, it's still very slow on Intel processors for some reason (7 cycles on Ice Lake according to Agner Fog's tables). It also seems to me it would be a good idea to put phase_per_sample in a register instead of loading it every iteration.
Anyhow, if I interpreted your code correctly the equivalent C version would look like this:
Code: Select all
static long *sin_table = { 0, /* ... */ };
long CalcPower(int *data, int size, int init_phase, int phase_per_sample, int power) {
long end_phase = 0;
for (int i = 0; i < size; i++) {
int p = (init_phase >> 13) & (~1);
end_phase += sin_table[p] * data[i];
init_phase += phase_per_sample;
}
return end_phase;
}
The corresponding generated assembly with `gcc -O3 sin.c -c -m32` is:
Code: Select all
00000000 <CalcPower>:
0: 57 push edi
1: 56 push esi
2: 53 push ebx
3: 8b 54 24 14 mov edx,DWORD PTR [esp+0x14]
7: 8b 4c 24 18 mov ecx,DWORD PTR [esp+0x18]
b: 8b 74 24 1c mov esi,DWORD PTR [esp+0x1c]
f: 85 d2 test edx,edx
11: 7e 35 jle 48 <CalcPower+0x48>
13: 8b 44 24 10 mov eax,DWORD PTR [esp+0x10]
17: 31 db xor ebx,ebx
19: 8d 3c 90 lea edi,[eax+edx*4]
1c: 8d 74 26 00 lea esi,[esi+eiz*1+0x0]
20: 89 ca mov edx,ecx
22: 83 c0 04 add eax,0x4
25: 01 f1 add ecx,esi
27: c1 fa 0d sar edx,0xd
2a: 83 e2 fe and edx,0xfffffffe
2d: 8b 14 95 00 00 00 00 mov edx,DWORD PTR [edx*4+0x0]
34: 0f af 50 fc imul edx,DWORD PTR [eax-0x4]
38: 01 d3 add ebx,edx
3a: 39 c7 cmp edi,eax
3c: 75 e2 jne 20 <CalcPower+0x20>
3e: 89 d8 mov eax,ebx
40: 5b pop ebx
41: 5e pop esi
42: 5f pop edi
43: c3 ret
44: 8d 74 26 00 lea esi,[esi+eiz*1+0x0]
48: 31 db xor ebx,ebx
4a: 89 d8 mov eax,ebx
4c: 5b pop ebx
4d: 5e pop esi
4e: 5f pop edi
4f: c3 ret
With `gcc -O3 sin.c -c -march=znver1` (I use a 2700X) the generated assembly is:
Code: Select all
0000000000000000 <CalcPower>:
0: 89 f0 mov eax,esi
2: 89 d6 mov esi,edx
4: 85 c0 test eax,eax
6: 7e 48 jle 50 <CalcPower+0x50>
8: ff c8 dec eax
a: 45 31 c0 xor r8d,r8d
d: 4c 8d 4c 87 04 lea r9,[rdi+rax*4+0x4]
12: 66 66 2e 0f 1f 84 00 data16 nop WORD PTR cs:[rax+rax*1+0x0]
19: 00 00 00 00
1d: 0f 1f 00 nop DWORD PTR [rax]
20: 89 f0 mov eax,esi
22: 48 63 17 movsxd rdx,DWORD PTR [rdi]
25: 48 83 c7 04 add rdi,0x4
29: 01 ce add esi,ecx
2b: c1 f8 0d sar eax,0xd
2e: 83 e0 fe and eax,0xfffffffe
31: 48 98 cdqe
33: 48 0f af 14 c5 00 00 imul rdx,QWORD PTR [rax*8+0x0]
3a: 00 00
3c: 49 01 d0 add r8,rdx
3f: 49 39 f9 cmp r9,rdi
42: 75 dc jne 20 <CalcPower+0x20>
44: 4c 89 c0 mov rax,r8
47: c3 ret
48: 0f 1f 84 00 00 00 00 nop DWORD PTR [rax+rax*1+0x0]
4f: 00
50: 45 31 c0 xor r8d,r8d
53: 4c 89 c0 mov rax,r8
56: c3 ret
My GCC version is `gcc (Debian 10.2.1-6) 10.2.1 20210110`
To see whether it's correct & to benchmark it I'll need the lookup table though.