Hi,
miaowei wrote:Have you even used 'bsf/bsr' instruction in your project ?
Yes. For one example:
Code: Select all
mov eax,[tempVector3+2*8+4] ;eax = highest 32-bits of C
mov eax,edx
xor eax,[tempVector3+1*8+4] ;eax = highest 32-bits of C ^ B
xor eax,[tempVector3+0*8+4] ;eax = highest 32-bits of C ^ B ^ A
cdq ;edx = sign of A ^ B ^ C
xor eax,edx ;eax = highest 32-bits of A ^ B ^ C ^ (sign of A ^ B ^ C)
mov ecx,31 ;ebx = number of times to shift A, B and C left - 32 (for best case)
test eax,eax ;Is there anything we need to keep?
je .gotShift1 ; no
mov ecx,30 ;ebx = number of times to shift A, B and C left - 32 (for best case)
bsr eax,eax ;eax = bit number for highest set bit
sub ecx,eax ;ecx = number of times to shift A, B and C left - 32 (if it's not <= 0)
ja .gotShift1
clr ecx ;ecx = number of times to shift A, B and C left - 32 (for worst case)
.gotShift1:
mov edx,[tempVector3+0*8+4]
mov eax,[tempVector3+0*8]
shld edx,eax,cl
mov [tempVector3+0*4],edx
mov edx,[tempVector3+1*8+4]
mov eax,[tempVector3+1*8]
shld edx,eax,cl
mov [tempVector3+1*4],edx
mov edx,[tempVector3+2*8+4]
mov eax,[tempVector3+2*8]
shld edx,eax,cl
mov [tempVector3+2*4],edx
Note: The idea here is to convert a vector with 64-bit components into a vector with 32-bit components, while shifting/scaling it to keep the most precision possible. It's used in code to find the A, B, C and D for the plane equation ("A*x + B*y + C*z +D = 0") from 3 points on the plane. Without this scaling I don't get enough precision and end up with ugly visible glitches; and this crude scaling is much much faster than doing full conversion to unit vector (which would be more precise, but would involve finding "1/sqrt(A*A + B*B + C*C)" and doing multiplications).
miaowei wrote:How about its performance?
Typically; BSR/BSF is a little slow compared to fast single instructions (like TEST, ADD, etc); but are still faster than many instructions in a loop on every 80x86 CPU ever made.
It is; but even on a Pentium they're probably still almost as fast as Agner Fog's "clever" (Pentium only) hack; and probably faster on a Pentium than Agner Fog's hack in some situations (e.g. when all FPU registers are in use, or when FPU is configured for MMX and not floating point, or when there's a cache miss writing to the temporary location, or when there's a branch misprediction, or when the bottleneck is instruction fetch/decode).
miaowei wrote:And another evidence, the do_softirq() function in kernel/softirq.c use repeatedly bit shift operation to traverse all 32 bits.
This is only evidence that C doesn't provide an easily usable method of doing it (other than resorting to non-portable inline assembly), and/or evidence that C programmers aren't assembly language programmers. Unless the compiler is smart enough to optimise that loop into a BSR/BSF (which is very unlikely) that code will be about 10 times slower than it could be on almost all 80x86 CPUs.
Note: This isn't quite true. On most compilers there are intrinsics (e.g. "__builtin_ffs", "__builtin_clz" and "__builtin_ctz" for GCC) that end up being a single BSR/BSF instruction; however these instrinsics aren't part of the C standard and aren't portable between compilers.
Cheers,
Brendan