Page 1 of 1

How about bitscan instruction on x86 ?

Posted: Fri Oct 28, 2016 6:15 pm
by miaowei
Hi.
Have you even used 'bsf/bsr' instruction in your project ?
How about its performance?
I saw in a post(http://www.asmcommunity.net/forums/topic/?id=3771) that bitscan instruction is quite slow on pentium.
And another evidence, the do_softirq() function in kernel/softirq.c use repeatedly bit shift operation to traverse all 32 bits.
He didn't use 'bsr/bsf'.
Here is the code:

Code: Select all

asmlinkage void do_softirq()
{
	int cpu = smp_processor_id();
	__u32 active, mask;

	if (in_interrupt())
		return;

	local_bh_disable();

	local_irq_disable();
	mask = softirq_mask(cpu);
	active = softirq_active(cpu) & mask;

	if (active) {
		struct softirq_action *h;

restart:
		/* Reset active bitmask before enabling irqs */
		softirq_active(cpu) &= ~active;

		local_irq_enable();

		h = softirq_vec;
		mask &= ~active;

		do {
			if (active & 1)
				h->action(h);
			h++;
			active >>= 1;
		} while (active);

		local_irq_disable();

		active = softirq_active(cpu);
		if ((active &= mask) != 0)
			goto retry;
	}

	local_bh_enable();

	/* Leave with locally disabled hard irqs. It is critical to close
	 * window for infinite recursion, while we help local bh count,
	 * it protected us. Now we are defenceless.
	 */
	return;

retry:
	goto restart;
}
And I pick out the relevant code:
do {
if (active & 1)
h->action(h);
h++;
active >>= 1;
} while (active);

Re: How about bitscan instruction on x86 ?

Posted: Fri Oct 28, 2016 8:11 pm
by Brendan
Hi,
miaowei wrote:Have you even used 'bsf/bsr' instruction in your project ?
Yes. For one example:

Code: Select all

	mov eax,[tempVector3+2*8+4]			;eax = highest 32-bits of C
	mov eax,edx
	xor eax,[tempVector3+1*8+4]			;eax = highest 32-bits of C ^ B
	xor eax,[tempVector3+0*8+4]			;eax = highest 32-bits of C ^ B ^ A
	cdq								;edx = sign of A ^ B ^ C
	xor eax,edx						;eax = highest 32-bits of A ^ B ^ C ^ (sign of A ^ B ^ C)
	mov ecx,31						;ebx = number of times to shift A, B and C left - 32 (for best case)
	test eax,eax						;Is there anything we need to keep?
	je .gotShift1						; no
	mov ecx,30						;ebx = number of times to shift A, B and C left - 32 (for best case)
	bsr eax,eax						;eax = bit number for highest set bit
	sub ecx,eax						;ecx = number of times to shift A, B and C left - 32 (if it's not <= 0)
	ja .gotShift1
	clr ecx							;ecx = number of times to shift A, B and C left - 32 (for worst case)

.gotShift1:
	mov edx,[tempVector3+0*8+4]
	mov eax,[tempVector3+0*8]
	shld edx,eax,cl
	mov [tempVector3+0*4],edx

	mov edx,[tempVector3+1*8+4]
	mov eax,[tempVector3+1*8]
	shld edx,eax,cl
	mov [tempVector3+1*4],edx

	mov edx,[tempVector3+2*8+4]
	mov eax,[tempVector3+2*8]
	shld edx,eax,cl
	mov [tempVector3+2*4],edx
Note: The idea here is to convert a vector with 64-bit components into a vector with 32-bit components, while shifting/scaling it to keep the most precision possible. It's used in code to find the A, B, C and D for the plane equation ("A*x + B*y + C*z +D = 0") from 3 points on the plane. Without this scaling I don't get enough precision and end up with ugly visible glitches; and this crude scaling is much much faster than doing full conversion to unit vector (which would be more precise, but would involve finding "1/sqrt(A*A + B*B + C*C)" and doing multiplications).
miaowei wrote:How about its performance?
Typically; BSR/BSF is a little slow compared to fast single instructions (like TEST, ADD, etc); but are still faster than many instructions in a loop on every 80x86 CPU ever made.
miaowei wrote:I saw in a post(http://www.asmcommunity.net/forums/topic/?id=3771) that bitscan instruction is quite slow on pentium.
It is; but even on a Pentium they're probably still almost as fast as Agner Fog's "clever" (Pentium only) hack; and probably faster on a Pentium than Agner Fog's hack in some situations (e.g. when all FPU registers are in use, or when FPU is configured for MMX and not floating point, or when there's a cache miss writing to the temporary location, or when there's a branch misprediction, or when the bottleneck is instruction fetch/decode).
miaowei wrote:And another evidence, the do_softirq() function in kernel/softirq.c use repeatedly bit shift operation to traverse all 32 bits.
This is only evidence that C doesn't provide an easily usable method of doing it (other than resorting to non-portable inline assembly), and/or evidence that C programmers aren't assembly language programmers. Unless the compiler is smart enough to optimise that loop into a BSR/BSF (which is very unlikely) that code will be about 10 times slower than it could be on almost all 80x86 CPUs.

Note: This isn't quite true. On most compilers there are intrinsics (e.g. "__builtin_ffs", "__builtin_clz" and "__builtin_ctz" for GCC) that end up being a single BSR/BSF instruction; however these instrinsics aren't part of the C standard and aren't portable between compilers.


Cheers,

Brendan