OSDev.org

Posted: **Sun Jun 28, 2009 4:07 am**

Hi,

for modern CPU, will mulitiplcation like "IMUL EAX,2" be executed as "SHL EAX,1" internally?

thanks!

Posted: **Sun Jun 28, 2009 4:12 am**

No, although this would be an optimization.
The CPU blidnly executes the opcodes you pass to it.

I don't have the specs at hand, but it is possible that IMUL affects some status- or flagregisters while SHL doesn't.
So this optimizations aren't used. This optimizations are handled by the compiler.

So, if you program in assembler you have to optimize it yourself.

Posted: **Sun Jun 28, 2009 5:49 am**

No. Multiplications have however become really fast and tend to execute at similar speeds as shifts (on some chips, shifts are even worse in some cases)

Posted: **Sun Jun 28, 2009 6:41 am**

Hi,

blackoil wrote:Hi,

for modern CPU, will mulitiplcation like "IMUL EAX,2" be executed as "SHL EAX,1" internally?

thanks!

It depends what level you look at. One the lowest level, all multiplications are implemented as shifts and adds. So yes, multiplying EAX by 2 would result in EAX being shifted left one bit, but this would occur inside the multiplier unit - the instruction won't be optimised into the ALU (as a shift insn) by the processor.

It's best to leave that sort of thing to the compiler - if as Combuster says, for some arch shifts are more costly than multiplications, the compiler will know that and work around it.

Posted: **Sun Jun 28, 2009 10:34 am**

I read somewhere, although I can't actually remember where, that (at least) Intel optimizes MUL instructions when you try to multiply with a number that is a power of two. It didn't say exactly in which way but I doubt that it's not shifting - since that's the fastest way.

Posted: **Sun Jun 28, 2009 7:03 pm**

to muliply a var with number 2, I can use

imul eax,[var],2

mov eax,[var]
shl eax,1

it's a bit difficult to determine to which one is faster.

And I saw Visual C++ 2008 express uses imul instruction for array indexing.

Posted: **Sun Jun 28, 2009 9:10 pm**

Hi,

blackoil wrote:to muliply a var with number 2, I can use

imul eax,[var],2

mov eax,[var]
shl eax,1

it's a bit difficult to determine to which one is faster.

In this case, the fastest way is probably "mov eax,[var]; add [var],eax", especially if there's other code you can place in between these instructions (so the CPU has something to do while waiting for the fetch from cache/RAM).

On some CPUs (e.g. early Pentium 4/Netburst) using several ADD instructions can be faster than using one instruction - e.g. "add eax,eax; add eax,eax" can be faster than "shl eax,2" or "lea eax,[eax*4]" or "imul eax,4".

Cheers,

Brendan

Posted: **Wed Jul 01, 2009 2:42 pm**

Brendan wrote:Hi,

In this case, the fastest way is probably "mov eax,[var]; add [var],eax", especially if there's other code you can place in between these instructions (so the CPU has something to do while waiting for the fetch from cache/RAM).

On some CPUs (e.g. early Pentium 4/Netburst) using several ADD instructions can be faster than using one instruction - e.g. "add eax,eax; add eax,eax" can be faster than "shl eax,2" or "lea eax,[eax*4]" or "imul eax,4".

Cheers,

Brendan

Aah, the wonders of CPUs without barrel shifters. I doubt the two instructions together would be faster if they immediately followed each other, however - in fact I imagine they would be much slower considering the ridiculous length of NetBurst's pipeline.

Edit: I've just realised the Irony: NetBurst does the exact oposite of what this thread was about

Posted: **Wed Jul 01, 2009 3:41 pm**

From what I gathered, optimising for the Pentium 4 generally has negative effects on all other processors. I had a run at the cycle sheets to get some concrete details on mentioned case. Might be interesting:

Consider the shift/lea/mul/add-add methods of computing reg * 4

On a 486 / Pentium 1 / Athlon, a shift takes one cycle. On netburst, it takes 4.
On a 486 / Pentium 1, a LEA takes one cycle, on an Athlon, a "complex" LEA takes two cycles. The intel document does not give exact timings, but suggests that equivalent adds are faster given enough decoder space (which implies that it would be > 2 cycles)
An imul has a 4 cycle latency on an Athlon (and post-netburst), 10 on netburst, and something similarly awful for old processors. (why again was everybody buying AthlonXPs in that time?

)
A sequence of adds to do a multiplication by four would take 2 clocks on all processors except netburst, which does that in one cycle (two half-clock operations).

Summarized:
P1/486: depending on situation, use lea (take care of AGIs) or shifts (take care of the u-v schedule)
Athlon: always use shifts
Netburst: use adds and expect execution times for all other platforms to double.

The other conclusion: there is no conversion.

Posted: **Wed Jul 01, 2009 5:57 pm**

Combuster wrote:P1/486: depending on situation, use lea (take care of AGIs) or shifts (take care of the u-v schedule)
Athlon: always use shifts
Netburst: use adds and expect execution times for all other platforms to double.

The nth Law of Optimization: if it's efficient somewhere, it's slow as molasses everywhere else.

OSDev.org

mulitiplcation to shift interally?

mulitiplcation to shift interally?

Re: mulitiplcation to shift interally?

Re: mulitiplcation to shift interally?

Re: mulitiplcation to shift interally?

Re: mulitiplcation to shift interally?

Re: mulitiplcation to shift interally?

Re: mulitiplcation to shift interally?

Re: mulitiplcation to shift interally?

Re: mulitiplcation to shift interally?

Re: mulitiplcation to shift interally?