mulitiplcation to shift interally?
mulitiplcation to shift interally?
Hi,
for modern CPU, will mulitiplcation like "IMUL EAX,2" be executed as "SHL EAX,1" internally?
thanks!
for modern CPU, will mulitiplcation like "IMUL EAX,2" be executed as "SHL EAX,1" internally?
thanks!
Re: mulitiplcation to shift interally?
No, although this would be an optimization.
The CPU blidnly executes the opcodes you pass to it.
I don't have the specs at hand, but it is possible that IMUL affects some status- or flagregisters while SHL doesn't.
So this optimizations aren't used. This optimizations are handled by the compiler.
So, if you program in assembler you have to optimize it yourself.
The CPU blidnly executes the opcodes you pass to it.
I don't have the specs at hand, but it is possible that IMUL affects some status- or flagregisters while SHL doesn't.
So this optimizations aren't used. This optimizations are handled by the compiler.
So, if you program in assembler you have to optimize it yourself.
- Combuster
- Member
- Posts: 9301
- Joined: Wed Oct 18, 2006 3:45 am
- Libera.chat IRC: [com]buster
- Location: On the balcony, where I can actually keep 1½m distance
- Contact:
Re: mulitiplcation to shift interally?
No. Multiplications have however become really fast and tend to execute at similar speeds as shifts (on some chips, shifts are even worse in some cases)
Re: mulitiplcation to shift interally?
Hi,
It's best to leave that sort of thing to the compiler - if as Combuster says, for some arch shifts are more costly than multiplications, the compiler will know that and work around it.
It depends what level you look at. One the lowest level, all multiplications are implemented as shifts and adds. So yes, multiplying EAX by 2 would result in EAX being shifted left one bit, but this would occur inside the multiplier unit - the instruction won't be optimised into the ALU (as a shift insn) by the processor.blackoil wrote:Hi,
for modern CPU, will mulitiplcation like "IMUL EAX,2" be executed as "SHL EAX,1" internally?
thanks!
It's best to leave that sort of thing to the compiler - if as Combuster says, for some arch shifts are more costly than multiplications, the compiler will know that and work around it.
- Love4Boobies
- Member
- Posts: 2111
- Joined: Fri Mar 07, 2008 5:36 pm
- Location: Bucharest, Romania
Re: mulitiplcation to shift interally?
I read somewhere, although I can't actually remember where, that (at least) Intel optimizes MUL instructions when you try to multiply with a number that is a power of two. It didn't say exactly in which way but I doubt that it's not shifting - since that's the fastest way.
"Computers in the future may weigh no more than 1.5 tons.", Popular Mechanics (1949)
[ Project UDI ]
[ Project UDI ]
Re: mulitiplcation to shift interally?
to muliply a var with number 2, I can use
imul eax,[var],2
mov eax,[var]
shl eax,1
it's a bit difficult to determine to which one is faster.
And I saw Visual C++ 2008 express uses imul instruction for array indexing.
imul eax,[var],2
mov eax,[var]
shl eax,1
it's a bit difficult to determine to which one is faster.
And I saw Visual C++ 2008 express uses imul instruction for array indexing.
Re: mulitiplcation to shift interally?
Hi,
On some CPUs (e.g. early Pentium 4/Netburst) using several ADD instructions can be faster than using one instruction - e.g. "add eax,eax; add eax,eax" can be faster than "shl eax,2" or "lea eax,[eax*4]" or "imul eax,4".
Cheers,
Brendan
In this case, the fastest way is probably "mov eax,[var]; add [var],eax", especially if there's other code you can place in between these instructions (so the CPU has something to do while waiting for the fetch from cache/RAM).blackoil wrote:to muliply a var with number 2, I can use
imul eax,[var],2
mov eax,[var]
shl eax,1
it's a bit difficult to determine to which one is faster.
On some CPUs (e.g. early Pentium 4/Netburst) using several ADD instructions can be faster than using one instruction - e.g. "add eax,eax; add eax,eax" can be faster than "shl eax,2" or "lea eax,[eax*4]" or "imul eax,4".
Cheers,
Brendan
For all things; perfection is, and will always remain, impossible to achieve in practice. However; by striving for perfection we create things that are as perfect as practically possible. Let the pursuit of perfection be our guide.
- Owen
- Member
- Posts: 1700
- Joined: Fri Jun 13, 2008 3:21 pm
- Location: Cambridge, United Kingdom
- Contact:
Re: mulitiplcation to shift interally?
Aah, the wonders of CPUs without barrel shifters. I doubt the two instructions together would be faster if they immediately followed each other, however - in fact I imagine they would be much slower considering the ridiculous length of NetBurst's pipeline.Brendan wrote:Hi,
In this case, the fastest way is probably "mov eax,[var]; add [var],eax", especially if there's other code you can place in between these instructions (so the CPU has something to do while waiting for the fetch from cache/RAM).
On some CPUs (e.g. early Pentium 4/Netburst) using several ADD instructions can be faster than using one instruction - e.g. "add eax,eax; add eax,eax" can be faster than "shl eax,2" or "lea eax,[eax*4]" or "imul eax,4".
Cheers,
Brendan
Edit: I've just realised the Irony: NetBurst does the exact oposite of what this thread was about
- Combuster
- Member
- Posts: 9301
- Joined: Wed Oct 18, 2006 3:45 am
- Libera.chat IRC: [com]buster
- Location: On the balcony, where I can actually keep 1½m distance
- Contact:
Re: mulitiplcation to shift interally?
From what I gathered, optimising for the Pentium 4 generally has negative effects on all other processors. I had a run at the cycle sheets to get some concrete details on mentioned case. Might be interesting:
Consider the shift/lea/mul/add-add methods of computing reg * 4
On a 486 / Pentium 1 / Athlon, a shift takes one cycle. On netburst, it takes 4.
On a 486 / Pentium 1, a LEA takes one cycle, on an Athlon, a "complex" LEA takes two cycles. The intel document does not give exact timings, but suggests that equivalent adds are faster given enough decoder space (which implies that it would be > 2 cycles)
An imul has a 4 cycle latency on an Athlon (and post-netburst), 10 on netburst, and something similarly awful for old processors. (why again was everybody buying AthlonXPs in that time? )
A sequence of adds to do a multiplication by four would take 2 clocks on all processors except netburst, which does that in one cycle (two half-clock operations).
Summarized:
P1/486: depending on situation, use lea (take care of AGIs) or shifts (take care of the u-v schedule)
Athlon: always use shifts
Netburst: use adds and expect execution times for all other platforms to double.
The other conclusion: there is no conversion.
Consider the shift/lea/mul/add-add methods of computing reg * 4
On a 486 / Pentium 1 / Athlon, a shift takes one cycle. On netburst, it takes 4.
On a 486 / Pentium 1, a LEA takes one cycle, on an Athlon, a "complex" LEA takes two cycles. The intel document does not give exact timings, but suggests that equivalent adds are faster given enough decoder space (which implies that it would be > 2 cycles)
An imul has a 4 cycle latency on an Athlon (and post-netburst), 10 on netburst, and something similarly awful for old processors. (why again was everybody buying AthlonXPs in that time? )
A sequence of adds to do a multiplication by four would take 2 clocks on all processors except netburst, which does that in one cycle (two half-clock operations).
Summarized:
P1/486: depending on situation, use lea (take care of AGIs) or shifts (take care of the u-v schedule)
Athlon: always use shifts
Netburst: use adds and expect execution times for all other platforms to double.
The other conclusion: there is no conversion.
- Troy Martin
- Member
- Posts: 1686
- Joined: Fri Apr 18, 2008 4:40 pm
- Location: Langley, Vancouver, BC, Canada
- Contact:
Re: mulitiplcation to shift interally?
The nth Law of Optimization: if it's efficient somewhere, it's slow as molasses everywhere else.Combuster wrote:P1/486: depending on situation, use lea (take care of AGIs) or shifts (take care of the u-v schedule)
Athlon: always use shifts
Netburst: use adds and expect execution times for all other platforms to double.