Assembly: code optimization

Programming, for all ages and all languages.
User avatar
Combuster
Member
Member
Posts: 9301
Joined: Wed Oct 18, 2006 3:45 am
Libera.chat IRC: [com]buster
Location: On the balcony, where I can actually keep 1½m distance
Contact:

Re: Assembly: code optimization

Post by Combuster »

which means 5 cycles until retired on a 486 (4x one issue and one potential AGI stall) and 3 on a pentium 1 (2x2 issued, same stalling potential), assuming the instructions are in cache. The only chip that might do better is a P4 with 2x2 half-clock issues which could do the calculation in a single clock.

I'm not sure about the number given for 386s, as far as I can tell you probably have two memory cycles (or one with 25% probability) into the prefetch queue, thanks to single-byte opcodes. That is then followed or overlapped with the issuing itself, which means 5+ clocks in case of maximum parallelisation.
"Certainly avoid yourself. He is a newbie and might not realize it. You'll hate his code deeply a few years down the road." - Sortie
[ My OS ] [ VDisk/SFS ]
User avatar
JamesM
Member
Member
Posts: 2935
Joined: Tue Jul 10, 2007 5:27 am
Location: York, United Kingdom
Contact:

Re: Assembly: code optimization

Post by JamesM »

Combuster wrote:which means 5 cycles until retired on a 486 (4x one issue and one potential AGI stall) and 3 on a pentium 1 (2x2 issued, same stalling potential), assuming the instructions are in cache. The only chip that might do better is a P4 with 2x2 half-clock issues which could do the calculation in a single clock.

I'm not sure about the number given for 386s, as far as I can tell you probably have two memory cycles (or one with 25% probability) into the prefetch queue, thanks to single-byte opcodes. That is then followed or overlapped with the issuing itself, which means 5+ clocks in case of maximum parallelisation.
Don't really want to argue as your numbers seem much more quantitative than I could provide, but doesn't the P4 have a >30 stage pipeline?
User avatar
Owen
Member
Member
Posts: 1700
Joined: Fri Jun 13, 2008 3:21 pm
Location: Cambridge, United Kingdom
Contact:

Re: Assembly: code optimization

Post by Owen »

JamesM wrote:Firstly, i7 has an issue width of 3, as far as I know.

Secondly, it'll take at least n clock ticks to complete where n is the length of the i7's pipeline.
I was under the impression that i7 had a width of 4. Maybe my memory is wrong.

In any case, when I say an instruction takes n cycles, what I'm saying is that an instruction has n cycles of latency, or: cycles which must elapse between the instruction taking its input data and producing a result which can be consumed by a following instruction. Pipeline length is an issue for mispredicted branches, not for individual instructions.
User avatar
Combuster
Member
Member
Posts: 9301
Joined: Wed Oct 18, 2006 3:45 am
Libera.chat IRC: [com]buster
Location: On the balcony, where I can actually keep 1½m distance
Contact:

Re: Assembly: code optimization

Post by Combuster »

Which is why I included AGI stalls in 486/586 calculation (because that's the worst latency until the next instruction can use the calculated value.) There are no jumps so pipeline flushes are not part of the scenario - and even then, a jump does not normally introduce latency for the entire pipeline's length. I don't know the uop dispatch latencies for 686 or netburst, nor the exact port configuration for K6+, but to just execute the code and have the results ready will be 3-4ish clocks average on the current multi-issue chip architectures.
"Certainly avoid yourself. He is a newbie and might not realize it. You'll hate his code deeply a few years down the road." - Sortie
[ My OS ] [ VDisk/SFS ]
Post Reply