Assembly: code optimization

Combuster · Post by **Combuster** » Wed Mar 23, 2011 4:44 pm

which means 5 cycles until retired on a 486 (4x one issue and one potential AGI stall) and 3 on a pentium 1 (2x2 issued, same stalling potential), assuming the instructions are in cache. The only chip that might do better is a P4 with 2x2 half-clock issues which could do the calculation in a single clock.

I'm not sure about the number given for 386s, as far as I can tell you probably have two memory cycles (or one with 25% probability) into the prefetch queue, thanks to single-byte opcodes. That is then followed or overlapped with the issuing itself, which means 5+ clocks in case of maximum parallelisation.

JamesM · Post by **JamesM** » Wed Mar 23, 2011 4:59 pm

Combuster wrote:which means 5 cycles until retired on a 486 (4x one issue and one potential AGI stall) and 3 on a pentium 1 (2x2 issued, same stalling potential), assuming the instructions are in cache. The only chip that might do better is a P4 with 2x2 half-clock issues which could do the calculation in a single clock.

I'm not sure about the number given for 386s, as far as I can tell you probably have two memory cycles (or one with 25% probability) into the prefetch queue, thanks to single-byte opcodes. That is then followed or overlapped with the issuing itself, which means 5+ clocks in case of maximum parallelisation.

Don't really want to argue as your numbers seem much more quantitative than I could provide, but doesn't the P4 have a >30 stage pipeline?

Owen · Post by **Owen** » Fri Mar 25, 2011 4:15 am

JamesM wrote:Firstly, i7 has an issue width of 3, as far as I know.

Secondly, it'll take at least n clock ticks to complete where n is the length of the i7's pipeline.

I was under the impression that i7 had a width of 4. Maybe my memory is wrong.

In any case, when I say an instruction takes n cycles, what I'm saying is that an instruction has n cycles of latency, or: cycles which must elapse between the instruction taking its input data and producing a result which can be consumed by a following instruction. Pipeline length is an issue for mispredicted branches, not for individual instructions.

Combuster · Post by **Combuster** » Fri Mar 25, 2011 10:18 am

Which is why I included AGI stalls in 486/586 calculation (because that's the worst latency until the next instruction can use the calculated value.) There are no jumps so pipeline flushes are not part of the scenario - and even then, a jump does not normally introduce latency for the entire pipeline's length. I don't know the uop dispatch latencies for 686 or netburst, nor the exact port configuration for K6+, but to just execute the code and have the results ready will be 3-4ish clocks average on the current multi-issue chip architectures.

OSDev.org

Assembly: code optimization

Re: Assembly: code optimization

Re: Assembly: code optimization

Re: Assembly: code optimization

Re: Assembly: code optimization