which means 5 cycles until retired on a 486 (4x one issue and one potential AGI stall) and 3 on a pentium 1 (2x2 issued, same stalling potential), assuming the instructions are in cache. The only chip that might do better is a P4 with 2x2 half-clock issues which could do the calculation in a single clock.
I'm not sure about the number given for 386s, as far as I can tell you probably have two memory cycles (or one with 25% probability) into the prefetch queue, thanks to single-byte opcodes. That is then followed or overlapped with the issuing itself, which means 5+ clocks in case of maximum parallelisation.
Assembly: code optimization
Re: Assembly: code optimization
Don't really want to argue as your numbers seem much more quantitative than I could provide, but doesn't the P4 have a >30 stage pipeline?Combuster wrote:which means 5 cycles until retired on a 486 (4x one issue and one potential AGI stall) and 3 on a pentium 1 (2x2 issued, same stalling potential), assuming the instructions are in cache. The only chip that might do better is a P4 with 2x2 half-clock issues which could do the calculation in a single clock.
I'm not sure about the number given for 386s, as far as I can tell you probably have two memory cycles (or one with 25% probability) into the prefetch queue, thanks to single-byte opcodes. That is then followed or overlapped with the issuing itself, which means 5+ clocks in case of maximum parallelisation.
- Owen
- Member
- Posts: 1700
- Joined: Fri Jun 13, 2008 3:21 pm
- Location: Cambridge, United Kingdom
- Contact:
Re: Assembly: code optimization
I was under the impression that i7 had a width of 4. Maybe my memory is wrong.JamesM wrote:Firstly, i7 has an issue width of 3, as far as I know.
Secondly, it'll take at least n clock ticks to complete where n is the length of the i7's pipeline.
In any case, when I say an instruction takes n cycles, what I'm saying is that an instruction has n cycles of latency, or: cycles which must elapse between the instruction taking its input data and producing a result which can be consumed by a following instruction. Pipeline length is an issue for mispredicted branches, not for individual instructions.
- Combuster
- Member
- Posts: 9301
- Joined: Wed Oct 18, 2006 3:45 am
- Libera.chat IRC: [com]buster
- Location: On the balcony, where I can actually keep 1½m distance
- Contact:
Re: Assembly: code optimization
Which is why I included AGI stalls in 486/586 calculation (because that's the worst latency until the next instruction can use the calculated value.) There are no jumps so pipeline flushes are not part of the scenario - and even then, a jump does not normally introduce latency for the entire pipeline's length. I don't know the uop dispatch latencies for 686 or netburst, nor the exact port configuration for K6+, but to just execute the code and have the results ready will be 3-4ish clocks average on the current multi-issue chip architectures.