Combuster wrote:Owen wrote:BETTER. Trivial loop exit conditions are uninteresting and, well, trivial. Intel have been correctly predicting them for 15 years.
In fact your code is worse, because it wastes two registers and two instructions per cycle.
And you think it's fair to simply strip silicon somewhere and not giving it back at an equal rate in another part of the machine?
A loop predictor does not comprise anywhere near the amount of silicon to add an additional execution unit, or an additional retirement unit.
Combuster wrote:Now that we've got enough background, let's take a look at Intel's own description of the loop detector:
The Loop Detector analyzes branches to see if they have loop behavior. Loop behavior is defined as moving in one direction (taken or not-taken) a fixed number of times interspersed with a single movement in the opposite direction. When such a branch is detected, a set of counters are allocated in the predictor such that the behavior of the program can be predicted completely accurately for larger iteration counts than typically captured by global or local history predictors.
Which means that it only works if the loop always has the same number of iterations, and always fails the first time
"Because the Pentium M did it that way, all Intel processors do it that way"?
Nehalem and above implement Macro-op fusion, in which "dec reg, jnz backwards", "cmp reg, const, jnz backwards" pairs can be fused into one micro-op. Especially in the former case, tell me why it would be difficult for the processor to correctly predict that every time? It would surprise me entirely if Intel weren't predicting that correctly.
Now: Please tell me how you would beat branch-history predictors while using equal silicon area. Remember, if you use register or memory bandwidth to do so, the predictor wins, because those are things which are
hard to expand (i.e. every increase in them consumes large amounts of silicon area) compared to the small dedicated resources predictors take.