embryo wrote:Owen wrote:The compiler has no knowledge of branch directions that the program takes
Processor also doesn't know it until too late, it just executes instructions speculatively and throws away all the speculative work after the branch information is calculated.
The processor has branch predictors which tell it which branch is
likely. Obviously, these are right more often than wrong. It speculatively executes the likely branch, and most of the time that is the right call.
Even if you do profile guided optimizations (i.e. the compiler looks at a previous run and encodes branch hints to the processor so it can choose the best direction based upon which was most likely in the profiling run), that can't adequately deal with cases where the situation on the user's computer is different (e.g. they're doing something different with the same code).
Accurate branch prediction is really important. Two decades ago a significant enough performance enhancement could be had by getting the branch predictor to predict the exit conditions for a loop correctly (note that a pure history based predictor will get this wrong every time). Its' gotten more exacting ever since
embryo wrote:Owen wrote:no knowledge of the memory bus constraints of various systems
There are the PCI standard and processor manufacturer's electrical constraints. And nothing more. The processor knows just two things - what it's manufacturer writes in a specification and the same PCI standard. Why compiler can not know such information ?
I don't know about your system, but in mine the RAM is not connected by PCI. Its' probably DDR3 - and in both of our machines likely different speeds of DDR3. You might have a lot of money and hence an Intel Sandy Bridge/Ivy Bridge E series processor with quad channel memory for incredible bandwidth.
My compiler can't know this. It can't know the latency of the memory. It can't know the current cache pressure. It can't know the current contention between the various devices vying for the bus - the processor core not exist in isolation, it exists on a chip probably with a number of other cores, plus an on chip or external GPU, plus alongside a bunch of hardware devices which are all competing for access to memory.
Its a dynamic situation. The exact nature of the memory bandwidth available changes from microsecond to microsecond. The latency of said memory access varies from microarchitecture to microarchitecture, from individual machine to individual machine.
The compiler can't know. So the processor works around it - out of order execution and all.
embryo wrote:But the compiler knows the program structure. It knows, for example, the branch section location and can infer limits for the both outcomes of branching, then it just caches both code fragments to help speed things up. And a processor has no need to speculate because there is the compiler which can feed the processor with the work the compiler knows exactly should be done. And it can have such knowledge because of the program structure available.
Owen wrote:You believe in the same delusions that sunk the Itanium. Everyone in the know who was involved in the design process knew of the disaster that was happening there then and fled.
The Itanium has it's market share. May be bad design has significantly reduced the share, but it's not the architecture problem, it's just bad design. There was no efficient compiler to support the architecture, may be the processor internals had design problems. And there was no market for it. No market and no good compiler - the result is a disaster. But if there was a good compiler? And a market? It is the case for the Java server applications market, it is very big and a processor with a good compiler supported by Java Operating System can perform very well even having relatively small investments allocated.
Owen wrote:Itanium depended upon the compiler. It stunk (The compiler didn't have the information)
The information or a bad design? It was the first thing of such class from Intel which had never done a real processor architecture change.
Itanium was a joint project by Intel (x86, of course, but also i860, i960, StrongARM and XScale), HP (HPPA) and Compaq (who had purchased DEC, who had developed VAX and Alpha). The expertise was all there.
But the architecture depends upon the compiler to deal with the fact that it doesn't do anything out of order and the compiler needs to fill in exactly what instructions can be executed in parallel. The thing is, the compiler can't predict memory latencies, so it can't get the optimization perfect, and the processor doesn't do anything out of order so it can't paper over the fact that the compiler needs to be conservative at boundaries of what it can see.
For example, a function call to a virtual method (i.e. every non final method in Java, since that seems to be your language of choice). How can the compiler know what implementation of that method it is landing in? It can't, so it has to be conservative and assume that any object it has a handle to may be modified by said method unless it can prove otherwise (that is, the object was created in this method and never passed to something it can't see the code for). This creates lots of unnecessary memory loads and stores, and therefore the processor has to deal with this (Out of order execution saves the day)
embryo wrote:But with C code it is really hard to create a good compiler. Then there should be another target for compilation - the Java, for example. There are no pointers in Java. It prevents programmer from using many hacks that hide required information from the compiler.
Any compiler developer will tell you that pointers aren't the real problem, C99 and above have enough constraints on their use that optimizing with them around is very possible.
embryo wrote:Owen wrote:Never mind that in times of increasingly constrained memory bandwidth one cannot afford to have 33% of your instruction fetch bandwidth wasted by NOPs as Itanium does.
It is not Itanium, who wastes the bandwidth, but the compiler. It is not efficient. Why the Itanium should be blamed?
Because its' not the compiler's fault. For most code (i.e. everything outside of the core loops of numeric code - and I'll admit Itanium shines there... but, most of the time, so does the GPU, or if it doesn't you might find a modern x86 cheaper and almost as fast) people can't get better with hand assembly.
The architecture is fundamentally flawed.
embryo wrote:Owen wrote:VLIW is even worse because every compiled binary becomes invalid (or highly inefficient) whenever you increase the issue width
If the processor changes then there should be change in a software, yes. And C-like approach really makes the permanent recompilation a very important issue. But there is another approach, when the recompilation is limited to a few components only. It is the Java Operating System approach. And having such alternative I can ask - why we should throw away a really good technology like VLIW? It manages to have a market share even with inherent C compiler problems. What it can deliver if there are no such problems?
Because VLIW still has issues of gross waste of memory bandwidth which only get worse with greater instruction word widths (and anyway VLIW is of no help at all for the majority of control-flow oriented code)
embryo wrote:Owen wrote:VLIW is great, if and only if you are designing an embedded processor which is never expected to exhibit binary backwards compatibility.
The mentioned above limited recompilation solves the backward compatibility problem. Then I can declare a win for the VLIW
If compilers didn't universally suck at a lot of problems. They really suck at vectorizing register intensive maths code, such as that found at the core of every video codec ever. When pushed, they really struggle at register allocation, causing unnecessary loads and or stores which just conflate matters.
These are problems which we can never fix completely, because optimal register allocation for arbitrary problems is only possible in NP time, and most people would like their compilation to complete this millennium. Vectorization is highly complex (though I don't think there is any formal analysis of its' complexity, I wouldn't expect it to be better than complexity class P and suspect it is probably NP).
Compilers must make approximations. Never underestimate the processing power of the human brain.
It may take a human 24 hours of thought and work to optimize an algorithm for a processor, but that only needs to be done once. Meanwhile, people tend to frown if a compiler spends 2 hours doing the same thing every compile, and I think they'd complain if it takes more than 0.1s for a JIT compiler...
embryo wrote:Rusky wrote:They also intercept potentially-aliasing stores, so loads can be hoisted as far as possible (this is an example of why the processor beats the compiler)
Why a compiler can not care about potentially-aliasing stores? It knows the program structure and every variable in it, why it can't detect aliases?
I addressed this above in my comments regarding virtual methods.