Re: The Mill: a new low-power, high-performance CPU design
Posted: Mon Mar 03, 2014 7:12 am
The processor updates the branch predictors every branch. It consults them every branch. When they're empty, it falls back upon developer/compiler provided branch hints (Backwards -> Probably taken, Forwards -> Probably not)embryo wrote:This and following arguments are based on the same thing - the processor has runtime information. But how the runtime information makes the processor to work faster ? It is all about algorithms. The processor has microprograms with those algorithms and just runs some of them when it seems can be helpful. And now recall what is the compiler. It is also an entity with algorithms, and even much better algorithms. The compiler also has a lot of static information.Owen wrote:The processor has branch predictors which tell it which branch is likely
embryo wrote:Now we can see the picture - just runtime information and some weak algorithms against a lot of static information and very powerful algorithms. And we should add to the picture one more thing - the compiler can inject the weak algorithms at compilation time and feed the processor with locally optimized actions depending on the current code section content. In simple words the compiler just replaces internal microprograms with the compiler driven. Also the compiler takes care of placing the injected algorithms in a right cache ensuring that the processor will have everything required.
Code: Select all
void f(A a) {
if(a.foo()) a.bar() else a.baz();
}
The processor, meanwhile, will get it wrong the first time, then have a weak false prediction, which will strengthen with each invocation.
Because the only way to learn what the code will actually do is to execute the code. Yes, the compiler can instrument the code, run it, and then recompile it once it knows better what the code does (Pretty much every JIT does that!), but this requires the compilation process to be fast (thereby limiting the optimization possibilities), and often fast code requires removing said instrumentation hooks (because they pollute caches, etc)embryo wrote:In the end - what is a hardware? It's just a dumb silicon. And where it gets the power to work fast? It is provided with algorithms by some external entity. And if the entity will be a compiler - what will be the problem in such case?
The processor, meanwhile, can determine these things with startlingly little resource use, using dedicated resources.
The best algorithms in the world can't save you if you don't have the information, or worse, if your information is wrong.embryo wrote:Once more - the compiler has all information except runtime. The runtime behaviour of a processor is defined by externally provided algorithms. The compiler can provide the processor with ANY algorithm it decides is suitable for a particular task. Can you imagine a processor with all algorithms possible? I can imagine such compiler, but not processor. At least it takes a lot less time to update compiler instead of updating a processor design and redistributing new chips across the world.
It was obvious to the Itanium designers, and also completely wrongembryo wrote:The compiler wins without questions. Isn't it obvious?
Because the compiler doesn't have a silicon fab at runtime?embryo wrote:It all is just a mix of specification information availability and runtime data processing. First is the same for compiler and for processor, the second is algorithm driven and in the processor case leads to the monstrous algorithm storage within a processor. Why we can't eliminate such storage and allow the compiler to provide a suitable algorithm?Owen wrote:My compiler can't know this. It can't know the latency of the memory. It can't know the current cache pressure. It can't know the current contention between the various devices vying for the bus.
You seem to think these are "microprograms", executed serially by the processor core. They aren't. They're hardwired. They're operating in parallel to the rest of the execution core.
The x86 microprogram store is less than 1% of the processor. Worse, it has nothing to do with any of this, but everything to do with the maintaining backwards compatibility with existing legacy code and implementing complex instructions such as far calls and privilege transfers.embryo wrote:It's just another point to consider when we think about how fast will be a system with all the chip area allocated for logic units instead of a big (and useless in case of compiler provided algorithms) microprogram storage.Owen wrote:The latency of said memory access varies from microarchitecture to microarchitecture
Lets look at a traditional call through a vtable:embryo wrote:Corporate internals can look worse if viewed from inside. May be it was a communication issues, may be bad management, may be the vision was too simplistic.Owen wrote:Itanium was a joint project by Intel (x86, of course, but also i860, i960, StrongARM and XScale), HP (HPPA) and Compaq (who had purchased DEC, who had developed VAX and Alpha). The expertise was all there.Actually in Java it is also non private, non static and non special methods (constructor and static initializer). But to the question - the processor must wait until the actual object reference (address) will be available and after it happens it can make decision about the actual method address. But what processor knows about method address? It only knows that the address should exist and nothing more. But the compiler always knows all the data structures responsible for address seeking and with unimaginable ease (comparing to the processor's faulty attempts) can order the system to cache data of a few successors of a base class with actual address information. And when the processor executes a jump to the actual method all addresses will be in it's register file. The only required additional operation before the jump is an index calculation to select a particular register. Again we see that compiler beats any processor with ease.Owen wrote:For example, a function call to a virtual method (i.e. every non final method in Java, since that seems to be your language of choice). How can the compiler know what implementation of that method it is landing in?
Code: Select all
// Read through vtable pointer
ldr lr, [r0]
// Get the method pointer
ldr lr, [lr], #8
// Do some calculations of the method parameters
// ...
blx lr
The processor, meanwhile, can start fetching the body of that function as soon as it has calculated the value of lr from instruction 2 in the above sequence. It might see, for example, that we just did a store to a member and that we are about to do a load from said member, and therefore forward the store to the load instruction without need to go through memory (~2 cycles for L1 cache, ~10 for L2, ~50 for L3, ~1000 for RAM)
Any compiler that isn't pants on head retarded can figure that one out. Its a trivial case.embryo wrote: It should be mentioned, that there are cases of a root class (the Object) successors when the number of successors can be measured in thousands, but even in such situations the compiler just can look at a code a bit before the virtual call and to find there an actual object's type like in the example:It means the compiler wins again.Code: Select all
Object var=new String(); var.hashCode(); // hashCode is object's method
Its the non-trivial cases which matter, and which no compiler has the time in the world to figure out.
Because that would be undefined behavior and the compiler can assume undefined behavior never happens?embryo wrote:If we have a pointer to some structure - how can we be sure there is no access to the same address from other thread?Owen wrote:Any compiler developer will tell you that pointers aren't the real problem
(N.B. Java has no undefined behavior. This is one of the reasons Java compilers can struggle to optimize)
Complicated pointer arithmetic is complicated. Guess what? The solutions to the inability to do this in managed languages are also often equally complicated.embryo wrote: Next we use a pointer arithmetic. What a structure the pointer addresses now? How to deal with it? What to cache? It is an inherent problem for unmanaged languages. And there are a lot of such problems. And all the likes of the problem are just eliminated in Java.
Because developers don't like it when you tell them that their code isn't going to compile because the optimizer is allergic to it?embryo wrote:Can you provide an example of a compiler fault here?Owen wrote:If compilers didn't universally suck at a lot of problems. They really suck at vectorizing register intensive maths codeWhy we should limit ourselves to an arbitrary case only? We have all required information to remove all the uncertainity. So - we just should use the information.Owen wrote:optimal register allocation for arbitrary problems is only possible in NP time
Because the cases for which register allocation is an NP problem are all the interesting, i.e. nontrivial ones?
I see very few situations where "a few compiles a month" is feasible.embryo wrote:Yes, it is an issue. But for the Java Operating System there should be just a few full compiles per month - is it a problem? It is not a consumer toy and there is no children waiting the game to react quickly.Owen wrote:people tend to frown if a compiler spends 2 hours doing the same thing every compile