The Mill: a new low-power, high-performance CPU design

Owen · Post by **Owen** » Mon Mar 03, 2014 7:12 am

embryo wrote:
Owen wrote:The processor has branch predictors which tell it which branch is likely
This and following arguments are based on the same thing - the processor has runtime information. But how the runtime information makes the processor to work faster ? It is all about algorithms. The processor has microprograms with those algorithms and just runs some of them when it seems can be helpful. And now recall what is the compiler. It is also an entity with algorithms, and even much better algorithms. The compiler also has a lot of static information.

The processor updates the branch predictors every branch. It consults them every branch. When they're empty, it falls back upon developer/compiler provided branch hints (Backwards -> Probably taken, Forwards -> Probably not)

embryo wrote:Now we can see the picture - just runtime information and some weak algorithms against a lot of static information and very powerful algorithms. And we should add to the picture one more thing - the compiler can inject the weak algorithms at compilation time and feed the processor with locally optimized actions depending on the current code section content. In simple words the compiler just replaces internal microprograms with the compiler driven. Also the compiler takes care of placing the injected algorithms in a right cache ensuring that the processor will have everything required.

Code: Select all

void f(A a) {
    if(a.foo()) a.bar() else a.baz();
}

Different As are passed to this method from hundreds of locations in the program (In the fortunate case it can even see every different version of A! Note that it can't do this if you are dynamically loading classes). How is the compiler to know that 99% of the time a.foo() returns false?

The processor, meanwhile, will get it wrong the first time, then have a weak false prediction, which will strengthen with each invocation.

embryo wrote:In the end - what is a hardware? It's just a dumb silicon. And where it gets the power to work fast? It is provided with algorithms by some external entity. And if the entity will be a compiler - what will be the problem in such case?

Because the only way to learn what the code will actually do is to execute the code. Yes, the compiler can instrument the code, run it, and then recompile it once it knows better what the code does (Pretty much every JIT does that!), but this requires the compilation process to be fast (thereby limiting the optimization possibilities), and often fast code requires removing said instrumentation hooks (because they pollute caches, etc)

The processor, meanwhile, can determine these things with startlingly little resource use, using dedicated resources.

embryo wrote:Once more - the compiler has all information except runtime. The runtime behaviour of a processor is defined by externally provided algorithms. The compiler can provide the processor with ANY algorithm it decides is suitable for a particular task. Can you imagine a processor with all algorithms possible? I can imagine such compiler, but not processor. At least it takes a lot less time to update compiler instead of updating a processor design and redistributing new chips across the world.

The best algorithms in the world can't save you if you don't have the information, or worse, if your information is wrong.

embryo wrote:The compiler wins without questions. Isn't it obvious?

It was obvious to the Itanium designers, and also completely wrong

embryo wrote:
Owen wrote:My compiler can't know this. It can't know the latency of the memory. It can't know the current cache pressure. It can't know the current contention between the various devices vying for the bus.
It all is just a mix of specification information availability and runtime data processing. First is the same for compiler and for processor, the second is algorithm driven and in the processor case leads to the monstrous algorithm storage within a processor. Why we can't eliminate such storage and allow the compiler to provide a suitable algorithm?

Because the compiler doesn't have a silicon fab at runtime?

You seem to think these are "microprograms", executed serially by the processor core. They aren't. They're hardwired. They're operating in parallel to the rest of the execution core.

embryo wrote:
Owen wrote:The latency of said memory access varies from microarchitecture to microarchitecture
It's just another point to consider when we think about how fast will be a system with all the chip area allocated for logic units instead of a big (and useless in case of compiler provided algorithms) microprogram storage.

The x86 microprogram store is less than 1% of the processor. Worse, it has nothing to do with any of this, but everything to do with the maintaining backwards compatibility with existing legacy code and implementing complex instructions such as far calls and privilege transfers.

embryo wrote:
Owen wrote:Itanium was a joint project by Intel (x86, of course, but also i860, i960, StrongARM and XScale), HP (HPPA) and Compaq (who had purchased DEC, who had developed VAX and Alpha). The expertise was all there.
Corporate internals can look worse if viewed from inside. May be it was a communication issues, may be bad management, may be the vision was too simplistic.
Owen wrote:For example, a function call to a virtual method (i.e. every non final method in Java, since that seems to be your language of choice). How can the compiler know what implementation of that method it is landing in?
Actually in Java it is also non private, non static and non special methods (constructor and static initializer). But to the question - the processor must wait until the actual object reference (address) will be available and after it happens it can make decision about the actual method address. But what processor knows about method address? It only knows that the address should exist and nothing more. But the compiler always knows all the data structures responsible for address seeking and with unimaginable ease (comparing to the processor's faulty attempts) can order the system to cache data of a few successors of a base class with actual address information. And when the processor executes a jump to the actual method all addresses will be in it's register file. The only required additional operation before the jump is an index calculation to select a particular register. Again we see that compiler beats any processor with ease.

Lets look at a traditional call through a vtable:

Code: Select all

    // Read through vtable pointer
    ldr lr, [r0]
    // Get the method pointer
    ldr lr, [lr], #8
    // Do some calculations of the method parameters
    // ...
    blx lr

Now, the compiler has no idea what is behind that call (Unless it can see the entire program, which means recompiling the whole system every time new software is installed, which for me, with updates and all, is just about every day), so it has to follow some defined calling convention and make some assumptions (unless it can prove an object hasn't escaped, it must assume it will be modified by said function call)

The processor, meanwhile, can start fetching the body of that function as soon as it has calculated the value of lr from instruction 2 in the above sequence. It might see, for example, that we just did a store to a member and that we are about to do a load from said member, and therefore forward the store to the load instruction without need to go through memory (~2 cycles for L1 cache, ~10 for L2, ~50 for L3, ~1000 for RAM)

embryo wrote: It should be mentioned, that there are cases of a root class (the Object) successors when the number of successors can be measured in thousands, but even in such situations the compiler just can look at a code a bit before the virtual call and to find there an actual object's type like in the example:
Code: Select all
Object var=new String();
var.hashCode(); // hashCode is object's method
It means the compiler wins again.

Any compiler that isn't pants on head retarded can figure that one out. Its a trivial case.

Its the non-trivial cases which matter, and which no compiler has the time in the world to figure out.

embryo wrote:
Owen wrote:Any compiler developer will tell you that pointers aren't the real problem
If we have a pointer to some structure - how can we be sure there is no access to the same address from other thread?

Because that would be undefined behavior and the compiler can assume undefined behavior never happens?

(N.B. Java has no undefined behavior. This is one of the reasons Java compilers can struggle to optimize)

embryo wrote: Next we use a pointer arithmetic. What a structure the pointer addresses now? How to deal with it? What to cache? It is an inherent problem for unmanaged languages. And there are a lot of such problems. And all the likes of the problem are just eliminated in Java.

Complicated pointer arithmetic is complicated. Guess what? The solutions to the inability to do this in managed languages are also often equally complicated.

embryo wrote:
Owen wrote:If compilers didn't universally suck at a lot of problems. They really suck at vectorizing register intensive maths code
Can you provide an example of a compiler fault here?
Owen wrote:optimal register allocation for arbitrary problems is only possible in NP time
Why we should limit ourselves to an arbitrary case only? We have all required information to remove all the uncertainity. So - we just should use the information.

Because developers don't like it when you tell them that their code isn't going to compile because the optimizer is allergic to it?

Because the cases for which register allocation is an NP problem are all the interesting, i.e. nontrivial ones?

embryo wrote:
Owen wrote:people tend to frown if a compiler spends 2 hours doing the same thing every compile
Yes, it is an issue. But for the Java Operating System there should be just a few full compiles per month - is it a problem? It is not a consumer toy and there is no children waiting the game to react quickly.

I see very few situations where "a few compiles a month" is feasible.

embryo · Post by **embryo** » Mon Mar 03, 2014 10:36 am

bwat wrote:The negative answer to the Entschiedungsproblem says ...

General artificial intelligence and a universal solver are not related to the subject. The subject is narrower a lot.

bwat · Post by **bwat** » Mon Mar 03, 2014 11:40 am

embryo wrote:
bwat wrote:The negative answer to the Entschiedungsproblem says ...
General artificial intelligence and a universal solver are not related to the subject. The subject is narrower a lot.

It has nothing to do with artificial intelligence! Look, your compiler will never beat a dynamic (run-time) analysis. The problem you want the compiler to solve boils down to an old question that was proven only partially solvable in 1936 --- the Entscheidungsproblem. Logically speaking it's a done deal!

embryo · Post by **embryo** » Mon Mar 03, 2014 12:08 pm

Owen wrote:The processor updates the branch predictors every branch

And what is the problem? Branch predictors are just the same algorithms and processed data.

Owen wrote:
Code: Select all
void f(A a) {
    if(a.foo()) a.bar() else a.baz();
}
Different As are passed to this method from hundreds of locations in the program (In the fortunate case it can even see every different version of A! Note that it can't do this if you are dynamically loading classes). How is the compiler to know that 99% of the time a.foo() returns false?

The processor, meanwhile, will get it wrong the first time, then have a weak false prediction, which will strengthen with each invocation.

If the result of the foo() is data driven then a compiler can use many suitable data mining algorithms instead of only one (and simplest) embedded in the processors microprogram storage. Then in cases when the probability of 'false' differs from 99% or even has some volatile nature processor will lose spectacularly. In the worst case for the compiler, when the data is selected in such a manner that exactly processor's prediction algorithm works the best, the compiler's version will loose on a short data sequences, but on a long sequences the difference will be negligible. But it is worst case scenario. In general - compiler wins, it's just smarter.

Owen wrote:the only way to learn what the code will actually do is to execute the code

Yes. But the prediction is still possible. And smart compiler can use such prediction. But there is no processor which can be so smart.

Owen wrote:The processor, meanwhile, can determine these things with startlingly little resource use, using dedicated resources.

It is not a processor's decision. It's all about algorithms used. Compiler can use any algorithm, the processor - only a few ones and all them should be very simple.

Owen wrote:The best algorithms in the world can't save you if you don't have the information, or worse, if your information is wrong.

Just remember genetic algorithms.

Owen wrote:You seem to think these are "microprograms", executed serially by the processor core. They aren't. They're hardwired. They're operating in parallel to the rest of the execution core.

Not exactly. There are basic units for logic and arithmetic operations and there are some means to operate the basic units. The operator units do something like open and close some gates from basic units to an internal bus/buses. And a microprogram here can be in the form of internal memory and simple shifter which switches operator unit input to the following simple instruction in a loop. It, of course, is my general understanding, but not exact internal schematics of a particular processor. However, my understanding tells me that it is a program, may be simple and even partially hardwired when some basic units are connected directly without operator unit, but it is still a program. And such program is optimized for a particular task. If the task is a bit different - the program fails. The compiler knows the task and can select appropriate program for operator units or decide to use optimized hardwired units like divider, for example. When such decision is made by a processor - it is not optimal in many cases.

Owen wrote:The x86 microprogram store is less than 1% of the processor. Worse, it has nothing to do with any of this, but everything to do with the maintaining backwards compatibility with existing legacy code and implementing complex instructions such as far calls and privilege transfers.

And what other billions of transistors do?

Owen wrote:Lets look at a traditional call through a vtable:
Code: Select all
    // Read through vtable pointer
    ldr lr, [r0]
    // Get the method pointer
    ldr lr, [lr], #8
    // Do some calculations of the method parameters
    // ...
    blx lr
Now, the compiler has no idea what is behind that call

Even in case of a C compiler it should to have an idea. It has source code where the type is given as a program construct like variable declaration or argument type definition. There must be the type. But in case of pointers the type can be hidden, yes. This means the pointers are not compiler friendly.

Owen wrote:Unless it can see the entire program, which means recompiling the whole system every time new software is installed

Why there is a recompilation needed? Any compiler can see the method/function it compiles. It is enough.

Owen wrote:The processor, meanwhile, can start fetching the body of that function as soon as it has calculated the value of lr from instruction 2

The compiler can start fetch the body of a few functions long before the processor can calculate the value of lr. It knows the class hierarchy and can determine the functions needed. Very often there is just one such function. Sometimes - two or three. The cases with hundreds of functions are very seldom.

Owen wrote:It might see, for example, that we just did a store to a member and that we are about to do a load from said member, and therefore forward the store to the load instruction without need to go through memory

Compiler sees it better. It, for example, can see that there was a NULL assignment right before the load and can optimize all the sequence to just one instruction for loading the NULL constant. And as usual, in the worst case the compiler falls to the processor's algorithm.

Owen wrote:Its the non-trivial cases which matter, and which no compiler has the time in the world to figure out.

Your generalization is too misleading. It supposes that all cases are irresolvable. But in real life the cases are often very simple.

Owen wrote:Because that would be undefined behavior and the compiler can assume undefined behavior never happens?

In case of managed languages it often can be assumed. For example a local object never can be accessed from outside, then we just should have an information about the source of an object. The compiler can have such information. The processor can't.

Owen wrote:Complicated pointer arithmetic is complicated. Guess what? The solutions to the inability to do this in managed languages are also often equally complicated.

In managed languages the solutions can be found (if they still aren't) because the problem space is very limited. In unmanaged languages the problem space is much bigger. This is an important difference. Every your argument about long compilation goes here.

Owen wrote:developers don't like it when you tell them that their code isn't going to compile because the optimizer is allergic to it

Optimization hints are absolutely accepted by a majority of developers. There's just no allergy.

Owen wrote:I see very few situations where "a few compiles a month" is feasible.

If we nave the standard bytecode as an input for execution then we have no need in a JVM recompilation. Also all frequently used applications are compiled long ago using background threads. Why do we need to recompile every day?

embryo · Post by **embryo** » Mon Mar 03, 2014 12:15 pm

bwat wrote:Look, your compiler will never beat a dynamic (run-time) analysis.

Do you have any particular argument?

bwat wrote:The problem you want the compiler to solve boils down to an old question that was proven only partially solvable

Why do you think the problem boils down to something proven? Can you prove the proposed boil will show us your theorem?

bwat · Post by **bwat** » Mon Mar 03, 2014 1:14 pm

embryo wrote:
bwat wrote:Look, your compiler will never beat a dynamic (run-time) analysis.
Do you have any particular argument?

You're trying to do the impossible.

embryo wrote:
bwat wrote:The problem you want the compiler to solve boils down to an old question that was proven only partially solvable
Why do you think the problem boils down to something proven? Can you prove the proposed boil will show us your theorem?

Your compiler will need to predict the execution path of any given input program if it is to be better than today's conservative static analyses* and compete with dynamic analyses. If you can solve this prediction problem then you can also solve the halting problem. This is impossible.

This is a well known result from theoretical computer science which all students are taught. As a student I was told to read On computable numbers, with an application to the Entscheidungsproblem.

*) Of course languages change and new techniques for static analysis will be developed but tomorrow's techniques won't be predicting the unpredictable.

Combuster · Post by **Combuster** » Mon Mar 03, 2014 2:23 pm

You're applying arguments to things where they don't apply. Why after all do we have static and dynamic analysis in the first place if by your reasoning the halting problem would make them impossible in the first place?

bwat · Post by **bwat** » Mon Mar 03, 2014 2:29 pm

Combuster wrote:Why after all do we have static and dynamic analysis in the first place if by your reasoning the halting problem would make them impossible in the first place?

You haven't understood. Nowhere do I state what you claim I state.

Combuster · Post by **Combuster** » Mon Mar 03, 2014 2:54 pm

Yet that is exactly what your argument claim:

Your compiler will need to predict the execution path of any given input program if it is to be better than today's conservative static analyses* and compete with dynamic analyses. If you can solve this prediction problem then you can also solve the halting problem. This is impossible.

1: You need X to be better than Y
2: X is equivalent to the halting problem
3: The halting problem is impossible
4: Therefore there exists no X better than Y (Which by induction means there exists no Y, or that the halting problem prevents any form of analysis to be created in the first place)

The next part is even more hilarious:

Of course languages change and new techniques for static analysis will be developed

5: There exists an Z better than Y
6. You lied.

embryo · Post by **embryo** » Mon Mar 03, 2014 3:33 pm

bwat wrote:If you can solve this prediction problem then you can also solve the halting problem. This is impossible.

Here is a key definition of the problem you are insisting to be a stopper:

a general algorithm to solve the halting problem for all possible program-input pairs cannot exist

Now look at the phrase "all possible program-input pairs". And please tell us where should we look to find "all possible program-input pairs" in statements like this:

Code: Select all

int a=1;
int b=a;
int c=b;
print(c);

Even if there somehow are "all possible program-input pairs" it just doesn't stop us from rewriting the code as follows:

Code: Select all

print(1);

Now, if we can "do a little", why do you think we can not do many such little things?

bwat · Post by **bwat** » Mon Mar 03, 2014 4:02 pm

Combuster wrote: 4: Therefore there exists no X better than Y (Which by induction means there exists no Y, or that the halting problem prevents any form of analysis to be created in the first place)

You've got a problem with thinking. By your reasoning, we can state that because there is no better than some optimum there can be no optimum. Just substitute "optimum" for your Y. Nonsense!

You also need to read up on static analysis. My argument is covered quite succinctly in the first two pages of Principles of Program Analysis, Nielson F., Nielson H.R., Hankin C., Springer, 1998. Here are the important quotes:

Program analysis offers static compile-time techniques for predicting safe and computable approximations to the set of values or behaviours arising dynamically at run-time when executing a program on a computer.

One common theme behind all approaches to program analysis is that in order to remain computable one can only provide approximate answers. As an example consider a simple language of statements and the program

read(x); (if x>0 then y:=1 else (y:=2;S)); z:= y

where S is some statement that does not contain an assignment to y, Intuitively, the values of y that can reach z:=y will be 1 or 2.

Now suppose that an analysis claims that the only value for y that can reach z:=y is in fact 1. While this seems intuitively wrong, it is in fact correct in the case where S is known never to terminate for x<=0 and y=2. But since it is undecidable whether or not S terminates, we normally do not expect our analysis to detect this situation. So in general, we expect the program analysis to produce a possibly larger set of possibilities than what will ever happen during the execution of the program.

If you read the paper I linked to above you'll see the reasoning behind these quotes.

bwat · Post by **bwat** » Mon Mar 03, 2014 4:10 pm

embryo wrote:Here is a key definition of the problem you are insisting to be a stopper:
a general algorithm to solve the halting problem for all possible program-input pairs cannot exist
Now look at the phrase "all possible program-input pairs".

You're right about the importance of dealing with all input. I was quite specific in my claim to cover this. Here's what I wrote with the important bits in bold:

bwat wrote:Your compiler will need to predict the execution path of any given input program if it is to be better than today's conservative static analyses* and compete with dynamic analyses. If you can solve this prediction problem then you can also solve the halting problem. This is impossible.

Have a quick read of my answer above with the quotes from the Program Analysis book. It might clear things up for you w.r.t. how a compiler has to deal with decidability.

Combuster · Post by **Combuster** » Mon Mar 03, 2014 4:37 pm

bwat wrote:
Combuster wrote: 4: Therefore there exists no X better than Y (Which by induction means there exists no Y, or that the halting problem prevents any form of analysis to be created in the first place)
You've got a problem with thinking. By your reasoning, we can state that because there is no better than some optimum there can be no optimum. Just substitute "optimum" for your Y. Nonsense!

Substitute optimum for Y? Like I said, your reasoning is hilarious.

5: There exists an Z better than Y
6. You lied.

7. You lied. Again.

Rusky · Post by **Rusky** » Tue Mar 04, 2014 10:44 am

I think the real point bwat is trying to make is that there are some properties of a program that can only be discovered by running it, thus the compiler cannot beat the CPU in those areas. It can use the same algorithms and data structures, but it doesn't have the same data until the program has already run. Even if it considers "all possible inputs," how is it going to encode that information in a way that the CPU can use it faster than just figuring it out itself in parallel with decoding the instruction stream?

Combuster · Post by **Combuster** » Tue Mar 04, 2014 11:28 am

By embedding such algorithms into the program when it can't statically predict what will happen?

I mean, if you have an explicit pipeline (like you'd have seen in delay-slot jump architectures), then you can for pretty much all for-loops explicitly inform the CPU where execution will go and have zero mispredictions - where every other CPU with a branch predictor would fail at least at the exit condition. Similarly, you can explicitly encode branch predictors for if statements, or even run a part of both branches until the jump kicks in. Being VLIW easily gives you the option to do such parallelism for free when you're doing typical dependent computational series anyway.

Bottom line: there's no such thing as the CPU being superior to the compiler.

OSDev.org

The Mill: a new low-power, high-performance CPU design

Re: The Mill: a new low-power, high-performance CPU design

Re: The Mill: a new low-power, high-performance CPU design

Re: The Mill: a new low-power, high-performance CPU design

Re: The Mill: a new low-power, high-performance CPU design

Re: The Mill: a new low-power, high-performance CPU design

Re: The Mill: a new low-power, high-performance CPU design

Re: The Mill: a new low-power, high-performance CPU design

Re: The Mill: a new low-power, high-performance CPU design

Re: The Mill: a new low-power, high-performance CPU design

Re: The Mill: a new low-power, high-performance CPU design

Re: The Mill: a new low-power, high-performance CPU design

Re: The Mill: a new low-power, high-performance CPU design

Re: The Mill: a new low-power, high-performance CPU design

Re: The Mill: a new low-power, high-performance CPU design

Re: The Mill: a new low-power, high-performance CPU design