The Mill: a new low-power, high-performance CPU design

Combuster · Post by **Combuster** » Thu Mar 06, 2014 11:49 am

Owen wrote:"Because the Pentium M did it that way, all Intel processors do it that way"?

That's probably what you get for claiming perfect accuracy since 15 years, so I took your 15-year old example. Quite interesting that they couldn't beat the AMDs of that time even though they seemed to have more advanced predictors.

And because I'm already in the mood of bashing concepts containing absolutes, here's the more up to date version:

Nehalem and above implement Macro-op fusion, in which "dec reg, jnz backwards", "cmp reg, const, jnz backwards" pairs can be fused into one micro-op. Especially in the former case, tell me why it would be difficult for the processor to correctly predict that every time? It would surprise me entirely if Intel weren't predicting that correctly.

Let's troll the predictor:

Code: Select all

(...)
opcode someplace, ecx
mov ecx, [EBP+4] ; loop counter stashed due to register pressure
dec ecx
jnz .restart_loop ;fused op, prediction already needed to happen before exact value of ecx was known.

Now: Please tell me how you would beat branch-history predictors while using equal silicon area.

My original example showed that adding delayed jumps for those cases where the conditional is known well ahead of the jump, you don't need the the predictor to guess it wrong - especially considering how most architectures force the conditional to be evaluated immediately before the jump. That's something static analysis can give you, but not dynamic.

But I agree, I should shut up now.

embryo · Post by **embryo** » Thu Mar 06, 2014 1:59 pm

Brendan wrote:I think the main thing here is that compilers are good at doing the expensive static optimisations before run-time, while CPUs are good at doing the dynamic optimisations during run-time. These are mutually beneficial, not exclusive.

Of course these things can coexist, but dynamic optimization at run-time can benefit from additional algorithms provided by a compiler. And if a compiler is obliged to split functions into static and dynamic only, then the performance decreases in comparison with a compiler than gets it's product into run-time also. And the mirror for it - if a processor was created with dynamic optimization in mind, then it will not expose useful for the compiler internals and we get performance decrease again.

It means there should be 'open' processor with all it's internals managed by a compiler. And if such processor will contain some prediction units then let the compiler decide when the units should be used. It appear a good processor should be a slave of a good compiler. But processor designers do not think so and continue to create processors as an autonomous device. It is the problem.

embryo · Post by **embryo** » Thu Mar 06, 2014 2:15 pm

Owen wrote:Please tell me how you would beat branch-history predictors while using equal silicon area

The goal is overall performance, but not the silicon deficit. There are billions of transistors in modern processors and if we will use a million or two it will be less than 0.1%. And million of transistors is a very big deal - it's about 40 000 adder or subtractor units (ideally, of course).

The overall performance degrades because of sequential programs we write. And if unused (due to program sequential nature) area of the silicon will be involved in the process by a provided by the compiler additional algorithm - it's just increased efficiency of the silicon usage and absolutely not a problem of used silicon area.

And to the contrary - if somebody looks for lesser silicon usage and forget about overall performance - he will lose. Then - just do not close your eyes and look at right thing - the overall performance.

Brendan · Post by **Brendan** » Thu Mar 06, 2014 4:19 pm

Hi,

embryo wrote:
Brendan wrote:I think the main thing here is that compilers are good at doing the expensive static optimisations before run-time, while CPUs are good at doing the dynamic optimisations during run-time. These are mutually beneficial, not exclusive.
Of course these things can coexist, but dynamic optimization at run-time can benefit from additional algorithms provided by a compiler. And if a compiler is obliged to split functions into static and dynamic only, then the performance decreases in comparison with a compiler than gets it's product into run-time also. And the mirror for it - if a processor was created with dynamic optimization in mind, then it will not expose useful for the compiler internals and we get performance decrease again.

It means there should be 'open' processor with all it's internals managed by a compiler. And if such processor will contain some prediction units then let the compiler decide when the units should be used. It appear a good processor should be a slave of a good compiler. But processor designers do not think so and continue to create processors as an autonomous device. It is the problem.

I think you're severely overestimating what compilers are actually capable of in practice. Providing "additional algorithms"? No. Unless the programmer explicitly wrote multiple algorithms a compiler will only generate different implementations of the same algorithm. A bubble sort won't magically get transformed into a quick sort, a queue using pthread mutexes won't magically get transformed into a lockless linked list, etc. Compilers even struggle just trying to convert serial code into a SIMD implementation of the same algorithm.

For an example, one of the things I've had on my mind for months is unions. For my compiler I want to get rid of all implementation defined and undefined behaviour; and (because it does whole program optimisation) the compiler can and will re-order structure members (e.g. to reduce space wasted for padding for alignment, to put frequently accessed members on the same cache line, etc). For unions, storing something in one member of a union and then reading from a different member of the union is implementation defined, and therefore I want to make sure that can't happen or hasn't happened. To determine if a union was used safely the compiler would have to ensure that different members of a union aren't used at the same time; but if a compiler was able to do that then it'd also be able to determine that different members of a structure aren't used at the same time and could automatically make those structure members use the same space. Basically, if the compiler can ensure unions are used safely, then the compiler can also extract the same space saving benefits from plain structures and therefore the language doesn't need to support unions at all.

So, I've been trying to figure out of its possible (in practice) for a compiler to know that different members of a structure (or union) are used at the same time or not (and then use this information to optimise structures). My instincts tell me that it's possible in theory, but not possible in practice. I honestly don't know if my instincts are right.

Cheers,

Brendan

Brendan's latter tangent has left us with two different and productive topics of discussion going on, so I've separated the tangent into a separate thread - Owen

embryo · Post by **embryo** » Fri Mar 07, 2014 6:54 am

Brendan wrote:I think you're severely overestimating what compilers are actually capable of in practice.

The theory today will soon be a new practice. We can only speculate about the actual possibility.

Brendan wrote:Unless the programmer explicitly wrote multiple algorithms a compiler will only generate different implementations of the same algorithm.

And what is the problem? Yes, compilers use just human provided algorithms. But do we can to extend the algorithm set? Yes, we can. Then - what's the problem?

Brendan wrote:A bubble sort won't magically get transformed into a quick sort, ...

If we can update our compiler - we can transform sorts. There's just no problem.

Brendan wrote:Compilers even struggle just trying to convert serial code into a SIMD implementation of the same algorithm.

But there is some success. Later the success will be more noticeable.

Brendan wrote:For my compiler I want to get rid of all implementation defined and undefined behaviour

Does it mean you just want to have a full control over the generated code? You wand to define behavior by yourself, as I understand. But you still want to use unions (and pointers, and C, and ...). And here we really see the computability problem - if we want everything, then we should understand that to have everything is impossible.

But we can limit our expectations and just use safe enums or (it's better) use safe languages like Java (there's no pointers and no unions). May be the habit to use such things is an issue, but we have the example of entire Java community without such habit.

Then - just don't try to have everything.

Brendan wrote:Basically, if the compiler can ensure unions are used safely, then the compiler can also extract the same space saving benefits from plain structures and therefore the language doesn't need to support unions at all.
So, I've been trying to figure out of its possible (in practice) for a compiler to know that different members of a structure (or union) are used at the same time or not

We shouldn't look for a general solution of an irresolvable problem. It means we should not try to figure all possible ways of using all possible members at the same time. Then we can build a really good compiler (in practice, of course). And if such complier will be worse than a theoretical one with all possible ways described - it's just not an issue for us, because we know the difference between the real and an unattainable ideal.

And more to the issue - there is an irritation among C-developers towards Java which can be described as 'thinking it's too inefficient', but do we need an efficient code which will be run just once and for a few microseconds? We just unable to detect if there was something slow or not, the speed is just irrelevant here. Also in such way we can see the memory usage considering lots of gigabytes available. But when the efficiency is really needed there should be a fully controllable compiler and a processor, that also is fully controllable by the compiler. And the top of the control is the developer. If developer just do not understand the controlling possibilities - the code will be inefficient. But if the developer has a good background - the code will be not worse than if were written in assembly. And actually better - human being is unable to optimize big assembler program without a compiler or something alike.

Brendan · Post by **Brendan** » Fri Mar 07, 2014 9:43 am

Hi,

embryo wrote:
Brendan wrote:I think you're severely overestimating what compilers are actually capable of in practice.
The theory today will soon be a new practice. We can only speculate about the actual possibility.
Brendan wrote:Unless the programmer explicitly wrote multiple algorithms a compiler will only generate different implementations of the same algorithm.
And what is the problem? Yes, compilers use just human provided algorithms. But do we can to extend the algorithm set? Yes, we can. Then - what's the problem?

The problem is that you're expecting that compilers will be able to do things that they've failed to be able to do for the last 50 years and will probably continue to fail to be able to do for the next 50 years. This isn't a sane approach.

embryo wrote:
Brendan wrote:For my compiler I want to get rid of all implementation defined and undefined behaviour
Does it mean you just want to have a full control over the generated code? You wand to define behavior by yourself, as I understand. But you still want to use unions (and pointers, and C, and ...). And here we really see the computability problem - if we want everything, then we should understand that to have everything is impossible.

My goals are:

Increasing the chance that bugs will detected when the programmer attempts to compile the code (not after the end user has received the code). This is the main reason why there can't be undefined behaviour or implementation defined behaviour.
Making software easy to read, maintain and debug. This means code does what it looks like it does. This is another reason why there can't be undefined behaviour or implementation defined behaviour.
Portability. If you need anything like "#ifdef 80x86" or "#ifdef MINGW" then it failed. This is another reason why there can't be undefined behaviour or implementation defined behaviour.
Making the programming language easy to learn. If a beginner can't learn the language within 2 weeks (and has to spend several years at university just to figure out that a functor is not a Pokémon and won't evolve into functosaur) then the language sucks.
Making overhead obvious. This means nothing is hidden from the programmer in a library, virtual machine or anywhere else. If someone uses a simple looking function that happens to contain massive quantities of bloat, then I want all that bloat shoved into the programmer's face daily and not tucked away where they'll never see it.
Performance/optimisation.

Please note that this list is in order of importance. Performance is last (where it should be). For example, it's extremely likely that I will not support unions in any way at all, even though unions can and do improve performance, simply because it goes against 3 of the 4 most important things on my list of goals. Of course once you've done your best for the more important goals, you try to do your best for the less important goals (e.g. try to get as much performance as you can without messing up those more important things).

embryo wrote:But we can limit our expectations and just use safe enums or (it's better) use safe languages like Java (there's no pointers and no unions). May be the habit to use such things is an issue, but we have the example of entire Java community without such habit.

The list of goals for Java were different to mine - it was designed for portability, safety and fast startup times (e.g. web pages that contain Java applets that need to start as soon as possible and must not be able to install rootkits, etc). These 3 things alone mostly force the use of some sort of portable byte code and interpreting/JIT. Of course once you've done your best for the more important goals, you try to do your best for the less important goals (e.g. try to get as much performance as you can without messing up those more important things); and they've done an incredible job squeezing every last drop of performance out of Java given the massive disadvantage of JIT. It continues to amaze me that it's only about twice as slow and only gobbles twice as much RAM as equivalent code in Fortran, C, C++, etc (Sun/Oracle really do deserve to be the current world champion in turd polishing for this incredible achievement).

embryo wrote:And more to the issue - there is an irritation among C-developers towards Java which can be described as 'thinking it's too inefficient', but do we need an efficient code which will be run just once and for a few microseconds? We just unable to detect if there was something slow or not, the speed is just irrelevant here. Also in such way we can see the memory usage considering lots of gigabytes available. But when the efficiency is really needed there should be a fully controllable compiler and a processor, that also is fully controllable by the compiler. And the top of the control is the developer. If developer just do not understand the controlling possibilities - the code will be inefficient. But if the developer has a good background - the code will be not worse than if were written in assembly. And actually better - human being is unable to optimize big assembler program without a compiler or something alike.

You're right - you don't need fast performance for something that only runs once and only executes for a small amount of time; and languages like Java and Phython are good for that. Something that's running continuously and consuming CPU time that other software in the system could have used (like a database management server, a GUI, a web server, a device driver, or a kernel) is very very different though - for these cases Java is simply the wrong tool for the job.

Cheers,

Brendan

Rusky · Post by **Rusky** » Fri Mar 07, 2014 9:53 am

embryo wrote:
Brendan wrote:Compilers even struggle just trying to convert serial code into a SIMD implementation of the same algorithm.
But there is some success. Later the success will be more noticeable.

This is yet another example of why compilers are not automatically superior to hardware. The Mill added a None value (using in-CPU metadata) that, when stored, does nothing. Combined with the smear vector operation, it is now possible to vectorize many more loops, including ones with internal control flow.

embryo wrote:And more to the issue - there is an irritation among C-developers towards Java which can be described as 'thinking it's too inefficient', but do we need an efficient code which will be run just once and for a few microseconds? We just unable to detect if there was something slow or not, the speed is just irrelevant here. Also in such way we can see the memory usage considering lots of gigabytes available. But when the efficiency is really needed there should be a fully controllable compiler and a processor, that also is fully controllable by the compiler. And the top of the control is the developer. If developer just do not understand the controlling possibilities - the code will be inefficient. But if the developer has a good background - the code will be not worse than if were written in assembly. And actually better - human being is unable to optimize big assembler program without a compiler or something alike.

The problem with Java is not performance, it's that it's not "fully controllable." It forces a specific object model, a specific memory allocation model (which is terribly wasteful with memory, even if it is more convenient to use sometimes). It makes things safe by taking away control and requiring a pretty heavy runtime. What you want for OS dev, etc. is something like Rust, that is safe through static analysis instead. Rust gives the programmer the same level of control as C, but with checks at compile time for null safety, memory safety, memory leaks, etc. It also allows you to disable these checks when necessary, unlike Java.

embryo · Post by **embryo** » Fri Mar 07, 2014 6:08 pm

Brendan wrote:The problem is that you're expecting that compilers will be able to do things that they've failed to be able to do for the last 50 years and will probably continue to fail to be able to do for the next 50 years.

Yes, I speculate a bit about compiler capabilities. But to prove my point I have to create the promised compiler. It is not done yet. However, my view is the same - the compiler will beat the processor.

Brendan wrote:Increasing the chance that bugs will detected when the programmer attempts to compile the code (not after the end user has received the code). This is the main reason why there can't be undefined behaviour or implementation defined behaviour.

In fact such a goal means a lot of things. It is very, very wide.

Brendan wrote:code does what it looks like it does

In my view it's all about controllability. I can control something if I can see behind the scene, under the hood and so on.

Brendan wrote:Portability

It's a good goal, but precompiler's behavior is a part of the controllability problem, in my view, of course.

Brendan wrote:Making the programming language easy to learn

As for me - the simplicity is very important here. And simplicity is part of controllability.

Brendan wrote:Making overhead obvious.

It's controllability again.

Brendan wrote:Performance/optimisation.

And this is all about compiler and hardware capabilities.

If the above is summarized then it seems you have a goal of controllability and simplicity in the first place. But may be it's just my name for the thing you already have name for.

Brendan wrote:The list of goals for Java were different to mine - it was designed for portability, safety and fast startup times

Officially there was dissatisfaction with C in the beginning. And then there emerged the language. And after it there was the story about tools for the language. It differs from the simple idea about fast and safe web pages.

Brendan wrote:It continues to amaze me that it's only about twice as slow and only gobbles twice as much RAM as equivalent code in Fortran, C, C++, etc

Here are some thoughts:
http://scribblethink.org/Computer/javaCbenchmark.html\
http://www.azulsystems.com/blog/cliff/2 ... manceagain

Brendan wrote:Something that's running continuously and consuming CPU time that other software in the system could have used (like a database management server, a GUI, a web server, a device driver, or a kernel) is very very different though - for these cases Java is simply the wrong tool for the job.

There are many dimensions in the discussion of wrong or right tool. The Java has the simplicity problem minimized. With Java OS there comes the controllability issue - it can be maximized at least as good as C world can. And 'at least' here because of the goals you have declared above are more suitable for Java, than for C. But, of course, the future will show ...

embryo · Post by **embryo** » Fri Mar 07, 2014 6:13 pm

Rusky wrote:The Mill added a None value (using in-CPU metadata) that, when stored, does nothing. Combined with the smear vector operation, it is now possible to vectorize many more loops, including ones with internal control flow.

How exactly the vectorization is performed? I suppose there is (as usual) just one simple algorithm. And do you think one algorithm matches all problems?

Rusky wrote:The problem with Java is not performance, it's that it's not "fully controllable."

With Java OS the Java becomes controllable. And there is no language change required. It's enough to use some annotations to hint the compiler about expected results.

Rusky · Post by **Rusky** » Fri Mar 07, 2014 6:23 pm

The vectorization is done in the compiler. It is possible because of the metadata.

So your unchanged Java language for Java OS has value semantics, deterministic, GC-free memory management, and lets you access arbitrary memory locations?

Brendan · Post by **Brendan** » Fri Mar 07, 2014 9:19 pm

Hi,

embryo wrote:
Brendan wrote:The problem is that you're expecting that compilers will be able to do things that they've failed to be able to do for the last 50 years and will probably continue to fail to be able to do for the next 50 years.
Yes, I speculate a bit about compiler capabilities. But to prove my point I have to create the promised compiler. It is not done yet. However, my view is the same - the compiler will beat the processor.

For some things the compiler can and will (and already does) beat the processor. For other things (e.g. branch prediction) it's impossible for the compiler to beat the processor because the optimisation can't be done statically and if it is done in software at run-time (e.g. JIT or self modifying code) the overhead of doing the optimisation is greater than the overhead it saves.

embryo wrote:
Brendan wrote:It continues to amaze me that it's only about twice as slow and only gobbles twice as much RAM as equivalent code in Fortran, C, C++, etc
Here are some thoughts:
http://scribblethink.org/Computer/javaCbenchmark.html\
http://www.azulsystems.com/blog/cliff/2 ... manceagain

There's lots of benchmarks that all say different things. In general, for the benchmarks that are out there some say Java is as fast or faster than C/C++ and most say the opposite; and if you combine all of them you get something like "Java is between 10 times slower and 2 times faster on average". Of course for almost all cases where Java is faster you can find out why and improve the C/C++ code until the C/C++ version is at least as fast or faster than the Java version; and for most of the cases where C/C++ is faster than Java you can't do anything about it.

embryo wrote:
Brendan wrote:Something that's running continuously and consuming CPU time that other software in the system could have used (like a database management server, a GUI, a web server, a device driver, or a kernel) is very very different though - for these cases Java is simply the wrong tool for the job.
There are many dimensions in the discussion of wrong or right tool. The Java has the simplicity problem minimized. With Java OS there comes the controllability issue - it can be maximized at least as good as C world can. And 'at least' here because of the goals you have declared above are more suitable for Java, than for C. But, of course, the future will show ...

For my goals, Java has good portability and there's no undefined behaviour or implementation defined behaviour (which is good). However, far too many bugs aren't discovered until run-time, it's harder to learn, and none of the overhead is obvious. This means that it's worse (for my goals) before you even start considering the controllability issue.

The controllability issue alone makes Java worthless to me. Even pure C/C++ (without inline assembly) isn't controllable enough. For a simple example, consider writing code to switch the CPU from protected mode to long mode, or code to do a task switch to a thread in a different virtual address space, or code to support hardware virtualisation, or even just code for a boring interrupt handler. Without being able to use assembly you're screwed. Of course none of these examples have anything to do with performance.

When do you start looking at performance there's 4 main things to worry about - memory hierarchy efficiency (e.g. things like knowing which cache lines you touch when), doing stuff in parallel efficiently (both multiple CPUs and SIMD), controlling when things happen (e.g. important work done before unimportant work) and latency. Something like garbage collection is enough to completely destroy all of this.

Cheers,

Brendan

embryo · Post by **embryo** » Sat Mar 08, 2014 4:00 am

Rusky wrote:The vectorization is done in the compiler. It is possible because of the metadata.

Metadata needs storage. The compiler can allocate the storage at the compile time. It means the compiler can do every trick the Mill can do with metadata. And, of course, the compiler will do it in more efficient manner, it just has a lot of algorithms for such task.

Rusky wrote:So your unchanged Java language for Java OS has value semantics, deterministic, GC-free memory management, and lets you access arbitrary memory locations?

Yes, it has.

Value semantics means just an immutable object. Form the beginning Java had (and has) immutable objects, for example - strings.

Java (with Java OS) is not less deterministic than C.

GC-free memory management is not a requirement. It's just common vision among C-developers that GC is inefficient, but there is no proof for it. At the very least manual memory management is tedious and error prone. And as such it leads to more bugs and longer development time frame. And at an extreme - it is not feasible to develop some big systems in C. For example you can look at the application server world.

But beside of GC the jEmbryoS allows to be GC-free, if you want.

And, of course, the jEmbryoS allows arbitrary memory access. But again - it's just not recommended because of possible bugs.

Combuster · Post by **Combuster** » Sat Mar 08, 2014 4:05 am

GC-free memory management is not a requirement. It's just common vision among C-developers that GC is inefficient, but there is no proof for it.

Java-style GC breaks (soft-)realtime. more often than I'd care for.

embryo · Post by **embryo** » Sat Mar 08, 2014 4:27 am

Brendan wrote:For other things (e.g. branch prediction) it's impossible for the compiler to beat the processor because the optimisation can't be done statically and if it is done in software at run-time (e.g. JIT or self modifying code) the overhead of doing the optimisation is greater than the overhead it saves.

We already have examples of the compiler win in the runtime optimization quest. There are a few messages above right about such win.

Brendan wrote:Of course for almost all cases where Java is faster you can find out why and improve the C/C++ code until the C/C++ version is at least as fast or faster than the Java version

It is not exactly such simple. For C/C++ to match Java in case of pointers problem it is required to reject to use pointers. But it means C/C++ will just emulate Java, then why not to use Java directly? With memory allocation it's again the same - to have less bugs C/C++ need to reject manual memory allocation. And there are more items in this list.

Brendan wrote:and for most of the cases where C/C++ is faster than Java you can't do anything about it.

Why I can't use compiler like GCC and get the exact result the C/C++ have? And I can use such compiler at runtime, like JIT. And there will be no significant performance overhead because compiler runs once and the server code then runs for very long time.

Brendan wrote:For my goals, Java has ... However, far too many bugs aren't discovered until run-time, it's harder to learn, and none of the overhead is obvious.

Number of bugs is a very volatile issue, but we can remember causes of most common bugs - manual memory allocation ,pointers, unsafe operations, etc.

About learning problem - it just seems non existent - why young people prefer not to learn C but learn Java and other safe languages?

The overhead is hidden in the JVM and it's interaction with OS, yes. But with Java OS the overhead becomes visible just like in case of the C. And the C-developer has a need to understand all OSes for which the code is compiled, but Java-developer can learn just Java OS.

Brendan wrote:consider writing code to switch the CPU from protected mode to long mode, or code to do a task switch to a thread in a different virtual address space, or code to support hardware virtualisation, or even just code for a boring interrupt handler. Without being able to use assembly you're screwed.

In jEmbryoS the assembler parts are as easy to implement as the inline assembly in C. Again we have no problems with Java.

Brendan wrote:When do you start looking at performance there's 4 main things to worry about - memory hierarchy efficiency (e.g. things like knowing which cache lines you touch when), doing stuff in parallel efficiently (both multiple CPUs and SIMD), controlling when things happen (e.g. important work done before unimportant work) and latency. Something like garbage collection is enough to completely destroy all of this.

With Java OS all 4 things are under control. And the GC is not an issue, because it can be as controllable as you wish. It can run at some predefined points in the program thread or it can run in a separate thread on a dedicated core with it's independent cache. It's just a matter of design, but not a stopper any more.

embryo · Post by **embryo** » Sat Mar 08, 2014 4:32 am

Combuster wrote:Java-style GC breaks (soft-)realtime

If the speed of memory allocation is less than speed of the memory deallocation by the incremental GC, then we have no problems. And memory intensive application usually are not realtime, because optimal manual deallocation of a very big number of memory regions is not as easy as just forgetting about it with Java.

OSDev.org

The Mill: a new low-power, high-performance CPU design

Re: The Mill: a new low-power, high-performance CPU design

Re: The Mill: a new low-power, high-performance CPU design

Re: The Mill: a new low-power, high-performance CPU design

Re: The Mill: a new low-power, high-performance CPU design

Re: The Mill: a new low-power, high-performance CPU design

Re: The Mill: a new low-power, high-performance CPU design

Re: The Mill: a new low-power, high-performance CPU design

Re: The Mill: a new low-power, high-performance CPU design

Re: The Mill: a new low-power, high-performance CPU design

Re: The Mill: a new low-power, high-performance CPU design

Re: The Mill: a new low-power, high-performance CPU design

Re: The Mill: a new low-power, high-performance CPU design

Re: The Mill: a new low-power, high-performance CPU design

Re: The Mill: a new low-power, high-performance CPU design

Re: The Mill: a new low-power, high-performance CPU design