OSDev.org

Posted: **Sat Mar 15, 2014 9:05 am**

Well yeah again, it's not about the ability of the program to do it. You could program a crysis engine in Qbasic or javascript. It would just be super slow. Or then much more bothering to write optimized code.

With GLSL you have neither of those inconvenient, the synthaxe is clear and neat, and it make it easy and even explicit for the compiler generate code optimized for simd capable cpu. With a very clear and straight forward synthaxe.

I'm not saying GLSL could replace entierely C or Java, it still lack many feature to be a complete language, it's not even supposed to be a turing complete language, but it's designed to make vectorization/paralelization easy and obvious.

But yes again you could code integer addition as a array of 32 bool and code addition manually, or make crysis engine in qbasic. It's not really the question of the ability of the language to do it, those stuff have been around for very long time, and there is not specially need for anything specific in the language to program it. Unless some language will be 10 times slower for the same thing.

What is less portable is to have the compiler to generate the most optimized code for any cpu or plateform without you have to give a single **** about putting a pragma or a annotation for the specific cpu or the specific kind of build you want. You just say the compiler i want this code compiler for ARM , intel, or anything else, optimized for size or speed, using call to the runtime or instrinsic/inlining, using this set of instruction or not. Or even ability to use specific code path at runtime for the specific version of the cpu with the runtime. Without you have to write a single line of additional code for any specific plateform. Because the langue already define the behavior that a function has, and the compiler already know how this function or behavior can be optimized for the target plateform.

Posted: **Sat Mar 15, 2014 4:02 pm**

(I am the author of the linked guide: Introduction to the Mill CPU Programming Model)

Its nice to read what everyone thinks about our Mill CPU

This thread has meandered a bit so I won't go trying to answer each old post, but I'm happy to answer all questions about Mill that you have.

Posted: **Sun Mar 16, 2014 2:41 am**

willedwards wrote:Its nice to read what everyone thinks about our Mill CPU

Yup, second opinions for free

Posted: **Sun Mar 16, 2014 5:08 am**

h0bby1 wrote:it's not about the ability of the program to do it. You could program a crysis engine in Qbasic or javascript. It would just be super slow.

Yes, my answers are not about the ability to write GPU application in some language. My answers are about the performance of the Java or C if the applications were written in Java or C. And the performance is not worse than in case of GLSL.

h0bby1 wrote:Because the langue already define the behavior that a function has, and the compiler already know how this function or behavior can be optimized for the target plateform.

It is absolutely irrelevant if the language can natively define any behavior. It is about the compiler's ability to understand the required behavior.

Posted: **Sun Mar 16, 2014 5:17 am**

willedwards wrote:I am the author of the linked guide: Introduction to the Mill CPU Programming Model

I hope the author can kindly answer some questions:
Is it the compatibility issue that has prevented Mill designers from having much more MIMD operations per cycle? Was the compatibility a main concern when the Mill was designed? Why there is less VLIW than possible? What is expected throughput of the Mill's bus? Where are the limits of the Mill's performance in case of easily parallelable program? Was the single threaded execution the only goal of the Mill designers?

Posted: **Sun Mar 16, 2014 5:42 am**

embryo wrote:
willedwards wrote:I am the author of the linked guide: Introduction to the Mill CPU Programming Model
I hope the author can kindly answer some questions:
Is it the compatibility issue that has prevented Mill designers from having much more MIMD operations per cycle? Was the compatibility a main concern when the Mill was designed? Why there is less VLIW than possible? What is expected throughput of the Mill's bus? Where are the limits of the Mill's performance in case of easily parallelable program? Was the single threaded execution the only goal of the Mill designers?

Well it practice 33 ops/cycle is a heroic world record

Its not really possible to have more. The first talk about "encoding" explains the mechanics really well.

Regards compatibility: at the core level, different members of the mill family are incompatible. But they are all compatible at the load module level. Under the hood, we provide a "specializer" to the OS which converts generic, portable Mill IR to target IR.

For us, compatibility means compatibility (and efficiency) with the programs you already have. We are just another LLVM backend. We care a lot about not breaking C etc.

We have a really exciting Security talk in a month - I'll announce on this forum when streaming details settled - which will be of super interest to people doing microkernels. But we support a Unix model too.

Posted: **Sun Mar 16, 2014 7:20 am**

embryo wrote:
h0bby1 wrote:it's not about the ability of the program to do it. You could program a crysis engine in Qbasic or javascript. It would just be super slow.
Yes, my answers are not about the ability to write GPU application in some language. My answers are about the performance of the Java or C if the applications were written in Java or C. And the performance is not worse than in case of GLSL.
h0bby1 wrote:Because the langue already define the behavior that a function has, and the compiler already know how this function or behavior can be optimized for the target plateform.
It is absolutely irrelevant if the language can natively define any behavior. It is about the compiler's ability to understand the required behavior.

Yes, and what is the purpose of a compiler ? It's to transform the behavior defined in a language into another language. Now it's not defined anywhere in C language how to compile a vec4 or a matrix. So the compiler has no reason to understand it.

If the language doesn't define the operation, the compiler can't magically produce code to do it. It can produce some non optimized code to do the operation as a sequence several sub operation that are defined in the language.Which is not optimized.

Would be like if the language would not have built in type for 32 bit integers, you would define them as array of bool, and the program doing the xor/carry bit/bit on the 32 bit registers 32 times per integer. The compiler will compile the code to 32 xor/carry Instead of doing it in a single instruction. Because it doesn't know the 32 bit integer type and the operators, and how to generate optimized code for the target cpu to do this operation.

To take an image, it's like with a car, the cpu is the engine, even if the engine is fast, but you don't have good brakes, good gear box to handle the engine couples, good direction, air bag, well you'd go faster and safer with a car with a less powerful engine.

The language is like the rest of car, it's what feed the cpu with what it has to do. It's how you control what the cpu will do. For this java is nice, it has all the gadget, smoked glasses, leather seats and everything, regarding confort of use, java is among the best languages, BUT it doesn't allow to define many operation a cpu can do, like operation on memory, I/O, and eventually vectors.

The performance will be worse if the code produced doesn't use simd. And if you use only C, the language doesn't define the behavior of the operation that could use the simd. And as matter of fact, i didn't see any C compiler today who can really generate optimized mmx assembler.

And it doesn't do that much of a great job at compiling the kind of operation that would be done in a GLSL either. It can compile them, you can define those operation in any language, but the output will not be optimized.

What i'm saying is just basically that the type and operator that the GLSL language define should be integrated into the C language for the compiler to be able to optimize them better.

And i'm not speaking specially about GPU application either. I'm speaking about feature of the GLSL regarding how it handle simd and paralellization. The fact that code is compiled for the GPU, for an intel, arm or mill cpu is irrelevant.

And for the me the example shown with the string copy is not that convincing because i'm not sure strcpys are really the bottle neck of lot program performances. And it would be more convincing to see how it optimize the kind of loop that would have in a GLSL program. Not saying it should be programmed in GLSL, but the principle to have complex loop execute on array of complex types. Something more complex than an strcpy.

Posted: **Sun Mar 16, 2014 7:24 am**

Yes, and what is the purpose of a compiler ? It's to transform the behavior defined in a language into another language. Now it's not defined anywhere in C language how to compile a vec4 or a matrix. So the compiler has no reason to understand it.

That's plain nonsense as written, and even if you fix the statement to what I think you mean it only demonstrates you're completely oblivious to what a vectoriser is supposed to do.

Posted: **Sun Mar 16, 2014 7:26 am**

Combuster wrote:
Yes, and what is the purpose of a compiler ? It's to transform the behavior defined in a language into another language. Now it's not defined anywhere in C language how to compile a vec4 or a matrix. So the compiler has no reason to understand it.
That's plain nonsense as written, and even if you fix the statement to what I think you mean it only demonstrates you're completely oblivious to what a vectoriser is supposed to do.

A vectorizer as it's generally used is to parallelize operations done on several elements in a loop. But i'm speaking about vectors as it's used in the linear algebra sense. And how well a language can define those linear algebra operations.

So there is two things at play here, the definition of linear algebra operations on vectors, and the 'vectorization' of those operation if they are to be done on arrays or in loops. GLSL define a mechanism to do both. 'Regular' Java or C doesn't do neither of these.

Actually those 'vectors' (vertices, pixels, texture coordinates) should not be dealt with as 'vectors' in the programming sense. They shouldn't be defined as arrays. Like in GLSL , you don't do vec4 point; point[0],point[1]. It's point.x/y/z/w. or texcoord.u/v. Or pixel.r/g/b/a.

But it just happen the operation done on them are operation of linear algebra, and they are operation defined as 'vectorial' with the linear algebra semantic. and that the cpu has registers and instructions specially made to do those operation in optimized manner. And that you often have to execute some linear operation on actual 'vectors' (here in the programming sense as arrays) of those vertex/pixel/texture coordinate.

But instead of having glVertexPointer/glLoadProgram/glDrawArray, you could have setDataPointer/loadProgram/ProcessArray. with the "Program" being some linear algebra operation to be executed on each element the data . With an internal loop on those element that can be then paralellized/vectorized/dispatched. The "Program" could be run as well on the main cpu or gpu, it doesn't matter. But what matter is why they use GLSL to program this kind of vectorized linear algebra operation and not cobol.

It's more on this kind of things that i'd like to see what they can come up with a cpu like the mill. And those stuff are not really exploitable easily from plain vanilla C or Java.

Posted: **Sun Mar 16, 2014 8:41 am**

Hi,

willedwards wrote:Well it practice 33 ops/cycle is a heroic world record

It would be easy to break this record.

For an example, rather than only doing things during a cycle and at cycle boundaries; you could reduce the maximum clock frequency by a factor of 16, split each cycle into 16 "sub-cycles", then claim that you're doing 264 ops/cycle. The ops/second wouldn't change at all, but the ops/cycle would be a lot higher.

My question is: what is the maximum clock frequency Mill can achieve, and/or how does Mill compare to traditional CPUs for "ops/second at max. speed"; and what is power consumption like at this max. clock frequency?

Cheers,

Brendan

Posted: **Sun Mar 16, 2014 9:10 am**

Brendan wrote:Hi,

willedwards wrote:Well it practice 33 ops/cycle is a heroic world record
It would be easy to break this record.

....

My question is: what is the maximum clock frequency Mill can achieve, and/or how does Mill compare to traditional CPUs for "ops/second at max. speed"; and what is power consumption like at this max. clock frequency?

Well those kind of games wouldn't fool anybody

We aim to start at 1Ghz for practical fab reasons, but can go all the way up just like the ooo superscalers.

For the current designs we are picking one cycle to be a 32 bit add. We could have picked another quantity. The latencies for each model are something the specialising compiler knows, and is not exposed to the earlier compile steps.

I can't recommend the Encoding talk strongly enough. It was the first talk, so Ivan explained a lot the reasoning behind the Mill and addressed much of has been raised in this thread

In the Encoding talk Ivan describes the power to ops/cycle for the Gold model at 1Ghz.

Posted: **Mon Mar 17, 2014 11:41 am**

willedwards wrote:Well it practice 33 ops/cycle is a heroic world record

Does it means Mill's designers do not know how to increase this number? Is there any visible limitation?

willedwards wrote:Regards compatibility: at the core level, different members of the mill family are incompatible. But they are all compatible at the load module level. Under the hood, we provide a "specializer" to the OS which converts generic, portable Mill IR to target IR.

Does it means there are no relations with the x86 command set compatibility? If so, them the Mill has the freedom to be what ever the designers want. Then may be it is possible to describe the reasons for implementing complex instructions like smearx and not exposing just a set of very simple instructions? In last case the compiler will be able to optimize a program better. Or the Mill designers think that complex instructions can deliver better performance?

willedwards wrote:For us, compatibility means compatibility (and efficiency) with the programs you already have. We are just another LLVM backend. We care a lot about not breaking C etc.

Yes, the sentence "compatibility means compatibility" has some expressive power, but I still in doubts - what the Mill is compatible with? What can be broken in the C? Why there should be a virtual machine if the Mill instruction set changes just a little in new models?

willedwards wrote:We have a really exciting Security talk in a month - I'll announce on this forum when streaming details settled - which will be of super interest to people doing microkernels. But we support a Unix model too.

What is the difference between the Unix kernel and a micro kernel from the point of view of the Mill designers? What the processor has to be able to support such difference?

Posted: **Mon Mar 17, 2014 12:10 pm**

h0bby1 wrote:Now it's not defined anywhere in C language how to compile a vec4 or a matrix. So the compiler has no reason to understand it.

Compiler developers has such reason. It is called - optimization.

h0bby1 wrote:If the language doesn't define the operation, the compiler can't magically produce code to do it. It can produce some non optimized code to do the operation as a sequence several sub operation that are defined in the language.Which is not optimized.

It is wrong to think that the compiler makes something not optimized. Any compiler should optimize as much as possible.

h0bby1 wrote:Because it doesn't know the 32 bit integer type and the operators, and how to generate optimized code for the target cpu to do this operation.

It knows all the types and all required operators. If there is no integer type in some language, then compiler developers still able to think about a 32 bit number. If you hide anything from the language users it doesn't mean the compiler developers should close their eyes and forget about integers, numbers, mathematics and logic.

h0bby1 wrote:And as matter of fact, i didn't see any C compiler today who can really generate optimized mmx assembler.

Do you mean the C compiler should generate code better than GLSL compiler when it is required to calculate all the 3D stuff? But what about the difference between CPU and GPU? How the code for CPU can be better if the CPU just misses some important parts? GPU was designed for the 3D operations, while the CPU was designed for general purpose operations. It means C compiler always will generate code a bit worse than GLSL compiler in case of 3D operations.

h0bby1 wrote:What i'm saying is just basically that the type and operator that the GLSL language define should be integrated into the C language for the compiler to be able to optimize them better.

No.

h0bby1 wrote:The fact that code is compiled for the GPU, for an intel, arm or mill cpu is irrelevant.

Then C compiler developers can copy some part of the GLSL compiler and implement in such a way any required optimizations. And there is no need to invent any new native type - the compiler will treat some derived types as optimizable.

h0bby1 wrote:And it would be more convincing to see how it optimize the kind of loop that would have in a GLSL program.

And what the problem with just copying corresponding part of the GLSL compiler? The only difference is - there is no new native types. All GLSL native types are replaced with some derivatives of the standard C types. But for the compiler it's just same regions in memory.

Posted: **Mon Mar 17, 2014 12:42 pm**

embryo wrote:
willedwards wrote:Well it practice 33 ops/cycle is a heroic world record
Does it means Mill's designers do not know how to increase this number? Is there any visible limitation?

We might work out a way to increase this number. Its not like you can wave a magic wand, sadly; if you don't pack your instructions, you won't have a big enough instruction cache to keep the thing feed. If you pack your instructions, you have to work some heroics to unpack them as quickly as they're eaten. The Mill [url=ootbcomp.com/topic/instruction-encoding/]Instruction Encoding[/url] talk really does explain a lot about how chips in general and then the Mill in particular work.

Does it means there are no relations with the x86 command set compatibility? If so, them the Mill has the freedom to be what ever the designers want. Then may be it is possible to describe the reasons for implementing complex instructions like smearx and not exposing just a set of very simple instructions?

Well from my perspective, smearx is a thing of beauty

Its actually a very very simple instruction to implement. Its also a critical no-longer-secret sauce for how a VLIW DSP-like architecture can vectorize almost all loops. We can vectorize loops that simply cannot be vectorized on the whole previous state of the art, and its because of None, pick and smear and kin.

In last case the compiler will be able to optimize a program better. Or the Mill designers think that complex instructions can deliver better performance?

We have a lot of very experienced compiler engineers on-board. This is the "speaker bio" we put on Ivan's talks:

Ivan Godard has designed, implemented or led the teams for 11 compilers for a variety of languages and targets, an operating system, an object-oriented database, and four instruction set architectures. He participated in the revision of Algol68 and is mentioned in its Report, was on the Green team that won the Ada language competition, designed the Mary family of system implementation languages, and was founding editor of the Machine Oriented Languages Bulletin. He is a Member Emeritus of IFIPS Working Group 2.4 (Implementation languages) and was a member of the committee that produced the IEEE and ISO floating-point standard 754-2011.

We get a lot of "sufficiently smart compiler syndrome" waved at us, which feels terribly unfair. Fundamentally, the Mill is designed to be able to do its thing without compiler heroics. We know compilers, we know what is viable and what is wishful thinking.

What is the difference between the Unix kernel and a micro kernel from the point of view of the Mill designers? What the processor has to be able to support such difference?

The security talk will explain what we bring to the table. We have already talked about our byte-granularity protection scheme, but not in much detail. We have already talked a little about stack protection, and interrupts, but not in detail. I'm afraid I really cannot say more for filing reasons, but can promise the details will be public really really soon.

Posted: **Mon Mar 17, 2014 3:01 pm**

willedwards wrote:If you pack your instructions, you have to work some heroics to unpack them as quickly as they're eaten.

I see here some problem. Why a pipeline, for example, can not deliver unpacked instructions? In case of branches there can be more than one pipeline. And caches today are relatively big, tenth of kilobytes for the fastest. Also, if the cache is so important, then why not to extend it? I suppose the area of the silicon within the Mill is much lesser than some monsters from Intel have.

willedwards wrote:We have a lot of very experienced compiler engineers on-board.

Unfortunately, it is not the best argument ever.

willedwards wrote:We get a lot of "sufficiently smart compiler syndrome" waved at us, which feels terribly unfair. Fundamentally, the Mill is designed to be able to do its thing without compiler heroics. We know compilers, we know what is viable and what is wishful thinking.

It is very important - the processor designers just refuse to think about "sufficiently smart compiler". And as a result - there are no smart compilers because of missing processor, designed to be able to do its thing with compiler heroics. We have an infinite loop.

willedwards wrote:
What is the difference between the Unix kernel and a micro kernel from the point of view of the Mill designers? What the processor has to be able to support such difference?
We have already talked about our byte-granularity protection scheme, but not in much detail. We have already talked a little about stack protection

So, the Mill's benefits for the OS are some means to protect some memory regions. But x86 also has such protection (and had it for 20 years). Is it heroic enough to have something, that was invented very long ago?

OSDev.org

The Mill: a new low-power, high-performance CPU design

Re: The Mill: a new low-power, high-performance CPU design

Re: The Mill: a new low-power, high-performance CPU design

Re: The Mill: a new low-power, high-performance CPU design

Re: The Mill: a new low-power, high-performance CPU design

Re: The Mill: a new low-power, high-performance CPU design

Re: The Mill: a new low-power, high-performance CPU design

Re: The Mill: a new low-power, high-performance CPU design

Re: The Mill: a new low-power, high-performance CPU design

Re: The Mill: a new low-power, high-performance CPU design

Re: The Mill: a new low-power, high-performance CPU design

Re: The Mill: a new low-power, high-performance CPU design

Re: The Mill: a new low-power, high-performance CPU design

Re: The Mill: a new low-power, high-performance CPU design

Re: The Mill: a new low-power, high-performance CPU design

Re: The Mill: a new low-power, high-performance CPU design