The Mill: a new low-power, high-performance CPU design

Combuster · Post by **Combuster** » Tue Mar 11, 2014 11:52 pm

Rusky wrote:
Combuster wrote:<x86 implementation>
Try moving a maximum vector of bytes per cycle.

Show me an example of the mill doing that.

embryo · Post by **embryo** » Wed Mar 12, 2014 2:30 am

h0bby1 wrote:Maybe either they should do cpu that are really made for a particular language to make it easy for the compiler to generate optimized code for the cpu

It actually will complicate compiler's job. For example - Java bytecode is a high level entity and as such it just do not help the compiler. It is almost equal for the compiler to use Java source code or it's compiled version in the form of bytecodes. So, if you enforce some high level constructs in hardware, then the compiler would be unable to change an algorithm of such high level implementation and will struggle for performance.

embryo · Post by **embryo** » Wed Mar 12, 2014 2:34 am

Rusky wrote:The loop isn't "split into two parts" with smearx any more than it is with a conditional branch instruction.

The split is shown on the pictures in the blog you have referenced above.

Rusky wrote: Nor is there any reason a Mill loop must always use smearx. It's just another tool like cmp, jnz, etc.

So why the processor should allocate some part of it's silicon for smearx implementation? Is it a worthy thing?

h0bby1 · Post by **h0bby1** » Wed Mar 12, 2014 6:36 am

embryo wrote:
h0bby1 wrote:Maybe either they should do cpu that are really made for a particular language to make it easy for the compiler to generate optimized code for the cpu
It actually will complicate compiler's job. For example - Java bytecode is a high level entity and as such it just do not help the compiler. It is almost equal for the compiler to use Java source code or it's compiled version in the form of bytecodes. So, if you enforce some high level constructs in hardware, then the compiler would be unable to change an algorithm of such high level implementation and will struggle for performance.

Well for a java program the optimization goal would be different, but if you take that on the level of what a compiler can do, or maybe a not short sighted raw bytecode executer, i think a part of what cripple java performance is to often have to deal with copy-constructor, and having all operation done on object, and often copy/constructed many time alongside the execution, maybe some support for this mechanics to have maybe some sort "object granularity" instead of "word granularity" in the memory layout, even possibly making some operation on those object atomic, and to manipulate memory area and reference to them as object would provide a better way to handle operation than what current form of assembler/memory unit offer, mostly based on words/register sized operation, some cpu could probably handle certain number of things like that in hardwired mode, and could probably improve perf, for ex to accelerate/improve access to array of simple types object, and that the compiler could feed up the cpu with the actual memory layout and stuff, maybe some operation of copy/construct, and certain number of operation in java could be accelerated

But as far as i understand, java was not made specially for computational performance, as far as i understand SUN developed java as a language for the desktop interface, specially to avoid to have to deal with many case of memory instance being used/modified/deleted in many different part of the program, and to simplifying dealing with complex shared type for real time application, and for memory security, rather than really for performance, their goal with it was clearly not computational performance or optimal memory use, or optimal use of CPU data type or structure, there is already the frankeinstenized version of C for this =)

With GLSL, the synthax is still quite close to something you could come up with in C, but even if i didn't really directly try that out, i'm pretty sure it would be much easier to write a compiler for GLSL to intel that would make use good of SIMD instructions, than writing a compiler that would be able to do the same writing it with 'naive' C. Even writing naive GLSL make explicit use of vector type and operation that the compiler can easily match to optimized cpu instruction, even if it's still rather 'high level' language. C can still compete using inlined sse intrinsic, actually i made some test already with raytracing, with a quadricore with hyperthreading ( 8 core), with sse optimisation, the perf are like twice slower than the shader with quasi identical code, but it would make the optimisation process much easier for the compiler to deal with a shader like language who support already all the floating point linear algebra kind of instruction (matrixes, vectors, dot, cross etc), and pixel operation, and eventually some 8x8 DCT, than writing it in plain C.

I mean by this it's a whole, as programmer will express their intention in the context of a compiled/interpreted language most of the time, they will not necessarily be expressed in term that are easily translated for specific architecture that has specific set of instruction to handle some specific operation, and for me it's almost impossible to have something that really take advantage of cpu feature with languages like C/C++ or java, unless you add some tweak into it, but the problem with tweaks is that you never know how another compiler will interpret it, and you break a little the global coherency of the program definition, and loose many benefit of what a compiler can offer in term of error detection and optimization.

Rusky · Post by **Rusky** » Wed Mar 12, 2014 7:43 am

Combuster wrote:
Rusky wrote:
Combuster wrote:<x86 implementation>
Try moving a maximum vector of bytes per cycle.
Show me an example of the mill doing that.

The example I linked does:

a can be loaded as a vector and b can be stored as a vector. Any elements in a that the process does not have permission to access will be NaR, but this will only fault if we try and store them.

The a vector can be compared to 0; this results in a vector of boolean, which is then smearx’ed. This can then be picked with a vector of None into b. The smearx offsetting ensures that the trailing zero is copied from a to b. The second return from smearx, recording if any 0 was found in a, is used for a conditional branch to determine if another iteration of the vectorised loop is required.

The phasing of the strcpy operations allows all 27 operations to be executed in just one cycle, which moves a full maximum vector of bytes each cycle.

embryo · Post by **embryo** » Wed Mar 12, 2014 11:40 am

h0bby1 wrote:i think a part of what cripple java performance is to often have to deal with copy-constructor, and having all operation done on object, and often copy/constructed many time alongside the execution

No, it is optimizable part of the problem. It's just like in C when developer uses some structure, there is no object related overhead.

h0bby1 wrote:java was not made specially for computational performance

It seems the Java was made to fix problems with the C, like pointers and other unsafe features. So, the performance was not neither decreased or increased. But pointer problem elimination helps to decrease compiler's complexity. In this way Java has some still unused potential.

h0bby1 wrote:With GLSL, the synthax is still quite close to something you could come up with in C

If it is close to C, then what advantages GLSL has over the C? For example - Java is safe and frees developer from many tedious tasks like memory management and safeness checks. And what about GLSL? If it's all about some DSP related libraries - why not to use C or Java?

h0bby1 wrote:Even writing naive GLSL make explicit use of vector type and operation that the compiler can easily match to optimized cpu instruction

It's like adding vector library to C or Java and upgrading the compiler for it to understand that the vector can be optimized in some efficient way. Why do we need GLSL?

h0bby1 wrote:as programmer will express their intention in the context of a compiled/interpreted language most of the time, they will not necessarily be expressed in term that are easily translated for specific architecture that has specific set of instruction to handle some specific operation, and for me it's almost impossible to have something that really take advantage of cpu feature with languages like C/C++ or java, unless you add some tweak into it

The problem can be solved by introducing a set of libraries, with each item optimized for a particular hardware. Then it's just enough to annotate Java code or use C pragmas for the compiler to select appropriate library. And, of course, the compiler should be enhanced to understand performance annotations or pragmas.

h0bby1 wrote:but the problem with tweaks is that you never know how another compiler will interpret it, and you break a little the global coherency of the program definition, and loose many benefit of what a compiler can offer in term of error detection and optimization.

With Java it is different. There are (actually) two compilers - first translates source texts into bytecodes and ensures "the global coherency of the program definition", and second, let's name it JIT for simplicity, can use any annotations the Java bytecode has and, in turn, is able to link a particular hardware dependent library without breaking any coherence. And, of course, such compiler chain gives us hardware independence, which is really nice advantage.

h0bby1 · Post by **h0bby1** » Wed Mar 12, 2014 12:07 pm

embryo wrote:If it is close to C, then what advantages GLSL has over the C? For example - Java is safe and frees developer from many tedious tasks like memory management and safeness checks. And what about GLSL? If it's all about some DSP related libraries - why not to use C or Java?

because the language recognize and force the use of specific type that can be mapped to registers for pixels and vectors, and have also type for matrices, texture, and the compiler can easily map those operation to fast sse operations on SIMD things. With C you can do it either using inlined SSE intrinsic , but then it's not really C anymore, if you do it in plain C without using any inline or intrinsic, there is very little chance the compiler will able to really make the best use of simd things. Or it's not really supposed to do it.

If you upgrade the compiler to recognize specific things in the language, then it become something else than java or C, and then you can make it recognize it, but it's not really part of the language itself, so you loose all benefit of good documentation, maintenance of the tools to compile it, error checks etc

embryo · Post by **embryo** » Thu Mar 13, 2014 7:09 am

h0bby1 wrote:because the language recognize and force the use of specific type that can be mapped to registers for pixels and vectors, and have also type for matrices, texture, and the compiler can easily map those operation to fast sse operations on SIMD things

Then such language is very tightly bound to the hardware. It's a kind of Assembler. Then why not to use Assembler? And the benefits of compatibility (as they are for all high level languages) are lost in case of such language.

h0bby1 wrote:With C you can do it either using inlined SSE intrinsic , but then it's not really C anymore

Ok, then there is Java

h0bby1 wrote:If you upgrade the compiler to recognize specific things in the language, then it become something else than java or C

In case of Java it is not a problem. The language is untouched. But only execution environment is updated. And as execution environment always runs on some particular hardware, then it is just natural to support hardware specific things at runtime, like JIT does.

So I see the Java way is much more compatible, it is more safe and it's level of abstraction is higher than in case of other solutions.

Rusky · Post by **Rusky** » Thu Mar 13, 2014 7:32 am

GLSL is for shaders. It's not general purpose. Java is designed for a completely different purpose.

h0bby1 · Post by **h0bby1** » Thu Mar 13, 2014 8:25 am

embryo wrote:
h0bby1 wrote:because the language recognize and force the use of specific type that can be mapped to registers for pixels and vectors, and have also type for matrices, texture, and the compiler can easily map those operation to fast sse operations on SIMD things
Then such language is very tightly bound to the hardware. It's a kind of Assembler. Then why not to use Assembler? And the benefits of compatibility (as they are for all high level languages) are lost in case of such language.
h0bby1 wrote:With C you can do it either using inlined SSE intrinsic , but then it's not really C anymore
Ok, then there is Java
h0bby1 wrote:If you upgrade the compiler to recognize specific things in the language, then it become something else than java or C
In case of Java it is not a problem. The language is untouched. But only execution environment is updated. And as execution environment always runs on some particular hardware, then it is just natural to support hardware specific things at runtime, like JIT does.

So I see the Java way is much more compatible, it is more safe and it's level of abstraction is higher than in case of other solutions.

no it's not a kind of assembler at all, it's still very high level because there is a whole context associated with the shader, which make it behave much like a dsp thing with input/output, and allow for very easy parralisation of the code on large array, and it's much more convenient synthaxe than pure assembler, and can be compiled at run time for the specific cpu without problem.

For lot of operation, some instruction in SSE4 like dot product, or horizontal addition can help big deal with lot of linear algebra operations. And those algorythm are already often complex enough not to have to deal with it in pure assembly and registers things. And the language support many data types that are not specially recognized by cpu, like matrices, or functions to interpolate pixels from a texture. And many operation that are still plain high level.

shader is a bit specific in the sense it's really meant to be used in a dsp like fashion, and to only work on a very specific memory area, and the code is supposed to be rather standalone, the memory that it will access to is well defined. So it make it easy for higher level layer to paralellize the function. Additionally to it being easy to compile using SIMD features.

The main thing that make GLSL specific is that it expect input array from opengl as vertex/normal/texcoord array, but it could be extended to support other type of array as input, to work on list of strings or whatever else, it's more the principle to have a language that has built-in types and operators that allow to write some complex/high level operation in a way that is easy to compile to machine code in optimized manner and paralellize. It's what lot of C compiler do with the libC functions, and they can optimize call to those function in variety of way, but it's still a non standard behavior somehow, if you stick to C89/C99 definition of the language. And libC functions are still simple operations, and they are not necessarily trivial to paralellize in the context of a general C program.

Well java ok, but if you have some heavy floating point vector math to do, on array of hundred of thousand if not millions of floating point 4D vectors, that need to be done about 50 time per second, then bye bye java.

The problem is all into how the JIT can recognize the operations that can be speed up by a specific hardware. The more general purpose the language is, the harder it is for compiler to recognize some specific operations that can be speed up by the hardware or paralellized. GLSL make it very easy for the compiler and the higher level execution framework to figure out what can be parellized, and how to compile the thing to machine code in efficient manner.

And if you use non standard feature, either it's with pragma, or anything else that is specific to the compiler, another compiler or interpreter will not necessarily compile the code correctly. And then you change a compiler switch, or some define, or something and all the code crash, or don't function properly, even if it's supposed to be valid code. If the code is still executable/compilable perfectly without the compiler to take in account any specific things, and the specific directive are purely optional it can be ok though.

And with opengl ES 2.0, GLSL become the core of the whole opengl rendering process, with very little fixed pipeline thing, and in any kind of modern desktop station, it's this kind of code, either for video or 3D, image processing, or audio that is really eating most of the cpu time.

For server station, like databases, or files server, it might be different.

embryo · Post by **embryo** » Thu Mar 13, 2014 11:23 am

h0bby1 wrote:no it's not a kind of assembler at all, it's still very high level

And it even has it's own bytcodes

I mean the strings which OpenGL sends to the GLSL compiler at runtime.

h0bby1 wrote:And the language support many data types that are not specially recognized by cpu, like matrices, or functions to interpolate pixels from a texture.

In fact the CPU recognizes those data types, but the name for CPU is GPU. So we have just another processor with it's personal compiler from every hardware vendor. It is absolutely the same for Java OS - any hardware vendor can include it's part of the JIT and get very good speed.

h0bby1 wrote:And many operation that are still plain high level.

And now, if we look at the picture without hardware, we have two successors of the C language - the Java and the GLSL. Partially, it is your preference that sticks you with a particular language, but it should be noted, that Java developer community is much bigger than GLSL's. And Java has much reacher OOP capability. It's just more high level thing.

h0bby1 wrote:The main thing that make GLSL specific is that it expect input array from opengl as vertex/normal/texcoord array, but it could be extended to support other type of array as input, to work on list of strings or whatever else

With Java it is no less easy to introduce any useful GL type, like matrices or vectors. And, of course, the type set can be extended too.

h0bby1 wrote:it's more the principle to have a language that has built-in types and operators that allow to write some complex/high level operation in a way that is easy to compile to machine code in optimized manner and paralellize.

Frankly, I see no point in some special language constructs instead of general data types like integer or float. It is absolutely possible to define any structure with root class of 3 doubles to define a vector. And it is also very easy to annotate any method like this:

Code: Select all

@VectorSummation(vector1ParamIndex=0, vector2ParamIndex=1)
public VectorSuccessor add(VectorSuccessor v1, VectorSuccessor v2)
{
  ...
}

Having such method we can provide it's default implementation for any hardware and default compiler, but also the annotation tells the vendor provided compiler, that there are two vectors and the result of the summation should be provided as another vector for function return value. So the vendor specific compiler (if it is present) can easily optimize the function by replacing it's body with optimized vector summation routine. And there's no need to compile all the GL related program from string representation, we need to recompile just some annotated methods.

h0bby1 wrote:but if you have some heavy floating point vector math to do, on array of hundred of thousand if not millions of floating point 4D vectors, that need to be done about 50 time per second, then bye bye java.

Even on a general CPU like Intel's chips the 20 millisecond interval is a very big time. For 3 millions of 32 bit floats (million 3d vectors) and last SSE with 256 bits available for parallel operations we have 160 processor cycles for one float operation - it is more than enough for multiplication or division. The only problem - we should tell the compiler, where our 3d vectors are. With primitive arrays it is very easy. But it is still possible even with objects.

h0bby1 wrote:The more general purpose the language is, the harder it is for compiler to recognize some specific operations that can be speed up by the hardware or paralellized.

No. Just use annotations - and that's all required.

h0bby1 wrote:And with opengl ES 2.0, GLSL become the core of the whole opengl rendering process

I am glad to see the victory of legacy problem fighters. It's just the thing the jEmbryoS introduces for all Java.

h0bby1 wrote:For server station, like databases, or files server, it might be different.

No, it's right as it is for GL. And hopefully such approach will win hearts of a whole Java community.

h0bby1 · Post by **h0bby1** » Thu Mar 13, 2014 2:01 pm

embryo wrote:
h0bby1 wrote:And the language support many data types that are not specially recognized by cpu, like matrices, or functions to interpolate pixels from a texture.
In fact the CPU recognizes those data types, but the name for CPU is GPU. So we have just another processor with it's personal compiler from every hardware vendor. It is absolutely the same for Java OS - any hardware vendor can include it's part of the JIT and get very good speed.

If the target would be an intel CPU, it doesn't recognize matrixes, or texture, but it would still be pretty easy for a compiler to generate good machine code that use the SIMD, and for the programmer to make it easy for the compiler to recognize it. Yes you can use array, if you take for pixel yes you can do the saturated addition to each member, with checking for potential staturation, or using intermediate 16 bits value. And there is very little chance the compiler will be able to produce efficient mmx code to handle it properly.

The point is not that you can't make a C program or a java program, or php or javascript program to do it, they even do javascript to handle quaternion math, but then it's all about how efficiently it will be run on the CPU. How effeciently the code can executed on a given cpu by using some data types or registers that the cpu can have to execute those operation.

embryo wrote:Frankly, I see no point in some special language constructs instead of general data types like integer or float. It is absolutely possible to define any structure with root class of 3 doubles to define a vector. And it is also very easy to annotate any method like this:
Code: Select all
@VectorSummation(vector1ParamIndex=0, vector2ParamIndex=1)
public VectorSuccessor add(VectorSuccessor v1, VectorSuccessor v2)
{
  ...
}
Having such method we can provide it's default implementation for any hardware and default compiler, but also the annotation tells the vendor provided compiler, that there are two vectors and the result of the summation should be provided as another vector for function return value. So the vendor specific compiler (if it is present) can easily optimize the function by replacing it's body with optimized vector summation routine. And there's no need to compile all the GL related program from string representation, we need to recompile just some annotated methods.

well yeah, the point is not that you can't do any vector math with any other language. Even in Qbasic you can do matrices and vectors. It's just faster if the code use can use SIMD instruction.

And then yes the most used solution i guess nowdays is to have a set of 'libaries', that are more than libraries in the sense the compiler can still recognize the semantic used, like it is for the libC, and most C/C++ compiler can recognize the function of the libC, but it's not really part of the language itself, and it would need that the compiler can recognize the function as such, to be able to use the most optimized instruction to execute the code, without you have to do any assembler routine on your own at all, because the compiler already know that the operation can be made using sse code. And any compiler that would have to compile the program would know that the types being used as actual 4D vector as a built in type and will produce optimize assembler to do the operation using simd instructions. And how much the compiler can detect potential problems or optimization on the code that would make it less suitable for optimization.

Using arrays as vectors and matrices, you can define operators and the whole thing and having a synthax very close to the shader things, but the compiler won't necessarily recognize those types as actual vectors and optimize properly using simd instruction all the time.

It's not so much the capacity of the language to define an algorithm, but the capacity of the compiler to extract meaningful operation from the code, and to generate assembly code that is the most efficient to achieve this operation.

GLSL cannot replace C or java for many things, but the principle is not so so specific either, it's specific to dsp like algorithm working on input array of vectors. At least the part that is the 'Vertex shader', the 'Fragment shader is a bit different because it output directly on a drawing buffer. But the principle used for vertex shader could be applied to much more general case than just for array of vertex or texture coordinates set up by opengl.

Or then need to had more type to C, they already added the complex type, that the compiler is supposed to optimize properly using the good trigo and stuff, would just need to extend that to quaternion, and 4x4 matrices/vectors. And potentially adding fourrier logic for it to be suited for most of the operation that can be substantially optimized by some specific instruction set. Without you actually have to annotate anything, or write any kind of assembly or cpu specific thing because the compiler already know what it is because it's a type that built in the language.

I'm not saying either it would be very hard to make a language like that, that has same properties than glsl to be easily vectorized/paralellized, and work on large arrays, but for now neither java or C really offer that. GLSL does. And it would not be too hard to make a GLSL to intel sse/mmx kind of compiler either. Much simpler than writing a full C or java compiler. It just need a bit of context handling for inner state, and input/ouput buffers. Other than this, it would not be too hard to write a compiler for GLSL kind of code that would make it easy to handle paralellization and vectorisation without having to annotate or add any pragma or other things for particular hardware or target, or anything else. The compiler just recognize it by default because it's fully part of the language definition how those type are supposed to be handled, and that the operation defined in the language are made in sort to be close enough to what the hardware can do efficiently.

But it can be applied for other purpose than vector math and GLSL like thing, like having built in type to handle string, or other things , lists of complex types and complex operations on those type in a way that is close to what hardware can do efficiently. And the compiler can recognize those operation as such, and can make best use of the hardware to execute them. C or java even if they are supposed to be high level, the language itself doesn't offer that much complex data type and operations.

You'd say well it's what assembler is for if you want to use specific cpu register and instruction to do something, but it's also supposed to be the goal of good performance oriented language to map efficiently the kind of operation that the cpu can do, and having type that match cpu registers in the language itself to make it easier for the compiler to produce efficient machine code.

And in the case, there is no language that offer natively data type and operations on them that match SIMD. And what you can define regarding paralellization with plain native C or java is also limited. After it's about using libraries and language extension, or inline assemblies, pragma, annotations, and/or tweaks, JNI, or other things.

embryo · Post by **embryo** » Fri Mar 14, 2014 4:28 am

h0bby1 wrote:And there is very little chance the compiler will be able to produce efficient mmx code to handle it properly.

The GLSL has it's personal compiler from each GPU vendor. Why Java OS can't have it's personal compiler from each GPU vendor? The situation is completely the same for Java and for GLSL. So the Java solution can use vendor provided compiler to get the best performance. And even if there is no vendor provided compiler, then the standard JIT will compile the default implementation of a vector function and we still be able to run the program, but in a bit less efficient manner. But GLSL will fail to run anything if there is no vendor provided compiler - this is the important advantage of Java, because it can run even on unknown hardware, but with a bit worse performance.

h0bby1 wrote:Even in Qbasic you can do matrices and vectors. It's just faster if the code use can use SIMD instruction.

The vendor provided compilers are the actual drivers of the performance. There is no technical problem to provide such compiler for Java OS.

h0bby1 wrote:And how much the compiler can detect potential problems or optimization on the code that would make it less suitable for optimization.

The suitability is all about the information the compiler has. Annotations provide such information. In case of a standard compiler (without knowledge about annotations) the resulting code will perform worse.

h0bby1 wrote:because the compiler already know what it is because it's a type that built in the language.

The base of any type is always the same - it is bytes. If we can show to the compiler where the required bytes are, then the compiler needs no more complex types, would they be "built in" or anything else.

h0bby1 wrote:And what you can define regarding paralellization with plain native C or java is also limited.

Where are the limits? The limits are in the information the compiler has. The annotations are the means of information transmission from developer to the compiler. We can provide any information we want. So - there are no limits, at all.

h0bby1 · Post by **h0bby1** » Fri Mar 14, 2014 5:56 am

embryo wrote:
h0bby1 wrote:And there is very little chance the compiler will be able to produce efficient mmx code to handle it properly.
The GLSL has it's personal compiler from each GPU vendor. Why Java OS can't have it's personal compiler from each GPU vendor? The situation is completely the same for Java and for GLSL. So the Java solution can use vendor provided compiler to get the best performance. And even if there is no vendor provided compiler, then the standard JIT will compile the default implementation of a vector function and we still be able to run the program, but in a bit less efficient manner. But GLSL will fail to run anything if there is no vendor provided compiler - this is the important advantage of Java, because it can run even on unknown hardware, but with a bit worse performance.
h0bby1 wrote:Even in Qbasic you can do matrices and vectors. It's just faster if the code use can use SIMD instruction.
The vendor provided compilers are the actual drivers of the performance. There is no technical problem to provide such compiler for Java OS.
h0bby1 wrote:And how much the compiler can detect potential problems or optimization on the code that would make it less suitable for optimization.
The suitability is all about the information the compiler has. Annotations provide such information. In case of a standard compiler (without knowledge about annotations) the resulting code will perform worse.
h0bby1 wrote:because the compiler already know what it is because it's a type that built in the language.
The base of any type is always the same - it is bytes. If we can show to the compiler where the required bytes are, then the compiler needs no more complex types, would they be "built in" or anything else.
h0bby1 wrote:And what you can define regarding paralellization with plain native C or java is also limited.
Where are the limits? The limits are in the information the compiler has. The annotations are the means of information transmission from developer to the compiler. We can provide any information we want. So - there are no limits, at all.

It's not only the problem of having a personal compiler for the architecture, but that the the compiler can extract meaningfull information from the language to optimize it.

Like let say you want to have a series of linear algebra operation on a vector. You would have to write it with a synthax like

vec4 my_vec;
vec4 my_transformed_vec;
mat4 my_mat;

mat_mul(my_vec,my_mat,my_transformed_vec);

okay, so far so good, you could do that with operators and have

my_transformed_vec=my_vec*my_mat;
my_transformed_vec+= something;
cross(my_transformed_vec,another_vec);
mat_my(my_mat,another_mat);
my_transformed_vec*=another_mat;

okay. nothing special here, you could come with that kind of synthax with pretty much any language.

now if you want to have it to work in C, java, or any language, you would have to define the operator for it, it would make a call, potentially store variable in temporary location, on the stack, make copies, actually you can't much do a mat4*mat4 operation in C without using a temporary matrix for example.

compared to if the language would recognize those as native type, you wouldn't have to define any of those function at all. The compiler would recognize it as native type, and would generate the good assembler to do that. And you would no need to annotate anything, use any pragma, or write any special at all for the compiler to recognize those operation, and generate optimize assembler, potentially do error checking, and all that. Without you have to put any annotation, or to worry about anything specific to the compiler.

If there is no native type, you'll have to either use some kind static __inline, and xmm instrinsic, to get it to the same level optimization, for that the compiler can keep all the vectors and matrix on the registers when needed, and eliminate any temporary variable that could be used in the C/C++/java implementation, detect potentiall error (uninitialized variables etc), eventually optimizing the whole arithmetic on the whole routine basis, and would make the optimization process much more straightforward.

Even if you could write optimized C code, or code that the compiler could potentially optimize with the SIMD, you have no garantee it would do so, it's not wrote anywhere in any C or Java language specification how a compiler is supposed to optimize vector math. Or what kind of code the compiler will recognize as being simd, and how it will manage the successive calls.

Regarding how small are routine to do that with SSE4, the whole thing could be totally inlined and it would barely take more instruction than doing the call. and there is no memory or stack access involved at all anywhere, and everything can be kept on the registers.

If you had to do a mat mul with a same matrix on 1000 vectors, the matrix could be kept entirely on the registers for the whole time. If you make calls to external function or operators, very little chance the compiler will be able to really do that.

But again it's just an example to say that the language used should be made to match the kind of operation the cpu can do quickly. Writing it from C or java synthaxe won't necessarily make the compiler to make the smartest or most efficient things, even if you annotate function separately, it will not necessarily optimize successive call. Unless you __inline everything. But to __inline C code to do this could be pretty expansive if the compiler doesn't use the sse instruction as well. So unless you really use __inline + xmm intrinsic, and you are sure any compiler would use the sse math, you can't really inline it either.

The only limit is the time you spend writing and debugging the code.

With glsl the time is very low because you can't do much error with the language, you don't have to care about if you used the good annotation or pragma at the right place, or if you wrote your routine with the good attribute of static __inline or anything at all, because the language already define the nature of the operation, and the compiler can't be mistaken about it. And you can control the behavior of the compiler easily from well documented compilation option, if you need to optimize for size or speed or anything else.

Again it's not that you couldn't come with similar thing using some monstruous amount of #pragma conditional compilation and stuff, but it's just much less convenient , much less secure , much less portable etc.

GLSL can be used on anything ranging from a windows PC, a mac, a linux, an android, or anything. It's completely portable, and the compiler can generate optimized code easily for any platform that has the SIMD instruction.

And the GLSL synthax could be compiled for intel using simd without much problem either. You can do exactly everything that a GLSL program can do on the main cpu. It's just that the compiler can take care of everything regarding how he will compile the whole program using simd instructions if they are available.

And you don't have to worry at all about how those function will be actually compiled, because the compiler already know how to compile them to efficient code. Which will not be the case if you use custom/user defined type and operators and do the operation as a call to an external function that the compiler is not even specially supposed to have access to.Again unless you __inline everything in case the cpu can do the operation in a small and efficient manner, but that require conditional compilation, eventually using a bunch of #define and conditional compilation to know if those function should be declared as inline or not etc, and whole mess that can be avoided all together if the compiler recognize the data types and operation as native and that his behavior can be controlled with more general options.

But i speak about simd and GLSL to show already how current state of thing it's already very hard to get a compiler to optimize code for instruction set that are many years old, using language like C that are decades old, and compilers still can't really manage this efficiently if you come from plain C or java thing, or it's always sort of a mess, even for something relatively trivial like vec4/mat4 or stuff that are defined in GLSL but it could be applied to compile for intel or any cpu that has simd instructions.

So i have hard time to imagine how a compiler could do a good job at managing a complex kind of algorithm like the Mill provide. If it's to have to had a bunch of pragma, define, and a whole lot of weird stuff into the C code for that compiler can take full advantage of the feature, well not sure if that's really that much great. Or how much the compiler can take advantage of the Mill feature coming from an average regular C program. Or how much it could benefits from having some kind of language designed specially for it, or having some kind of data types or operator that match the kind of operation the Mill can do.

embryo · Post by **embryo** » Sat Mar 15, 2014 3:56 am

h0bby1 wrote:compared to if the language would recognize those as native type, you wouldn't have to define any of those function at all. The compiler would recognize it as native type, and would generate the good assembler to do that.

A native type is just an information. The information is consumed by a compiler and is translated in some low level code. In case of annotations the way is completely the same - the information is consumed by a compiler and the good machine code is delivered. In both cases compiler knows about possible variants of information, in both cases compiler has required information, but in a different form. And having completely identical cases, with the difference in the formal syntax only, doesn't prevent us from achieving the same results. But for Java we have the standard compatible solution, while for GLSL we have different language, incompatible with it's ancestor (the C). Another point - in Java we can define default implementation and manage to do our job without a special compiler, but with GLSL the situation is much worse - if there is no special compiler - there is no solution at all.

h0bby1 wrote:And you would no need to annotate anything, use any pragma, or write any special at all for the compiler to recognize those operation, and generate optimize assembler, potentially do error checking, and all that. Without you have to put any annotation, or to worry about anything specific to the compiler.

Annotations are safe entities. It means the default Java compiler checks type compatibility and other safeness stuff. So if we replace some special native types with annotations - there will be no difference in the compiler's help for any error checking and bug hunting. Another way we can do - we can just define some new objects of which the special compiler knows. In such case we have no need in annotations. And if the objects are not native it is in no way prevents us from using them as it is required, because the default compiler will work with the new objects as with any other, but the special compiler can recognize them as a case for predefined optimization. There's just no need in native data structures when we can use some structures/objects derived from the standard language base.

h0bby1 wrote:If there is no native type, you'll have to either use some kind static __inline, and xmm instrinsic, to get it to the same level optimization, for that the compiler can keep all the vectors and matrix on the registers when needed, and eliminate any temporary variable that could be used in the C/C++/java implementation, detect potentiall error (uninitialized variables etc), eventually optimizing the whole arithmetic on the whole routine basis, and would make the optimization process much more straightforward.

Everything you have mentioned above is possible with annotations or special types without any intrinsics or whatever breaks the language standard.

h0bby1 wrote:Even if you could write optimized C code, or code that the compiler could potentially optimize with the SIMD, you have no garantee it would do so, it's not wrote anywhere in any C or Java

It's write once and use anywhere. Because we have default implementation which is acceptable by all standard compilers. But when there is a special compiler - we have all the performance ever needed.

h0bby1 wrote:Again it's not that you couldn't come with similar thing using some monstruous amount of #pragma conditional compilation and stuff, but it's just much less convenient , much less secure , much less portable etc.

Let's compare - if GLSL has some native type then what prevents us from having absolutely the same data structure in C or Java? There's just nothing to prevent us from doing so. Next - what is so inconvenient in case of working with some data structure/objects in C or Java? What is less secure? And what is less portable?

The thing which understands the native type or just newly introduced object is the compiler. If the compiler knows about our object - there's no more problems. Just that simple.

h0bby1 wrote:GLSL can be used on anything ranging from a windows PC, a mac, a linux, an android, or anything. It's completely portable, and the compiler can generate optimized code easily for any platform that has the SIMD instruction.

Absolutely the same can be said about Java, but even without SIMD.

h0bby1 wrote:And you don't have to worry at all about how those function will be actually compiled, because the compiler already know how to compile them to efficient code. Which will not be the case if you use custom/user defined type and operators and do the operation as a call to an external function that the compiler is not even specially supposed to have access to.

Why do you think the compiler should not have access to some useful functions? It is a special compiler. It is written for handling special cases. It just must have access to any desired function.

h0bby1 wrote:So i have hard time to imagine how a compiler could do a good job at managing a complex kind of algorithm like the Mill provide.

The compiler is provided with the corresponding algorithm and required information - in such way it will manage anything unimaginable.

OSDev.org

The Mill: a new low-power, high-performance CPU design

Re: The Mill: a new low-power, high-performance CPU design

Re: The Mill: a new low-power, high-performance CPU design

Re: The Mill: a new low-power, high-performance CPU design

Re: The Mill: a new low-power, high-performance CPU design

Re: The Mill: a new low-power, high-performance CPU design

Re: The Mill: a new low-power, high-performance CPU design

Re: The Mill: a new low-power, high-performance CPU design

Re: The Mill: a new low-power, high-performance CPU design

Re: The Mill: a new low-power, high-performance CPU design

Re: The Mill: a new low-power, high-performance CPU design

Re: The Mill: a new low-power, high-performance CPU design

Re: The Mill: a new low-power, high-performance CPU design

Re: The Mill: a new low-power, high-performance CPU design

Re: The Mill: a new low-power, high-performance CPU design

Re: The Mill: a new low-power, high-performance CPU design