Re: The Mill: a new low-power, high-performance CPU design
Posted: Tue Mar 11, 2014 11:52 pm
Show me an example of the mill doing that.Rusky wrote:Try moving a maximum vector of bytes per cycle.Combuster wrote:<x86 implementation>
The Place to Start for Operating System Developers
https://f.osdev.org/
Show me an example of the mill doing that.Rusky wrote:Try moving a maximum vector of bytes per cycle.Combuster wrote:<x86 implementation>
It actually will complicate compiler's job. For example - Java bytecode is a high level entity and as such it just do not help the compiler. It is almost equal for the compiler to use Java source code or it's compiled version in the form of bytecodes. So, if you enforce some high level constructs in hardware, then the compiler would be unable to change an algorithm of such high level implementation and will struggle for performance.h0bby1 wrote:Maybe either they should do cpu that are really made for a particular language to make it easy for the compiler to generate optimized code for the cpu
The split is shown on the pictures in the blog you have referenced above.Rusky wrote:The loop isn't "split into two parts" with smearx any more than it is with a conditional branch instruction.
So why the processor should allocate some part of it's silicon for smearx implementation? Is it a worthy thing?Rusky wrote: Nor is there any reason a Mill loop must always use smearx. It's just another tool like cmp, jnz, etc.
Well for a java program the optimization goal would be different, but if you take that on the level of what a compiler can do, or maybe a not short sighted raw bytecode executer, i think a part of what cripple java performance is to often have to deal with copy-constructor, and having all operation done on object, and often copy/constructed many time alongside the execution, maybe some support for this mechanics to have maybe some sort "object granularity" instead of "word granularity" in the memory layout, even possibly making some operation on those object atomic, and to manipulate memory area and reference to them as object would provide a better way to handle operation than what current form of assembler/memory unit offer, mostly based on words/register sized operation, some cpu could probably handle certain number of things like that in hardwired mode, and could probably improve perf, for ex to accelerate/improve access to array of simple types object, and that the compiler could feed up the cpu with the actual memory layout and stuff, maybe some operation of copy/construct, and certain number of operation in java could be acceleratedembryo wrote:It actually will complicate compiler's job. For example - Java bytecode is a high level entity and as such it just do not help the compiler. It is almost equal for the compiler to use Java source code or it's compiled version in the form of bytecodes. So, if you enforce some high level constructs in hardware, then the compiler would be unable to change an algorithm of such high level implementation and will struggle for performance.h0bby1 wrote:Maybe either they should do cpu that are really made for a particular language to make it easy for the compiler to generate optimized code for the cpu
The example I linked does:Combuster wrote:Show me an example of the mill doing that.Rusky wrote:Try moving a maximum vector of bytes per cycle.Combuster wrote:<x86 implementation>
a can be loaded as a vector and b can be stored as a vector. Any elements in a that the process does not have permission to access will be NaR, but this will only fault if we try and store them.
The a vector can be compared to 0; this results in a vector of boolean, which is then smearx’ed. This can then be picked with a vector of None into b. The smearx offsetting ensures that the trailing zero is copied from a to b. The second return from smearx, recording if any 0 was found in a, is used for a conditional branch to determine if another iteration of the vectorised loop is required.
The phasing of the strcpy operations allows all 27 operations to be executed in just one cycle, which moves a full maximum vector of bytes each cycle.
No, it is optimizable part of the problem. It's just like in C when developer uses some structure, there is no object related overhead.h0bby1 wrote:i think a part of what cripple java performance is to often have to deal with copy-constructor, and having all operation done on object, and often copy/constructed many time alongside the execution
It seems the Java was made to fix problems with the C, like pointers and other unsafe features. So, the performance was not neither decreased or increased. But pointer problem elimination helps to decrease compiler's complexity. In this way Java has some still unused potential.h0bby1 wrote:java was not made specially for computational performance
If it is close to C, then what advantages GLSL has over the C? For example - Java is safe and frees developer from many tedious tasks like memory management and safeness checks. And what about GLSL? If it's all about some DSP related libraries - why not to use C or Java?h0bby1 wrote:With GLSL, the synthax is still quite close to something you could come up with in C
It's like adding vector library to C or Java and upgrading the compiler for it to understand that the vector can be optimized in some efficient way. Why do we need GLSL?h0bby1 wrote:Even writing naive GLSL make explicit use of vector type and operation that the compiler can easily match to optimized cpu instruction
The problem can be solved by introducing a set of libraries, with each item optimized for a particular hardware. Then it's just enough to annotate Java code or use C pragmas for the compiler to select appropriate library. And, of course, the compiler should be enhanced to understand performance annotations or pragmas.h0bby1 wrote:as programmer will express their intention in the context of a compiled/interpreted language most of the time, they will not necessarily be expressed in term that are easily translated for specific architecture that has specific set of instruction to handle some specific operation, and for me it's almost impossible to have something that really take advantage of cpu feature with languages like C/C++ or java, unless you add some tweak into it
With Java it is different. There are (actually) two compilers - first translates source texts into bytecodes and ensures "the global coherency of the program definition", and second, let's name it JIT for simplicity, can use any annotations the Java bytecode has and, in turn, is able to link a particular hardware dependent library without breaking any coherence. And, of course, such compiler chain gives us hardware independence, which is really nice advantage.h0bby1 wrote:but the problem with tweaks is that you never know how another compiler will interpret it, and you break a little the global coherency of the program definition, and loose many benefit of what a compiler can offer in term of error detection and optimization.
because the language recognize and force the use of specific type that can be mapped to registers for pixels and vectors, and have also type for matrices, texture, and the compiler can easily map those operation to fast sse operations on SIMD things. With C you can do it either using inlined SSE intrinsic , but then it's not really C anymore, if you do it in plain C without using any inline or intrinsic, there is very little chance the compiler will able to really make the best use of simd things. Or it's not really supposed to do it.embryo wrote:If it is close to C, then what advantages GLSL has over the C? For example - Java is safe and frees developer from many tedious tasks like memory management and safeness checks. And what about GLSL? If it's all about some DSP related libraries - why not to use C or Java?
Then such language is very tightly bound to the hardware. It's a kind of Assembler. Then why not to use Assembler? And the benefits of compatibility (as they are for all high level languages) are lost in case of such language.h0bby1 wrote:because the language recognize and force the use of specific type that can be mapped to registers for pixels and vectors, and have also type for matrices, texture, and the compiler can easily map those operation to fast sse operations on SIMD things
Ok, then there is Javah0bby1 wrote:With C you can do it either using inlined SSE intrinsic , but then it's not really C anymore
In case of Java it is not a problem. The language is untouched. But only execution environment is updated. And as execution environment always runs on some particular hardware, then it is just natural to support hardware specific things at runtime, like JIT does.h0bby1 wrote:If you upgrade the compiler to recognize specific things in the language, then it become something else than java or C
no it's not a kind of assembler at all, it's still very high level because there is a whole context associated with the shader, which make it behave much like a dsp thing with input/output, and allow for very easy parralisation of the code on large array, and it's much more convenient synthaxe than pure assembler, and can be compiled at run time for the specific cpu without problem.embryo wrote:Then such language is very tightly bound to the hardware. It's a kind of Assembler. Then why not to use Assembler? And the benefits of compatibility (as they are for all high level languages) are lost in case of such language.h0bby1 wrote:because the language recognize and force the use of specific type that can be mapped to registers for pixels and vectors, and have also type for matrices, texture, and the compiler can easily map those operation to fast sse operations on SIMD thingsOk, then there is Javah0bby1 wrote:With C you can do it either using inlined SSE intrinsic , but then it's not really C anymoreIn case of Java it is not a problem. The language is untouched. But only execution environment is updated. And as execution environment always runs on some particular hardware, then it is just natural to support hardware specific things at runtime, like JIT does.h0bby1 wrote:If you upgrade the compiler to recognize specific things in the language, then it become something else than java or C
So I see the Java way is much more compatible, it is more safe and it's level of abstraction is higher than in case of other solutions.
And it even has it's own bytcodes I mean the strings which OpenGL sends to the GLSL compiler at runtime.h0bby1 wrote:no it's not a kind of assembler at all, it's still very high level
In fact the CPU recognizes those data types, but the name for CPU is GPU. So we have just another processor with it's personal compiler from every hardware vendor. It is absolutely the same for Java OS - any hardware vendor can include it's part of the JIT and get very good speed.h0bby1 wrote:And the language support many data types that are not specially recognized by cpu, like matrices, or functions to interpolate pixels from a texture.
And now, if we look at the picture without hardware, we have two successors of the C language - the Java and the GLSL. Partially, it is your preference that sticks you with a particular language, but it should be noted, that Java developer community is much bigger than GLSL's. And Java has much reacher OOP capability. It's just more high level thing.h0bby1 wrote:And many operation that are still plain high level.
With Java it is no less easy to introduce any useful GL type, like matrices or vectors. And, of course, the type set can be extended too.h0bby1 wrote:The main thing that make GLSL specific is that it expect input array from opengl as vertex/normal/texcoord array, but it could be extended to support other type of array as input, to work on list of strings or whatever else
Frankly, I see no point in some special language constructs instead of general data types like integer or float. It is absolutely possible to define any structure with root class of 3 doubles to define a vector. And it is also very easy to annotate any method like this:h0bby1 wrote:it's more the principle to have a language that has built-in types and operators that allow to write some complex/high level operation in a way that is easy to compile to machine code in optimized manner and paralellize.
Code: Select all
@VectorSummation(vector1ParamIndex=0, vector2ParamIndex=1)
public VectorSuccessor add(VectorSuccessor v1, VectorSuccessor v2)
{
...
}
Even on a general CPU like Intel's chips the 20 millisecond interval is a very big time. For 3 millions of 32 bit floats (million 3d vectors) and last SSE with 256 bits available for parallel operations we have 160 processor cycles for one float operation - it is more than enough for multiplication or division. The only problem - we should tell the compiler, where our 3d vectors are. With primitive arrays it is very easy. But it is still possible even with objects.h0bby1 wrote:but if you have some heavy floating point vector math to do, on array of hundred of thousand if not millions of floating point 4D vectors, that need to be done about 50 time per second, then bye bye java.
No. Just use annotations - and that's all required.h0bby1 wrote:The more general purpose the language is, the harder it is for compiler to recognize some specific operations that can be speed up by the hardware or paralellized.
I am glad to see the victory of legacy problem fighters. It's just the thing the jEmbryoS introduces for all Java.h0bby1 wrote:And with opengl ES 2.0, GLSL become the core of the whole opengl rendering process
No, it's right as it is for GL. And hopefully such approach will win hearts of a whole Java community.h0bby1 wrote:For server station, like databases, or files server, it might be different.
If the target would be an intel CPU, it doesn't recognize matrixes, or texture, but it would still be pretty easy for a compiler to generate good machine code that use the SIMD, and for the programmer to make it easy for the compiler to recognize it. Yes you can use array, if you take for pixel yes you can do the saturated addition to each member, with checking for potential staturation, or using intermediate 16 bits value. And there is very little chance the compiler will be able to produce efficient mmx code to handle it properly.embryo wrote:In fact the CPU recognizes those data types, but the name for CPU is GPU. So we have just another processor with it's personal compiler from every hardware vendor. It is absolutely the same for Java OS - any hardware vendor can include it's part of the JIT and get very good speed.h0bby1 wrote:And the language support many data types that are not specially recognized by cpu, like matrices, or functions to interpolate pixels from a texture.
well yeah, the point is not that you can't do any vector math with any other language. Even in Qbasic you can do matrices and vectors. It's just faster if the code use can use SIMD instruction.embryo wrote:Frankly, I see no point in some special language constructs instead of general data types like integer or float. It is absolutely possible to define any structure with root class of 3 doubles to define a vector. And it is also very easy to annotate any method like this:Having such method we can provide it's default implementation for any hardware and default compiler, but also the annotation tells the vendor provided compiler, that there are two vectors and the result of the summation should be provided as another vector for function return value. So the vendor specific compiler (if it is present) can easily optimize the function by replacing it's body with optimized vector summation routine. And there's no need to compile all the GL related program from string representation, we need to recompile just some annotated methods.Code: Select all
@VectorSummation(vector1ParamIndex=0, vector2ParamIndex=1) public VectorSuccessor add(VectorSuccessor v1, VectorSuccessor v2) { ... }
The GLSL has it's personal compiler from each GPU vendor. Why Java OS can't have it's personal compiler from each GPU vendor? The situation is completely the same for Java and for GLSL. So the Java solution can use vendor provided compiler to get the best performance. And even if there is no vendor provided compiler, then the standard JIT will compile the default implementation of a vector function and we still be able to run the program, but in a bit less efficient manner. But GLSL will fail to run anything if there is no vendor provided compiler - this is the important advantage of Java, because it can run even on unknown hardware, but with a bit worse performance.h0bby1 wrote:And there is very little chance the compiler will be able to produce efficient mmx code to handle it properly.
The vendor provided compilers are the actual drivers of the performance. There is no technical problem to provide such compiler for Java OS.h0bby1 wrote:Even in Qbasic you can do matrices and vectors. It's just faster if the code use can use SIMD instruction.
The suitability is all about the information the compiler has. Annotations provide such information. In case of a standard compiler (without knowledge about annotations) the resulting code will perform worse.h0bby1 wrote:And how much the compiler can detect potential problems or optimization on the code that would make it less suitable for optimization.
The base of any type is always the same - it is bytes. If we can show to the compiler where the required bytes are, then the compiler needs no more complex types, would they be "built in" or anything else.h0bby1 wrote:because the compiler already know what it is because it's a type that built in the language.
Where are the limits? The limits are in the information the compiler has. The annotations are the means of information transmission from developer to the compiler. We can provide any information we want. So - there are no limits, at all.h0bby1 wrote:And what you can define regarding paralellization with plain native C or java is also limited.
It's not only the problem of having a personal compiler for the architecture, but that the the compiler can extract meaningfull information from the language to optimize it.embryo wrote:The GLSL has it's personal compiler from each GPU vendor. Why Java OS can't have it's personal compiler from each GPU vendor? The situation is completely the same for Java and for GLSL. So the Java solution can use vendor provided compiler to get the best performance. And even if there is no vendor provided compiler, then the standard JIT will compile the default implementation of a vector function and we still be able to run the program, but in a bit less efficient manner. But GLSL will fail to run anything if there is no vendor provided compiler - this is the important advantage of Java, because it can run even on unknown hardware, but with a bit worse performance.h0bby1 wrote:And there is very little chance the compiler will be able to produce efficient mmx code to handle it properly.The vendor provided compilers are the actual drivers of the performance. There is no technical problem to provide such compiler for Java OS.h0bby1 wrote:Even in Qbasic you can do matrices and vectors. It's just faster if the code use can use SIMD instruction.The suitability is all about the information the compiler has. Annotations provide such information. In case of a standard compiler (without knowledge about annotations) the resulting code will perform worse.h0bby1 wrote:And how much the compiler can detect potential problems or optimization on the code that would make it less suitable for optimization.The base of any type is always the same - it is bytes. If we can show to the compiler where the required bytes are, then the compiler needs no more complex types, would they be "built in" or anything else.h0bby1 wrote:because the compiler already know what it is because it's a type that built in the language.Where are the limits? The limits are in the information the compiler has. The annotations are the means of information transmission from developer to the compiler. We can provide any information we want. So - there are no limits, at all.h0bby1 wrote:And what you can define regarding paralellization with plain native C or java is also limited.
A native type is just an information. The information is consumed by a compiler and is translated in some low level code. In case of annotations the way is completely the same - the information is consumed by a compiler and the good machine code is delivered. In both cases compiler knows about possible variants of information, in both cases compiler has required information, but in a different form. And having completely identical cases, with the difference in the formal syntax only, doesn't prevent us from achieving the same results. But for Java we have the standard compatible solution, while for GLSL we have different language, incompatible with it's ancestor (the C). Another point - in Java we can define default implementation and manage to do our job without a special compiler, but with GLSL the situation is much worse - if there is no special compiler - there is no solution at all.h0bby1 wrote:compared to if the language would recognize those as native type, you wouldn't have to define any of those function at all. The compiler would recognize it as native type, and would generate the good assembler to do that.
Annotations are safe entities. It means the default Java compiler checks type compatibility and other safeness stuff. So if we replace some special native types with annotations - there will be no difference in the compiler's help for any error checking and bug hunting. Another way we can do - we can just define some new objects of which the special compiler knows. In such case we have no need in annotations. And if the objects are not native it is in no way prevents us from using them as it is required, because the default compiler will work with the new objects as with any other, but the special compiler can recognize them as a case for predefined optimization. There's just no need in native data structures when we can use some structures/objects derived from the standard language base.h0bby1 wrote:And you would no need to annotate anything, use any pragma, or write any special at all for the compiler to recognize those operation, and generate optimize assembler, potentially do error checking, and all that. Without you have to put any annotation, or to worry about anything specific to the compiler.
Everything you have mentioned above is possible with annotations or special types without any intrinsics or whatever breaks the language standard.h0bby1 wrote:If there is no native type, you'll have to either use some kind static __inline, and xmm instrinsic, to get it to the same level optimization, for that the compiler can keep all the vectors and matrix on the registers when needed, and eliminate any temporary variable that could be used in the C/C++/java implementation, detect potentiall error (uninitialized variables etc), eventually optimizing the whole arithmetic on the whole routine basis, and would make the optimization process much more straightforward.
It's write once and use anywhere. Because we have default implementation which is acceptable by all standard compilers. But when there is a special compiler - we have all the performance ever needed.h0bby1 wrote:Even if you could write optimized C code, or code that the compiler could potentially optimize with the SIMD, you have no garantee it would do so, it's not wrote anywhere in any C or Java
Let's compare - if GLSL has some native type then what prevents us from having absolutely the same data structure in C or Java? There's just nothing to prevent us from doing so. Next - what is so inconvenient in case of working with some data structure/objects in C or Java? What is less secure? And what is less portable?h0bby1 wrote:Again it's not that you couldn't come with similar thing using some monstruous amount of #pragma conditional compilation and stuff, but it's just much less convenient , much less secure , much less portable etc.
Absolutely the same can be said about Java, but even without SIMD.h0bby1 wrote:GLSL can be used on anything ranging from a windows PC, a mac, a linux, an android, or anything. It's completely portable, and the compiler can generate optimized code easily for any platform that has the SIMD instruction.
Why do you think the compiler should not have access to some useful functions? It is a special compiler. It is written for handling special cases. It just must have access to any desired function.h0bby1 wrote:And you don't have to worry at all about how those function will be actually compiled, because the compiler already know how to compile them to efficient code. Which will not be the case if you use custom/user defined type and operators and do the operation as a call to an external function that the compiler is not even specially supposed to have access to.
The compiler is provided with the corresponding algorithm and required information - in such way it will manage anything unimaginable.h0bby1 wrote:So i have hard time to imagine how a compiler could do a good job at managing a complex kind of algorithm like the Mill provide.