The Mill: a new low-power, high-performance CPU design

embryo · Post by **embryo** » Mon Mar 10, 2014 9:16 am

Rusky wrote:So you move the actual copying logic into the CPU so the program doesn't have to copy byte by byte?

There's no logic, it's just an order to do things the hardware can do. The bus access, the memory access, the synchronization of both - all this is not efficient to control from a program. It's just atomic operations. If you name any atomic and simplest operation a 'logic' then you should believe in smart things around you - your chair is holding you with some very interesting logic involved in the process.

But it is much better to say the program controls things like processor or memory/bus controller.

Rusky wrote:In the Mill's case, however, it can also do the discovery-of-how-much-to-copy at the same time (and speed) as the actual copy

The Mill has one simple algorithm implemented, is having such algorithm enough to claim victory in a fight against something smart? An amoeba can claim it's victory in a competition with a human because it has at least some form of very primitive logic too. Is such a win sane?

Rusky wrote:Your example would have to scan the string a byte at a time first.

My example was about the speed the interaction entities can provide a dumb processor with data. For the compiler the limit is the mentioned speed. For the Mill the limit is it's single and simple algorithm. If the system has high memory bandwidth the Mill is just unable to utilize it, but the compiler, while using a slave processor, can.

Combuster · Post by **Combuster** » Mon Mar 10, 2014 11:57 am

Rusky wrote:as opposed to dropping down to byte-by-byte speeds to avoid overshooting

There's a certain amount of nonsense involved with forcing byte reads. For x86, the C compiler is always free to read the entire cache line when it needs one byte of it, and it would do so without failing or causing other read errors (you should have used volatile or avoided segmentation otherwise). Thus you can perform checks and operations on all 16 bytes at once and find your exit condition in chunks rather than forcing sequential ops for some silly reason.

If you didn't find your exit condition, you can also write the whole bunch of output if you need it.

Rusky · Post by **Rusky** » Mon Mar 10, 2014 4:40 pm

embryo wrote:There's no logic, it's just an order to do things the hardware can do. The bus access, the memory access, the synchronization of both - all this is not efficient to control from a program. It's just atomic operations. If you name any atomic and simplest operation a 'logic' then you should believe in smart things around you - your chair is holding you with some very interesting logic involved in the process.

Oh okay, then the Mill just supports a couple of really useful atomic operation called "smearx" and "store None", whose use enables faster and simpler execution of the logic from the compiler. Happy with my definitions now? And no, the compiler cannot implement smearx and store None as efficiently as the hardware.

embryo wrote:My example was about the speed the interaction entities can provide a dumb processor with data. For the compiler the limit is the mentioned speed. For the Mill the limit is it's single and simple algorithm. If the system has high memory bandwidth the Mill is just unable to utilize it, but the compiler, while using a slave processor, can.

The Mill's limit is higher than your compiler-is-always-better CPU, because it executes the full loop in a single instruction (note this is VLIW-esque so a single instruction includes multiple operations that can be composed many ways. The Mill is not limited to this specific algorithm- the compiler can vectorize a great many more loops on the Mill than it can on other architectures). Your example still requires the program to scan for the null terminator in a separate loop, using many instructions.

Combuster wrote:There's a certain amount of nonsense involved with forcing byte reads. For x86, the C compiler is always free to read the entire cache line when it needs one byte of it, and it would do so without failing or causing other read errors (you should have used volatile or avoided segmentation otherwise). Thus you can perform checks and operations on all 16 bytes at once and find your exit condition in chunks rather than forcing sequential ops for some silly reason.

If you didn't find your exit condition, you can also write the whole bunch of output if you need it.

The advantage the Mill has here is smearx- where it finds the null terminator, it marks all the subsequent data in the vector as not-to-be-written. Because of the Mill's phasing, smearx does in a single instruction (along with the cmp, pick, branch, and store ops, all of which issues together in a single cycle and takes only 5 cycles to move all the way through the pipeline), what your compiler-is-always-better CPU would take multiple dependent instructions to do (to resolve which null terminator is first), even within a single cache line.

The Mill also has byte-granularity protection, in a single address space (making IPC waaay simpler, faster, and safer). Vectorizing strcpy in a single instruction is still possible in this environment because loading a protected value does not fault, it simply returns NaR, which only faults if you try to store it (otherwise it just propagates more NaRs, with metadata so the debugger can find their origin). So the strcpy store would only fault if the null terminator didn't occur before the protected memory, which is correct. Not so in a compiler-is-always-better design.

embryo · Post by **embryo** » Mon Mar 10, 2014 5:57 pm

Rusky wrote:then the Mill just supports a couple of really useful atomic operation called "smearx" and "store None", whose use enables faster and simpler execution of the logic from the compiler

From the algorithm point of view there are atomic operations of a basic logic like OR, AND, XOR. One more basic operation is MOV. When you talk about smearx there are a number of such atomic basic operations involved. When I talk about memory access there is just one operation - the copying of a number of bits from one location to another. In such sense it is atomic. When you take a set of atomic operations as another atomic operation then there should be some reason for it. Your reason is the location of algorithm - it is placed in the Mill. And I tell you that if the compiler provide processor with algorithms then there will be more efficient system. It leads to the conclusion about leaving only atomic operations in the processor, because every compound operation can be provided by the compiler. It's just reducing of redundancy. The only reason for a compound operation to be implemented in a processor is a great number of it's calls, then we can reduce number of program/algorithm bytes loaded by a processor every time the compound instruction is required. Such operations can be division or multiplication, for example. But smearx is not such frequently used operation. And beside of number of use there is the efficiency concern - smearx fills ALL flags after a starting one with the same values, but if there are more than one values to be skipped? So we have to place another algorithm somewhere. And, obviously, the best place for it is the compiler.

Rusky wrote:this is VLIW-esque so a single instruction includes multiple operations that can be composed many ways

The basic capability of executing many simple operations in parallel batch is a good capability. It allows the complier to get more tight control over processor and deliver really great performance. But when you start to implement some complex operations it leads you to the x86 architecture or something alike. You reduce compiler's ability to select best algorithm for a particular task. The x86 also hides atomic operations from compiler and exposes mostly compound operations. So the VLIW-esqueness is compromised and in the end we will have another CISC processor.

Rusky wrote:The Mill is not limited to this specific algorithm- the compiler can vectorize a great many more loops on the Mill than it can on other architectures

And this approach should be extended as much as possible.

Rusky · Post by **Rusky** » Mon Mar 10, 2014 7:58 pm

So multiplication was pulled into the CPU because, even though it's not really an indivisible operation that the compiler can't use multiple algorithms for, it's so much more efficient to put it in there than having the compiler do it, and it's used all the time so it's worth using transistors for it. I've got news for you:

smearx is a while loop implemented in the CPU.

Are while loops not widely-used enough for you? Do compilers gain great performance advantages by using different algorithms for while loops all over the place? Is it better for the program to loop through a vector using many instructions, multiple registers, the branch predictor, and the cache, rather than loop within a single instruction operating on registers? How does using smearx make it harder for the compiler to use different looping logic? Why do we use a TLB instead of just trapping to the kernel on every single read? Why don't we use CPUs with only one instruction that are still turing complete?

embryo wrote:So the VLIW-esqueness is compromised and in the end we will have another CISC processor.

No, because on a CISC you only get the combinations the CPU provides. The whole point of a VLIW is to split the operations out and expose them to the compiler so it can specify exactly what to do and when. Doesn't that sound like something your compiler could use to its advantage?

embryo wrote:
Rusky wrote:The Mill is not limited to this specific algorithm- the compiler can vectorize a great many more loops on the Mill than it can on other architectures
And this approach should be extended as much as possible.

Guess how? By using smear, smearx, and pick, which are just implementations of while loops and conditionals in the processor. Kind of like multiplication.

h0bby1 · Post by **h0bby1** » Tue Mar 11, 2014 4:28 am

the main difference between compiler and cpu is that compiler can have better understanding of the program as a whole, so the difference between the two approach matter when the benefit of optimizing the program as whole is superior than the benefits of having very small chunk of operation being done sequentially as fast as possible

it also depend a lot on the kind of language being used, the problem with C is that it has been done on purpose to avoid architecture specific operation, and there is basically no way a C program can be made to take in account architecture specific things, or it come with compiler specific directive, but it's not really part of the language itself

I mean many for trivial loops like memcpy/strcpy it can make a difference, but if you take for example a video station who can have several input video stream, either from direct live aquisition, or from hard drive, networking, and having some software to manipulate the video stream live in real time, it come down also to execute many complex mathematic operation, like DCT's, FFT's, quantizations, and operation that happen on a level of abstraction that only the compiler can understand, i don't see how a cpu could figure out 'hey you are trying to do a 8x8 DCT here, so i'll remake your routine because your code is not efficient to do this'. There is no a way a cpu can do this, and even compilers often have hard time to figure that out from plain regular C.

A good case of a language being able to take in account the architecture is GLSL/shader language, the language is made in sort to map the data type that the cpu can handle, like vectors, textures/pixel etc and the kind of operation the gpu can do, and the compiler is shipped by the gpu manufacturer and the GLSL is compiled for the specific gpu by the same compiler.

The best approach is like intel do with their intel compiler, and kernel math lib for example, and having a library who already expose all the high level function that are useful and can be optimized substantially if using a particular way of representing the data directly inside of the program, and particular way to do the operation, and the represent the whole chain of operation that has to be made in a way for that the compiler can produce optimal code for the particular architecture, and they probably use a bunch of inline or asm routine to figure out the best code path at runtime, C alone can't really do this, unless it would be able to recognize higher level block of code and how a particular cpu can optimize it.

Unless they would do something like GLSL, where the cpu more or less contain it's own compiler, and can do some sort of global analysis of the whole program being run, even in order to do the alignment, and organizing how static routine or data structure are placed in memory, or can be the most efficient use of registers, and to recognize also higher level routine in order to figure what's the best way to implement them, it will always be limited in term of optimization.

Even for the mill, probably it could benefit of something like they do with the intel math/performance libs, to provide also structure to organize the data in a way that is already easy to figure out for the compiler how to make optimized code for the target cpu. Because with plain C, it's very hard to see how a compiler could really take best advantage of the capacity of the cpu to manage some specific complex operation if it's just wrote from scratch in plain C.

They are trying to do this a little with the SSE intrinsic header in C, but it's still limited to simple operations, even if compiler generally do a good job for optimizing those sequential operation on vectors using sse intrinsic, specially if they are inlined, it still can't really manage well complex operation that cpu can do wrote in plain C. It's the same thing with the ALTIVec that there was on ppc in the days.

From the moment the cpu can implement complex operations, and since even the first pentium, and pipelining, mmx, cache levels etc, it will most likely need a compiler to have optimized code for it, i don't see any benefit doing all alignment, taking in account cacheline, and how different data and code can be best layout in memory , doing all this manually in a flat binary assembler program, compared to letting a C compiler doing all this work.

Or the cpu would need to be able to have global understanding of the program in order to auto optimize a whole lot of things, the problem is that nowdays feature of cpus can only be taken advantage of fully using some kind of compiler, and that cpu manufacturer don't really release compiler for their own platform, and don't always provide all what would be necessary for developers or compilers to produce the most optimized code easily.

The cpu can't do much either at runtime to reorganize alignment, global memory layout, so it's always a bit tricky between having general language who don't care about any specific of the architecture, and use mostly scientific/mathematic world types and operations, and not cpu related kind of directive or having cpu data types more exposed into the language itself, otherwise it's very hard either way to really produce optimized code, and for now neither C compiler or cpu can really do a good job at it, until they use some kind of cpu specific library or directive, that are not part of the language itself. Unlike GLSL language who is a language made specially for some kind of gpu in order to make it easy for the compiler to generate optimized code.

embryo · Post by **embryo** » Tue Mar 11, 2014 7:21 am

Rusky wrote:smearx is a while loop implemented in the CPU.

Are while loops not widely-used enough for you?

No. General while loops are not used widely enough. There are always different bodies and very often there are different exit conditions.

Even the operation of division of one number by another can be improved, but it is relatively seldom case. The loop operation can not be generalized enough to fit all possible cases in a processor's hardware logic. Having just one of many loop cases implemented in the Mill is not a good idea.

Rusky wrote:Why don't we use CPUs with only one instruction that are still turing complete?

There's just the efficiency problem that prevents us from doing it. In general the simplest possible hardware is a good idea, but one instruction case is not as simple as you can think. There are a lot of transistors inside. We can manage every transistor from an ideal program in an ideal world. And even better - we can manage every electron in the crystal. But I hope it is obvious that the expenses of managing every electron will prevent us from doing so. Absolutely the same problem is actual for the 'one instruction processor'.

There should be an optimum. In the real world only such solutions work. And for processors the optimum (from my point of view) includes multiplication and excludes loop implementation.

Rusky wrote:The whole point of a VLIW is to split the operations out and expose them to the compiler so it can specify exactly what to do and when. Doesn't that sound like something your compiler could use to its advantage?

Yes, it's just the advantage I think the VLIW is offering. But the smearx instruction doesn't expose it's internals and it is not a VLIW way. It's complex and specific with a narrow usage area. It's a bad idea to generalize loops and place such implementation into processor.

embryo · Post by **embryo** » Tue Mar 11, 2014 7:37 am

h0bby1 wrote:A good case of a language being able to take in account the architecture is GLSL/shader language

It's just a specific software thing for another specific hardware thing. But there is a need for a general software solution for a general hardware implementation. And here we see the following chain - developer - language - compiler - hardware. Developer sets the system goal and he should be able to influence any decision in the chain. Such influence is possible only in case of completely exposed internals of all lower chains. Processor should expose it's internals to the compiler, compiler - to the language and the language - to the developer.

With Java I can see such chain when Java annotations direct the compiler's actions, then compiler's actions select most optimal algorithm and feed it to a very open processor, which in the end just does things an algorithm can't do (like actual signal level changing). And of course, there should be an optimum at every level of the responsibility hierarchy.

h0bby1 · Post by **h0bby1** » Tue Mar 11, 2014 9:06 am

embryo wrote:
h0bby1 wrote:A good case of a language being able to take in account the architecture is GLSL/shader language
It's just a specific software thing for another specific hardware thing. But there is a need for a general software solution for a general hardware implementation. And here we see the following chain - developer - language - compiler - hardware. Developer sets the system goal and he should be able to influence any decision in the chain. Such influence is possible only in case of completely exposed internals of all lower chains. Processor should expose it's internals to the compiler, compiler - to the language and the language - to the developer.

With Java I can see such chain when Java annotations direct the compiler's actions, then compiler's actions select most optimal algorithm and feed it to a very open processor, which in the end just does things an algorithm can't do (like actual signal level changing). And of course, there should be an optimum at every level of the responsibility hierarchy.

The problem with this chain is that there is very rarely a continuous chain between developers and hardware manufacturer.

Language and compilers are bit the middle man, and regarding the nature of PC and most 'computers' architecture (opposed to dsp or game console like things), it's all made in sort to do a lot of things in software, and so the general orientation is more to use general language that are able to represent problems that developers has to solve, rather than having language that can represent the hardware very closely, because hardware change all the time, and a certain number of problem that has to be done by the hardware are still a always same, and can be expressed in more general form.

So it's better to have the program expressed in a language that can be then compiled with maximum efficiency on wide range of hardware, rather than have to develop program that are made for a specific hardware, but it also reduce how efficiently the hardware can be taken in account by compiler, because the language is too generic to integrate cpu related feature in the definition of the program.

So there is necessarily a disconnection between hardware manufacturer and developers at this point, because developers who use language like C and Java do it so especially to avoid to have to deal with hardware specific things, and for lot of task that are really resource intensive, it's hard to concieve how you could write generic C or Java program that could really take the best advantage of the hardware if it has specific data types and operations that are not exposed into the language, so either need to have some kind of library or language extension made specifically to take advantage of some hardware feature, like the link with "native code" in java, or SSE intrinsic in C/C++, or various built-in type that can be found in some languages and/or compilers.

Or either not having the hardware to have any feature or data type that the C or high level language cannot express explicitly, so it limit to basic C operators , or Java operation, and having a simple mapping directly between the high level language and cpu instruction, but then it prevent to have cpu that can handle more complex type and operation, that the language do not recognize.

Or then you design the cpu to fit for a specific language, or make a language specially for the cpu, like it has been made with shaders. Maybe the Mill is an attempt at that, to make a cpu that can easily optimize wide range of things expressed in common C language.

Otherwise, in either way, either you'll have to tweak with the language to have maximum performance using specific hardware things, or either you'll have to use some kind of lib made by the hardware manufacturer to take advantage of those feature.

Rusky · Post by **Rusky** » Tue Mar 11, 2014 1:15 pm

Look, embryo- compilers have their advantages. But so do CPUs. You need both for an efficient program. Compilers can see the whole program at once, they can spend more time thinking, etc. But CPUs know exactly what the state of the program is, which in many cases is impossible for the compiler. Some operations are more efficient directly in the CPU, as opposed to software. CPUs control exactly what operations are available to the compiler.

So for an efficient program, you can't just have the CPU do a bunch of simple things- they have to be organized well too. CPUs determine the limits of what is possible. Again, smearx is independent of the loop body and condition. It is not "just one of many loop cases." It just takes a vector of bools- the exit condition evaluated on each loop iteration in the vector- and returns a vector of bools for which iterations should execute. This is precisely the logic of literally every while loop in existence. You can either give this tool to the compiler (along with existing ones of course), and speed up your loop by the size of a vector, or you can stick with a conditional branch for every single iteration. I think we both know which one is more efficient.

Plus, you still haven't shown how a CPU without smearx could implement strcpy as efficiently.

embryo · Post by **embryo** » Tue Mar 11, 2014 1:46 pm

h0bby1 wrote:The problem with this chain is that there is very rarely a continuous chain between developers and hardware manufacturer.

It's an artificial situation. The Intel drives markets and markets drive the Intel. A bit of infinite loop.

The division of the software chain into language and compiler is just the tool to fight vendor locked infinite loop. The developer expresses his thoughts in a high level language and then lets the compiler to connect the language to the hardware. And the compiler can hide any hardware vendor specific things from language or developer. But if developer decides to achieve the best performance, then he can use annotations, for example, to tell the compiler what optimization technic is expected.

For example in case of Java there is obvious target for high performance - the Java OS. It should be optimized for a particular hardware. But most of the bytecodes the system can execute as a separate application often should be without hardware related annotations to allow it's execution on different hardware platforms. And having things separated in such manner we can easily jump out of manufacturer lock. That's why the chain not only should work fine, but it can even stimulate all the hardware industry to compete in a less relaxed manner and for our profit.

And by the way - using annotations it is possible to tell the compiler what to do in case of every particular kind of the hardware. So if a developer is ready to provide such instructions, then there can be very efficient system on every platform the developer knows. It means the efficiency of the chain is very high.

embryo · Post by **embryo** » Tue Mar 11, 2014 2:03 pm

Rusky wrote:Look, embryo- compilers have their advantages. But so do CPUs.

CPU has advantage of just one thing that is able to do with a physical world. And that's all about it's advantages. Everything else is an algorithm. But we can discuss the place where the algorithm can be found. You know my opinion - primitive algorithms like OR, AND or XOR are for the processor and more complex algorithms are for the compiler.

Rusky wrote:Some operations are more efficient directly in the CPU, as opposed to software.

It is important to understand that it is not the operations in question, but it is the costs of operations. If the cost of providing the processor with new algorithm is low, then the algorithm must be in the compiler. Else - it must be in the processor.

Rusky wrote:CPUs control exactly what operations are available to the compiler.

It's not the CPU. It's all about CPU vendor. It's two very different things.

Rusky wrote:Again, smearx is independent of the loop body and condition. It is not "just one of many loop cases." It just takes a vector of bools- the exit condition evaluated on each loop iteration in the vector- and returns a vector of bools for which iterations should execute.

If it returns a vector of flags it doesn't meant it is independent of the loop. It just splits the loop into two parts - condition evaluation and body execution. When it returns the vector it has processed the condition part. Next it must process the body. Compiler can do the same with ease. Or the compiler can determine, that in a particular case it is not efficient to split loop into such parts. But the Mill always must split the loop.

Rusky wrote:You can either give this tool to the compiler (along with existing ones of course), and speed up your loop by the size of a vector, or you can stick with a conditional branch for every single iteration.

The last option is here because the Mill just not exposes it's internals enough to let the compiler to choose the best algorithm.

Combuster · Post by **Combuster** » Tue Mar 11, 2014 2:04 pm

Plus, you still haven't shown how a CPU without smearx could implement strcpy as efficiently.

Something like this (untested)?

Code: Select all

.loop:
  movaps xmm0, [esi]
  pxor xmm1, xmm1
  pcmpeqb xmm1, xmm0
  pmovmskb ecx, xmm1
  or ecx, ecx
  jnz last_bytes_manual
  movaps [edi], xmm0
  add edi, 16
  add esi, 16
  jmp .loop
last_bytes_manual:
  (...)

for an average of two bytes copied per opcode.

h0bby1 · Post by **h0bby1** » Tue Mar 11, 2014 2:32 pm

Well in the current state of things, C/C++ compilers can still do a good job at optimizing most things on most architecture, i mean for most task, they just do fine with regular C/C++ code, even for specific things, that can really benefit from some cpu feature, you can hardly gain 10 or 20% of performance on the overall execution, even java it doesn't do too bad, maybe for really high res video encoding or decoding, it would be a bit too slow, but even generally this sort of stuff use massive chunck of assembler code also to optimize the thing the best for particular cpu, and the plain C version is still substantially slower.

Maybe either they should do cpu that are really made for a particular language to make it easy for the compiler to generate optimized code for the cpu, they could do that for java, i'm sure there could be way that some specific cpu instruction could speed up execution of java byte code, or even for C, or other language that contain more complex data type or operators.

With the Mill, it looks like it aim at optimizing globally what a C program can have to do, after would need to see how well programs in other language like java byte code or shaders can take advantage of it. If they can run some kind of fast opengl on it with shader support, or some kind of video/audio framework, it can be interesting.

And how well the compiler can really vectorize complex loops, like if you would have a loop to decode macro blocks of an h264 stream, with the quantization, dct and deblocking, and eventually conversion from yuv to rgb and resizing. If there would still be benefits to write it in assembler for the cpu, it mean already the interest is limited to optimize more complex type of operation, and then would need some kind of lib like intel math lib to take advantage of it in a C program. Most of this stuff could be done with shaders language as well.

Optimizing the execution of java byte code is probably different than optimizing the assemblers generated from a C program though.

Rusky · Post by **Rusky** » Tue Mar 11, 2014 3:32 pm

embryo wrote:If it returns a vector of flags it doesn't meant it is independent of the loop. It just splits the loop into two parts - condition evaluation and body execution. When it returns the vector it has processed the condition part. Next it must process the body. Compiler can do the same with ease. Or the compiler can determine, that in a particular case it is not efficient to split loop into such parts. But the Mill always must split the loop.

Wait what? The loop isn't "split into two parts" with smearx any more than it is with a conditional branch instruction. Nor is there any reason a Mill loop must always use smearx. It's just another tool like cmp, jnz, etc.

Combuster wrote:Something like this (untested)?
Code: Select all
.loop:
  movaps xmm0, [esi]
  pxor xmm1, xmm1
  pcmpeqb xmm1, xmm0
  pmovmskb ecx, xmm1
  or ecx, ecx
  jnz last_bytes_manual
  movaps [edi], xmm0
  add edi, 16
  add esi, 16
  jmp .loop
last_bytes_manual:
  (...)
for an average of two bytes copied per opcode.

Try moving a maximum vector of bytes per cycle.

OSDev.org

The Mill: a new low-power, high-performance CPU design

Re: The Mill: a new low-power, high-performance CPU design

Re: The Mill: a new low-power, high-performance CPU design

Re: The Mill: a new low-power, high-performance CPU design

Re: The Mill: a new low-power, high-performance CPU design

Re: The Mill: a new low-power, high-performance CPU design

Re: The Mill: a new low-power, high-performance CPU design

Re: The Mill: a new low-power, high-performance CPU design

Re: The Mill: a new low-power, high-performance CPU design

Re: The Mill: a new low-power, high-performance CPU design

Re: The Mill: a new low-power, high-performance CPU design

Re: The Mill: a new low-power, high-performance CPU design

Re: The Mill: a new low-power, high-performance CPU design

Re: The Mill: a new low-power, high-performance CPU design

Re: The Mill: a new low-power, high-performance CPU design

Re: The Mill: a new low-power, high-performance CPU design