The Mill: a new low-power, high-performance CPU design

Discussions on more advanced topics such as monolithic vs micro-kernels, transactional memory models, and paging vs segmentation should go here. Use this forum to expand and improve the wiki!
Post Reply
User avatar
Brendan
Member
Member
Posts: 8561
Joined: Sat Jan 15, 2005 12:00 am
Location: At his keyboard!
Contact:

Re: The Mill: a new low-power, high-performance CPU design

Post by Brendan »

Hi,
embryo wrote:
Brendan wrote:For other things (e.g. branch prediction) it's impossible for the compiler to beat the processor because the optimisation can't be done statically and if it is done in software at run-time (e.g. JIT or self modifying code) the overhead of doing the optimisation is greater than the overhead it saves.
We already have examples of the compiler win in the runtime optimization quest. There are a few messages above right about such win.
I'm not sure which messages you're talking about - is it the one where Combuster failed to unroll a small loop and remove all branches?

Note: Would you mind specifying which compiler you're referring to? When you say "compiler" I assume you mean the compiler that converts Java source code into Java byte-code; but usually you seem to be referring to the JVM instead. ;)
embryo wrote:
Brendan wrote:Of course for almost all cases where Java is faster you can find out why and improve the C/C++ code until the C/C++ version is at least as fast or faster than the Java version
It is not exactly such simple. For C/C++ to match Java in case of pointers problem it is required to reject to use pointers. But it means C/C++ will just emulate Java, then why not to use Java directly?
I'll assume you're talking about the "pointer aliasing" problem that was partially fixed in C++ (strict aliasing rules), and then properly fixed in C99 (the "restrict" keyword) and then adopted by most C++ compilers to completely fix their previous partial fix.

From my point of view, even though the problem is "fixed" in C/C++, it's a symptom of a larger problem with compiler implementations (object files and linking causing difficulty with whole program optimisation); and not necessarily a problem that's part of the languages themselves. Also note that it's definitely not related to "ahead of time" vs. "just in time" - e.g. Fortan never had the problem to begin with even though it also uses an "ahead of time" compiler and doesn't use JIT at all.
embryo wrote:With memory allocation it's again the same - to have less bugs C/C++ need to reject manual memory allocation. And there are more items in this list.
Um, we were talking about speed, not safety. For C/C++ the normal malloc/new is stupid and slow (causes cache locality problems that can be avoided with smarter memory allocation). Modern Java does implement the smarter memory allocation, which gives it an advantage over the typical "written by a noob" C/C++ code you'll find in most benchmarks (and find in most C/C++ projects). This isn't a problem caused by C/C++ though (it's relatively easy to write your own special purpose allocators that don't suffer from cache locality problems), it's caused by lazy programmers. Again; it's definitely not related to "ahead of time" vs. "just in time".

For safety, C/C++ were never intended to be safe to begin with; and Java was never intended to be useful for low-level code. Complaining that C/C++ aren't safe is as stupid as complaining that Java won't let you have direct access to any address you like.
embryo wrote:
Brendan wrote:and for most of the cases where C/C++ is faster than Java you can't do anything about it.
Why I can't use compiler like GCC and get the exact result the C/C++ have? And I can use such compiler at runtime, like JIT. And there will be no significant performance overhead because compiler runs once and the server code then runs for very long time.
I'm not sure what you're trying to say here. You can use (e.g.) GCJ to "ahead of time" compile Java directly to native code and not bother having any JIT; but the performance will probably be even worse than running Java byte-code on a JVM because the language was never designed to do this efficiently (although I'd expect the memory footprint will be better - no big bloated virtual machine involved).
embryo wrote:
Brendan wrote:For my goals, Java has ... However, far too many bugs aren't discovered until run-time, it's harder to learn, and none of the overhead is obvious.
Number of bugs is a very volatile issue, but we can remember causes of most common bugs - manual memory allocation ,pointers, unsafe operations, etc.
Heh. For detecting bugs, it's like trying to determine the winner of a horse race when both horses have broken legs - you end up trying to determine "least worst" when you know both options are bad.
embryo wrote:About learning problem - it just seems non existent - why young people prefer not to learn C but learn Java and other safe languages?
If you were a novice bull fighter, you'd probably want to start learning with a "safe" fat cow too.
embryo wrote:
Brendan wrote:consider writing code to switch the CPU from protected mode to long mode, or code to do a task switch to a thread in a different virtual address space, or code to support hardware virtualisation, or even just code for a boring interrupt handler. Without being able to use assembly you're screwed.
In jEmbryoS the assembler parts are as easy to implement as the inline assembly in C. Again we have no problems with Java.
You have no problems with Java because you're not using Java. If you want to disagree, then show me where in the Java documentation I can find the "using assembly language in Java" section.
embryo wrote:
Brendan wrote:When do you start looking at performance there's 4 main things to worry about - memory hierarchy efficiency (e.g. things like knowing which cache lines you touch when), doing stuff in parallel efficiently (both multiple CPUs and SIMD), controlling when things happen (e.g. important work done before unimportant work) and latency. Something like garbage collection is enough to completely destroy all of this.
With Java OS all 4 things are under control. And the GC is not an issue, because it can be as controllable as you wish. It can run at some predefined points in the program thread or it can run in a separate thread on a dedicated core with it's independent cache. It's just a matter of design, but not a stopper any more.
Again; once you stop using Java and start using assembly (to implement your own garbage collectors, etc), the restrictions of Java no longer apply.


Cheers,

Brendan
For all things; perfection is, and will always remain, impossible to achieve in practice. However; by striving for perfection we create things that are as perfect as practically possible. Let the pursuit of perfection be our guide.
h0bby1
Member
Member
Posts: 240
Joined: Wed Aug 21, 2013 7:08 am

Re: The Mill: a new low-power, high-performance CPU design

Post by h0bby1 »

for memory allocation, it's possible to have reference counters as well in C, and to avoid the whole pointer ownsership and deallocation problem, and to also deal with pointer aliasing explicitly using reference, but you can't really force the use of reference using language/compiler

what seem to be specially interesting about the mill is the capacity to vectorize a whole lot of things, and to have lot of pipeline to handle parallelization, even if the compiler and libC/runtime probably need to be specific to handle it well, it doesn't require either that much high degree of prediction to do this kind optimization and to figure out how to optimize assembly regarding the pipeline configuration of the cpu, at least for non confusing kind of loops dealing with vectors/arrays
User avatar
Rusky
Member
Member
Posts: 792
Joined: Wed Jan 06, 2010 7:07 pm

Re: The Mill: a new low-power, high-performance CPU design

Post by Rusky »

embryo wrote:Metadata needs storage. The compiler can allocate the storage at the compile time. It means the compiler can do every trick the Mill can do with metadata. And, of course, the compiler will do it in more efficient manner, it just has a lot of algorithms for such task.
Metadata storage is a few bits, associated with each register. The CPU uses that to reduce instruction size, because the instructions no longer need to store their operand type. This enables a faster decode path, which is the opposite of the compiler method. The metadata used for vectorizing marks the elements of the vector past the end of the data as None, so that when the program stores the vector (one instruction) those elements are not written. If the compiler implemented the same algorithm, it would have to individually extract each element of the vector, test it against the metadata stored in another register, and then write it, effectively de-vectorizing the loop and using more, rather than fewer, registers vs the vanilla implementation.

So no, the compiler cannot do every trick the Mill can do, and it definitely cannot do it more efficiently. The Mill's tricks do, however, enable the compiler to do some pretty cool stuff it couldn't do on other processors.
embryo wrote:Value semantics means just an immutable object. Form the beginning Java had (and has) immutable objects, for example - strings.
What I meant by value semantics is where objects are stored. All objects in Java are allocated on the heap, so if you use an object as, say, a member of another, it is implemented as a pointer. In many cases, especially in OS dev, you must be able to control whether a member is stored directly in the object or as a pointer. Java does not have that, so if your Java OS does, you modified the language.
User avatar
Combuster
Member
Member
Posts: 9301
Joined: Wed Oct 18, 2006 3:45 am
Libera.chat IRC: [com]buster
Location: On the balcony, where I can actually keep 1½m distance
Contact:

Re: The Mill: a new low-power, high-performance CPU design

Post by Combuster »

Rusky wrote:What I meant by value semantics is where objects are stored. All objects in Java are allocated on the heap, so if you use an object as, say, a member of another, it is implemented as a pointer. In many cases, especially in OS dev, you must be able to control whether a member is stored directly in the object or as a pointer. Java does not have that, so if your Java OS does, you modified the language.
Mind you that "fully defined behaviour" is not equivalent to "implementations must do it like this", and therefore your logic doesn't hold.
"Certainly avoid yourself. He is a newbie and might not realize it. You'll hate his code deeply a few years down the road." - Sortie
[ My OS ] [ VDisk/SFS ]
User avatar
Rusky
Member
Member
Posts: 792
Joined: Wed Jan 06, 2010 7:07 pm

Re: The Mill: a new low-power, high-performance CPU design

Post by Rusky »

Java-the-language does specify reference semantics for objects, does it not?
User avatar
Combuster
Member
Member
Posts: 9301
Joined: Wed Oct 18, 2006 3:45 am
Libera.chat IRC: [com]buster
Location: On the balcony, where I can actually keep 1½m distance
Contact:

Re: The Mill: a new low-power, high-performance CPU design

Post by Combuster »

Like I said, semantics (= behaviour) does not dictate the specifics of the implementation. If you create your own ABI, you also know how to constuct java classes to follow the storage guidelines of your choice.
"Certainly avoid yourself. He is a newbie and might not realize it. You'll hate his code deeply a few years down the road." - Sortie
[ My OS ] [ VDisk/SFS ]
User avatar
Rusky
Member
Member
Posts: 792
Joined: Wed Jan 06, 2010 7:07 pm

Re: The Mill: a new low-power, high-performance CPU design

Post by Rusky »

Object semantics is different from reference semantics, regardless of its implementation.
embryo

Re: The Mill: a new low-power, high-performance CPU design

Post by embryo »

Brendan wrote:I'm not sure which messages you're talking about - is it the one where Combuster failed to unroll a small loop and remove all branches?
Why do you think Combuster failed? He just shows how a compiler can do something simultaneously (if I understand correctly).

But as an example you can consider this message - http://forum.osdev.org/viewtopic.php?f= ... 3&start=42.
Brendan wrote:Would you mind specifying which compiler you're referring to?
Yes, the definition should be cleared. I am trying to describe the bytecode to native compiler. It takes Java bytecode and produces machine code. And it is not a JIT. JIT is just one form of the compiler, but there can be other forms.
Brendan wrote:I'll assume you're talking about the "pointer aliasing" problem that was partially fixed in C++
Not only. The actual problem is the amount of time required to produce a good machine code. With pointers the time is greater than without them. Safeness of a language ensures lesser problem space. It means for C/C++ compiler to compete with a safe language compiler there should be safe code as an input to the C/C++ compiler. Else the pointer manipulation complexity will slow C/C++ compiler or even prevent it from making a good code (fast code). Of course, the developer can think about smart pointer usage and prevent compilation problems, but then we have obvious advantage of Java - it allows us to concentrate on the architecture and not to be bothered with a lot of details of a code that is smartly conforming to the compiler requirements.
Brendan wrote:Also note that it's definitely not related to "ahead of time" vs. "just in time"
It's better to see the whole picture - from static compilation through JIT and up to the algorithm switch at some points of the program execution. And splitting the whole process into it's parts can be interesting when there actually are some problems avoided by using a particular part of the process.
Brendan wrote:Modern Java does implement the smarter memory allocation, which gives it an advantage over the typical "written by a noob" C/C++ code
It's not about noobs only. It's about programmer's freedom to implement his decisions. If a noob can use the freedom then it's still the freedom - experienced developer can leverage it better.
Brendan wrote:Complaining that C/C++ aren't safe is as stupid as complaining that Java won't let you have direct access to any address you like.
If it is such 'by design' it doesn't free it from the slow compiler problem. The problem is a number of steps required to make a good code and this problem should not be mixed with language design. The problem just exists or not. And design here is irrelevant.
Brendan wrote:
embryo wrote:Why I can't use compiler like GCC and get the exact result the C/C++ have? And I can use such compiler at runtime, like JIT. And there will be no significant performance overhead because compiler runs once and the server code then runs for very long time.
I'm not sure what you're trying to say here.
It's about applying the same algorithms to different languages - Java and C. In such a way we can have equal performance of both if the language constructs are close. And only difference is unsafe features of the C - here we should go different ways.
Brendan wrote:You can use (e.g.) GCJ to "ahead of time" compile Java directly to native code and not bother having any JIT; but the performance will probably be even worse than running Java byte-code on a JVM because the language was never designed to do this efficiently
Do you mean inline assembler is not a part of Java specification? Following comment describes the situation.
Brendan wrote:You have no problems with Java because you're not using Java. If you want to disagree, then show me where in the Java documentation I can find the "using assembly language in Java" section.
The way it is used in Java is a pure Java. It looks like assembly, but it is Java. There was the goal to emulate the assembly look and feel in Java using only Java constructs. And the goal is achieved. Now anybody can write pure Java program and have low level code defined right in the Java source code. The actual machine code representation is produced by the compiler.
Brendan wrote:once you stop using Java and start using assembly (to implement your own garbage collectors, etc), the restrictions of Java no longer apply.
The GC is in Java. But yes, the restrictions of Java are circumvented. And assembly usage is not safe. Also it is the case for direct memory access. However - all the code is 100% Java.

One more point - if we stay with all Java restrictions untouched then there is no way to access hardware except using a JVM. But in the case of jEmbryoS it is the JVM in itself. There's no another JVM to ask about hardware stuff. However, the JVM part is isolated from other system and lets a system user to see a pure Java environment. But from the other side the developer can quickly go down to the JVM level and have everything required for the efficient hardware access.
embryo

Re: The Mill: a new low-power, high-performance CPU design

Post by embryo »

h0bby1 wrote:for memory allocation, it's possible to have reference counters as well in C
It's all about automatic memory management. The reference counting is not 'the best'. It requires no less efforts to manage memory in comparison to the garbage collector. The problem root is in keeping mutually referenced objects alive and leaking memory in such way. To prevent the leak there should be more complex solution.
embryo

Re: The Mill: a new low-power, high-performance CPU design

Post by embryo »

Rusky wrote:If the compiler implemented the same algorithm, it would have to individually extract each element of the vector, test it against the metadata stored in another register, and then write it, effectively de-vectorizing the loop and using more, rather than fewer, registers vs the vanilla implementation.
Why the compiler should read each element of the vector? It can read just metadata and select a particular element only.
Rusky wrote:All objects in Java are allocated on the heap
There's no such requirement neither in the Java language specification or in Java virtual machine specification.
User avatar
Rusky
Member
Member
Posts: 792
Joined: Wed Jan 06, 2010 7:07 pm

Re: The Mill: a new low-power, high-performance CPU design

Post by Rusky »

embryo wrote:
Rusky wrote:If the compiler implemented the same algorithm, it would have to individually extract each element of the vector, test it against the metadata stored in another register, and then write it, effectively de-vectorizing the loop and using more, rather than fewer, registers vs the vanilla implementation.
Why the compiler should read each element of the vector? It can read just metadata and select a particular element only.
Show us the generated code for the vectorized strcpy example on this page for the compiler method.
embryo wrote:
Rusky wrote:All objects in Java are allocated on the heap
There's no such requirement neither in the Java language specification or in Java virtual machine specification.
So I misspoke. But as I said, there is a requirement for reference semantics, which is incompatible with value semantics. You can get value semantics if you explicitly clone objects all over the place (or if they're immutable, use an interning pool), but there's no way the compiler's going to inline the member object into its container that way- this is one example of Java not having the control you need for a systems language. You have to hope the compiler will do a highly unlikely optimization to control cache behavior, etc. with inline values.
embryo

Re: The Mill: a new low-power, high-performance CPU design

Post by embryo »

Rusky wrote:Show us the generated code for the vectorized strcpy example on this page for the compiler method.
It's insane to ask me to read the Mill manuals and to present here the Mill program. But I can describe the algorithm. It reads metadata and scans for signs of elements that should not be copied. After the sign is found or the end of metadata is reached the processor just ordered to copy a range of elements from start address and up to the last determined element. All those actions can be done using parallel execution units. Cache usage optimization is also absolutely possible with such approach.
Rusky wrote:but there's no way the compiler's going to inline the member object into its container that way- this is one example of Java not having the control you need for a systems language
There are many ways the compiler can manage the objects. It can unbox object's internals and place them on the stack without using memory allocation at all, for example. But may be you have some example where just inlining of one object within another can save the world?
User avatar
Rusky
Member
Member
Posts: 792
Joined: Wed Jan 06, 2010 7:07 pm

Re: The Mill: a new low-power, high-performance CPU design

Post by Rusky »

embryo wrote:It's insane to ask me to read the Mill manuals and to present here the Mill program. But I can describe the algorithm. It reads metadata and scans for signs of elements that should not be copied. After the sign is found or the end of metadata is reached the processor just ordered to copy a range of elements from start address and up to the last determined element. All those actions can be done using parallel execution units. Cache usage optimization is also absolutely possible with such approach.
There are no Mill manuals yet- you were going to do it without Mill features. Besides, did you even look at the example I linked? Telling the processor to copy a range of elements between addresses essentially is the operation we're vectorizing here- give us some pseudo assembly for your imaginary perfect CPU that lets the compiler do everything.
embryo wrote:But may be you have some example where just inlining of one object within another can save the world?
When a table structure like EFI or ACPI requires it. Or I guess you could just copy and paste those fields everywhere...
embryo

Re: The Mill: a new low-power, high-performance CPU design

Post by embryo »

Rusky wrote:did you even look at the example I linked?
Yes. And the example tells me that there is just another version of the algorithm I have described above. First, the Mill looks at metadata and determines end of the string. Second, the Mill just copies array elements from start and up to the identified last element.
Rusky wrote:Telling the processor to copy a range of elements between addresses essentially is the operation we're vectorizing
How a task of copying from one memory location to another can be enhanced by a processor? It is obviously something that hardware (like memory controller) should do using DMA mode or something alike. Also it is applicable in case of copying between the memory and a cache of a particular level. The only case when the processor is required is the situation when the data should be used by the processor after the copy operation is finished. But even such case is still all about getting the data from memory and placing it in another memory (register file, in particular). It's just an issue of system bus bandwidth and the controller hardware involved (memory and bus controllers).
Rusky wrote:give us some pseudo assembly for your imaginary perfect CPU that lets the compiler do everything
There should be some means of data transfer control in my imaginary hardware.

Code: Select all

mov destAddr, srcAddr, numberOfWordsToMove; this is memory or bus controller instruction
mov regStartIndex, srcAddr, numberOfWordsToMove; here the processor is involved
The program execution unit while prefetching the program can select the memory transfer commands using some bits as a flag. Then it just switches output lines to the memory or bus interaction unit to start the actual transfer.

But the actual implementation at the silicon layer is out of my expertise. Then it is possible to blame all the hell on me :)
User avatar
Rusky
Member
Member
Posts: 792
Joined: Wed Jan 06, 2010 7:07 pm

Re: The Mill: a new low-power, high-performance CPU design

Post by Rusky »

So you move the actual copying logic into the CPU so the program doesn't have to copy byte by byte? That's essentially what I've been saying the whole time- the CPU can be a better place for logic than the compiler's output.

In the Mill's case, however, it can also do the discovery-of-how-much-to-copy at the same time (and speed) as the actual copy, while still copying at the full speed available to the CPU (as opposed to dropping down to byte-by-byte speeds to avoid overshooting). Your example would have to scan the string a byte at a time first.

Thus you both proved my point (that moving logic into the CPU can be good) and disproved your own (that the compiler is always better) at the same time. Congrats.
Post Reply