OSDev.org

Posted: **Fri Nov 12, 2010 12:46 am**

Solar,

Slight misunderstanding I think - my fault for being horrible at conveying my point!

What I was trying to (badly) say was that JIT compilation is essentially compiling the bytecode to machine code on-the-fly. So the concept of "XYZ is slower because it's interpreted bytecode" is, in my opinion, quite wrong. Specifically, I was referring to DavidCooper's explicit mention of "machine code". Like I said - I was pretty bad at conveying my point.

I don't think a JIT-compiled application could beat an AOT-compiled application in some areas - especially areas with lots of computation (the optimisation you can pull off before compilation is significant, and likely outweighs the improvement you get from CPU-targeted JIT compiled output). As a generalisation though the difference between a JIT-compiled application and an AOT-compiled application should be minimal.

Now I just hope my attempt to clear up the misunderstanding doesn't create more misunderstandings

Posted: **Fri Nov 12, 2010 1:57 am**

pcmattman wrote:As a generalisation though the difference between a JIT-compiled application and an AOT-compiled application should be minimal.

Now I just hope my attempt to clear up the misunderstanding doesn't create more misunderstandings

Nope. And sorry for venting my frustration with the local office Java Fanboy on you.

Posted: **Fri Nov 12, 2010 6:24 pm**

Hi pcmattman,

pcmattman wrote:And in case you don't know what JIT compilation is: it's converting the bytecode directly to machine code to run natively on the host system... at what you call "full speed".

Having read up on Java, JIT and AOT (I had never paid any attention to this before because I just saw Java as a slow interpreted langauge used primarily for Web applets), Java looks a lot more capable than I had imagined, so I can see now what you're getting at: number 3 (the 8th item) on this list - http://java.dzone.com/tips/ten-amazing- ... plications shows how versatile it actually is, and that it must be reasonably fast. Clearly it could appeal to many OS designers if it's realistically possible for them to bolt it on during their lifetime, though it's going to be best suited to those who prefer to work in high-level programming languages. Those of us who do everything through direct machine code or assembler aren't likely to want to give up the ability to craft our code for absolute speed with the app checking to see what capability the processor has available on it, so I'd still be looking for a simple program to interface between the OS and machine code apps designed to work with it, though that wouldn't get in the way of adding Java as well, so both systems could be used in time.

Even so, the system working with machine code apps would be far easier to bolt onto an OS than Java as it would be much more restricted in its initial scope. As time goes on it could offer more and more capability, but an OS designer meeting the standard of the first level would immediately be able to run any apps designed to run at level one. Every time more capability is added to the system, it would be labelled with a new coding number so that apps could state clearly which level of this interfacing program they are compatible with, the simpler apps obviously being compatible with all of them, but the more advanced apps (in terms of them requiring a greater range of OS services) would need a more advanced version of the interfacing program (which would have to be written by each OS designer). This system would enable any OS designer to add compatibility bit by bit to the interfacing program such that more and more apps would become compatible with their OS every time they reach a new level with their interfacing program, but even from the start they would be able to run a large number of apps on it as soon as their OS can handle level one. I think this system could be attractive to a lot of writers of hobby OSes, and there could be hundreds of people getting involved in this in the future who would find it much more rewarding if this system was available to them.

So I will once again re-iterate the point I've been making in every other post, in a different way yet again: you would probably find it faster and easier to implement a Java runtime and JIT compiler than to get enough agreement from even a small group of people to achieve binary compatibility in more than one operating system. People just have a habit of not agreeing with each other - like I said in my last post: driver interface A vs driver interface B discussions always result in a lot of disagreement. What I didn't mention is that within UDI there are struggles to get everyone to agree on features - you can find evidence of that simply by searching the forums.

Well, it's just possible that a group of people might come together who can work together by being pragmatic about things - you can't expect to have your own way all the time when trying to create a common system. If they understand that, they could get somewhere, and getting somewhere is much more important than creating a perfection which is nothing of the kind in the eyes of other people. If they want their own perfection, they can still create that for their own apps through their native API and run them all in glorious isolation as much as they like.

Posted: **Sat Nov 13, 2010 8:27 am**

berkus wrote:
DavidCooper wrote:A.I. wouldn't want to do anything: it would simply do what it's programmed to do.
Doesn't look like AI to me.

If we're going to continue discussing AI at all in this thread, everyone is going to need to know its definition. DavidCooper should also specify whether he's only claiming to have a good natural language semantic parser (with some sort of weak AI building its vocabulary/connections) or a strong AI behind that semantic parser, because his claims seem to be inconsistent about it. In addition, if we're going to continue discussing AI, this thread should be split. This is an if: DavidCooper has expressed that he doesn't want to talk further about it.

Posted: **Sat Nov 13, 2010 5:45 pm**

berkus wrote:... directly in machine code.

Doesn't assembly count?

Current vectorisation implementations are far from perfect. In fact I have a routine around that can well be vectorized but GCC doesn't do it for me (last tried with 4.4.1). So yes, I can optimize a multicore algorithm in assembly and do a better job than a standard compiler

I'm just not that stupid to optimize the kitchen sink in that fashion.

Posted: **Sat Nov 13, 2010 10:39 pm**

Hi Berkus,

berkus wrote:
DavidCooper wrote:A.I. wouldn't want to do anything: it would simply do what it's programmed to do.
Doesn't look like AI to me.

Maybe the word "want" is the problem. A computer of the kind we have today cannot be conscious, and that is why it can't want anything. As it happens, we too just do what we are programmed to do, but "wanting" comes into it because qualia appear to be involved. We are essentially programmed to do whatever we calculate to be the best thing at any point in time, guided both by reason and by how we feel, but there is no such thing as free will, so we're just as driven by programs as a computer.

Hi Nick,

NickJohnson wrote:DavidCooper should also specify whether he's only claiming to have a good natural language semantic parser (with some sort of weak AI building its vocabulary/connections) or a strong AI behind that semantic parser, because his claims seem to be inconsistent about it. In addition, if we're going to continue discussing AI, this thread should be split. This is an if: DavidCooper has expressed that he doesn't want to talk further about it.

A finch is a bird. A brambling is a finch. If both of those statements are true, it's a simple matter of applying logic to them to produce the idea that: A brambling is a bird. Applying logic of this kind is all it takes to generate new data out of existing data, though in the example I've just given there is nothing that needs to be translated. Any kind of idea that can be derived from other ideas can be derived through logic and mathematics. I don't know whether you want to call that weak or strong A.I., but I just call it reasoning. Anyone can create a system to apply logic to data in this way, but it would soon get stuck with data that isn't in compatible forms, and that's where the linguistics comes in: it's all about converting ideas into forms which allow logic to get to grips with it.

Hi again Berkus,

berkus wrote:I'm really going to worship you if you say (just say) that you are able to speed-optimize for 8-core with 3 levels of cache and OOO execution by hand directly in machine code.

That would be a mistake - you should require proof before you go that far, but I don't really think anyone or anything is worthy of worship in the first place. (I would require proof too as I've never done the experiment to see how well my code compares with compiled code, though I'm learning C++ now in order to try to find out.) I haven't got round to it yet, but I intend to write a program to check my code to see if it's been written in the best possible order without me having to think it all out carefully: it will point out any areas where a changes would speed things up, as well as indicating inefficient allignments with memory - once I've written it, I should have all the same advantages with automated code optimisation as anyone using a compiler. It should be able to check all of my old programs too and give them all a speed boost wherever I've missed a trick in the past.

As for the cache, it's easy enough to think about how you structure your data and design your programs to process it in such an order as to minimise cache misses, and I can't see that a multiprocessor environment makes a lot of difference to that - if a task can be split and processed in parallel, it's just a matter of thinking about how the data will be accessed and whether each processor is having to work with all the data (which will result in more cache misses) or just a fraction of it, ideally in inverse proportion to the number of processors in the machine. Most of the stuff I've done so far doesn't really lend itself to multiprocessing, but if anything comes up where it is important, I'll know which processor is likely to be doing what and with which data and whether a thread is being switched between processors or going back to the same one repeatedly if it gets switched in and out repeatedly. I can't see that you have any advantage there with a compiler, because you're still going to have to plan how your data is arranged, the order in which it's processed and how to farm it out across processors.

Posted: **Sat Nov 13, 2010 11:05 pm**

I can't see that you have any advantage there with a compiler, because you're still going to have to plan how your data is arranged, the order in which it's processed and how to farm it out across processors.

The advantage that a compiler has, is that it knows more about the operation of the processor than the average programmer. The guys that write compiler backends often come from the companies that make the silicon. The companies that make the silicon also advise or have actual input into the design of the major operating systems. The silicon designers also consider the design of the major operating systems when they make their decisions.

As for the cache, it's easy enough to think about how you structure your data and design your programs to process it in such an order as to minimise cache misses, ...

No. it's not easy. Well it may be for small amounts of data, but it becomes difficult very quickly in a big program.

Posted: **Sat Nov 13, 2010 11:57 pm**

gerryg400 wrote:
I can't see that you have any advantage there with a compiler, because you're still going to have to plan how your data is arranged, the order in which it's processed and how to farm it out across processors.
The advantage that a compiler has, is that it knows more about the operation of the processor than the average programmer. The guys that write compiler backends often come from the companies that make the silicon. The companies that make the silicon also advise or have actual input into the design of the major operating systems. The silicon designers also consider the design of the major operating systems when they make their decisions.

Fun Fact: Remond WA(the base city of MS ie VS) is the #1 Google(not Bing) searcher of "x86 assembly". They can't be too skilled... Reference

Posted: **Sun Nov 14, 2010 10:15 pm**

Hi Gerry,

gerryg400 wrote:
I can't see that you have any advantage there with a compiler, because you're still going to have to plan how your data is arranged, the order in which it's processed and how to farm it out across processors.
The advantage that a compiler has, is that it knows more about the operation of the processor than the average programmer. The guys that write compiler backends often come from the companies that make the silicon. The companies that make the silicon also advise or have actual input into the design of the major operating systems. The silicon designers also consider the design of the major operating systems when they make their decisions.

I was talking specifically about the issue of cache misses in multiprocessing where the way you structure and process your data have a massive impact on performance - no compiler can do this for you (unless it was to use human-level A.I.) because it can't simply change the contents of your arrays and rewrite your code for you, chopping it up into separate threads to farm out the work to different processors in the most efficient way. That work has to be done by the programmer, and it's vital that the programmer understands how much contiguous data is collected and loaded into the caches (which varies from processor to processor) in order to optimise the program for the particular ones most favoured by the programmer (based on likely users of the program). Some people may want to write a different version for each processor in critical code in certain programs, but for most of your code would be best to target newer machines and try to write a single version which will be the fastest code on average when run across a range of the top processors. Where a compiler may be able to speed things up is by rearranging your variables to put the most commonly accessed ones together (fewer in-and-out of cache issues), but I've no idea how well or even whether they attempt to make this more efficient with multiprocessing as there's unlikely to be any further gain. I could easily run my own code indirectly (by interpreting it) in order to count up how often it accesses all the different variables and rearrange them accordingly, so everything that a compiler can automate could equally be done for a machine code or assembler programmer, and all the information needed to keep this such a system up to date can be gained by reading the processor manuals, doing speed experiments with code and studying the output of compilers written by the experts you talk of - most of this research work could be automated, so it would only take a small amount of work to keep things up to date for all low-level programmers.

As for the cache, it's easy enough to think about how you structure your data and design your programs to process it in such an order as to minimise cache misses, ...
No. it's not easy. Well it may be for small amounts of data, but it becomes difficult very quickly in a big program.

But you can't rely on the compiler to do it for you either. If you have a huge database, you need to think about how it's laid out and how your programs work their way through the data, because if you get it wrong it could be putting a new 64 or 128 bytes into the cache every time it reads a byte if the bytes it's reading are 128 bytes apart. If it's possible to rearrange your data by giving it a more complex structure, you might be able to access perhaps 16 or 32 different entries in the database for a single cache load. The compiler can do nothing about this.

Hi Berkus,

berkus wrote:
DavidCooper wrote:to check my code to see if it's been written in the best possible order
You understand that "best possible order" will be different per each CPU model and sometimes even stepping?

There are general rules which you should follow to get the best performance on average. If you're doing critical stuff which is very much processor specific there is more reason to check the cpu id, but I would have thought you'd be doing that kind of work through assembler anyway, though maybe I'm wrong: do compilers automatically bloat all your code by writing alternatives for every piece of code in your program such that it will run at the maximum possible speed on any processor it encounters? I'd be interested to know what sort of control you have over this.

DavidCooper wrote:As for the cache, it's easy enough to think about how you structure your data and design your programs to process it in such an order as to minimise cache misses, and I can't see that a multiprocessor environment makes a lot of difference to that
Indeed, it's very easy to write superoptimal code "in best possible order" and consider 3-layer memory effects in SMP environment. But only in theory.

You've got exactly the same work to do if you're using a compiler (apart from determining the order for the instructions come in, though I can automate that too, and will do when I have the time - the actual algorithms that you use to process your data are your own design and the compiler can't help you with that).

Posted: **Mon Nov 15, 2010 3:22 am**

gerryg400 wrote:The guys that write compiler backends often come from the companies that make the silicon. The companies that make the silicon also advise or have actual input into the design of the major operating systems. The silicon designers also consider the design of the major operating systems when they make their decisions.

*cough*

/me just spilled his tea all over the keyboard...

You're absolutely correct there, gerryg400.

And some day I'm going to learn that I must not think of female anatomy augmentation when I read "silicon"... (English: silicon / silicone. German: Silicium / Silikon...)

That in mind, read the above sentences again...

Posted: **Mon Nov 15, 2010 4:55 am**

Solar wrote:
gerryg400 wrote:The guys that write compiler backends often come from the companies that make the silicon. The companies that make the silicon also advise or have actual input into the design of the major operating systems. The silicon designers also consider the design of the major operating systems when they make their decisions.
*cough*

/me just spilled his tea all over the keyboard...

You're absolutely correct there, gerryg400.

And some day I'm going to learn that I must not think of female anatomy augmentation when I read "silicon"... (English: silicon / silicone. German: Silicium / Silikon...)

That in mind, read the above sentences again...

Gotta meet one of those silicone designers one day. Love their work.

Posted: **Mon Nov 15, 2010 4:39 pm**

I was talking specifically about the issue of cache misses in multiprocessing where the way you structure and process your data have a massive impact on performance - no compiler can do this for you (unless it was to use human-level A.I.) because it can't simply change the contents of your arrays and rewrite your code for you, chopping it up into separate threads to farm out the work to different processors in the most efficient way. That work has to be done by the programmer, and it's vital that the programmer understands how much contiguous data is collected and loaded into the caches (which varies from processor to processor) in order to optimise the program for the particular ones most favoured by the programmer (based on likely users of the program). Some people may want to write a different version for each processor in critical code in certain programs, but for most of your code would be best to target newer machines and try to write a single version which will be the fastest code on average when run across a range of the top processors. Where a compiler may be able to speed things up is by rearranging your variables to put the most commonly accessed ones together (fewer in-and-out of cache issues), but I've no idea how well or even whether they attempt to make this more efficient with multiprocessing as there's unlikely to be any further gain. I could easily run my own code indirectly (by interpreting it) in order to count up how often it accesses all the different variables and rearrange them accordingly, so everything that a compiler can automate could equally be done for a machine code or assembler programmer, and all the information needed to keep this such a system up to date can be gained by reading the processor manuals, doing speed experiments with code and studying the output of compilers written by the experts you talk of - most of this research work could be automated, so it would only take a small amount of work to keep things up to date for all low-level programmers.

I don't believe that you, yourself, can look at a resonably large program and optimise cache hits the way you describe. Perhaps you could optimise a number of routines that are chosen after profiliing, but all reasonably capable software engineers can do and do do that as required. You do realise that modern CPUs have 3 levels of cache for data and code and that those caches have vastly different miss-panalties for the different cache levels. CPUs also have instuction pipelining, and out of order exection, and branch prediction etc. The penalties for an incorrect branch prediction are roughly the same as an L1 cache miss on some CPUs.

To imagine that you can take an average sized program (let's say 100,000 lines) and analyse it while understanding and keeping in mind all these cpu features plus the ones that are barely documented and produce better code than a compiler, in reasonable time is, well, just imagining.

Posted: **Mon Nov 15, 2010 5:55 pm**

Hi Berkus,

berkus wrote:
DavidCooper wrote:because it can't simply change the contents of your arrays and rewrite your code for you, chopping it up into separate threads to farm out the work to different processors in the most efficient way.
Will I be too wrong here if I say OpenMP? Or maybe Grand Central Dispatch.

I don't know. Are you telling me that they can look at your code and data and determine that it's doing something repetitive which could be chopped up into lots of threads which produce results that don't depend on earlier results, then write a new version of your program for you, possibly even changing your algorithm or the structure of a database in order to ensure that it will process the data in the order that minimises cache misses? I'm sure these technologies do something helpful, but I don't think they can go that far.

DavidCooper wrote:That work has to be done by the programmer, and it's vital that the programmer understands how much contiguous data is collected and loaded into the caches
It's nice when programmer understands this and takes it into account whilst writing their application. This does not help as much as it could exactly because of differences in caches sizes and peformance that only runtime checks can verify and use to advantage. Compilers and smarter runtime systems are better with this than the programmer could be, in a theoretical maximum.

With JIT and AOT the code can be compiled to run on the specific processor in each individual machine, but it would be easy enough to have a program in an OS which goes through apps modifying them to run on the specific processor in the machine if they're written in a particular way to fit in work with such a system. It would be like using machine code as a bytecode and "compiling" it into machine code on loading, meaning that some instructions would be replaced with others which are more efficient if the processor supports them, but this "bytecode" would also run successfully without modification. In most cases this could be done simply by marking parts of code for modification by preceding them with perhaps 147 147 and following it with a string of 144s long enough to make room for the possible replacement code, the 147s being overwritten during the conversion process.

DavidCooper wrote:I could easily run my own code indirectly (by interpreting it) in order to count up how often it accesses all the different variables and rearrange them accordingly, so everything that a compiler can automate could equally be done for a machine code or assembler programmer
If you said "I could use valgrind or gprof" or any other existing tool, I would agree, but you demonstrate a NIHS that says you should learn a bit more about existing tools and state of the art in the compiler area.

NIHS Not Invented Here Syndrome
NIHS National Institute of Health Sciences, Japan
NIHS National Institute for Hometown Security (Somerset, KY)
NIHS North Iredell High School (North Carolina)
NIHS National Intercollegiate Horse Show

I can't comment on that mystery, but the tools you speak of simply aren't available on my OS - I have to write all my own tools for everything.

DavidCooper wrote:You've got exactly the same work to do if you're using a compiler (apart from determining the order for the instructions come in, though I can automate that too, and will do when I have the time - the actual algorithms that you use to process your data are your own design and the compiler can't help you with that).
In this case the compiler does the reordering for you, this is something compiler developer has to do only once, to provide compiler with knowledge about the pipelines, OoO interlocked stages and other constraints, and compiler can take it from there. Of course it will still miss, because cache and memory access times are not constant, but you can not do any better, you can do only slower (measured in code generation time).

There's no reason why it should be any slower the way I do it - this reordering of code could also be done on loading by running a program which reads through the app's code making changes to optimise it to the processor - this could be done so fast you wouldn't notice any delay.

Again: algorithms are important, but we do not question this here - I personally assume a reasonably skilled C/asm programmer, who can design their programs, but also can profile and fix their code (I have a link to nice presentation about it if you're interested), if the original design turns out to be inefficient. And compilers are much better at dumb mechanical repeating work that is low-level code optimizations.

I'm interested, but this isn't a priority for me so it can probably wait. Anything a compiler does to aid optimisation of code can be done through a similar kind of program designed to work directly with machine code numbers. I'm not against automation at all: I simply prefer to do my programming in the same way as A.I. will do its programming in the future by working directly with machine code instructions, and once I've written the tools to automate all the things which compilers automate, I won't be at any disadvantage through programming this way.

Hi Gerry,

gerryg400 wrote:I don't believe that you, yourself, can look at a resonably large program and optimise cache hits the way you describe. Perhaps you could optimise a number of routines that are chosen after profiliing, but all reasonably capable software engineers can do and do do that as required. You do realise that modern CPUs have 3 levels of cache for data and code and that those caches have vastly different miss-panalties for the different cache levels. CPUs also have instuction pipelining, and out of order exection, and branch prediction etc. To imagine that you can take an average sized program (let's say 100,000 lines) and analyse it while understanding and keeping in mind all these cpu features plus the ones that are barely documented and produce better code than a compiler, in reasonable time is, well, just imagining.

Do you seriously imagine that app developers write a different version of every large array and database plus a different algorithm to process each of them in order to optimise their program for every individual kind of processor that it might run on? They can't possibly do that. There will be certain limited procedures working with large amounts of repetitively-structured data which may be worth writing in such a way as to tune them to specific processors and have multiple versions of that code in the app, but that will be a tiny part of an app which can probably be written in under a thousand lines of code. Imagination is what you need in order to create new ways of doing things, and what I've done is come up with a system which makes direct machine code programming not just practical, but fast. Once I've written a code-optimisation program to go with it, it will do all that hard work for me and leave me at no disadvantage, so my code will never run slower than yours, but it will probably run faster in many places and it will certainly be a great deal more compact. We can have a code race to see who's right when everything's finished, but in the meantime I need to concentrate on writing my programs without worrying too much about the speed they're going to run at - it'll be easy enough to make all the speed tweaks at the end of the process.

Posted: **Mon Nov 15, 2010 7:25 pm**

Once I've written a code-optimisation program to go with it, it will do all that hard work for me and leave me at no disadvantage, so my code will never run slower than yours, but it will probably run faster in many places and it will certainly be a great deal more compact. We can have a code race to see who's right when everything's finished,....

A code race ? Your code faster than mine ? You cannot possibly be serious !

Posted: **Tue Nov 16, 2010 1:32 am**

DavidCooper wrote:
If you said "I could use valgrind or gprof" or any other existing tool, I would agree, but you demonstrate a NIHS that says you should learn a bit more about existing tools and state of the art in the compiler area.
NIHS Not Invented Here Syndrome
NIHS National Institute of Health Sciences, Japan
NIHS National Institute for Hometown Security (Somerset, KY)
NIHS North Iredell High School (North Carolina)
NIHS National Intercollegiate Horse Show

Not Invented Here Syndrome. You cannot "demonstrate" an institution, can you?

I can't comment on that mystery, but the tools you speak of simply aren't available on my OS - I have to write all my own tools for everything.

Eeeerrrrmmmm... you're the one kicking off this whole show here about being able to run apps on each other's operating systems, and now you're telling us you have to write stuff like Valgrind or gprof yourself because you cannot port them?

I mean, seriously?

OSDev.org

Can you run your apps on each other's operating systems?

Re: Can you run your apps on each other's operating systems?

Re: Can you run your apps on each other's operating systems?

Re: Can you run your apps on each other's operating systems?

Re: Can you run your apps on each other's operating systems?

Re: Can you run your apps on each other's operating systems?

Re: Can you run your apps on each other's operating systems?

Re: Can you run your apps on each other's operating systems?

Re: Can you run your apps on each other's operating systems?

Re: Can you run your apps on each other's operating systems?

Re: Can you run your apps on each other's operating systems?

Re: Can you run your apps on each other's operating systems?

Re: Can you run your apps on each other's operating systems?

Re: Can you run your apps on each other's operating systems?

Re: Can you run your apps on each other's operating systems?

Re: Can you run your apps on each other's operating systems?