Hi Berkus,
berkus wrote:DavidCooper wrote:because it can't simply change the contents of your arrays and rewrite your code for you, chopping it up into separate threads to farm out the work to different processors in the most efficient way.
Will I be too wrong here if I say OpenMP? Or maybe Grand Central Dispatch.
I don't know. Are you telling me that they can look at your code and data and determine that it's doing something repetitive which could be chopped up into lots of threads which produce results that don't depend on earlier results, then write a new version of your program for you, possibly even changing your algorithm or the structure of a database in order to ensure that it will process the data in the order that minimises cache misses? I'm sure these technologies do something helpful, but I don't think they can go that far.
DavidCooper wrote:That work has to be done by the programmer, and it's vital that the programmer understands how much contiguous data is collected and loaded into the caches
It's nice when programmer understands this and takes it into account whilst writing their application. This does not help as much as it could exactly because of differences in caches sizes and peformance that only runtime checks can verify and use to advantage. Compilers and smarter runtime systems are better with this than the programmer could be, in a theoretical maximum.
With JIT and AOT the code can be compiled to run on the specific processor in each individual machine, but it would be easy enough to have a program in an OS which goes through apps modifying them to run on the specific processor in the machine if they're written in a particular way to fit in work with such a system. It would be like using machine code as a bytecode and "compiling" it into machine code on loading, meaning that some instructions would be replaced with others which are more efficient if the processor supports them, but this "bytecode" would also run successfully without modification. In most cases this could be done simply by marking parts of code for modification by preceding them with perhaps 147 147 and following it with a string of 144s long enough to make room for the possible replacement code, the 147s being overwritten during the conversion process.
DavidCooper wrote:I could easily run my own code indirectly (by interpreting it) in order to count up how often it accesses all the different variables and rearrange them accordingly, so everything that a compiler can automate could equally be done for a machine code or assembler programmer
If you said "I could use valgrind or gprof" or any other existing tool, I would agree, but you demonstrate a NIHS that says you should learn a bit more about existing tools and state of the art in the compiler area.
NIHS Not Invented Here Syndrome
NIHS National Institute of Health Sciences, Japan
NIHS National Institute for Hometown Security (Somerset, KY)
NIHS North Iredell High School (North Carolina)
NIHS National Intercollegiate Horse Show
I can't comment on that mystery, but the tools you speak of simply aren't available on my OS - I have to write all my own tools for everything.
DavidCooper wrote:You've got exactly the same work to do if you're using a compiler (apart from determining the order for the instructions come in, though I can automate that too, and will do when I have the time - the actual algorithms that you use to process your data are your own design and the compiler can't help you with that).
In this case the compiler does the reordering for you, this is something compiler developer has to do only once, to provide compiler with knowledge about the pipelines, OoO interlocked stages and other constraints, and compiler can take it from there. Of course it will still miss, because cache and memory access times are not constant, but you can not do any better, you can do only slower (measured in code generation time).
There's no reason why it should be any slower the way I do it - this reordering of code could also be done on loading by running a program which reads through the app's code making changes to optimise it to the processor - this could be done so fast you wouldn't notice any delay.
Again: algorithms are important, but we do not question this here - I personally assume a reasonably skilled C/asm programmer, who can design their programs, but also can profile and fix their code (I have a link to nice presentation about it if you're interested), if the original design turns out to be inefficient. And compilers are much better at dumb mechanical repeating work that is low-level code optimizations.
I'm interested, but this isn't a priority for me so it can probably wait. Anything a compiler does to aid optimisation of code can be done through a similar kind of program designed to work directly with machine code numbers. I'm not against automation at all: I simply prefer to do my programming in the same way as A.I. will do its programming in the future by working directly with machine code instructions, and once I've written the tools to automate all the things which compilers automate, I won't be at any disadvantage through programming this way.
Hi
Gerry,
gerryg400 wrote:I don't believe that you, yourself, can look at a resonably large program and optimise cache hits the way you describe. Perhaps you could optimise a number of routines that are chosen after profiliing, but all reasonably capable software engineers can do and do do that as required. You do realise that modern CPUs have 3 levels of cache for data and code and that those caches have vastly different miss-panalties for the different cache levels. CPUs also have instuction pipelining, and out of order exection, and branch prediction etc. To imagine that you can take an average sized program (let's say 100,000 lines) and analyse it while understanding and keeping in mind all these cpu features plus the ones that are barely documented and produce better code than a compiler, in reasonable time is, well, just imagining.
Do you seriously imagine that app developers write a different version of every large array and database plus a different algorithm to process each of them in order to optimise their program for every individual kind of processor that it might run on? They can't possibly do that. There will be certain limited procedures working with large amounts of repetitively-structured data which may be worth writing in such a way as to tune them to specific processors and have multiple versions of that code in the app, but that will be a tiny part of an app which can probably be written in under a thousand lines of code. Imagination is what you need in order to create new ways of doing things, and what I've done is come up with a system which makes direct machine code programming not just practical, but fast. Once I've written a code-optimisation program to go with it, it will do all that hard work for me and leave me at no disadvantage, so my code will never run slower than yours, but it will probably run faster in many places and it will certainly be a great deal more compact. We can have a code race to see who's right when everything's finished, but in the meantime I need to concentrate on writing my programs without worrying too much about the speed they're going to run at - it'll be easy enough to make all the speed tweaks at the end of the process.