Question about which tools to use, bugs, the best way to implement a function, etc should go here. Don't forget to see if your question is answered in the wiki first! When in doubt post here.
Been looking around for some recent information on opcode times. All seems to be based on old processors Pentium 1, etc.. I have an AMD Athlon XP 2600 there is only a couple of instructions I'am after.....
1) Callgate in use on the same segment and of less protection
2) Ret in the same segment and of lesser protection
The reason they don't publish them anymore is because they were never accurate once all the caching/pipline stuff had been taken into consideration. Just do some benchmarking against your clock/counter.
moreover, the actual amount of clock cycles required for such operations will heavily depend on how those 'complex instructions' are encoded into micro-rom instructions (they're more or less interpreted on modern RISC-cored cpus ...)
If you want to optimize your code, check out the "IA-32 Intel Architecture Optimization Reference Manual". It's a companion to the Software Developer Manuals. It explains all the branch-prediction, pipelining etc. - I'm afraid there is no easy "here's the table" answer to the question how to write utmost efficient code.
Note, however, the olden rule: Premature optimization is the root of all evil. (D. Knuth) Before you start hacking Assembler "because it's faster", assert that:
* the code section you are optimizing actually is the one causing performance problems, and
* your Assembler actually is faster than compiled C.
In both cases, chances are the answer is "no"...
Every good solution is obvious once you've found it.
great stuff, candy ... all those "optimized long integer multiply" and other "optimized decimal-to-asciiz conversion" routines will certainly please those who're writing a stdlib replacement
Candy wrote:
I just hate to see people give incorrect answers...
Note that:
1) That table does not list execution time in clock cycles, but execute latencies and decode type - which is a different ballgame, and to understand all implications you have to understand the architecture;
2) Numbers given are for AMD Athlon; are you sure it won't differ with the Athlon XP?
Every good solution is obvious once you've found it.
Solar wrote:
1) That table does not list execution time in clock cycles, but execute latencies and decode type - which is a different ballgame, and to understand all implications you have to understand the architecture;
Not necessarily. The numbers CAN also be seen as max clock cycles, not counting memory access latencies, just like on a 486 f.ex.
2) Numbers given are for AMD Athlon; are you sure it won't differ with the Athlon XP?
Athlon is an architecture (actually K7), Athlon XP is a die size plus marketing. The Morgan-series Durons are the exact same but have less cache, the Spitfire Durons are the same at a larger die size (0.18 afaik), etc etc etc. The 32-bit Athlons are all the same (note, this is NOT true for the 64-bit ones), except for some minor differences in features (but not in latencies!). A processor won't have a latency for a feature it doesn't have, of course.