ASM fine-tuning

Pype.Clicker · Post by **Pype.Clicker** » Sat Jun 01, 2002 2:33 am

Can someone remind me what is the fastest on actual Pentium architecture ?

[table][tr][td]pusha[/td][td]
push ebx
push esi
push edi[/td]
[/tr][/table]

Well, the idea is to fine-tune my task-switching code: i know there are other registers that pusha will save (i.e. eax, ecx, edx, ebp and esp for those who ask what i'm talkin' about), but they normally could be forgotten safely in my calling conventions or are already saved somewhere else.

with all these pairing, caching, etc. i really wonder what will be the fastest. if you have some infos, lemme know

crazybuddha · Post by **crazybuddha** » Sat Jun 01, 2002 7:24 am

I don't have any idea to what extent the following table is accurate, but it might be interesting to know about:

http://www.quantasm.com/opcode_i.html

Regarding the actual time of PUSHA, there's no single answer. If you are at the point where you are trying to shave cycles, you need to read these:

http://www.agner.org/assem/
http://www.azillionmonkeys.com/qed/optimize.html

as well as Intel's manuals (which are presently offline).

Although usually optimization issues are prefaced with a reprimand about how unnecessary it is, nothing will teach you as much as trying to quantify and speed up your code performance, even if you do a lot of silly things in the process.

f2 · Post by f2 » Sat Jun 01, 2002 9:51 am

>> as well as Intel's manuals (which are presently offline). <<

Where would you get these?

crazybuddha · Post by **crazybuddha** » Sat Jun 01, 2002 10:36 am

http://www.intel.com/design/litcentr/index.htm

f2 · Post by f2 » Sat Jun 01, 2002 11:16 am

I thought that was the place. I ordered the books about a month and a half ago, and they STILL haven't come in!

Tim · Post by **Tim** » Sat Jun 01, 2002 11:37 am

I'm not going to guess which is faster on a modern processor, but I'd say that for more than five registers, pusha will be quicker than a series of register pushes.

However, it looks like Pype.Clicker is only saving three registers (why, may I ask?) so the three push instructions will apparently be faster on all CPUs than a pusha.

crazybuddha · Post by **crazybuddha** » Sat Jun 01, 2002 12:07 pm

Don't quote me but I believe PUSHA will be significantly slower than 5 cycles on a Pentium unless going into a cached stack, which perhaps is a real problem given how it's being used.

My money is on the register pushes in any case.

Pype.Clicker · Post by **Pype.Clicker** » Sun Jun 02, 2002 11:55 am

Although usually optimization issues are prefaced with a reprimand about how unnecessary it is, nothing will teach you as much as trying to quantify and speed up your code performance, even if you do a lot of silly things in the process.

Well, i have virtually no experience about quantifying code speed ... do yo have nice tutorials about clock-cycle counting ? i've been told of some "time stamp counter" or something alike in Pentium+ architecture ... how exactly can it be used ?

crazybuddha · Post by **crazybuddha** » Sun Jun 02, 2002 1:27 pm

RDTSC is the instruction. If it's not supported in your assembler, you can just hand code it in. Do a google search for the following document:

rdtscpm1.pdf

BTW, the intel docs appear to be mirrored here:

http://www.x86.org/intel.doc/inteldocs.htm

The Agner Fog document is really the only tutorial I'm aware of.

OSDev.org

ASM fine-tuning

ASM fine-tuning

Re:ASM fine-tuning

Re:ASM fine-tuning

Re:ASM fine-tuning

Re:ASM fine-tuning

Re:ASM fine-tuning

Re:ASM fine-tuning

Re:ASM fine-tuning

Re:ASM fine-tuning