ASM; what is faster? And why can I not benchmark it?

Daevius · Post by **Daevius** » Mon Oct 12, 2009 9:10 am

I made an string-to-integer (char * -> unsigned int) procedure in ASM, but I found 3 methods (of which 2 very similar) to multiply a number by 10. I use NASM and GCC.

eax has the number to be multiplied

Code: Select all

    mov edx, 10
    mul edx

    mov edx, eax
    shl eax, 3
    add eax, edx
    add eax, edx

    mov edx, eax
    shl eax, 2
    add eax, edx
    shl eax, 1

I did benchmarks using this C code:

Code: Select all

#include <stdio.h>
#include <windows.h>

#define N 1000
#define I 10000

void InitSystemTime();
double SystemTime();

unsigned int a(const char *string);
unsigned int b(const char *string);
unsigned int c(const char *string);

////////////////////////////////////////////////////////////////

int main()
{
    InitSystemTime();

    const char *string = "421";

    double time;
    double stats[3];
    unsigned int n, i;

    stats[0] = stats[1] = stats[2] = 0.0;
    for (n = 0; n < N; n++)
    {
        time = SystemTime();
        for (i = 0; i < I; i++)
        {
            a(string);
        }
        stats[0] += SystemTime() - time;

        time = SystemTime();
        for (i = 0; i < I; i++)
        {
            b(string);
        }
        stats[1] += SystemTime() - time;

        time = SystemTime();
        for (i = 0; i < I; i++)
        {
            c(string);
        }
        stats[2] += SystemTime() - time;
    }
    printf("%g\n", stats[0] / 1);
    printf("%g\n", stats[1] / 1);
    printf("%g\n", stats[2] / 1);
    return 0;
}

////////////////////////////////////////////////////////////////

LARGE_INTEGER Frequency;
BOOL UseHighPerformanceTimer;

void InitSystemTime()
{
    BOOL UseHighPerformanceTimer = QueryPerformanceFrequency(&Frequency);
}

double SystemTime()
{
    if (UseHighPerformanceTimer)
    {
        LARGE_INTEGER CurrentTime;
        QueryPerformanceCounter(&CurrentTime);

        return (double) ((CurrentTime.QuadPart) / Frequency.QuadPart);
    }
    else
    {
        return GetTickCount() * 0.001;
    }
}

Results vary between 0.060 and 0.150 seconds, for each 3 of them. How can I discover a speed difference between 2 opcodes or between 2 ASM procedures? Can I look up the documentation and calculate the clock cycles it should take in theory?

Thanks!

Creature · Post by **Creature** » Mon Oct 12, 2009 10:30 am

What I always do if speed differences are little is something like this:

Code: Select all

unsigned long Timer;

// Perform test 1.
Timer = GetTicks();

for(unsigned i(0); i < 500000; ++i)
   perform_test_1();

std::cout << "Test 1: Time: " << (GetTicks() - Timer) << '\n';

// Perform test 2.
Timer = GetTicks();

for(unsigned i(0); i < 500000; ++i)
   perform_test_2();

std::cout << "Test 2: Time: " << (GetTicks() - Timer) << '\n';

...

I make the loops incredibly huge so subtle speed differences become bigger over a lot of iterations. Of course, if the speed difference is incredibly small, you should ask yourself the question if it is even worth optimizing or whether you should go for readability rather than speed.

NickJohnson · Post by **NickJohnson** » Mon Oct 12, 2009 2:15 pm

Daevius wrote:I made an string-to-integer (char * -> unsigned int) procedure in ASM, but I found 3 methods (of which 2 very similar) to multiply a number by 10. I use NASM and GCC.

eax has the number to be multiplied
Code: Select all
    mov edx, 10
    mul edx

    mov edx, eax
    shl eax, 3
    add eax, edx
    add eax, edx

    mov edx, eax
    shl eax, 2
    add eax, edx
    shl eax, 1

You should really write that routine in C, so the compiler can optimize it for you. Something like that may have a different optimal implementation depending on the exact processor model you have, and you don't want to lock yourself into one implementation. If you have to use assembly, just go with the shortest version - "mul eax, 10". Short means fewer instruction reads, which is usually faster. Either way, you shouldn't optimize until you profile, and even then optimization is often bad except for the very innermost loops.

Can I look up the documentation and calculate the clock cycles it should take in theory?

Yes, there are lists of how many clock cycles are needed for each instruction, but it often varies by model. That actually is a decent way of doing it if the processor model is known and the instruction sequence is short and without branches, so the instruction cache is predictable. It actually might be better than empirical testing when running stuff in a user environment (like under Windows), where multitasking can create unknown delays and cache misses. This is just about the only situation where it would work though.

Daevius · Post by **Daevius** » Tue Oct 13, 2009 8:49 am

Creature wrote:Of course, if the speed difference is incredibly small, you should ask yourself the question if it is even worth optimizing or whether you should go for readability rather than speed.

Heh, yes true. I just enjoy it much to find faster ways

NickJohnson wrote:You should really write that routine in C, so the compiler can optimize it for you. Something like that may have a different optimal implementation depending on the exact processor model you have, and you don't want to lock yourself into one implementation.

You have a good point there. C also adds incredible readability over ASM, thanks! Good idea.

OSDev.org

ASM; what is faster? And why can I not benchmark it?

ASM; what is faster? And why can I not benchmark it?

Re: ASM; what is faster? And why can I not benchmark it?

Re: ASM; what is faster? And why can I not benchmark it?

Re: ASM; what is faster? And why can I not benchmark it?