ASM; what is faster? And why can I not benchmark it?

Programming, for all ages and all languages.
Post Reply
Daevius
Posts: 13
Joined: Wed Oct 07, 2009 12:38 pm
Location: Groningen, The Netherlands

ASM; what is faster? And why can I not benchmark it?

Post by Daevius »

I made an string-to-integer (char * -> unsigned int) procedure in ASM, but I found 3 methods (of which 2 very similar) to multiply a number by 10. I use NASM and GCC.

eax has the number to be multiplied

Code: Select all

    mov edx, 10
    mul edx

    mov edx, eax
    shl eax, 3
    add eax, edx
    add eax, edx

    mov edx, eax
    shl eax, 2
    add eax, edx
    shl eax, 1
I did benchmarks using this C code:

Code: Select all

#include <stdio.h>
#include <windows.h>

#define N 1000
#define I 10000

void InitSystemTime();
double SystemTime();

unsigned int a(const char *string);
unsigned int b(const char *string);
unsigned int c(const char *string);

////////////////////////////////////////////////////////////////

int main()
{
    InitSystemTime();

    const char *string = "421";

    double time;
    double stats[3];
    unsigned int n, i;

    stats[0] = stats[1] = stats[2] = 0.0;
    for (n = 0; n < N; n++)
    {
        time = SystemTime();
        for (i = 0; i < I; i++)
        {
            a(string);
        }
        stats[0] += SystemTime() - time;

        time = SystemTime();
        for (i = 0; i < I; i++)
        {
            b(string);
        }
        stats[1] += SystemTime() - time;

        time = SystemTime();
        for (i = 0; i < I; i++)
        {
            c(string);
        }
        stats[2] += SystemTime() - time;
    }
    printf("%g\n", stats[0] / 1);
    printf("%g\n", stats[1] / 1);
    printf("%g\n", stats[2] / 1);
    return 0;
}

////////////////////////////////////////////////////////////////

LARGE_INTEGER Frequency;
BOOL UseHighPerformanceTimer;

void InitSystemTime()
{
    BOOL UseHighPerformanceTimer = QueryPerformanceFrequency(&Frequency);
}

double SystemTime()
{
    if (UseHighPerformanceTimer)
    {
        LARGE_INTEGER CurrentTime;
        QueryPerformanceCounter(&CurrentTime);

        return (double) ((CurrentTime.QuadPart) / Frequency.QuadPart);
    }
    else
    {
        return GetTickCount() * 0.001;
    }
}
Results vary between 0.060 and 0.150 seconds, for each 3 of them. How can I discover a speed difference between 2 opcodes or between 2 ASM procedures? Can I look up the documentation and calculate the clock cycles it should take in theory?

Thanks!
User avatar
Creature
Member
Member
Posts: 548
Joined: Sat Dec 27, 2008 2:34 pm
Location: Belgium

Re: ASM; what is faster? And why can I not benchmark it?

Post by Creature »

What I always do if speed differences are little is something like this:

Code: Select all

unsigned long Timer;

// Perform test 1.
Timer = GetTicks();

for(unsigned i(0); i < 500000; ++i)
   perform_test_1();

std::cout << "Test 1: Time: " << (GetTicks() - Timer) << '\n';

// Perform test 2.
Timer = GetTicks();

for(unsigned i(0); i < 500000; ++i)
   perform_test_2();

std::cout << "Test 2: Time: " << (GetTicks() - Timer) << '\n';

...
I make the loops incredibly huge so subtle speed differences become bigger over a lot of iterations. Of course, if the speed difference is incredibly small, you should ask yourself the question if it is even worth optimizing or whether you should go for readability rather than speed.
When the chance of succeeding is 99%, there is still a 50% chance of that success happening.
User avatar
NickJohnson
Member
Member
Posts: 1249
Joined: Tue Mar 24, 2009 8:11 pm
Location: Sunnyvale, California

Re: ASM; what is faster? And why can I not benchmark it?

Post by NickJohnson »

Daevius wrote:I made an string-to-integer (char * -> unsigned int) procedure in ASM, but I found 3 methods (of which 2 very similar) to multiply a number by 10. I use NASM and GCC.

eax has the number to be multiplied

Code: Select all

    mov edx, 10
    mul edx

    mov edx, eax
    shl eax, 3
    add eax, edx
    add eax, edx

    mov edx, eax
    shl eax, 2
    add eax, edx
    shl eax, 1
You should really write that routine in C, so the compiler can optimize it for you. Something like that may have a different optimal implementation depending on the exact processor model you have, and you don't want to lock yourself into one implementation. If you have to use assembly, just go with the shortest version - "mul eax, 10". Short means fewer instruction reads, which is usually faster. Either way, you shouldn't optimize until you profile, and even then optimization is often bad except for the very innermost loops.
Can I look up the documentation and calculate the clock cycles it should take in theory?
Yes, there are lists of how many clock cycles are needed for each instruction, but it often varies by model. That actually is a decent way of doing it if the processor model is known and the instruction sequence is short and without branches, so the instruction cache is predictable. It actually might be better than empirical testing when running stuff in a user environment (like under Windows), where multitasking can create unknown delays and cache misses. This is just about the only situation where it would work though.
Daevius
Posts: 13
Joined: Wed Oct 07, 2009 12:38 pm
Location: Groningen, The Netherlands

Re: ASM; what is faster? And why can I not benchmark it?

Post by Daevius »

Creature wrote:Of course, if the speed difference is incredibly small, you should ask yourself the question if it is even worth optimizing or whether you should go for readability rather than speed.
Heh, yes true. I just enjoy it much to find faster ways ;-)
NickJohnson wrote:You should really write that routine in C, so the compiler can optimize it for you. Something like that may have a different optimal implementation depending on the exact processor model you have, and you don't want to lock yourself into one implementation.
You have a good point there. C also adds incredible readability over ASM, thanks! Good idea.
Post Reply