OSDev.org • float/double/long double Implementation - Page 2

Page 2 of 3

Re: float/double/long double Implementation

Posted: Mon Oct 13, 2008 6:22 pm

by 01000101

Thanks Brendan for the info, it helped me out quite a bit with the multiplication functions. I have implemented add/sub/mov/clr/cmp/shr/shl/pow/mul so far, but I am having real troubles with the modulus function. I had a small one made that was able to handle smaller numbers, but it cracks under larger numbers. I was also using the subtraction function for this, which was EXTREMELY slow. How exactly am I supposed to calculate the modulus of a number without using a subtraction loop? I am currently implementing a division function, so if the mod function requires it, it will be there in the near future.

I tested my functions using keys of > 1024 and the seem to work fine, I'll be doing more testing soon though. How fast should a multiplication or power function take on integers of > 512 bits? should it be noticeable, or are my functions just too crude and unoptimized in comparison to time-tested designs? I can see a definite ~.5 sec pause when performing large multiplication or power operations.

Re: float/double/long double Implementation

Posted: Tue Oct 14, 2008 1:45 am

by AJ

01000101 wrote:How exactly am I supposed to calculate the modulus of a number without using a subtraction loop?

I'm no mathematician, but I can certainly see a way by using your division routine (once that's in place).

* Perform the division.
* Multiply the answer back up by the divisor.
* Subtract the answer of the multiplication from the original dividend.

A lot of integer division routines use a method that arrives with the modulus and division simultaneously. I have written an OS library that I am using with my boot loader. One of the features of this library is that it allows GCC to perform 64 bit modulus/division functions without unresolved references to __divdi3/__moddi3 and __udivdi3/__umoddi3. I found the implementation for this after some googling. As I say, I am not a mathematician and it may be that this implementation is not efficient. Also, these functions assume the availability of existing 32 bit division/mod routines and I don't know if they can be applied to floats in the same way:

Code: Select all

uint64_t __div_64_32(uint64_t* dividend, uint32_t divisor)
{
	uint32_t dividendLow, dividendLow2, dividendHigh, remainder;
	dividendLow = (uint32_t)(*dividend & 0xFFFFFFFF);
	dividendHigh = (uint32_t)(*dividend >> 32);
	remainder = dividendHigh % divisor;
	dividendHigh = dividendHigh / divisor;
	dividendLow2 = dividendLow >> 16;
	dividendLow2+= remainder << 16;
	remainder = dividendLow2 % divisor;
	dividendLow2 = dividendLow2 / divisor;
	dividendLow = dividendLow & 0xFFFF;
	dividendLow+= remainder << 16;
	remainder = dividendLow % divisor;
	dividendLow = dividendLow / divisor;
	*dividend = dividendLow + ((uint64_t)dividendLow2 << 16) + ((uint64_t)dividendHigh << 32);

	return(remainder);
}

As you can probably see, the division is stored in the dividend and the modulus is returned. Now I'm off to hide before someone starts flaming about how untidy the code is and how inefficient that routine is

HTH,
Adam

Re: float/double/long double Implementation

Posted: Tue Oct 14, 2008 3:02 am

by Combuster

And division itself is still a long subtraction loop. Just look at how long division works...

Re: float/double/long double Implementation

Posted: Tue Oct 14, 2008 3:41 am

by Brendan

Hi,

01000101 wrote:I had a small one made that was able to handle smaller numbers, but it cracks under larger numbers.

Modulus is the remainder of a division, so once you've got division sorted out modulus should be easy - the same function should give you the quotient and remainder (and therefore, the same function should be useful for both division and modulus).

For an example of this, here's the "long division" example from my previous post:

Code: Select all

         1 2 3 4      <-answer
 1 2 ) 1 4 8 0 9
     - 1 2
      ----
         2 8
       - 2 4
        ----
           4 0
         - 3 6
          ----
             4 9
           - 4 8
            ----
             0 1         <- remainder

From this example you can see that "14809 % 12 = 1" (and also that "14809 / 12 = 1234")...

01000101 wrote:I tested my functions using keys of > 1024 and the seem to work fine, I'll be doing more testing soon though. How fast should a multiplication or power function take on integers of > 512 bits? should it be noticeable, or are my functions just too crude and unoptimized in comparison to time-tested designs? I can see a definite ~.5 sec pause when performing large multiplication or power operations.

For multiplication, you can skip entire "rows" if a digit is zero; and avoid doing any multiplication for a row if the digit is one. You can also test if all remaining digits in the second operand are zero and stop if they are.

For a number with a lot of digits you might be able to improve performance by caching previous results. This would work a lot better with small number bases though. For e.g. with "base 8" (octal) there's a good chance that a digit will be repeated, but for "base 4294967296" there's almost no chance a digit will be repeated. Hopefully you'll be using the largest base you can to reduce the number of smaller multiplications needed, so this probably won't help (I just thought I'd mention it in case you can think of a way of using it).

I'm hoping at least some of this made sense. Here's an example that might help:

Code: Select all

              1 2 3 4 5 6 7      <-first operand
*             0 0 0 1 0 2 2      <-second operand
 ---------------------------
              2 4 6 9 1 3 4      <- row needed to be calculated
+           2 4 6 9 1 3 4 0      <- row obtained from recycling results from previous row
+         0 0 0 0 0 0 0 0 0      <- entire row can be skipped because digit was zero
+       1 2 3 4 5 6 7 0 0 0      <- entire row copied from first operand because digit was one
+     0 0 0 0 0 0 0 0 0 0 0      <- entire row can be skipped because remaining digits in second operand are all zero
+   0 0 0 0 0 0 0 0 0 0 0 0      <- entire row can be skipped because remaining digits in second operand are all zero
+ 0 0 0 0 0 0 0 0 0 0 0 0 0      <- entire row can be skipped because remaining digits in second operand are all zero
 --------------------------
  0 0 0 1 2 6 1 7 2 7 4 7 4      <- answer

To work out powers, there's the simple way like this:

Code: Select all

int pow(int a, int b) {
    int result = 1;

    while(b < 0) {
        result /= a;
        b++;
    }
    while(b > 0) {
        result /= a;
        b--;
    }
    return result;
}

A faster way might go something like this:

Code: Select all

int pow(int a, int b) {
    int result;

    if(b == 0) return 1;
    if(b > 0) {
        temp = a;
    } else {
        temp = 1 / a;
    }
    do {
        if( (b & 1) != 0) result *= temp;
        b = b >> 1;
        temp *= temp;
    } while(b != 0);
    return result;
}

I haven't tested this, but in theory I think it should work, and if it does work it'd be a lot faster than the simple way (for signed 32-bit integers you'd do a maximum of 31 multiplications, instead of a maximum of over 2 billion multiplications)...

Cheers,

Brendan

Re: float/double/long double Implementation

Posted: Tue Oct 14, 2008 7:22 am

by Craze Frog

This is how I do Pow().

Code: Select all

function POW(Num as real, Exp as int) as real
    new Result as real
    Result = 1
    while Exp > 0
        if Exp & 1
            Result = Result * Num
            Exp = Exp - 1
        else
            Num = Num * Num
            Exp = Exp >> 1
        endif
    wend
  return Result
endfunction

Re: float/double/long double Implementation

Posted: Tue Oct 14, 2008 12:25 pm

by 01000101

Thanks guys!
Those were great examples, very useful.

Hopefully I'll be able to extract the necessary info from them to formulate a function that can be used on two 128-bit+ integers.

I believe I will purchase the Art of Computer Programming, Volume 2: Seminumerical Algorithms (3rd Edition) by Donald E. Knuth to aid in this as well. It seems to be the 'classic' book on basic algorithms implemented in multi/arbitrary precision designs.

Re: float/double/long double Implementation

Posted: Wed Oct 15, 2008 6:42 pm

by 01000101

OK, so I've been working on some of these functions more and trying to do some more optimizations and such. Here is what I have so far, I'd really like to get these functions faster and as quick as possible as they are 'core' mp functions in essence and will determine higher-level multi-precision functions' speeds.

They are a bit bulky, but I believe it is necessary, I'd be very happy to get proven wrong though.

These are public domain as I believe these are very far from being implemented in any stable version of my OS. I would like them to at least have a reference to me though if implemented in the wiki or this forum. I think it would be great for this forum to tackle something like multi/arbitrary precision integers as I myself had trouble finding documentation or simple explanations/source code (that was PD/LGPL of course). Maybe if these get worked out, a wiki article could be in order?

I know these are crude, but the point is to get them leaner.
BTW, some/most of these are reliant on the d_bits variable that declares the width of the largest amount of bits to be used. Also, these are for positive integers only, my goal was not to get decimal numbers, but large primes and keys for encryption algorithms.

Code: Select all

#define d_bits 512

basic addition with integer overflow alert (mp integer >= max for bit width).

Code: Select all

void mp_add(uint32 *num1, uint32 *num2, uint32 *result, uint32 bits)
{
    static uint64 sum64 = 0;
    static uint32 i = 0, carry = 0, sum32 = 0;
    static uint32 out[d_bits/32];
    
    for(i = 0; i < bits/32; i++)
    {
        sum64 = (uint64)((uint64)num1[i] + (uint64)num2[i] + carry);
        sum32 = sum64 & 0xFFFFFFFF;
        carry = ((sum64 >> 32) & 0xFFFFFFFF);

        if(carry)
        {if(i == bits-1){printf("mp - integer overflow (add) \n");}}
        
        out[i] = sum32;
    }  
    mp_mov(out, result, bits);
}

basic subtraction between two mp integers. How can I catch negative results?

Code: Select all

void mp_sub(uint32 *num1, uint32 *num2, uint32 *result, uint32 bits)
{
    static uint64 sum64 = 0;
    static uint32 i = 0, carry = 0, sum32 = 0;
    static uint32 out[d_bits/32];
    
    for(i = 0; i < bits/32; i++)
    {
        if(num2[i])
        {
            sum64 = (uint64)((uint64)num1[i] - (uint64)num2[i] - carry);
            sum32 = sum64 & 0xFFFFFFFF;
            carry = ((sum64 >> 32) & 0xFFFFFFFF);
            carry = ~carry + 1;
            out[i] = sum32;
        }
        else
        {
            out[i] = num1[i] + carry;
            carry = 0;
        }
    }
    mp_mov(out, result, bits);
}

mp_clr (clear/set to 0) and mp_cmp (compare) may or may not rely on standard functions as I kind of doubt my versions of memset and memcmp use the same function variables as the 'standards' do.

Code: Select all

void mp_clr(uint32 *out, uint32 bits)
{memset((char *)out, 0, (bits/8));}

uint8 mp_cmp(uint32 *num1, uint32 *num2, uint32 bits)
{return memcmp((char *)num1, (char *)num2, (bits/8));}

mp_shl (SHift Left) and mp_shr (SHift Right) pad the appropriate ends with 0's, unlike most examples where it takes a value from a second mp variable.

Code: Select all

void mp_shl(uint32 *num1, uint32 *result, uint32 count, uint32 bits)
{
    static uint32 i = 0, x = 0, carry = 0, carry_tmp = 0;
    static uint32 out[d_bits/32];
    
    mp_mov(num1, out, bits);
    
    while(count--)
    {
        for(x = 0; x < (bits/32); x++)
        {
            carry_tmp = (out[x] >> 31) & 0x00000001;
            out[x] <<= 1;
            out[x] += carry;
            carry = carry_tmp;
        }
    }
    
    mp_mov(out, result, bits);
}

void mp_shr(uint32 *num1, uint32 *result, uint32 count, uint32 bits)
{
    static uint32 i = 0, x = 0, carry = 0;
    static uint32 out[d_bits/32];
    
    mp_mov(num1, out, bits);
    
    while(count--)
    {
        for(x = 0; x < (bits/32); x++)
        {
            if(x == (bits/32)-1){out[x] >>= 1;}
            else
            {
                carry = (out[x+1] << 31) & 0x80000000;
                out[x] >>= 1;
                out[x] += carry;
            }
        }
    }
    mp_mov(out, result, bits);
}

Here is the mp_mul (multiply) and mp_pow (power^) functions. I took advice from Brendan about catching 0's or 1's in the multiplier to speed things up. The mp_mul is a beast of a function, but the comparisons take up a lot of lines.

Code: Select all

void mp_mul(uint32 *num1, uint32 *num2, uint32 *result, const uint32 bits)
{
    static uint64 sum64 = 0;
    static uint32 i = 0, x = 0, carry = 0, sum32 = 0; 
    static uint32 buf[d_bits/32], out[d_bits/32];
    
    mp_clr(buf, bits);  // init mp integers
    mp_clr(out, bits);
    
    // see if either mp integers are 0, if so, return 0
    if(mp_cmp(num1, buf, bits)){mp_mov(out, result, bits); return;}
    if(mp_cmp(num2, buf, bits)){mp_mov(out, result, bits); return;}
    buf[0] = 1;
    // see if either mp integers are 1, if so return the other mp integer
    if(mp_cmp(num1, buf, bits)){mp_mov(num2, result, bits); return;}
    if(mp_cmp(num2, buf, bits)){mp_mov(num1, result, bits); return;}
    buf[0] = 0;
    
    for(i = 0; i < bits/32; i++)
    {
        if(num2[i] == 0){mp_clr(buf, bits); buf[0] = carry; carry = 0;}
        else if(num2[i] == 1){mp_mov(num1, buf, bits); buf[0] += carry; carry = 0;}
        else
        {
            for(x = 0; x < bits/32; x++)
            {
                if(num1[x] == 0){buf[x] = carry; carry = 0;}
                else if(num1[x] == 1){buf[x] = num2[i] + carry; carry = 0;}
                else
                {
                    sum64 = (uint64)num2[i] * (uint64)num1[x];
                    sum64 += carry;     // add back in previous carry integer
                    carry = (uint32)((uint64)sum64 >> 32) & 0xFFFFFFFF; // get HI_32
                    buf[x] = (uint32)((uint64)sum64 & 0xFFFFFFFF);      // get LO_32
                }
            }
        }
        mp_shl(buf, buf, i*32, bits); // (in, out, shift_count, size)
        mp_add(buf, out, out, bits);  // (in, +in, out, size)
    }
    mp_mov(out, result, bits);        // (from, to, size)
    
    // the result+out mp integers are there for the ability to do 
    // something like mp_mul(num1, num2, num1, bits) and not have 
    // num1 get screwed up because it is both an input and output
    // integer. 
}

void mp_pow(uint32 *num1, uint32 *num2, uint32 *result, const uint32 bits)
{
    static uint32 buf[d_bits/32], num[d_bits/32], exp[d_bits/32], out[d_bits/32];
    mp_mov(num1, num, bits);
    mp_mov(num2, exp, bits);
    
    mp_clr(buf, bits);
    mp_clr(out, bits);
    if(mp_cmp(exp, out, bits)){out[0] = 1; mp_mov(out, result, bits); return;}
    out[0] = 1;
    
    while(1)
    {
        
        buf[0] = 0;
        if(mp_cmp(exp, buf, bits)){break;}

        if(exp[0] & 0x01)
        {
            mp_mul(out, num, out, bits);
            buf[0] = 1;
            mp_sub(exp, buf, exp, bits);
        }
        else
        {
            mp_shr(exp, exp, 1, bits);
            mp_mul(num, num, num, bits);  
        }
    }
    
    mp_mov(out, result, bits);
}

Once I write the mp_div (divide) and mp_mod (modulus) functions, I'll get back about that. I'll soon whip up the xor/or/and/not functions in a bit.

Don't be to harsh on these as I have never declared to be a mathematician.

Re: float/double/long double Implementation

Posted: Wed Oct 15, 2008 9:15 pm

by Brendan

Hi,

01000101 wrote:OK, so I've been working on some of these functions more and trying to do some more optimizations and such. Here is what I have so far, I'd really like to get these functions faster and as quick as possible as they are 'core' mp functions in essence and will determine higher-level multi-precision functions' speeds.

The problem with C is that you can't get any access to the carry flag, so you need to do extra/unnecessary work to detect overflows (where it's extremely unlikely that the compiler's optimizer can remove this extra/unnecessary work and just use the carry). Switching to assembly or inline assembly would solve that...

The problem with assembly is that it's machine specific - you'd need to use some conditional code with a different version for each target architecture (for e.g. a 32-bit 80x86 version and a 64-bit 80x86 version). The difference between assembly and C could be significant though.

For example, to detect if subtraction overflows you could compare the numbers first (if "a < b" then "a - b" will overflow) but that'll double the time it'd take to do the subtraction.

For addition, you're calculating your own carry flag and that would probably double the time addition takes.

For shifts, I have a theory that it's possible to do the initial data move and the shift at the same time, something like this (but repeated or in a loop):

Code: Select all

    mov eax,[esi+?]
    shld ebx,eax,ecx
    mov [edi+edx+?],ebx

Note: I also think your C code for shifts is broken - the initial data move needs to move the data by "bits / 32" dwords while the post-shift needs to use "bits % 32".

For multiplication, when the CPU does multiplication the result is twice as wide as the operands (for e.g. multiplying a pair of 32-bit values gives you a 64-bit result, or multiplying a pair of 64-bit values gives you a 128-bit result). You can't use this in C and you end up with a situation where multiplying a pair of 32-bit values gives you a 32-bit result that may have overflowed; and to get around that you'd need to work on 16-bit values (so you'd be doing "16-bit * 16-bit = 32-bit with no chance of overflow" instead of "32-bit * 32-bit = 32-bit with likely overflow problems"; or "32-bit * 32-bit = 64-bit with no chance of overflow" instead of "64-bit * 64-bit = 64-bit with likely overflow problems"). Because of the nature of the operation this will make your code about 4 times slower - for 512-bit multiplication, on 32-bit computers you'd end up doing 1024 multiplications instead of 256 multiplications (or on 64-bit computers, 256 multiplications instead of 64 multiplications).

For the power function, the performance you get will mostly depend on the performance of multiplication. If multiplication is 4 times slower, then the power function will be 4 times slower too.

Then there's smaller things. When setting a 512-bit number to zero, the code should be unrolled so it uses 16 "mov [edi+?],eax" instructions (or 8 "mov [rdi+?], rax" instructions, or maybe 4 "MOVDQA" instructions). For comparisons, I'd use "rep cmpsd".

Mostly what I'm saying here is that if you care about performance you need to use assembly, because C isn't very good at this sort of thing (partly because C won't let you access the carry flag).

For an example, here's some code for 512-bit addition on 64-bit computers:

Code: Select all

;Addition
;
;Input
; rdx   Address of first operand
; rsi   Address of second operand
; rdi   Address to put result
;
;Output
; rax   Zero if no overflow, 0xFFFFFFFFFFFFFFF if overflowed
;
;Trashed
; r8, r9, r10, r11, r12, r13, r14

mp_add:
	mov rax,[rdx]
	mov r8,[rdx+8]
	mov r9,[rdx+16]
	mov r10,[rdx+24]
	mov r11,[rdx+32]
	mov r12,[rdx+40]
	mov r13,[rdx+48]
	mov r14,[rdx+56]
	add rax,[rsi]
	adc r8,[rsi+8]
	adc r9,[rsi+16]
	adc r10,[rsi+24]
	adc r11,[rsi+32]
	adc r12,[rsi+40]
	adc r13,[rsi+48]
	adc r14,[rsi+56]
	mov [rdi],eax
	mov [rdi+8],r8
	mov [rdi+16],r9
	mov [rdi+24],r10
	mov [rdi+32],r11
	mov [rdi+40],r12
	mov [rdi+48],r13
	mov [rdi+56],r14
	mov rax,0
	sbc rax,0
	ret

Here's a modified version of your code:

Code: Select all

int mp_add(uint32 *num1, uint32 *num2, uint32 *result, uint32 bits) {
    static uint64 sum64 = 0;
    static uint32 i = 0, carry = 0, sum32 = 0;

    for(i = 0; i < bits/32; i++)
    {
        sum64 = (uint64)((uint64)num1[i] + (uint64)num2[i] + carry);
        sum32 = sum64 & 0xFFFFFFFF;
        carry = ((sum64 >> 32) & 0xFFFFFFFF);
        result[i] = sum32;
    }  
    return carry;
}

Compiling this version with "gcc -O3" gave me this disassembly:

Code: Select all

00000000004004d0 <mp_add>:
  4004d0:       8d 41 1f                lea    0x1f(%rcx),%eax
  4004d3:       85 c9                   test   %ecx,%ecx
  4004d5:       41 89 ca                mov    %ecx,%r10d
  4004d8:       49 89 d3                mov    %rdx,%r11
  4004db:       c7 05 4b 0b 20 00 00    movl   $0x0,0x200b4b(%rip)        # 601030 <i.2240>
  4004e2:       00 00 00
  4004e5:       44 0f 48 d0             cmovs  %eax,%r10d
  4004e9:       41 c1 fa 05             sar    $0x5,%r10d
  4004ed:       45 85 d2                test   %r10d,%r10d
  4004f0:       7e 51                   jle    400543 <mp_add+0x73>
  4004f2:       8b 0d 34 0b 20 00       mov    0x200b34(%rip),%ecx        # 60102c <carry.2241>
  4004f8:       45 31 c9                xor    %r9d,%r9d
  4004fb:       45 31 c0                xor    %r8d,%r8d
  4004fe:       66 90                   xchg   %ax,%ax

  400500:       4a 63 14 87             movslq (%rdi,%r8,4),%rdx
  400504:       4a 63 04 86             movslq (%rsi,%r8,4),%rax
  400508:       48 63 c9                movslq %ecx,%rcx
  40050b:       41 ff c1                inc    %r9d
  40050e:       48 01 ca                add    %rcx,%rdx
  400511:       48 01 d0                add    %rdx,%rax
  400514:       48 89 c2                mov    %rax,%rdx
  400517:       43 89 04 83             mov    %eax,(%r11,%r8,4)
  40051b:       49 ff c0                inc    %r8
  40051e:       48 c1 fa 20             sar    $0x20,%rdx
  400522:       45 39 d1                cmp    %r10d,%r9d
  400525:       89 d1                   mov    %edx,%ecx
  400527:       75 d7                   jne    400500 <mp_add+0x30>

  400529:       44 89 15 00 0b 20 00    mov    %r10d,0x200b00(%rip)        # 601030 <i.2240>
  400530:       89 05 f2 0a 20 00       mov    %eax,0x200af2(%rip)        # 601028 <sum32.2242>
  400536:       48 89 05 fb 0a 20 00    mov    %rax,0x200afb(%rip)        # 601038 <sum64.2239>
  40053d:       89 15 e9 0a 20 00       mov    %edx,0x200ae9(%rip)        # 60102c <carry.2241>
  400543:       8b 05 e3 0a 20 00       mov    0x200ae3(%rip),%eax        # 60102c <carry.2241>
  400549:       c3                      retq

My assembly only does 8 64-bit additions. GCC's loop does 32 64-bit additions (a pair of 64-bit additions for each 32-bits of results), which is 4 times as many additions. Including loop overhead, shifting, etc I'd guess that GCC's code will be at least 6 times slower.

Cheers,

Brendan

Re: float/double/long double Implementation

Posted: Thu Oct 16, 2008 3:38 am

by 01000101

You have some very good points concerning the inefficiency of C under these circumstances. As I'm not all that concerned about architecture compatibility, inline assembly is fine.

I changed a couple functions and hoped we could continue this critiquing until a polished (or at least less embarrassing) code-set emerges as I believe it could be useful later on for both OS and library developers.

new mp_shr and mp_shl.

Code: Select all

void mp_shr(uint64 *out)
{
    __asm__ __volatile__("  addq %%rcx, %0;     "
                         "  subq $0x08, %0;     " 
                         "  shrq $0x01, (%0);   "
                         "  subq $0x08, %0;     "
                         "  decq %%rcx;         "
                         "shrloop:              "
                         "  rcrq $0x01, (%0);   "
                         "  subq $0x08, %0;     "
                         "  loopnz shrloop;     "
                         :
                         : "r"((uint64)out), "c"((uint64)(d_bits/64))
                         : "memory");
}

void mp_shl(uint64 *out)
{
    __asm__ __volatile__("  shlq $0x01, (%0);   "
                         "  addq $0x08, %0;     "
                         "  decq %%rcx;         "
                         "shlloop:              "
                         "  rclq $0x01, (%0);   "
                         "  addq $0x08, %0;     "
                         "  loopnz shlloop;     "
                         :
                         : "r"((uint64)out), "c"((uint64)(d_bits/64))
                         : "memory");
}

I was unable to find what the SBC assembly instruction that you used earlier does or how it is used, so I will need a little more info about handling overflows/negation.

add+sub+mov

Code: Select all

void mp_mov(uint64 *from, uint64 *to)
{
    __asm__ __volatile__("movloop:              "
                         "  movq (%0), %%rax;   "
                         "  movq %%rax, (%1);    "
                         "  addq $0x08, %0;     "
                         "  addq $0x08, %1;     "
                         "  loopnz movloop;     "
                         :
                         : "r"((uint64)from), "r"((uint64)to), "c"((uint64)(d_bits/64))
                         : "memory");
}

void mp_add_op(uint64 *num1, uint64 *num2, uint64 *result)
{    
    __asm__ __volatile__("  movq (%0), %%rax;           "
                         "  addq (%1), %%rax;           "
                         "  movq %%rax, (%2);           "
                         "  decq %%rcx;                 "
                         "addloop:                      "
                         "  movq (%0), %%rax;           "
                         "  adcq (%1), %%rax;           "
                         "  movq %%rax, (%2);           "
                         "  addq $0x08, %0;             "
                         "  addq $0x08, %1;             "
                         "  addq $0x08, %2;             "
                         "  loopnz addloop;            "
                         : 
                         : "r" ((uint64)num1), "r" ((uint64)num2), 
                           "r" ((uint64)result), "c" ((uint64)(d_bits/64))
                         : "memory"); 
}

void mp_add(uint64 *num1, uint64 *num2, uint64 *result)
{
    static uint64 out[d_bits/64];
    mp_add_op(num1, num2, out);
    mp_mov(out, result);
}

void mp_sub_op(uint64 *num1, uint64 *num2, uint64 *result)
{    
    __asm__ __volatile__("  movq (%0), %%rax;           "
                         "  subq (%1), %%rax;           "
                         "  movq %%rax, (%2);           "
                         "  decq %%rcx;                 "
                         "subloop:                      "
                         "  movq (%0), %%rax;           "
                         "  sbbq (%1), %%rax;           "
                         "  movq %%rax, (%2);           "
                         "  addq $0x08, %0;             "
                         "  addq $0x08, %1;             "
                         "  addq $0x08, %2;             "
                         "  loopnz subloop;            "
                         : 
                         : "r" ((uint64)num1), "r" ((uint64)num2), 
                           "r" ((uint64)result), "c" ((uint64)(d_bits/64))
                         : "memory"); 
}

void mp_sub(uint64 *num1, uint64 *num2, uint64 *result)
{
    static uint64 out[d_bits/64];
    mp_sub_op(num1, num2, out);
    mp_mov(out, result);
}

these are all 64-bit instructions now and have been reduced in size significantly. Also, does the Xeon 'pause' optimization work in this case, or is that only for extremely tight 'spinlock-like' loops?

Re: float/double/long double Implementation

Posted: Thu Oct 16, 2008 7:41 am

by Brendan

Hi,

01000101 wrote:I was unable to find what the SBC assembly instruction that you used earlier does or how it is used, so I will need a little more info about handling overflows/negation.

That's because I wrote the wrong thing!

I was thinking SBC (SuBtract with Carry) when I should've written SBB (SuBtract with Borrow).

Basically, if RAX is zero and the carry flag is clear then "sbb rax,0" will leave RAX set to zero, and if the carry flag is set then "sbb rax,0" will set RAX to 0xFFFFFFFFFFFFFFFF (or -1 if you look at it as a signed number). Mostly it's the same as "if (carry == clear) { RAX = 0 } else { RAX = 0xFFFFFFFFFFFFFFFF }" except there is no branch.

If you want the function to return "RAX = 1" if an overflow was detected then there's plenty of alternatives - "setc al" (which sets AL to zero if carry is clear, and sets AL to one if carry is set); or maybe "rcl rax,1" (which just shifts the carry flag into the lowest bit of RAX), or "adc rax, 0" (which works in a similar way to "sbb"), or "cmovc eax,1" (which does "eax = 1" if carry is set). I'm not sure which way would be faster...

01000101 wrote:Also, does the Xeon 'pause' optimization work in this case, or is that only for extremely tight 'spinlock-like' loops?

Most modern CPUs do several instructions at the same time. For tight loops, the CPU can be doing several iterations of the same loop at the same time (e.g. some instructions at the end of the loop being retired/completed while more instructions at the start of the loop are being speculatively executed). For spinlocks on systems with hyper-threading this creates problems because many iterations of the same loop may be being executed, which wastes execution resources that could be used by the other logical CPU. It also increases the costs to exit the loop, because those unnecessary instructions being speculatively executed need to be cleared from the pipeline.

The PAUSE instruction fixes this by telling the CPU to wait until all instructions are retired/completed before continuing (which also uses less CPU resources and let's the other logical CPU use those resources).

For a normal loop (anything that isn't a "polling loop"), you want the CPU to execute as much as it can at the same time to improve performance (e.g. so that by the time the CPU figures out which way the "loopnz" will branch, most of the work for the next iteration of the loop is already done). In this case a PAUSE instruction will hurt performance.

Also note that using any form of loop adds overhead (extra calculations and a branch mis-prediction on exit), so unrolling the loop can help. Unrolling also helps to avoid delays caused by fetching data. For example, load everything at the start of the function, so that by the time the last load is executed the data for the first load has arrived, and then start manipulating the data.

Large (unrolled) functions aren't suitable for inlining, but (assuming these functions would be used from different places) inlining can prevent them from being in the trace cache.

Anyway, here's my attempt at shifting left...

Code: Select all

;Shift left ("result = operand1 << operand2")
;
;Input
; rcx	Address of 512-bit operand 2
; rsi	Address of 512-bit operand 1
; rdi	Address of 512-bit result
;
;Output
; rax	Zero if no overflow, non-zero if overflow detected

mp_shr:
	mov rbx,rcx
	and ecx,0x0000003F
	shr rbx,6
	xor edx,edx
	cmp rbx,512/64
	jae .s512
	jmp [.shift_left_table + ebx * 8]

	section .data
.shift_left_table:
	dq .s000, .s064, .s128, .s192, .s256, .s320, .s384, .s448
	section .text

.s512:
	xor r8,r8
	xor r9,r9
	xor r10,r10
	xor r11,r11
	xor r12,r12
	xor r13,r13
	xor r14,r14
	xor r15,r15
	mov rax,[rsi]
	or rax,[rsi+8]
	or rax,[rsi+16]
	or rax,[rsi+24]
	or rax,[rsi+32]
	or rax,[rsi+40]
	or rax,[rsi+48]
	or rax,[rsi+56]
	test ecx,ecx
	jne .donePreShift
	jmp .donePostShift
.s448:
	xor r8,r8
	xor r9,r9
	xor r10,r10
	xor r11,r11
	xor r12,r12
	xor r13,r13
	xor r14,r14
	mov r15,[rsi]
	mov rax,[rsi+8]
	or rax,[rsi+16]
	or rax,[rsi+24]
	or rax,[rsi+32]
	or rax,[rsi+40]
	or rax,[rsi+48]
	or rax,[rsi+56]
	test ecx,ecx
	jne .donePreShift
	jmp .donePostShift
.s384:
	xor r8,r8
	xor r9,r9
	xor r10,r10
	xor r11,r11
	xor r12,r12
	xor r13,r13
	mov r14,[rsi]
	mov r15,[rsi+8]
	mov rax,[rsi+16]
	or rax,[rsi+24]
	or rax,[rsi+32]
	or rax,[rsi+40]
	or rax,[rsi+48]
	or rax,[rsi+56]
	test ecx,ecx
	jne .donePreShift
	jmp .donePostShift
.s320:
	xor r8,r8
	xor r9,r9
	xor r10,r10
	xor r11,r11
	xor r12,r12
	mov r13,[rsi]
	mov r14,[rsi+8]
	mov r15,[rsi+16]
	mov rax,[rsi+24]
	or rax,[rsi+32]
	or rax,[rsi+40]
	or rax,[rsi+48]
	or rax,[rsi+56]
	test ecx,ecx
	jne .donePreShift
	jmp .donePostShift
.s256:
	xor r8,r8
	xor r9,r9
	xor r10,r10
	xor r11,r11
	mov r12,[rsi]
	mov r13,[rsi+8]
	mov r14,[rsi+16]
	mov r15,[rsi+24]
	mov rax,[rsi+32]
	or rax,[rsi+40]
	or rax,[rsi+48]
	or rax,[rsi+56]
	test ecx,ecx
	jne .donePreShift
	jmp .donePostShift
.s192:
	xor r8,r8
	xor r9,r9
	xor r10,r10
	mov r11,[rsi]
	mov r12,[rsi+8]
	mov r13,[rsi+16]
	mov r14,[rsi+24]
	mov r15,[rsi+32]
	mov rax,[rsi+40]
	or rax,[rsi+48]
	or rax,[rsi+56]
	test ecx,ecx
	jne .donePreShift
	jmp .donePostShift
.s128:
	xor r8,r8
	xor r9,r9
	mov r10,[rsi]
	mov r11,[rsi+8]
	mov r12,[rsi+16]
	mov r13,[rsi+24]
	mov r14,[rsi+32]
	mov r15,[rsi+40]
	mov rax,[rsi+48]
	or rax,[rsi+56]
	test ecx,ecx
	jne .donePreShift
	jmp .donePostShift
.s064:
	xor r8,r8
	mov r9,[rsi]
	mov r10,[rsi+8]
	mov r11,[rsi+16]
	mov r12,[rsi+24]
	mov r13,[rsi+32]
	mov r14,[rsi+40]
	mov r15,[rsi+48]
	mov rax,[rsi+56]
	test ecx,ecx
	jne .donePreShift
	jmp .donePostShift
.s000:
	xor eax,eax
	mov r8,[rsi]
	mov r9,[rsi+8]
	mov r10,[rsi+16]
	mov r11,[rsi+24]
	mov r12,[rsi+32]
	mov r13,[rsi+40]
	mov r14,[rsi+48]
	mov r15,[rsi+56]
	test ecx,ecx
	je .donePostShift
.donePreShift:
	shld rdx,r15,cl
	shld r15,r14,cl
	shld r14,r13,cl
	shld r13,r12,cl
	shld r12,r11,cl
	shld r11,r10,cl
	shld r10,r9,cl
	shld r9,r8,cl
	shl r8,cl
.donePostShift:
	mov [rdi],r8
	mov [rdi+8],r9
	mov [rdi+16],r10
	mov [rdi+24],r11
	mov [rdi+32],r12
	mov [rdi+40],r13
	mov [rdi+48],r14
	mov [rdi+56],r15

	or rax,rdx
	ret

And 2 different versions for subtraction:

Code: Select all

;Subtract ("result = operand1 - operand2")
;
;Input
; rdx	Address of 512-bit operand 1
; rsi	Address of 512-bit operand 2
; rdi	Address of 512-bit result
;
;Output
; rax	Zero if no overflow, non-zero if overflow detected

mp_sub_3operand:
	mov r8,[rdx]
	mov r9,[rdx+8]
	mov r10,[rdx+16]
	mov r11,[rdx+24]
	mov r12,[rdx+32]
	mov r13,[rdx+40]
	mov r14,[rdx+48]
	mov r15,[rdx+56]
	sub r8,[rsi]
	sbb r9,[rsi+8]
	sbb r10,[rsi+16]
	sbb r11,[rsi+24]
	sbb r12,[rsi+32]
	sbb r13,[rsi+40]
	sbb r14,[rsi+48]
	sbb r15,[rsi+56]
	sbb eax,0
	mov [rdi],r8
	mov [rdi+8],r9
	mov [rdi+16],r10
	mov [rdi+24],r11
	mov [rdi+32],r12
	mov [rdi+40],r13
	mov [rdi+48],r14
	mov [rdi+56],r15
	ret


;Subtract ("result -= operand2")
;
;Input
; rsi	Address of 512-bit operand 2
; rdi	Address of 512-bit operand 1 (replaced by result)
;
;Output
; rax	Zero if no overflow, non-zero if overflow detected

mp_sub_2operand:
	mov r8,[rsi]
	mov r9,[rsi+8]
	mov r10,[rsi+16]
	mov r11,[rsi+24]
	mov r12,[rsi+32]
	mov r13,[rsi+40]
	mov r14,[rsi+48]
	mov r15,[rsi+56]
	sub [rdi],r8
	sbb [rdi+8],r9
	sbb [rdi+16],r10
	sbb [rdi+24],r11
	sbb [rdi+32],r12
	sbb [rdi+40],r13
	sbb [rdi+48],r14
	sbb [rdi+56],r15
	sbb eax,0
	ret

Also, something makes me think your inline assembly versions aren't right, but I didn't spend much time thinking about them - my brain has a built-in "NASM syntax" parser, and I assume it's easier to compile them and see if they produce the right results that way...

Cheers,

Brendan

Re: float/double/long double Implementation

Posted: Thu Oct 16, 2008 8:30 am

by Combuster

;Output
; rax Zero if no overflow, non-zero if overflow detected
(...)
sbb rax, 0

RAX hasn't previously been set, so if you do a SBB RAX, 0 (RAX := RAX - 0 - carry) then the answer is still undefined.
You probably meant to use a SBB RAX, RAX (RAX := RAX - RAX - carry = -carry) where the result only depends on the carry flag

Re: float/double/long double Implementation

Posted: Thu Oct 16, 2008 9:55 am

by Craze Frog

I couldn't understand what your multiplication does, so maybe this is a related algorithm, but I'll post it just in case. It pushes all the bit manipulations down to already made (let's hope so) comparison, shifting and addition functions. I don't know how fast it is, but it sure is simple.

Code: Select all

Procedure Mul(A, B)
  If A > B    ; Whether this swap improves performance
    Swap A, B ; must simply be tested with a few numbers
  EndIf
  C = B
  While A > 0
    A >> 1
    B << 1
    If A & 1 ; A is odd
      C + B
    EndIf
  Wend
  ProcedureReturn C
EndProcedure

Re: float/double/long double Implementation

Posted: Sun Dec 28, 2008 8:55 pm

by 01000101

I know this is massive thread necromancy, but I think that it's better to extend off of this than create a new thread for the same topic and such.

I've been working on my multi-precision code, and decided to trash it as it was horrid lol. I re-wrote some of the core functions (add, sub, mul, pow, and shl). I'd like to get some feedback about these as the feedback from before (about the previously posted functions) were incredibly helpful. I'd also like to hear any thoughts on how to optimize this code while keeping it arbitrary-precision and without defining sections to pre-defined bit counts (if the operation is using 512-bits, execute this...).

I tried to do a decent job of commenting, and I'm not posting all of them, just a few for now.

I'm currently working on a division functions (and a modulus function spawned from division) so that I can complete the final functions to use in RSA.

I've tested these functions with 4096 bits max (as my print function prints *all* bits instead of what's used, it won't fit 8192 bit on the screen haha) and they work good from what I can tell. The add and multiply functions seem to work at a decent speed and don't have noticeable calculation times.

mp_add:

Code: Select all

global mp_add ; dst + val -> dst
; void mp_add(uint8 * dst, uint8 * val, uint64 bit_size); 
; mp_add(rdi, rsi, rdx); // function argument translations
; trashed: rax, rcx, rdx, rdi, rsi, r8 
mp_add:
    cmp rdx, 0          ; see if bit_size > 0
    ja .checkmod        ; if more than 0 bits, continue; else, return
    ret                    
  
 .checkmod:
    push rdx            ; save bit_size
    mov rax, rdx        ; place bit_size into RAX (for division)
    mov rdx, 0          ; set RDX to 0 (DIV uses RDX:RAX/r64)
    mov r8, 64          ; place 64 (bit alignment) into R8
    div r8              ; divide bit_size/64 (RAX == quotient, RDX == remainder)
    mov rax, rdx        ; move the remainder into RAX
    pop rdx             ; restore bit_size
    
    cmp rax, 0          ; see if remainder == 0
    je .bits2qwords     ; if the remainder is equal to 0, continue; else, return
    ret
    
 .bits2qwords:          ; convert the count in bits to bytes
    shr rdx, 6          ; shift by 6 to get the number of qwords from the given bit_size
    mov rcx, rdx        ; restore count register (RCX) with the count in bytes
    
    cmp rcx, 0          ; is the number of qwords 0?
    je .exit            ; if so, return
    
    clc                 ; clear the carry flag
    pushf               ; push the rflags (to keep the carry flag)
 .addqword:
    popf                ; pop the rflags (to restore the carry flag)
    mov rax, [rdi]      ; put dst qword into the accumulator           
    adc rax, [rsi]      ; add the src qword to the accumulator (with carry)
    mov [rdi], rax      ; store the result into dst
    pushf               ; push the rflags (to keep the carry flag)
    
    add rdi, 8          ; add sizeof(qword) to dst
    add rsi, 8          ; add sizeof(qword) to src
    dec rcx             ; decrease the counter

    cmp rcx, 0          ; if the counter is > 0, loop to .addbyte
    ja .addqword
    
 .exit:
    popf                ; pop the rflags (to restore the carry flag)
    ret                 ; return

mp_mul:

Code: Select all

global mp_mul; dst * val -> dst
; void mp_mul(uint8 * dst, uint8 * val, uint64 bit_size);
; mp_mul(rdi, rsi, rdx) // function argument translations
mp_mul:
    cmp rdx, 64         ; compare the bit_size to 64 bits
    jae .bits2qwords    ; if more than 64 bits, continue; else, return
    ret                    
    
 .bits2qwords:          ; convert the count in bits to bytes
    shr rdx, 6          ; shift by 6 to get the number of qwords from the given bit_size
    mov rcx, rdx        ; restore count register (RCX) with the count in bytes
    
    cmp rcx, 0          ; is the number of qwords 0?
    je .exit            ; if so, return
    
    clc                 ; clear the carry flag
     
    ; qword count (static)                  (rcx)   comparison value
    ; counter for multiplicand (small) loop (r14)   top muliplication value (alternates frequently)
    ; counter for muliplier (large) loop    (r15)   bottom multiplication value (alternates infrequently)
    ; pointer to the temp buffer            (r8)    pointer to (mulbuf)
    ; pointer to the temp buffer (static)   (r9)    incremental pointer to (mulbuf)
    ; pointer to the muliplicand            (rdi)   
    ; pointer to the muliplier              (rsi)

    mov r14, 0          ; when r14 == rcx, we're done with the current loop
    mov r15, 0          ; when r15 == rcx, we're done completely
    
    mov r8, mulbuf      ; r8 is our buffer pointer
    mov r9, r8          ; r9 is our temporary buffer pointer
    
    push rdi            ; save for later (rep movsq)
    
    ; this sub-section just zeros mulbuf
 .zerotmp:
    push rdi            
    push rax
    push rcx
    mov rdi, mulbuf
    mov rax, 0
    rep stosq
    pop rcx
    pop rax
    pop rdi
    
 .mulloop:
    mov rax, [rdi]      ; get multiplicand
    
    cmp rax, 0          ; skip if the multiplicand is 0
    je .rdiis0             
    cmp dword [rsi], 0  ; skip if the multiplier is 0
    je .rsiis0             
    
    mul qword [rsi]     ; multiply by the multiplier
    add rax, [r8]       ; add previous integer from buffer
    mov [r8], rax       ; and store it back to the buffer
    adc rdx, 0          ; get our carry value
    add r8, 8           ; increase our buffer position by 8
    add [r8], rdx       ; and add in the carry value

 .rdiis0:    
    add rdi, 8          ; increase our multiplicand buffer pointer
    inc r14             ; increase the small loop counter
    
    cmp r14, rcx        ; is the small loop counter reached its limit?
    jb .mulloop         ; if no, keep looping
    
 .rsiis0:
    xor r14, r14        ; reset the small loop counter
    inc r15             ; increase the large loop counter
    
    add r9, 8           ; increase our temporary buffer position by 8
    mov r8, r9          ; and place it as our actual buffer pointer

    add rsi, 8          ; increase the multiplier buffer position by 8
    pop rdi             ; restore our multiplicand buffer position
    push rdi            ; and save it again
    cmp r15, rcx        ; has the large loop counter reached its limit?
    jb .mulloop         ; if no, keep looping
    
    pop rdi             ; restore our multiplicand buffer position
    mov rsi, mulbuf     ; set RSI as the buffer pointer
    rep movsq           ; copy our buffer to the multiplicand buffer (for return use)
        
 .exit:
    ret                 ; return

[section .bss]
align 16
mulbuf: resb 8192/8; max bit_size for now is 8192  
[section .text]

mp_shl:

Code: Select all

global mp_shl; dst << count -> dst
; void mp_shl(uint8 * dst, uint64 count, uint64 bit_size);
; mp_shl(rdi, rsi, rdx) // function argument translations
mp_shl:
    cmp rdx, 64         ; compare the bit_size to 64 bits
    jae .bits2qwords    ; if more than 64 bits, continue; else, return
    ret                    
    
 .bits2qwords:          ; convert the count in bits to bytes
    shr rdx, 6          ; shift by 6 to get the number of qwords from the given bit_size
    mov rcx, rdx        ; restore count register (RCX) with the count in bytes
    
    cmp rcx, 0          ; is the number of qwords 0?
    je .exit            ; if so, return
    
    clc                 ; clear the carry flag
    
    push rdi            ; start of integer buffer (save)
    pushfq              ; save the RFLAGS (carry flag)
 .shlloop:
    popfq               ; restore the RFLAGS (carry flag)
    mov rcx, rdx        ; restore the counter to bit_size/8
    pop rdi             ; restore the integer buffer pointer
    push rdi            ; ... and save it again
    
    shl qword [rdi], 1  ; do one SHL (shift left, no carry restoration)
    jmp .compare        ; jump over the RCL
 .wcarry:
    popfq               ; restore the RFLAGS (carry flag)
    rcl qword [rdi], 1  ; do one RCL (shift left with carry)
    
 .compare:
    pushfq              ; restore the RFLAGS (carry flag)
    add rdi, 8          ; increase our buffer position by sizeof(qword)
    dec rcx             ; decrease our counter

    cmp rcx, 0          ; if the counter != 0, keep RCL'ing
    jne .wcarry
    
    dec rsi             ; decrease the bit_size/8 count
    cmp rsi, 0          ; if that is zero, we're done
    jne .shlloop        ; else, SHL and keep looping
 
 .exit:                 
    popfq               ; restore the RFLAGS
    pop rdi             ; restore RDI
    ret                 ; return

mp_pow is just a multiplication loop and mp_sub looks like mp_add just with different key instructions, so if one is flawed, I'm positive the other will be. =)

There isn't a better way to do mp_pow is there? I can only think of multiplying again and again as the solution.

BTW, if anyone wants it, this code is free... in all senses, or if it makes you feel better, it's Public Domain. Once I get a real library of this going, I plan on releasing it as free.

Re: float/double/long double Implementation

Posted: Sun Dec 28, 2008 9:17 pm

by JohnnyTheDon

If you're only using this for RSA, there is a simpler method for exponentiation. It works especially well when a modulus is involved (as there is in RSA) because you can reduce after each step.

In pseudo code:

Code: Select all

base = base % n;
accumulator = base;
for(i = index of first 1 in exponent minus 1, starting from the left; i > 0; i--)
{
     if(exponent[i] == 1)
     {
              accumulator = accumulator^2 * base;

     } else {
              accumulator = accumulator^2;
     }
     accumulator = accumulator % n
}

If you're not using this only for exponentiation with a modulus, sorry for the OT post.

Re: float/double/long double Implementation

Posted: Mon Dec 29, 2008 9:05 am

by Brendan

Hi,

01000101 wrote:There isn't a better way to do mp_pow is there? I can only think of multiplying again and again as the solution.

Brendan wrote:To work out powers, there's the simple way like this:
Code: Select all
int pow(int a, int b) {
    int result = 1;

    while(b < 0) {
        result /= a;
        b++;
    }
    while(b > 0) {
        result /= a;
        b--;
    }
    return result;
}
A faster way might go something like this:
Code: Select all
int pow(int a, int b) {
    int result;

    if(b == 0) return 1;
    if(b > 0) {
        temp = a;
    } else {
        temp = 1 / a;
    }
    do {
        if( (b & 1) != 0) result *= temp;
        b = b >> 1;
        temp *= temp;
    } while(b != 0);
    return result;
}
I haven't tested this, but in theory I think it should work, and if it does work it'd be a lot faster than the simple way (for signed 32-bit integers you'd do a maximum of 31 multiplications, instead of a maximum of over 2 billion multiplications)...

Note: I did test a few positive exponents using this method (on paper) and got the right answer; and there is a bug for negative exponents in the code I posted (for negative exponents, you'd need to negate the exponent before doing the final loop). I'd also assume that (for e.g.) a maximum of 31 multiplications instead of a maximum of over 2 billion multiplications would make a significant performance difference (e.g. enough to make it worth researching/testing my theory).

However, this code won't work in some cases - the exponent must be an integer (e.g. something like "5 ** 2.25" won't work). I'm not sure if there's a trick I'm missing or some other way to handle this case. I do realize that (for e.g.) "5 ** 2.25 = (5 ** 2) * (5 ** 0.25)" and that "5 ** (-2.25) = (5 ** (-2) ) * (5 ** (-0.25) )", but I can't think of an easy way to calculate the result when the exponent is between -1 and 0 or between 0 and 1.

[EDIT]Doh.

I just noticed there's a stupid bug in the code I posted for the "simple method" too. For positive exponents it should be:

Code: Select all

    while(b > 0) {
        result *= a;
        b--;
    }

[/i][/EDIT]

Cheers,

Brendan