float/double/long double Implementation
Re: float/double/long double Implementation
Thanks Brendan for the info, it helped me out quite a bit with the multiplication functions. I have implemented add/sub/mov/clr/cmp/shr/shl/pow/mul so far, but I am having real troubles with the modulus function. I had a small one made that was able to handle smaller numbers, but it cracks under larger numbers. I was also using the subtraction function for this, which was EXTREMELY slow. How exactly am I supposed to calculate the modulus of a number without using a subtraction loop? I am currently implementing a division function, so if the mod function requires it, it will be there in the near future.
I tested my functions using keys of > 1024 and the seem to work fine, I'll be doing more testing soon though. How fast should a multiplication or power function take on integers of > 512 bits? should it be noticeable, or are my functions just too crude and unoptimized in comparison to time-tested designs? I can see a definite ~.5 sec pause when performing large multiplication or power operations.
I tested my functions using keys of > 1024 and the seem to work fine, I'll be doing more testing soon though. How fast should a multiplication or power function take on integers of > 512 bits? should it be noticeable, or are my functions just too crude and unoptimized in comparison to time-tested designs? I can see a definite ~.5 sec pause when performing large multiplication or power operations.
Website: https://joscor.com
Re: float/double/long double Implementation
I'm no mathematician, but I can certainly see a way by using your division routine (once that's in place).01000101 wrote:How exactly am I supposed to calculate the modulus of a number without using a subtraction loop?
* Perform the division.
* Multiply the answer back up by the divisor.
* Subtract the answer of the multiplication from the original dividend.
A lot of integer division routines use a method that arrives with the modulus and division simultaneously. I have written an OS library that I am using with my boot loader. One of the features of this library is that it allows GCC to perform 64 bit modulus/division functions without unresolved references to __divdi3/__moddi3 and __udivdi3/__umoddi3. I found the implementation for this after some googling. As I say, I am not a mathematician and it may be that this implementation is not efficient. Also, these functions assume the availability of existing 32 bit division/mod routines and I don't know if they can be applied to floats in the same way:
Code: Select all
uint64_t __div_64_32(uint64_t* dividend, uint32_t divisor)
{
uint32_t dividendLow, dividendLow2, dividendHigh, remainder;
dividendLow = (uint32_t)(*dividend & 0xFFFFFFFF);
dividendHigh = (uint32_t)(*dividend >> 32);
remainder = dividendHigh % divisor;
dividendHigh = dividendHigh / divisor;
dividendLow2 = dividendLow >> 16;
dividendLow2+= remainder << 16;
remainder = dividendLow2 % divisor;
dividendLow2 = dividendLow2 / divisor;
dividendLow = dividendLow & 0xFFFF;
dividendLow+= remainder << 16;
remainder = dividendLow % divisor;
dividendLow = dividendLow / divisor;
*dividend = dividendLow + ((uint64_t)dividendLow2 << 16) + ((uint64_t)dividendHigh << 32);
return(remainder);
}
HTH,
Adam
- Combuster
- Member
- Posts: 9301
- Joined: Wed Oct 18, 2006 3:45 am
- Libera.chat IRC: [com]buster
- Location: On the balcony, where I can actually keep 1½m distance
- Contact:
Re: float/double/long double Implementation
And division itself is still a long subtraction loop. Just look at how long division works...
Re: float/double/long double Implementation
Hi,
For an example of this, here's the "long division" example from my previous post:
From this example you can see that "14809 % 12 = 1" (and also that "14809 / 12 = 1234")...
For a number with a lot of digits you might be able to improve performance by caching previous results. This would work a lot better with small number bases though. For e.g. with "base 8" (octal) there's a good chance that a digit will be repeated, but for "base 4294967296" there's almost no chance a digit will be repeated. Hopefully you'll be using the largest base you can to reduce the number of smaller multiplications needed, so this probably won't help (I just thought I'd mention it in case you can think of a way of using it).
I'm hoping at least some of this made sense. Here's an example that might help:
To work out powers, there's the simple way like this:
A faster way might go something like this:
I haven't tested this, but in theory I think it should work, and if it does work it'd be a lot faster than the simple way (for signed 32-bit integers you'd do a maximum of 31 multiplications, instead of a maximum of over 2 billion multiplications)...
Cheers,
Brendan
Modulus is the remainder of a division, so once you've got division sorted out modulus should be easy - the same function should give you the quotient and remainder (and therefore, the same function should be useful for both division and modulus).01000101 wrote:I had a small one made that was able to handle smaller numbers, but it cracks under larger numbers.
For an example of this, here's the "long division" example from my previous post:
Code: Select all
1 2 3 4 <-answer
1 2 ) 1 4 8 0 9
- 1 2
----
2 8
- 2 4
----
4 0
- 3 6
----
4 9
- 4 8
----
0 1 <- remainder
For multiplication, you can skip entire "rows" if a digit is zero; and avoid doing any multiplication for a row if the digit is one. You can also test if all remaining digits in the second operand are zero and stop if they are.01000101 wrote:I tested my functions using keys of > 1024 and the seem to work fine, I'll be doing more testing soon though. How fast should a multiplication or power function take on integers of > 512 bits? should it be noticeable, or are my functions just too crude and unoptimized in comparison to time-tested designs? I can see a definite ~.5 sec pause when performing large multiplication or power operations.
For a number with a lot of digits you might be able to improve performance by caching previous results. This would work a lot better with small number bases though. For e.g. with "base 8" (octal) there's a good chance that a digit will be repeated, but for "base 4294967296" there's almost no chance a digit will be repeated. Hopefully you'll be using the largest base you can to reduce the number of smaller multiplications needed, so this probably won't help (I just thought I'd mention it in case you can think of a way of using it).
I'm hoping at least some of this made sense. Here's an example that might help:
Code: Select all
1 2 3 4 5 6 7 <-first operand
* 0 0 0 1 0 2 2 <-second operand
---------------------------
2 4 6 9 1 3 4 <- row needed to be calculated
+ 2 4 6 9 1 3 4 0 <- row obtained from recycling results from previous row
+ 0 0 0 0 0 0 0 0 0 <- entire row can be skipped because digit was zero
+ 1 2 3 4 5 6 7 0 0 0 <- entire row copied from first operand because digit was one
+ 0 0 0 0 0 0 0 0 0 0 0 <- entire row can be skipped because remaining digits in second operand are all zero
+ 0 0 0 0 0 0 0 0 0 0 0 0 <- entire row can be skipped because remaining digits in second operand are all zero
+ 0 0 0 0 0 0 0 0 0 0 0 0 0 <- entire row can be skipped because remaining digits in second operand are all zero
--------------------------
0 0 0 1 2 6 1 7 2 7 4 7 4 <- answer
To work out powers, there's the simple way like this:
Code: Select all
int pow(int a, int b) {
int result = 1;
while(b < 0) {
result /= a;
b++;
}
while(b > 0) {
result /= a;
b--;
}
return result;
}
Code: Select all
int pow(int a, int b) {
int result;
if(b == 0) return 1;
if(b > 0) {
temp = a;
} else {
temp = 1 / a;
}
do {
if( (b & 1) != 0) result *= temp;
b = b >> 1;
temp *= temp;
} while(b != 0);
return result;
}
Cheers,
Brendan
For all things; perfection is, and will always remain, impossible to achieve in practice. However; by striving for perfection we create things that are as perfect as practically possible. Let the pursuit of perfection be our guide.
-
- Member
- Posts: 368
- Joined: Sun Sep 23, 2007 4:52 am
Re: float/double/long double Implementation
This is how I do Pow().
Code: Select all
function POW(Num as real, Exp as int) as real
new Result as real
Result = 1
while Exp > 0
if Exp & 1
Result = Result * Num
Exp = Exp - 1
else
Num = Num * Num
Exp = Exp >> 1
endif
wend
return Result
endfunction
Re: float/double/long double Implementation
Thanks guys!
Those were great examples, very useful.
Hopefully I'll be able to extract the necessary info from them to formulate a function that can be used on two 128-bit+ integers.
I believe I will purchase the Art of Computer Programming, Volume 2: Seminumerical Algorithms (3rd Edition) by Donald E. Knuth to aid in this as well. It seems to be the 'classic' book on basic algorithms implemented in multi/arbitrary precision designs.
Those were great examples, very useful.
Hopefully I'll be able to extract the necessary info from them to formulate a function that can be used on two 128-bit+ integers.
I believe I will purchase the Art of Computer Programming, Volume 2: Seminumerical Algorithms (3rd Edition) by Donald E. Knuth to aid in this as well. It seems to be the 'classic' book on basic algorithms implemented in multi/arbitrary precision designs.
Website: https://joscor.com
Re: float/double/long double Implementation
OK, so I've been working on some of these functions more and trying to do some more optimizations and such. Here is what I have so far, I'd really like to get these functions faster and as quick as possible as they are 'core' mp functions in essence and will determine higher-level multi-precision functions' speeds.
They are a bit bulky, but I believe it is necessary, I'd be very happy to get proven wrong though.
These are public domain as I believe these are very far from being implemented in any stable version of my OS. I would like them to at least have a reference to me though if implemented in the wiki or this forum. I think it would be great for this forum to tackle something like multi/arbitrary precision integers as I myself had trouble finding documentation or simple explanations/source code (that was PD/LGPL of course). Maybe if these get worked out, a wiki article could be in order?
I know these are crude, but the point is to get them leaner.
BTW, some/most of these are reliant on the d_bits variable that declares the width of the largest amount of bits to be used. Also, these are for positive integers only, my goal was not to get decimal numbers, but large primes and keys for encryption algorithms.
basic addition with integer overflow alert (mp integer >= max for bit width).
basic subtraction between two mp integers. How can I catch negative results?
mp_clr (clear/set to 0) and mp_cmp (compare) may or may not rely on standard functions as I kind of doubt my versions of memset and memcmp use the same function variables as the 'standards' do.
mp_shl (SHift Left) and mp_shr (SHift Right) pad the appropriate ends with 0's, unlike most examples where it takes a value from a second mp variable.
Here is the mp_mul (multiply) and mp_pow (power^) functions. I took advice from Brendan about catching 0's or 1's in the multiplier to speed things up. The mp_mul is a beast of a function, but the comparisons take up a lot of lines.
Once I write the mp_div (divide) and mp_mod (modulus) functions, I'll get back about that. I'll soon whip up the xor/or/and/not functions in a bit.
Don't be to harsh on these as I have never declared to be a mathematician.
They are a bit bulky, but I believe it is necessary, I'd be very happy to get proven wrong though.
These are public domain as I believe these are very far from being implemented in any stable version of my OS. I would like them to at least have a reference to me though if implemented in the wiki or this forum. I think it would be great for this forum to tackle something like multi/arbitrary precision integers as I myself had trouble finding documentation or simple explanations/source code (that was PD/LGPL of course). Maybe if these get worked out, a wiki article could be in order?
I know these are crude, but the point is to get them leaner.
BTW, some/most of these are reliant on the d_bits variable that declares the width of the largest amount of bits to be used. Also, these are for positive integers only, my goal was not to get decimal numbers, but large primes and keys for encryption algorithms.
Code: Select all
#define d_bits 512
Code: Select all
void mp_add(uint32 *num1, uint32 *num2, uint32 *result, uint32 bits)
{
static uint64 sum64 = 0;
static uint32 i = 0, carry = 0, sum32 = 0;
static uint32 out[d_bits/32];
for(i = 0; i < bits/32; i++)
{
sum64 = (uint64)((uint64)num1[i] + (uint64)num2[i] + carry);
sum32 = sum64 & 0xFFFFFFFF;
carry = ((sum64 >> 32) & 0xFFFFFFFF);
if(carry)
{if(i == bits-1){printf("mp - integer overflow (add) \n");}}
out[i] = sum32;
}
mp_mov(out, result, bits);
}
Code: Select all
void mp_sub(uint32 *num1, uint32 *num2, uint32 *result, uint32 bits)
{
static uint64 sum64 = 0;
static uint32 i = 0, carry = 0, sum32 = 0;
static uint32 out[d_bits/32];
for(i = 0; i < bits/32; i++)
{
if(num2[i])
{
sum64 = (uint64)((uint64)num1[i] - (uint64)num2[i] - carry);
sum32 = sum64 & 0xFFFFFFFF;
carry = ((sum64 >> 32) & 0xFFFFFFFF);
carry = ~carry + 1;
out[i] = sum32;
}
else
{
out[i] = num1[i] + carry;
carry = 0;
}
}
mp_mov(out, result, bits);
}
Code: Select all
void mp_clr(uint32 *out, uint32 bits)
{memset((char *)out, 0, (bits/8));}
uint8 mp_cmp(uint32 *num1, uint32 *num2, uint32 bits)
{return memcmp((char *)num1, (char *)num2, (bits/8));}
Code: Select all
void mp_shl(uint32 *num1, uint32 *result, uint32 count, uint32 bits)
{
static uint32 i = 0, x = 0, carry = 0, carry_tmp = 0;
static uint32 out[d_bits/32];
mp_mov(num1, out, bits);
while(count--)
{
for(x = 0; x < (bits/32); x++)
{
carry_tmp = (out[x] >> 31) & 0x00000001;
out[x] <<= 1;
out[x] += carry;
carry = carry_tmp;
}
}
mp_mov(out, result, bits);
}
void mp_shr(uint32 *num1, uint32 *result, uint32 count, uint32 bits)
{
static uint32 i = 0, x = 0, carry = 0;
static uint32 out[d_bits/32];
mp_mov(num1, out, bits);
while(count--)
{
for(x = 0; x < (bits/32); x++)
{
if(x == (bits/32)-1){out[x] >>= 1;}
else
{
carry = (out[x+1] << 31) & 0x80000000;
out[x] >>= 1;
out[x] += carry;
}
}
}
mp_mov(out, result, bits);
}
Code: Select all
void mp_mul(uint32 *num1, uint32 *num2, uint32 *result, const uint32 bits)
{
static uint64 sum64 = 0;
static uint32 i = 0, x = 0, carry = 0, sum32 = 0;
static uint32 buf[d_bits/32], out[d_bits/32];
mp_clr(buf, bits); // init mp integers
mp_clr(out, bits);
// see if either mp integers are 0, if so, return 0
if(mp_cmp(num1, buf, bits)){mp_mov(out, result, bits); return;}
if(mp_cmp(num2, buf, bits)){mp_mov(out, result, bits); return;}
buf[0] = 1;
// see if either mp integers are 1, if so return the other mp integer
if(mp_cmp(num1, buf, bits)){mp_mov(num2, result, bits); return;}
if(mp_cmp(num2, buf, bits)){mp_mov(num1, result, bits); return;}
buf[0] = 0;
for(i = 0; i < bits/32; i++)
{
if(num2[i] == 0){mp_clr(buf, bits); buf[0] = carry; carry = 0;}
else if(num2[i] == 1){mp_mov(num1, buf, bits); buf[0] += carry; carry = 0;}
else
{
for(x = 0; x < bits/32; x++)
{
if(num1[x] == 0){buf[x] = carry; carry = 0;}
else if(num1[x] == 1){buf[x] = num2[i] + carry; carry = 0;}
else
{
sum64 = (uint64)num2[i] * (uint64)num1[x];
sum64 += carry; // add back in previous carry integer
carry = (uint32)((uint64)sum64 >> 32) & 0xFFFFFFFF; // get HI_32
buf[x] = (uint32)((uint64)sum64 & 0xFFFFFFFF); // get LO_32
}
}
}
mp_shl(buf, buf, i*32, bits); // (in, out, shift_count, size)
mp_add(buf, out, out, bits); // (in, +in, out, size)
}
mp_mov(out, result, bits); // (from, to, size)
// the result+out mp integers are there for the ability to do
// something like mp_mul(num1, num2, num1, bits) and not have
// num1 get screwed up because it is both an input and output
// integer.
}
void mp_pow(uint32 *num1, uint32 *num2, uint32 *result, const uint32 bits)
{
static uint32 buf[d_bits/32], num[d_bits/32], exp[d_bits/32], out[d_bits/32];
mp_mov(num1, num, bits);
mp_mov(num2, exp, bits);
mp_clr(buf, bits);
mp_clr(out, bits);
if(mp_cmp(exp, out, bits)){out[0] = 1; mp_mov(out, result, bits); return;}
out[0] = 1;
while(1)
{
buf[0] = 0;
if(mp_cmp(exp, buf, bits)){break;}
if(exp[0] & 0x01)
{
mp_mul(out, num, out, bits);
buf[0] = 1;
mp_sub(exp, buf, exp, bits);
}
else
{
mp_shr(exp, exp, 1, bits);
mp_mul(num, num, num, bits);
}
}
mp_mov(out, result, bits);
}
Don't be to harsh on these as I have never declared to be a mathematician.
Website: https://joscor.com
Re: float/double/long double Implementation
Hi,
The problem with assembly is that it's machine specific - you'd need to use some conditional code with a different version for each target architecture (for e.g. a 32-bit 80x86 version and a 64-bit 80x86 version). The difference between assembly and C could be significant though.
For example, to detect if subtraction overflows you could compare the numbers first (if "a < b" then "a - b" will overflow) but that'll double the time it'd take to do the subtraction.
For addition, you're calculating your own carry flag and that would probably double the time addition takes.
For shifts, I have a theory that it's possible to do the initial data move and the shift at the same time, something like this (but repeated or in a loop):
Note: I also think your C code for shifts is broken - the initial data move needs to move the data by "bits / 32" dwords while the post-shift needs to use "bits % 32".
For multiplication, when the CPU does multiplication the result is twice as wide as the operands (for e.g. multiplying a pair of 32-bit values gives you a 64-bit result, or multiplying a pair of 64-bit values gives you a 128-bit result). You can't use this in C and you end up with a situation where multiplying a pair of 32-bit values gives you a 32-bit result that may have overflowed; and to get around that you'd need to work on 16-bit values (so you'd be doing "16-bit * 16-bit = 32-bit with no chance of overflow" instead of "32-bit * 32-bit = 32-bit with likely overflow problems"; or "32-bit * 32-bit = 64-bit with no chance of overflow" instead of "64-bit * 64-bit = 64-bit with likely overflow problems"). Because of the nature of the operation this will make your code about 4 times slower - for 512-bit multiplication, on 32-bit computers you'd end up doing 1024 multiplications instead of 256 multiplications (or on 64-bit computers, 256 multiplications instead of 64 multiplications).
For the power function, the performance you get will mostly depend on the performance of multiplication. If multiplication is 4 times slower, then the power function will be 4 times slower too.
Then there's smaller things. When setting a 512-bit number to zero, the code should be unrolled so it uses 16 "mov [edi+?],eax" instructions (or 8 "mov [rdi+?], rax" instructions, or maybe 4 "MOVDQA" instructions). For comparisons, I'd use "rep cmpsd".
Mostly what I'm saying here is that if you care about performance you need to use assembly, because C isn't very good at this sort of thing (partly because C won't let you access the carry flag).
For an example, here's some code for 512-bit addition on 64-bit computers:
Here's a modified version of your code:
Compiling this version with "gcc -O3" gave me this disassembly:
My assembly only does 8 64-bit additions. GCC's loop does 32 64-bit additions (a pair of 64-bit additions for each 32-bits of results), which is 4 times as many additions. Including loop overhead, shifting, etc I'd guess that GCC's code will be at least 6 times slower.
Cheers,
Brendan
The problem with C is that you can't get any access to the carry flag, so you need to do extra/unnecessary work to detect overflows (where it's extremely unlikely that the compiler's optimizer can remove this extra/unnecessary work and just use the carry). Switching to assembly or inline assembly would solve that...01000101 wrote:OK, so I've been working on some of these functions more and trying to do some more optimizations and such. Here is what I have so far, I'd really like to get these functions faster and as quick as possible as they are 'core' mp functions in essence and will determine higher-level multi-precision functions' speeds.
The problem with assembly is that it's machine specific - you'd need to use some conditional code with a different version for each target architecture (for e.g. a 32-bit 80x86 version and a 64-bit 80x86 version). The difference between assembly and C could be significant though.
For example, to detect if subtraction overflows you could compare the numbers first (if "a < b" then "a - b" will overflow) but that'll double the time it'd take to do the subtraction.
For addition, you're calculating your own carry flag and that would probably double the time addition takes.
For shifts, I have a theory that it's possible to do the initial data move and the shift at the same time, something like this (but repeated or in a loop):
Code: Select all
mov eax,[esi+?]
shld ebx,eax,ecx
mov [edi+edx+?],ebx
For multiplication, when the CPU does multiplication the result is twice as wide as the operands (for e.g. multiplying a pair of 32-bit values gives you a 64-bit result, or multiplying a pair of 64-bit values gives you a 128-bit result). You can't use this in C and you end up with a situation where multiplying a pair of 32-bit values gives you a 32-bit result that may have overflowed; and to get around that you'd need to work on 16-bit values (so you'd be doing "16-bit * 16-bit = 32-bit with no chance of overflow" instead of "32-bit * 32-bit = 32-bit with likely overflow problems"; or "32-bit * 32-bit = 64-bit with no chance of overflow" instead of "64-bit * 64-bit = 64-bit with likely overflow problems"). Because of the nature of the operation this will make your code about 4 times slower - for 512-bit multiplication, on 32-bit computers you'd end up doing 1024 multiplications instead of 256 multiplications (or on 64-bit computers, 256 multiplications instead of 64 multiplications).
For the power function, the performance you get will mostly depend on the performance of multiplication. If multiplication is 4 times slower, then the power function will be 4 times slower too.
Then there's smaller things. When setting a 512-bit number to zero, the code should be unrolled so it uses 16 "mov [edi+?],eax" instructions (or 8 "mov [rdi+?], rax" instructions, or maybe 4 "MOVDQA" instructions). For comparisons, I'd use "rep cmpsd".
Mostly what I'm saying here is that if you care about performance you need to use assembly, because C isn't very good at this sort of thing (partly because C won't let you access the carry flag).
For an example, here's some code for 512-bit addition on 64-bit computers:
Code: Select all
;Addition
;
;Input
; rdx Address of first operand
; rsi Address of second operand
; rdi Address to put result
;
;Output
; rax Zero if no overflow, 0xFFFFFFFFFFFFFFF if overflowed
;
;Trashed
; r8, r9, r10, r11, r12, r13, r14
mp_add:
mov rax,[rdx]
mov r8,[rdx+8]
mov r9,[rdx+16]
mov r10,[rdx+24]
mov r11,[rdx+32]
mov r12,[rdx+40]
mov r13,[rdx+48]
mov r14,[rdx+56]
add rax,[rsi]
adc r8,[rsi+8]
adc r9,[rsi+16]
adc r10,[rsi+24]
adc r11,[rsi+32]
adc r12,[rsi+40]
adc r13,[rsi+48]
adc r14,[rsi+56]
mov [rdi],eax
mov [rdi+8],r8
mov [rdi+16],r9
mov [rdi+24],r10
mov [rdi+32],r11
mov [rdi+40],r12
mov [rdi+48],r13
mov [rdi+56],r14
mov rax,0
sbc rax,0
ret
Code: Select all
int mp_add(uint32 *num1, uint32 *num2, uint32 *result, uint32 bits) {
static uint64 sum64 = 0;
static uint32 i = 0, carry = 0, sum32 = 0;
for(i = 0; i < bits/32; i++)
{
sum64 = (uint64)((uint64)num1[i] + (uint64)num2[i] + carry);
sum32 = sum64 & 0xFFFFFFFF;
carry = ((sum64 >> 32) & 0xFFFFFFFF);
result[i] = sum32;
}
return carry;
}
Code: Select all
00000000004004d0 <mp_add>:
4004d0: 8d 41 1f lea 0x1f(%rcx),%eax
4004d3: 85 c9 test %ecx,%ecx
4004d5: 41 89 ca mov %ecx,%r10d
4004d8: 49 89 d3 mov %rdx,%r11
4004db: c7 05 4b 0b 20 00 00 movl $0x0,0x200b4b(%rip) # 601030 <i.2240>
4004e2: 00 00 00
4004e5: 44 0f 48 d0 cmovs %eax,%r10d
4004e9: 41 c1 fa 05 sar $0x5,%r10d
4004ed: 45 85 d2 test %r10d,%r10d
4004f0: 7e 51 jle 400543 <mp_add+0x73>
4004f2: 8b 0d 34 0b 20 00 mov 0x200b34(%rip),%ecx # 60102c <carry.2241>
4004f8: 45 31 c9 xor %r9d,%r9d
4004fb: 45 31 c0 xor %r8d,%r8d
4004fe: 66 90 xchg %ax,%ax
400500: 4a 63 14 87 movslq (%rdi,%r8,4),%rdx
400504: 4a 63 04 86 movslq (%rsi,%r8,4),%rax
400508: 48 63 c9 movslq %ecx,%rcx
40050b: 41 ff c1 inc %r9d
40050e: 48 01 ca add %rcx,%rdx
400511: 48 01 d0 add %rdx,%rax
400514: 48 89 c2 mov %rax,%rdx
400517: 43 89 04 83 mov %eax,(%r11,%r8,4)
40051b: 49 ff c0 inc %r8
40051e: 48 c1 fa 20 sar $0x20,%rdx
400522: 45 39 d1 cmp %r10d,%r9d
400525: 89 d1 mov %edx,%ecx
400527: 75 d7 jne 400500 <mp_add+0x30>
400529: 44 89 15 00 0b 20 00 mov %r10d,0x200b00(%rip) # 601030 <i.2240>
400530: 89 05 f2 0a 20 00 mov %eax,0x200af2(%rip) # 601028 <sum32.2242>
400536: 48 89 05 fb 0a 20 00 mov %rax,0x200afb(%rip) # 601038 <sum64.2239>
40053d: 89 15 e9 0a 20 00 mov %edx,0x200ae9(%rip) # 60102c <carry.2241>
400543: 8b 05 e3 0a 20 00 mov 0x200ae3(%rip),%eax # 60102c <carry.2241>
400549: c3 retq
Cheers,
Brendan
For all things; perfection is, and will always remain, impossible to achieve in practice. However; by striving for perfection we create things that are as perfect as practically possible. Let the pursuit of perfection be our guide.
Re: float/double/long double Implementation
You have some very good points concerning the inefficiency of C under these circumstances. As I'm not all that concerned about architecture compatibility, inline assembly is fine.
I changed a couple functions and hoped we could continue this critiquing until a polished (or at least less embarrassing) code-set emerges as I believe it could be useful later on for both OS and library developers.
new mp_shr and mp_shl.
I was unable to find what the SBC assembly instruction that you used earlier does or how it is used, so I will need a little more info about handling overflows/negation.
add+sub+mov
these are all 64-bit instructions now and have been reduced in size significantly. Also, does the Xeon 'pause' optimization work in this case, or is that only for extremely tight 'spinlock-like' loops?
I changed a couple functions and hoped we could continue this critiquing until a polished (or at least less embarrassing) code-set emerges as I believe it could be useful later on for both OS and library developers.
new mp_shr and mp_shl.
Code: Select all
void mp_shr(uint64 *out)
{
__asm__ __volatile__(" addq %%rcx, %0; "
" subq $0x08, %0; "
" shrq $0x01, (%0); "
" subq $0x08, %0; "
" decq %%rcx; "
"shrloop: "
" rcrq $0x01, (%0); "
" subq $0x08, %0; "
" loopnz shrloop; "
:
: "r"((uint64)out), "c"((uint64)(d_bits/64))
: "memory");
}
void mp_shl(uint64 *out)
{
__asm__ __volatile__(" shlq $0x01, (%0); "
" addq $0x08, %0; "
" decq %%rcx; "
"shlloop: "
" rclq $0x01, (%0); "
" addq $0x08, %0; "
" loopnz shlloop; "
:
: "r"((uint64)out), "c"((uint64)(d_bits/64))
: "memory");
}
add+sub+mov
Code: Select all
void mp_mov(uint64 *from, uint64 *to)
{
__asm__ __volatile__("movloop: "
" movq (%0), %%rax; "
" movq %%rax, (%1); "
" addq $0x08, %0; "
" addq $0x08, %1; "
" loopnz movloop; "
:
: "r"((uint64)from), "r"((uint64)to), "c"((uint64)(d_bits/64))
: "memory");
}
void mp_add_op(uint64 *num1, uint64 *num2, uint64 *result)
{
__asm__ __volatile__(" movq (%0), %%rax; "
" addq (%1), %%rax; "
" movq %%rax, (%2); "
" decq %%rcx; "
"addloop: "
" movq (%0), %%rax; "
" adcq (%1), %%rax; "
" movq %%rax, (%2); "
" addq $0x08, %0; "
" addq $0x08, %1; "
" addq $0x08, %2; "
" loopnz addloop; "
:
: "r" ((uint64)num1), "r" ((uint64)num2),
"r" ((uint64)result), "c" ((uint64)(d_bits/64))
: "memory");
}
void mp_add(uint64 *num1, uint64 *num2, uint64 *result)
{
static uint64 out[d_bits/64];
mp_add_op(num1, num2, out);
mp_mov(out, result);
}
void mp_sub_op(uint64 *num1, uint64 *num2, uint64 *result)
{
__asm__ __volatile__(" movq (%0), %%rax; "
" subq (%1), %%rax; "
" movq %%rax, (%2); "
" decq %%rcx; "
"subloop: "
" movq (%0), %%rax; "
" sbbq (%1), %%rax; "
" movq %%rax, (%2); "
" addq $0x08, %0; "
" addq $0x08, %1; "
" addq $0x08, %2; "
" loopnz subloop; "
:
: "r" ((uint64)num1), "r" ((uint64)num2),
"r" ((uint64)result), "c" ((uint64)(d_bits/64))
: "memory");
}
void mp_sub(uint64 *num1, uint64 *num2, uint64 *result)
{
static uint64 out[d_bits/64];
mp_sub_op(num1, num2, out);
mp_mov(out, result);
}
Website: https://joscor.com
Re: float/double/long double Implementation
Hi,
I was thinking SBC (SuBtract with Carry) when I should've written SBB (SuBtract with Borrow).
Basically, if RAX is zero and the carry flag is clear then "sbb rax,0" will leave RAX set to zero, and if the carry flag is set then "sbb rax,0" will set RAX to 0xFFFFFFFFFFFFFFFF (or -1 if you look at it as a signed number). Mostly it's the same as "if (carry == clear) { RAX = 0 } else { RAX = 0xFFFFFFFFFFFFFFFF }" except there is no branch.
If you want the function to return "RAX = 1" if an overflow was detected then there's plenty of alternatives - "setc al" (which sets AL to zero if carry is clear, and sets AL to one if carry is set); or maybe "rcl rax,1" (which just shifts the carry flag into the lowest bit of RAX), or "adc rax, 0" (which works in a similar way to "sbb"), or "cmovc eax,1" (which does "eax = 1" if carry is set). I'm not sure which way would be faster...
The PAUSE instruction fixes this by telling the CPU to wait until all instructions are retired/completed before continuing (which also uses less CPU resources and let's the other logical CPU use those resources).
For a normal loop (anything that isn't a "polling loop"), you want the CPU to execute as much as it can at the same time to improve performance (e.g. so that by the time the CPU figures out which way the "loopnz" will branch, most of the work for the next iteration of the loop is already done). In this case a PAUSE instruction will hurt performance.
Also note that using any form of loop adds overhead (extra calculations and a branch mis-prediction on exit), so unrolling the loop can help. Unrolling also helps to avoid delays caused by fetching data. For example, load everything at the start of the function, so that by the time the last load is executed the data for the first load has arrived, and then start manipulating the data.
Large (unrolled) functions aren't suitable for inlining, but (assuming these functions would be used from different places) inlining can prevent them from being in the trace cache.
Anyway, here's my attempt at shifting left...
And 2 different versions for subtraction:
Also, something makes me think your inline assembly versions aren't right, but I didn't spend much time thinking about them - my brain has a built-in "NASM syntax" parser, and I assume it's easier to compile them and see if they produce the right results that way...
Cheers,
Brendan
That's because I wrote the wrong thing!01000101 wrote:I was unable to find what the SBC assembly instruction that you used earlier does or how it is used, so I will need a little more info about handling overflows/negation.
I was thinking SBC (SuBtract with Carry) when I should've written SBB (SuBtract with Borrow).
Basically, if RAX is zero and the carry flag is clear then "sbb rax,0" will leave RAX set to zero, and if the carry flag is set then "sbb rax,0" will set RAX to 0xFFFFFFFFFFFFFFFF (or -1 if you look at it as a signed number). Mostly it's the same as "if (carry == clear) { RAX = 0 } else { RAX = 0xFFFFFFFFFFFFFFFF }" except there is no branch.
If you want the function to return "RAX = 1" if an overflow was detected then there's plenty of alternatives - "setc al" (which sets AL to zero if carry is clear, and sets AL to one if carry is set); or maybe "rcl rax,1" (which just shifts the carry flag into the lowest bit of RAX), or "adc rax, 0" (which works in a similar way to "sbb"), or "cmovc eax,1" (which does "eax = 1" if carry is set). I'm not sure which way would be faster...
Most modern CPUs do several instructions at the same time. For tight loops, the CPU can be doing several iterations of the same loop at the same time (e.g. some instructions at the end of the loop being retired/completed while more instructions at the start of the loop are being speculatively executed). For spinlocks on systems with hyper-threading this creates problems because many iterations of the same loop may be being executed, which wastes execution resources that could be used by the other logical CPU. It also increases the costs to exit the loop, because those unnecessary instructions being speculatively executed need to be cleared from the pipeline.01000101 wrote:Also, does the Xeon 'pause' optimization work in this case, or is that only for extremely tight 'spinlock-like' loops?
The PAUSE instruction fixes this by telling the CPU to wait until all instructions are retired/completed before continuing (which also uses less CPU resources and let's the other logical CPU use those resources).
For a normal loop (anything that isn't a "polling loop"), you want the CPU to execute as much as it can at the same time to improve performance (e.g. so that by the time the CPU figures out which way the "loopnz" will branch, most of the work for the next iteration of the loop is already done). In this case a PAUSE instruction will hurt performance.
Also note that using any form of loop adds overhead (extra calculations and a branch mis-prediction on exit), so unrolling the loop can help. Unrolling also helps to avoid delays caused by fetching data. For example, load everything at the start of the function, so that by the time the last load is executed the data for the first load has arrived, and then start manipulating the data.
Large (unrolled) functions aren't suitable for inlining, but (assuming these functions would be used from different places) inlining can prevent them from being in the trace cache.
Anyway, here's my attempt at shifting left...
Code: Select all
;Shift left ("result = operand1 << operand2")
;
;Input
; rcx Address of 512-bit operand 2
; rsi Address of 512-bit operand 1
; rdi Address of 512-bit result
;
;Output
; rax Zero if no overflow, non-zero if overflow detected
mp_shr:
mov rbx,rcx
and ecx,0x0000003F
shr rbx,6
xor edx,edx
cmp rbx,512/64
jae .s512
jmp [.shift_left_table + ebx * 8]
section .data
.shift_left_table:
dq .s000, .s064, .s128, .s192, .s256, .s320, .s384, .s448
section .text
.s512:
xor r8,r8
xor r9,r9
xor r10,r10
xor r11,r11
xor r12,r12
xor r13,r13
xor r14,r14
xor r15,r15
mov rax,[rsi]
or rax,[rsi+8]
or rax,[rsi+16]
or rax,[rsi+24]
or rax,[rsi+32]
or rax,[rsi+40]
or rax,[rsi+48]
or rax,[rsi+56]
test ecx,ecx
jne .donePreShift
jmp .donePostShift
.s448:
xor r8,r8
xor r9,r9
xor r10,r10
xor r11,r11
xor r12,r12
xor r13,r13
xor r14,r14
mov r15,[rsi]
mov rax,[rsi+8]
or rax,[rsi+16]
or rax,[rsi+24]
or rax,[rsi+32]
or rax,[rsi+40]
or rax,[rsi+48]
or rax,[rsi+56]
test ecx,ecx
jne .donePreShift
jmp .donePostShift
.s384:
xor r8,r8
xor r9,r9
xor r10,r10
xor r11,r11
xor r12,r12
xor r13,r13
mov r14,[rsi]
mov r15,[rsi+8]
mov rax,[rsi+16]
or rax,[rsi+24]
or rax,[rsi+32]
or rax,[rsi+40]
or rax,[rsi+48]
or rax,[rsi+56]
test ecx,ecx
jne .donePreShift
jmp .donePostShift
.s320:
xor r8,r8
xor r9,r9
xor r10,r10
xor r11,r11
xor r12,r12
mov r13,[rsi]
mov r14,[rsi+8]
mov r15,[rsi+16]
mov rax,[rsi+24]
or rax,[rsi+32]
or rax,[rsi+40]
or rax,[rsi+48]
or rax,[rsi+56]
test ecx,ecx
jne .donePreShift
jmp .donePostShift
.s256:
xor r8,r8
xor r9,r9
xor r10,r10
xor r11,r11
mov r12,[rsi]
mov r13,[rsi+8]
mov r14,[rsi+16]
mov r15,[rsi+24]
mov rax,[rsi+32]
or rax,[rsi+40]
or rax,[rsi+48]
or rax,[rsi+56]
test ecx,ecx
jne .donePreShift
jmp .donePostShift
.s192:
xor r8,r8
xor r9,r9
xor r10,r10
mov r11,[rsi]
mov r12,[rsi+8]
mov r13,[rsi+16]
mov r14,[rsi+24]
mov r15,[rsi+32]
mov rax,[rsi+40]
or rax,[rsi+48]
or rax,[rsi+56]
test ecx,ecx
jne .donePreShift
jmp .donePostShift
.s128:
xor r8,r8
xor r9,r9
mov r10,[rsi]
mov r11,[rsi+8]
mov r12,[rsi+16]
mov r13,[rsi+24]
mov r14,[rsi+32]
mov r15,[rsi+40]
mov rax,[rsi+48]
or rax,[rsi+56]
test ecx,ecx
jne .donePreShift
jmp .donePostShift
.s064:
xor r8,r8
mov r9,[rsi]
mov r10,[rsi+8]
mov r11,[rsi+16]
mov r12,[rsi+24]
mov r13,[rsi+32]
mov r14,[rsi+40]
mov r15,[rsi+48]
mov rax,[rsi+56]
test ecx,ecx
jne .donePreShift
jmp .donePostShift
.s000:
xor eax,eax
mov r8,[rsi]
mov r9,[rsi+8]
mov r10,[rsi+16]
mov r11,[rsi+24]
mov r12,[rsi+32]
mov r13,[rsi+40]
mov r14,[rsi+48]
mov r15,[rsi+56]
test ecx,ecx
je .donePostShift
.donePreShift:
shld rdx,r15,cl
shld r15,r14,cl
shld r14,r13,cl
shld r13,r12,cl
shld r12,r11,cl
shld r11,r10,cl
shld r10,r9,cl
shld r9,r8,cl
shl r8,cl
.donePostShift:
mov [rdi],r8
mov [rdi+8],r9
mov [rdi+16],r10
mov [rdi+24],r11
mov [rdi+32],r12
mov [rdi+40],r13
mov [rdi+48],r14
mov [rdi+56],r15
or rax,rdx
ret
Code: Select all
;Subtract ("result = operand1 - operand2")
;
;Input
; rdx Address of 512-bit operand 1
; rsi Address of 512-bit operand 2
; rdi Address of 512-bit result
;
;Output
; rax Zero if no overflow, non-zero if overflow detected
mp_sub_3operand:
mov r8,[rdx]
mov r9,[rdx+8]
mov r10,[rdx+16]
mov r11,[rdx+24]
mov r12,[rdx+32]
mov r13,[rdx+40]
mov r14,[rdx+48]
mov r15,[rdx+56]
sub r8,[rsi]
sbb r9,[rsi+8]
sbb r10,[rsi+16]
sbb r11,[rsi+24]
sbb r12,[rsi+32]
sbb r13,[rsi+40]
sbb r14,[rsi+48]
sbb r15,[rsi+56]
sbb eax,0
mov [rdi],r8
mov [rdi+8],r9
mov [rdi+16],r10
mov [rdi+24],r11
mov [rdi+32],r12
mov [rdi+40],r13
mov [rdi+48],r14
mov [rdi+56],r15
ret
;Subtract ("result -= operand2")
;
;Input
; rsi Address of 512-bit operand 2
; rdi Address of 512-bit operand 1 (replaced by result)
;
;Output
; rax Zero if no overflow, non-zero if overflow detected
mp_sub_2operand:
mov r8,[rsi]
mov r9,[rsi+8]
mov r10,[rsi+16]
mov r11,[rsi+24]
mov r12,[rsi+32]
mov r13,[rsi+40]
mov r14,[rsi+48]
mov r15,[rsi+56]
sub [rdi],r8
sbb [rdi+8],r9
sbb [rdi+16],r10
sbb [rdi+24],r11
sbb [rdi+32],r12
sbb [rdi+40],r13
sbb [rdi+48],r14
sbb [rdi+56],r15
sbb eax,0
ret
Cheers,
Brendan
For all things; perfection is, and will always remain, impossible to achieve in practice. However; by striving for perfection we create things that are as perfect as practically possible. Let the pursuit of perfection be our guide.
- Combuster
- Member
- Posts: 9301
- Joined: Wed Oct 18, 2006 3:45 am
- Libera.chat IRC: [com]buster
- Location: On the balcony, where I can actually keep 1½m distance
- Contact:
Re: float/double/long double Implementation
RAX hasn't previously been set, so if you do a SBB RAX, 0 (RAX := RAX - 0 - carry) then the answer is still undefined.;Output
; rax Zero if no overflow, non-zero if overflow detected
(...)
sbb rax, 0
You probably meant to use a SBB RAX, RAX (RAX := RAX - RAX - carry = -carry) where the result only depends on the carry flag
-
- Member
- Posts: 368
- Joined: Sun Sep 23, 2007 4:52 am
Re: float/double/long double Implementation
I couldn't understand what your multiplication does, so maybe this is a related algorithm, but I'll post it just in case. It pushes all the bit manipulations down to already made (let's hope so) comparison, shifting and addition functions. I don't know how fast it is, but it sure is simple.
Code: Select all
Procedure Mul(A, B)
If A > B ; Whether this swap improves performance
Swap A, B ; must simply be tested with a few numbers
EndIf
C = B
While A > 0
A >> 1
B << 1
If A & 1 ; A is odd
C + B
EndIf
Wend
ProcedureReturn C
EndProcedure
Re: float/double/long double Implementation
I know this is massive thread necromancy, but I think that it's better to extend off of this than create a new thread for the same topic and such.
I've been working on my multi-precision code, and decided to trash it as it was horrid lol. I re-wrote some of the core functions (add, sub, mul, pow, and shl). I'd like to get some feedback about these as the feedback from before (about the previously posted functions) were incredibly helpful. I'd also like to hear any thoughts on how to optimize this code while keeping it arbitrary-precision and without defining sections to pre-defined bit counts (if the operation is using 512-bits, execute this...).
I tried to do a decent job of commenting, and I'm not posting all of them, just a few for now.
I'm currently working on a division functions (and a modulus function spawned from division) so that I can complete the final functions to use in RSA.
I've tested these functions with 4096 bits max (as my print function prints *all* bits instead of what's used, it won't fit 8192 bit on the screen haha) and they work good from what I can tell. The add and multiply functions seem to work at a decent speed and don't have noticeable calculation times.
mp_add:
mp_mul:
mp_shl:
mp_pow is just a multiplication loop and mp_sub looks like mp_add just with different key instructions, so if one is flawed, I'm positive the other will be. =)
There isn't a better way to do mp_pow is there? I can only think of multiplying again and again as the solution.
BTW, if anyone wants it, this code is free... in all senses, or if it makes you feel better, it's Public Domain. Once I get a real library of this going, I plan on releasing it as free.
I've been working on my multi-precision code, and decided to trash it as it was horrid lol. I re-wrote some of the core functions (add, sub, mul, pow, and shl). I'd like to get some feedback about these as the feedback from before (about the previously posted functions) were incredibly helpful. I'd also like to hear any thoughts on how to optimize this code while keeping it arbitrary-precision and without defining sections to pre-defined bit counts (if the operation is using 512-bits, execute this...).
I tried to do a decent job of commenting, and I'm not posting all of them, just a few for now.
I'm currently working on a division functions (and a modulus function spawned from division) so that I can complete the final functions to use in RSA.
I've tested these functions with 4096 bits max (as my print function prints *all* bits instead of what's used, it won't fit 8192 bit on the screen haha) and they work good from what I can tell. The add and multiply functions seem to work at a decent speed and don't have noticeable calculation times.
mp_add:
Code: Select all
global mp_add ; dst + val -> dst
; void mp_add(uint8 * dst, uint8 * val, uint64 bit_size);
; mp_add(rdi, rsi, rdx); // function argument translations
; trashed: rax, rcx, rdx, rdi, rsi, r8
mp_add:
cmp rdx, 0 ; see if bit_size > 0
ja .checkmod ; if more than 0 bits, continue; else, return
ret
.checkmod:
push rdx ; save bit_size
mov rax, rdx ; place bit_size into RAX (for division)
mov rdx, 0 ; set RDX to 0 (DIV uses RDX:RAX/r64)
mov r8, 64 ; place 64 (bit alignment) into R8
div r8 ; divide bit_size/64 (RAX == quotient, RDX == remainder)
mov rax, rdx ; move the remainder into RAX
pop rdx ; restore bit_size
cmp rax, 0 ; see if remainder == 0
je .bits2qwords ; if the remainder is equal to 0, continue; else, return
ret
.bits2qwords: ; convert the count in bits to bytes
shr rdx, 6 ; shift by 6 to get the number of qwords from the given bit_size
mov rcx, rdx ; restore count register (RCX) with the count in bytes
cmp rcx, 0 ; is the number of qwords 0?
je .exit ; if so, return
clc ; clear the carry flag
pushf ; push the rflags (to keep the carry flag)
.addqword:
popf ; pop the rflags (to restore the carry flag)
mov rax, [rdi] ; put dst qword into the accumulator
adc rax, [rsi] ; add the src qword to the accumulator (with carry)
mov [rdi], rax ; store the result into dst
pushf ; push the rflags (to keep the carry flag)
add rdi, 8 ; add sizeof(qword) to dst
add rsi, 8 ; add sizeof(qword) to src
dec rcx ; decrease the counter
cmp rcx, 0 ; if the counter is > 0, loop to .addbyte
ja .addqword
.exit:
popf ; pop the rflags (to restore the carry flag)
ret ; return
Code: Select all
global mp_mul; dst * val -> dst
; void mp_mul(uint8 * dst, uint8 * val, uint64 bit_size);
; mp_mul(rdi, rsi, rdx) // function argument translations
mp_mul:
cmp rdx, 64 ; compare the bit_size to 64 bits
jae .bits2qwords ; if more than 64 bits, continue; else, return
ret
.bits2qwords: ; convert the count in bits to bytes
shr rdx, 6 ; shift by 6 to get the number of qwords from the given bit_size
mov rcx, rdx ; restore count register (RCX) with the count in bytes
cmp rcx, 0 ; is the number of qwords 0?
je .exit ; if so, return
clc ; clear the carry flag
; qword count (static) (rcx) comparison value
; counter for multiplicand (small) loop (r14) top muliplication value (alternates frequently)
; counter for muliplier (large) loop (r15) bottom multiplication value (alternates infrequently)
; pointer to the temp buffer (r8) pointer to (mulbuf)
; pointer to the temp buffer (static) (r9) incremental pointer to (mulbuf)
; pointer to the muliplicand (rdi)
; pointer to the muliplier (rsi)
mov r14, 0 ; when r14 == rcx, we're done with the current loop
mov r15, 0 ; when r15 == rcx, we're done completely
mov r8, mulbuf ; r8 is our buffer pointer
mov r9, r8 ; r9 is our temporary buffer pointer
push rdi ; save for later (rep movsq)
; this sub-section just zeros mulbuf
.zerotmp:
push rdi
push rax
push rcx
mov rdi, mulbuf
mov rax, 0
rep stosq
pop rcx
pop rax
pop rdi
.mulloop:
mov rax, [rdi] ; get multiplicand
cmp rax, 0 ; skip if the multiplicand is 0
je .rdiis0
cmp dword [rsi], 0 ; skip if the multiplier is 0
je .rsiis0
mul qword [rsi] ; multiply by the multiplier
add rax, [r8] ; add previous integer from buffer
mov [r8], rax ; and store it back to the buffer
adc rdx, 0 ; get our carry value
add r8, 8 ; increase our buffer position by 8
add [r8], rdx ; and add in the carry value
.rdiis0:
add rdi, 8 ; increase our multiplicand buffer pointer
inc r14 ; increase the small loop counter
cmp r14, rcx ; is the small loop counter reached its limit?
jb .mulloop ; if no, keep looping
.rsiis0:
xor r14, r14 ; reset the small loop counter
inc r15 ; increase the large loop counter
add r9, 8 ; increase our temporary buffer position by 8
mov r8, r9 ; and place it as our actual buffer pointer
add rsi, 8 ; increase the multiplier buffer position by 8
pop rdi ; restore our multiplicand buffer position
push rdi ; and save it again
cmp r15, rcx ; has the large loop counter reached its limit?
jb .mulloop ; if no, keep looping
pop rdi ; restore our multiplicand buffer position
mov rsi, mulbuf ; set RSI as the buffer pointer
rep movsq ; copy our buffer to the multiplicand buffer (for return use)
.exit:
ret ; return
[section .bss]
align 16
mulbuf: resb 8192/8; max bit_size for now is 8192
[section .text]
Code: Select all
global mp_shl; dst << count -> dst
; void mp_shl(uint8 * dst, uint64 count, uint64 bit_size);
; mp_shl(rdi, rsi, rdx) // function argument translations
mp_shl:
cmp rdx, 64 ; compare the bit_size to 64 bits
jae .bits2qwords ; if more than 64 bits, continue; else, return
ret
.bits2qwords: ; convert the count in bits to bytes
shr rdx, 6 ; shift by 6 to get the number of qwords from the given bit_size
mov rcx, rdx ; restore count register (RCX) with the count in bytes
cmp rcx, 0 ; is the number of qwords 0?
je .exit ; if so, return
clc ; clear the carry flag
push rdi ; start of integer buffer (save)
pushfq ; save the RFLAGS (carry flag)
.shlloop:
popfq ; restore the RFLAGS (carry flag)
mov rcx, rdx ; restore the counter to bit_size/8
pop rdi ; restore the integer buffer pointer
push rdi ; ... and save it again
shl qword [rdi], 1 ; do one SHL (shift left, no carry restoration)
jmp .compare ; jump over the RCL
.wcarry:
popfq ; restore the RFLAGS (carry flag)
rcl qword [rdi], 1 ; do one RCL (shift left with carry)
.compare:
pushfq ; restore the RFLAGS (carry flag)
add rdi, 8 ; increase our buffer position by sizeof(qword)
dec rcx ; decrease our counter
cmp rcx, 0 ; if the counter != 0, keep RCL'ing
jne .wcarry
dec rsi ; decrease the bit_size/8 count
cmp rsi, 0 ; if that is zero, we're done
jne .shlloop ; else, SHL and keep looping
.exit:
popfq ; restore the RFLAGS
pop rdi ; restore RDI
ret ; return
mp_pow is just a multiplication loop and mp_sub looks like mp_add just with different key instructions, so if one is flawed, I'm positive the other will be. =)
There isn't a better way to do mp_pow is there? I can only think of multiplying again and again as the solution.
BTW, if anyone wants it, this code is free... in all senses, or if it makes you feel better, it's Public Domain. Once I get a real library of this going, I plan on releasing it as free.
Website: https://joscor.com
-
- Member
- Posts: 524
- Joined: Sun Nov 09, 2008 2:55 am
- Location: Pennsylvania, USA
Re: float/double/long double Implementation
If you're only using this for RSA, there is a simpler method for exponentiation. It works especially well when a modulus is involved (as there is in RSA) because you can reduce after each step.
In pseudo code:
If you're not using this only for exponentiation with a modulus, sorry for the OT post.
In pseudo code:
Code: Select all
base = base % n;
accumulator = base;
for(i = index of first 1 in exponent minus 1, starting from the left; i > 0; i--)
{
if(exponent[i] == 1)
{
accumulator = accumulator^2 * base;
} else {
accumulator = accumulator^2;
}
accumulator = accumulator % n
}
Re: float/double/long double Implementation
Hi,
However, this code won't work in some cases - the exponent must be an integer (e.g. something like "5 ** 2.25" won't work). I'm not sure if there's a trick I'm missing or some other way to handle this case. I do realize that (for e.g.) "5 ** 2.25 = (5 ** 2) * (5 ** 0.25)" and that "5 ** (-2.25) = (5 ** (-2) ) * (5 ** (-0.25) )", but I can't think of an easy way to calculate the result when the exponent is between -1 and 0 or between 0 and 1.
[EDIT]Doh.
I just noticed there's a stupid bug in the code I posted for the "simple method" too. For positive exponents it should be:[/i][/EDIT]
Cheers,
Brendan
01000101 wrote:There isn't a better way to do mp_pow is there? I can only think of multiplying again and again as the solution.
Note: I did test a few positive exponents using this method (on paper) and got the right answer; and there is a bug for negative exponents in the code I posted (for negative exponents, you'd need to negate the exponent before doing the final loop). I'd also assume that (for e.g.) a maximum of 31 multiplications instead of a maximum of over 2 billion multiplications would make a significant performance difference (e.g. enough to make it worth researching/testing my theory).Brendan wrote:To work out powers, there's the simple way like this:
A faster way might go something like this:Code: Select all
int pow(int a, int b) { int result = 1; while(b < 0) { result /= a; b++; } while(b > 0) { result /= a; b--; } return result; }
I haven't tested this, but in theory I think it should work, and if it does work it'd be a lot faster than the simple way (for signed 32-bit integers you'd do a maximum of 31 multiplications, instead of a maximum of over 2 billion multiplications)...Code: Select all
int pow(int a, int b) { int result; if(b == 0) return 1; if(b > 0) { temp = a; } else { temp = 1 / a; } do { if( (b & 1) != 0) result *= temp; b = b >> 1; temp *= temp; } while(b != 0); return result; }
However, this code won't work in some cases - the exponent must be an integer (e.g. something like "5 ** 2.25" won't work). I'm not sure if there's a trick I'm missing or some other way to handle this case. I do realize that (for e.g.) "5 ** 2.25 = (5 ** 2) * (5 ** 0.25)" and that "5 ** (-2.25) = (5 ** (-2) ) * (5 ** (-0.25) )", but I can't think of an easy way to calculate the result when the exponent is between -1 and 0 or between 0 and 1.
[EDIT]Doh.
I just noticed there's a stupid bug in the code I posted for the "simple method" too. For positive exponents it should be:
Code: Select all
while(b > 0) {
result *= a;
b--;
}
Cheers,
Brendan
For all things; perfection is, and will always remain, impossible to achieve in practice. However; by striving for perfection we create things that are as perfect as practically possible. Let the pursuit of perfection be our guide.