Page 1 of 1

float conversion not identical to GCC version

Posted: Sat Feb 05, 2005 2:51 pm
by Candy
I'm using my own bignumber library for converting ints (really big ones) to floats. ATM I'm still testing it, but I've noticed that for ints from 16777216 and up it has a flaw every 4 numbers. No matter whether I compensate, it shows. After printing all numbers my function generates and those the compiler/cpu generates itself, I've noticed that the computers are always rounded to the nearest even number, unless the number itself is also present. In other words:

16777216 -> 16777216 / 16777216
16777217 -> 16777216 / 16777218
16777218 -> 16777218 / 16777218
16777219 -> 16777220 / 16777220
16777220 -> 16777220 / 16777220
16777221 -> 16777220 / 16777222
16777222 -> 16777222 / 16777222
16777223 -> 16777224 / 16777224
16777224 -> 16777224 / 16777224

Is this a known awkwardness in IEEE 754 or is this something I'm doing wrong? I'm getting the second answers for the compilers idea, the third for mine. The CPU used was a K6-2 at 366.

[edit]
Found out where I disagree with my processor. It converts a number ending with a number that in my opinion should be the border case for starting to round up, rounded down if the part left out was a single one with further only zeroes left. IE, imo, it does the border case wrong. Am I wrong or is he wrong?
[/edit]

Re:float conversion not identical to GCC version

Posted: Sun Feb 06, 2005 10:59 am
by Solar
Check out <float.h>'s FLT_ROUND, as well as <fenv.h>, especially the functions fegetround() and fesetround(). A compiler is basically allowed to define bordercase rounding any which way it likes, unless you set it explicitly.

Re:float conversion not identical to GCC version

Posted: Sun Feb 06, 2005 11:20 am
by Candy
Solar wrote: Check out <float.h>'s FLT_ROUND, as well as <fenv.h>, especially the functions fegetround() and fesetround(). A compiler is basically allowed to define bordercase rounding any which way it likes, unless you set it explicitly.
Hate that... it rounds it almost logically...

Also, I now have working versions of these codes, one in assembly (which is considerably faster than the C++ ones) and one in c++, which both work afaik for long double, double and float. It's nearly C except for their presence in a class :)

Am going to PD these, they right now have a skew function that adjusts them for what the processor does differently than my function, so that I can check the results with == on my computer.

Thanks for the explanation.

They'll be PD when the entire huge-num library is complete, of course. Till then,

Candy