float conversion not identical to GCC version

Candy · Post by **Candy** » Sat Feb 05, 2005 2:51 pm

I'm using my own bignumber library for converting ints (really big ones) to floats. ATM I'm still testing it, but I've noticed that for ints from 16777216 and up it has a flaw every 4 numbers. No matter whether I compensate, it shows. After printing all numbers my function generates and those the compiler/cpu generates itself, I've noticed that the computers are always rounded to the nearest even number, unless the number itself is also present. In other words:

16777216 -> 16777216 / 16777216
16777217 -> 16777216 / 16777218
16777218 -> 16777218 / 16777218
16777219 -> 16777220 / 16777220
16777220 -> 16777220 / 16777220
16777221 -> 16777220 / 16777222
16777222 -> 16777222 / 16777222
16777223 -> 16777224 / 16777224
16777224 -> 16777224 / 16777224

Is this a known awkwardness in IEEE 754 or is this something I'm doing wrong? I'm getting the second answers for the compilers idea, the third for mine. The CPU used was a K6-2 at 366.

[edit]
Found out where I disagree with my processor. It converts a number ending with a number that in my opinion should be the border case for starting to round up, rounded down if the part left out was a single one with further only zeroes left. IE, imo, it does the border case wrong. Am I wrong or is he wrong?
[/edit]

Solar · Post by **Solar** » Sun Feb 06, 2005 10:59 am

Check out <float.h>'s FLT_ROUND, as well as <fenv.h>, especially the functions fegetround() and fesetround(). A compiler is basically allowed to define bordercase rounding any which way it likes, unless you set it explicitly.

Candy · Post by **Candy** » Sun Feb 06, 2005 11:20 am

Solar wrote: Check out <float.h>'s FLT_ROUND, as well as <fenv.h>, especially the functions fegetround() and fesetround(). A compiler is basically allowed to define bordercase rounding any which way it likes, unless you set it explicitly.

Hate that... it rounds it almost logically...

Also, I now have working versions of these codes, one in assembly (which is considerably faster than the C++ ones) and one in c++, which both work afaik for long double, double and float. It's nearly C except for their presence in a class

Am going to PD these, they right now have a skew function that adjusts them for what the processor does differently than my function, so that I can check the results with == on my computer.

Thanks for the explanation.

They'll be PD when the entire huge-num library is complete, of course. Till then,

Candy

OSDev.org

float conversion not identical to GCC version

float conversion not identical to GCC version

Re:float conversion not identical to GCC version

Re:float conversion not identical to GCC version