Opened 17 months ago
Last modified 15 months ago
#12527 reopened Bugs
cpp_bin_float: Anal fixation. Part 3. Double rounding when result of convert_to<double>() is a subnormal
Reported by: | Michael Shatz | Owned by: | John Maddock |
---|---|---|---|
Milestone: | To Be Determined | Component: | multiprecision |
Version: | Boost 1.62.0 | Severity: | Problem |
Keywords: | Cc: |
Description
When convert_to<double>() applied to numbers in subnormal range the value is initially rounded to 53-bit precision and then rounded again, by ldexp() routine, to the target precision which is lower than 53 bits. The problem is exactly the same as the well-known problem that makes it virtually impossible to produce 100% IEEE-754 compliant results with Intel x87 FPU.
Because of double rounding, resulting subnormals are not always the closest representable double-precision numbers to the original value or, in case of tie, they are not always even.
Exactly the same problem applies to convert_to<float>()
The attached files demonstrate the problem and one possible workaround, not necessarily that fastest, but likely the simplest.
Attachments (3)
Change History (24)
Changed 17 months ago by
Attachment: | af_p3d.cpp added |
---|
Changed 17 months ago by
Attachment: | af_p3f.cpp added |
---|
demonstrates double rounding in cpp_bin_float convert_to<float> when the result is subnormal
comment:1 Changed 16 months ago by
That's a reasonable fix as it only kicks in for denorms. I really need to device a better/more comprehensive test case for convert-to-float in general though...
comment:2 Changed 16 months ago by
My code above produces wrong sign of zero at -(2**-1075). Here is a fix:
double my_convert_to_double(const boost_quadfloat_t& x) { if (isnormal(x)) { if (x.backend().exponent() < -1023+1) { // subnormal or zero if (x.backend().exponent() < -1023 - 52) { return x.backend().sign() ? -0.0 : 0.0; // underflow } boost_quadfloat_t tmp = abs(x) + DBL_MIN; double res = tmp.convert_to<double>() - DBL_MIN; return x.backend().sign() ? -res : res; } } return x.convert_to<double>(); }
comment:3 follow-up: 4 Changed 16 months ago by
Resolution: | → fixed |
---|---|
Status: | new → closed |
Fixed in https://github.com/boostorg/multiprecision/commit/059cb63649efae0ec0062fa68f58cee357beb7c9
Thanks again for tracking these issues down!
comment:4 Changed 16 months ago by
Resolution: | fixed |
---|---|
Status: | closed → reopened |
Replying to johnmaddock:
Fixed in https://github.com/boostorg/multiprecision/commit/059cb63649efae0ec0062fa68f58cee357beb7c9
Thanks again for tracking these issues down!
It looks like you implemented my original workaround. Please see a post above. It explains why it is imperfect and how to do it better.
comment:5 follow-up: 6 Changed 16 months ago by
Resolution: | → fixed |
---|---|
Status: | reopened → closed |
I was about to say that it's a non-issue because things that return zero are handled separately. However, there was a bug in that (separate) case, see https://github.com/boostorg/multiprecision/commit/51686cae8739741f808f58be4cdc2059ad1f2502
comment:6 Changed 16 months ago by
Replying to johnmaddock:
I was about to say that it's a non-issue because things that return zero are handled separately. However, there was a bug in that (separate) case, see https://github.com/boostorg/multiprecision/commit/51686cae8739741f808f58be4cdc2059ad1f2502
I think, you didn't understand my case.
If I correctly understood the voodoo of numeric_limits, your condition if(e < min_exp_limit) is the same as if (x.backend().exponent() < -1077).
But the case that I am talking about happens when (x.backend().exponent() == -1075) for the smallest absolute value in this range. The rest of the range is rounded away from zero, i.e. to smallest negative number == -(2**-1074), but this single point on the edge of the range, is a tie, and as a good tie it has to be rounded to nearest even, i.e. to negative zero. However my original workaround rounds this point to positive zero. It happens due to same IEEE rule that we discussed few weeks ago - difference of equals numbers is always a positive zero.
That's why a new workaround is better than the original and why any change of min_exp_limit would not help this particular unique point.
comment:7 Changed 16 months ago by
Resolution: | fixed |
---|---|
Status: | closed → reopened |
comment:8 Changed 16 months ago by
Right the case denorm_min()/2 is the only one with that exponent that doesn't round up, I though I had this tested already but apparently not. The fix is trivial, but in the process of generating tests there are two other issues:
- When we add numeric_limits::min() we may loose important digits (that would otherwise break a tie).
- When we add numeric_limits::min() we may round up, leading to double rounding again (or at least that's what I think is happening).
I may need to just bite the bullet and write a custom conversion routine...
comment:9 Changed 16 months ago by
Unfortunately, you are right. Still, I can think about another workaround to keep my workaround alive:
- Convert to cpp_bin_float<N+52>
- add numeric_limits::min(). Now addition is exact!
- convert to double
- subtract numeric_limits::min().
Practically it's probably faster to convert to cpp_bin_float<N+64>
BTW, special-casing of (x.backend().exponent() < -1023 - 52) in my code is a performance optimization. This step is not necessary for correction.
comment:10 follow-up: 11 Changed 16 months ago by
I've updated the core rounding code to work at arbitrary locations, and changed the conversion code to make use of it. So hopefully, this time fixe din https://github.com/boostorg/multiprecision/commit/0006227416f0107693fcce4d8c15d127dc51e437
comment:11 Changed 16 months ago by
Replying to johnmaddock:
I've updated the core rounding code to work at arbitrary locations, and changed the conversion code to make use of it. So hopefully, this time fixe din https://github.com/boostorg/multiprecision/commit/0006227416f0107693fcce4d8c15d127dc51e437
So far i see one bug: numbers in range (-INF..-DBL_MAX] are converted to +inf instead of -inf.
Also one defect: new version is almost twice slower than the previous version, which by itself was pretty slow, considering a simplicity of the function.
comment:12 Changed 16 months ago by
I've fixed the obvious bug, the speed drop comes from making an additional copy of the bit array - it shouldn't be *twice* as slow though, will investigate.
comment:13 Changed 16 months ago by
I measured on my favorite size - 128 bits, where time to copy a bit array is practically zero, even when copied byte-by-byte rather than 64-bit-by-64-bit. It should be something else.
comment:14 Changed 16 months ago by
Try this:
template <typename T> double my_convert_to_double(const T& x) { double ret = 0; if (isfinite(x)) { if (x.backend().exponent() >= -1023 - 52 && x != 0) { if (x.backend().exponent() <= 1023) { int e = x.backend().exponent(); T y = ldexp(abs(x), 55 - e); T t = trunc(y); int64_t ti = t.template convert_to<int64_t>(); if ((ti & 1)==0) { if (t < y) ti |= 1; } if (e >= -1023 + 1) { ret = ldexp(double(ti), e - 55); } else { // subnormal typedef boost::multiprecision::number<boost::multiprecision::cpp_bin_float<128, boost::multiprecision::backends::digit_base_2> > cpp_bin_float128_t; cpp_bin_float128_t sx = ldexp(cpp_bin_float128_t(ti), e - 55); sx += DBL_MIN; e = -1023 + 1; cpp_bin_float128_t sy = ldexp(sx, 55 - e); cpp_bin_float128_t st = trunc(sy); ti = st.convert_to<int64_t>(); if ((ti & 1)==0) { if (st < sy) ti |= 1; } ret = ldexp(double(ti), e - 55) - DBL_MIN; } } else { // overflow ret = HUGE_VAL; } } } else { if (isnan(x)) return nan(""); // inf ret = HUGE_VAL; } return x.backend().sign() ? -ret : ret; }
For 128-bit cpp_bin_float on 64-bit x64 platform it is approximately 4 times faster than your last variant. Also, to me it looks much simpler.
Unfortunately, I am not too good in complex template programming, so I don't want to code template <class Float, unsigned Digits, digit_base_type DigitBase?, class Allocator, class Exponent, Exponent MinE, Exponent MaxE> inline typename boost::enable_if_c<boost::is_float<Float>::value>::type eval_convert_to(Float *res, const cpp_bin_float<Digits, DigitBase?, Allocator, Exponent, MinE, MaxE> &original_arg) { ... } Too many things in that code look to me as a black magic.
But, hopefully, the ideas are obvious from the example above.
comment:15 Changed 16 months ago by
I did a little more research and now I know why your new version is slow. That's because on both of platforms that I did the tests (x64/gcc 5.3.0 on MINGW and x64/msvc 19.00.24210) library function std::ldexp() is much slower than can be reasonably expected. So, for not to wide cpp_bin_float types, execution time of convert_to<>() is almost directly proportional to number of ldexp() calls.
There are circumstances when ldexp() is inlined by compiler and then it runs several times faster, but so far I was unable to figure out what they are.
Below is a variant of eval_convert_to() that is ideologically similar to your code - all rounding is done with integer arithmetic, but unlike you code it calls ldexp() exactly once. On my test platforms it is ~4 times faster than yours. Unfortunately, it is not complete: it only handles cases in which std::numeric_limits<Float>::digits <= 62. Thus, works very well for IEEE double and float, but even x87 'long double' is too much, as well as anything wider than that.
Nevertheless, look at it. Even if you can't extend it to wider types, IMHO, IEEE double and float are sufficiently important to be special-cased.
template <class Float, unsigned Digits, digit_base_type DigitBase, class Allocator, class Exponent, Exponent MinE, Exponent MaxE> inline typename boost::enable_if_c<boost::is_float<Float>::value>::type eval_convert_to(Float *res, const cpp_bin_float<Digits, DigitBase, Allocator, Exponent, MinE, MaxE> &original_arg) { Float ret = 0; switch (eval_fpclassify(original_arg)) { case FP_INFINITE: ret = std::numeric_limits<Float>::infinity(); break; case FP_NAN: *res = std::numeric_limits<Float>::quiet_NaN(); return; case FP_ZERO: break; default: { // == normal, because cpp_bin_float does not support subnormal if (original_arg.exponent() >= std::numeric_limits<Float>::min_exponent - std::numeric_limits<Float>::digits - 1) { if (original_arg.exponent() <= std::numeric_limits<Float>::max_exponent-1) { int e = original_arg.exponent(); cpp_bin_float<Digits, DigitBase, Allocator, Exponent, MinE, MaxE> y = original_arg; y.exponent() = 63; y.sign() = 0; uint64_t bits; eval_convert_to(&bits, y); int nhbits = std::numeric_limits<Float>::digits; if (e < std::numeric_limits<Float>::min_exponent-1) nhbits = e - std::numeric_limits<Float>::min_exponent + 1 + std::numeric_limits<Float>::digits; // round uint64_t lbits = bits << nhbits; uint64_t hbits = (bits >> 1) >> (63-nhbits); // shift by (64-nhbits) that works for nhbits in range [0..63] lbits |= (hbits & 1); // assure that tie rounded to even const uint64_t BIT63 = uint64_t(1) << 63; if (lbits == BIT63) { cpp_bin_float<Digits, DigitBase, Allocator, Exponent, MinE, MaxE> truncY; truncY = bits; if (!eval_eq(truncY, y)) lbits |= 1; } hbits += (lbits > BIT63); ret = std::ldexp(static_cast<Float>(static_cast<int64_t>(hbits)), e + 1 - nhbits); } else { // overflow ret = std::numeric_limits<Float>::infinity(); } } } break; } *res = original_arg.sign() ? -ret : ret; }
comment:16 Changed 16 months ago by
Nod. I started experimenting with an improved version last night - like yours it calls ldexp exactly once for float and double conversions (at least on 64-bit GCC), and it's *much* faster. Unfortunately I ran out of time before tracking down a few bugs. There's also a strange issue when the bit count is between 65 and 128 when it runs approx 10x slower! I suspect __int128 -> double
conversion is the bottleneck, but I'll investigate more later in the week. Other than that, it's more or less constant time regardless of the bit count.
comment:17 Changed 16 months ago by
It seems to me, one simple change is sufficient to get to acceptable speed: Replace:
cpp_bin_float<Digits, DigitBase, Allocator, Exponent, MinE, MaxE> arg;
With
cpp_bin_float<((std::numeric_limits<Float>::digits-1)/64 + 1)*64, DigitBase, Allocator, Exponent, MinE, MaxE> arg;
That's enough to get speed of convert_to<double>() and convert_to<float>() to about 80% of the speed of my variant.
Changed 15 months ago by
Attachment: | cpp_bin_float_conversion_performance.cpp added |
---|
comment:18 Changed 15 months ago by
I've just pushed some changes to develop which address this, using the attached test program and comparing to your routine in comment 14 here are the results I see:
MSVC:
Testing times for type: cpp_bin_float_double Reference time: 0.519 seconds (total sum = 0.000) Boost time: 0.345 seconds (total sum = 383586723786.860) Testing times for type: cpp_bin_float_quad Reference time: 1.450 seconds (total sum = 383870655523.786) Boost time: 0.506 seconds (total sum = 383870655523.786) Testing times for type: cpp_bin_float_50 Reference time: 1.261 seconds (total sum = 383509869056.583) Boost time: 0.495 seconds (total sum = 383509869056.583) Testing times for type: cpp_bin_float_100 Reference time: 1.762 seconds (total sum = 383488757195.157) Boost time: 0.508 seconds (total sum = 383488757195.157)
GCC Mingw64:
Testing times for type: cpp_bin_float_double Reference time: 0.292 seconds (total sum = 0.000) Boost time: 0.121 seconds (total sum = 383586723786.860) Testing times for type: cpp_bin_float_quad Reference time: 0.417 seconds (total sum = 383870655523.786) Boost time: 0.271 seconds (total sum = 383870655523.786) Testing times for type: cpp_bin_float_50 Reference time: 2.416 seconds (total sum = 383509869056.583) Boost time: 0.303 seconds (total sum = 383509869056.583) Testing times for type: cpp_bin_float_100 Reference time: 1.669 seconds (total sum = 383488757195.157) Boost time: 0.307 seconds (total sum = 383488757195.157)
GCC Ubuntu times are similar.
So the new Boost code is nearly twice as fast for you preferred 128-bit float, and much more than that for 100 decimal digit precision.
The remaining issue was that the bit-scanning/testing code wasn't properly optimised for __int128
, fixing that and handling ldexp better led to the speedup. I've attached the test program I used to this bug case, can you test at your end as well? BTW you will need all the develop branch of Multiprecision, not just cpp_bin_float.hpp as it's some of the core integer ops that needed work.
comment:19 Changed 15 months ago by
Thank you
Unfortunately, tonight I have more important things to do. Like watching Carlsen vs Karjakin ;-)
Most likely, I wouldn't test it until Saturday night.
Sorry
P.S.
Variant in comment 15 is non-trivially faster than comment 14. Especially for cpp_bin_float_100.
But yes, even without measurement I can believe that using low-level cpp_bin_float primitives can be faster than using relatively high-level eval_convert_to(uint64_t*, cpp_bin_float*).
On the other hand, if the difference (aginst variant in comment 15) is as significant as you suggest, it means that eval_convert_to(uint64_t*, cpp_bin_float*) is undertuned.
comment:20 Changed 15 months ago by
I finally came to testing your new version.
It is not bad. Appears to produce correct results. It is not quite as fast as the one, I suggested in comment 14. But the difference (on gcc 5.3.0/Mingw x64/Intel Ivy Bridge) is only 30-35 clocks, so it's not mighty important.
In the process I figured out why ldexp() was painfully slow in part of my measurement and rather acceptable in the others. The reason is a performance bug in my version of gcc when the code is compiled with -mavx flag (my personal default).
gcc/mingw64 implementation of ldexp() is the same for -mavx and -mno-avx. It uses "legacy" SSE instructions. So, when the function is called from AVX code, compiler should have been inserting zeroupper instruction before the call. But gcc forgot to do it, which leads to the switching penalty of 150-200 clocks on all Intel processors starting from Sandy Bridge and up to Broadwell. I heard that on Skylake the penalty is much smaller, but still exists.
gcc maintainers practically never fix anything in their old versions, not even obvious mistakes in documentation, so it does not make much practical sense reporting bugs in 5.3.0. But if you see similar behavior in 6.2.0 then it's worth reporting.
I can't do it easily myself, because right now I how no machine with 6.2 installed.
Regard, Michael
comment:21 Changed 15 months ago by
Just for fun, I coded "light-speed" variant of conversion routine, that detects and specializes cases of 'Float' equivalent to IEEE-754 binary32 and binary64 and uses bit-level knowledge of this formats in order to implement conversion that does not use FPU at all. Enjoy:
template <class Float, unsigned Digits, digit_base_type DigitBase, class Allocator, class Exponent, Exponent MinE, Exponent MaxE> inline typename boost::enable_if_c<boost::is_float<Float>::value>::type eval_convert_to(Float *res, const cpp_bin_float<Digits, DigitBase, Allocator, Exponent, MinE, MaxE> &original_arg) { typedef cpp_bin_float<std::numeric_limits<Float>::digits, digit_base_2, void, Exponent, MinE, MaxE> conv_type; typedef typename common_type<typename conv_type::exponent_type, int>::type common_exp_type; // // Special cases first: // switch(original_arg.exponent()) { case cpp_bin_float<Digits, DigitBase, Allocator, Exponent, MinE, MaxE>::exponent_zero: *res = 0; if(original_arg.sign()) *res = -*res; return; case cpp_bin_float<Digits, DigitBase, Allocator, Exponent, MinE, MaxE>::exponent_nan: *res = std::numeric_limits<Float>::quiet_NaN(); return; case cpp_bin_float<Digits, DigitBase, Allocator, Exponent, MinE, MaxE>::exponent_infinity: *res = (std::numeric_limits<Float>::infinity)(); if(original_arg.sign()) *res = -*res; return; } if (std::numeric_limits<Float>::radix==2) { if (original_arg.exponent() >= std::numeric_limits<Float>::max_exponent || original_arg.exponent() < std::numeric_limits<Float>::min_exponent - std::numeric_limits<Float>::digits -1) { Float ret = original_arg.exponent() >= std::numeric_limits<Float>::max_exponent ? (std::numeric_limits<Float>::has_infinity ? std::numeric_limits<Float>::infinity() : std::numeric_limits<Float>::max() ): 0; *res = original_arg.sign() ? -ret : ret; return; } // fast handling of mundane cases if (std::numeric_limits<Float>::digits <= 62) { // extract 64 MS bits of significand typename cpp_bin_float<Digits, DigitBase, Allocator, Exponent, MinE, MaxE>::rep_type bits(original_arg.bits()); if (original_arg.bit_count > 64) eval_right_shift(bits, original_arg.bit_count - 64); uint64_t msbits; eval_convert_to(&msbits, bits); if (original_arg.bit_count < 64) msbits <<= (64 - original_arg.bit_count) % 64; int digits_to_round_to = std::numeric_limits<Float>::digits; int e = original_arg.exponent(); if (e < std::numeric_limits<Float>::min_exponent-1) { digits_to_round_to += e - (std::numeric_limits<Float>::min_exponent-1); } const uint64_t offmask = uint64_t(-1) >> digits_to_round_to; const uint64_t offmsbit = uint64_t(1) << (63-digits_to_round_to); uint64_t offbits = msbits & offmask; uint64_t onbits = (msbits >> 1) >> (63-digits_to_round_to); offbits |= (onbits & 1); // tie rounded toward even if (original_arg.bit_count > 64 && offbits == offmsbit) { // Suspect for a tie. We have to examine lower bits of original // The method used below is certainly not the fastest possible, // but the code is short and does not rely on internal representation of cpp_int // Taking into account limits of my template-fu, that's not a small advantage. // I am feeling less guilty about it because in typical cases of conversion (double/float) // we will not arrive here too often bits = original_arg.bits(); eval_left_shift(bits, 64); if (!eval_eq(bits, limb_type(0))) offbits |= 1; } if (((std::numeric_limits<Float>::digits==53 && sizeof(Float)==sizeof(uint64_t)) || (std::numeric_limits<Float>::digits==24 && sizeof(Float)==sizeof(uint32_t))) && std::numeric_limits<Float>::has_denorm == std::denorm_present && std::numeric_limits<Float>::is_iec559) { // 'Float' is IEEE 754 binary64 or IEEE 754 binary32 // exploit knowledge of bit-level representation of IEEE 754 binaryXX // for extremely fast replacement of ldexp() if (e < std::numeric_limits<Float>::min_exponent-1) e = std::numeric_limits<Float>::min_exponent-2; onbits &= uint64_t(-1) >> (65-std::numeric_limits<Float>::digits); // remove hidden bit onbits += (offbits > offmsbit); const int exp_bias = std::numeric_limits<Float>::digits==24 ? 127 : 1023; onbits += uint64_t(exp_bias+e) << (std::numeric_limits<Float>::digits-1); // '+' rather than '|' , so case of carry from significand into exponent handled automatically onbits |= uint64_t(original_arg.sign() ? 1 : 0) << (std::numeric_limits<Float>::digits==24 ? 31 : 63); if (sizeof(Float)==sizeof(uint32_t)) { uint32_t u32 = static_cast<uint32_t>(onbits); memcpy(res, &u32, sizeof(*res)); } else { memcpy(res, &onbits, sizeof(*res)); } return; } onbits += (offbits > offmsbit); if (std::numeric_limits<double>::radix==2 && sizeof(uint64_t)==sizeof(double) && std::numeric_limits<double>::digits==53 && std::numeric_limits<double>::has_denorm == std::denorm_present && std::numeric_limits<double>::is_iec559 && // platform's 'double' is IEEE 754 binary64 std::numeric_limits<Float>::digits <= std::numeric_limits<double>::digits && std::numeric_limits<Float>::min_exponent >= std::numeric_limits<double>::min_exponent && std::numeric_limits<Float>::max_exponent <= std::numeric_limits<double>::max_exponent) { // 'Float' is a subset of 'double' // exploit knowledge of bit-level representation of IEEE 754 binary64 // for replacement of ldexp() which is not necessarily faster than good library implementation, // but have more robust performance int ee = e+1-digits_to_round_to; if (ee & 1) { onbits <<= 1; ee -= 1; } double ret = static_cast<Float>(static_cast<int64_t>(onbits)); ret = original_arg.sign() ? -ret : ret; int eh = ee / 2; uint64_t u_scale = uint64_t(1023 + eh) << 52; double d_scale; memcpy(&d_scale, &u_scale, sizeof(d_scale)); // d_scale = 2**(ee/2) *res = static_cast<Float>((ret * d_scale) * d_scale); return; } Float ret = static_cast<Float>(static_cast<int64_t>(onbits)); ret = original_arg.sign() ? -ret : ret; *res = std::ldexp(ret, e+1-digits_to_round_to); return; } } // // Check for super large exponent that must be converted to infinity: // if(original_arg.exponent() > std::numeric_limits<Float>::max_exponent) { *res = std::numeric_limits<Float>::has_infinity ? std::numeric_limits<Float>::infinity() : (std::numeric_limits<Float>::max)(); if(original_arg.sign()) *res = -*res; return; } // // Figure out how many digits we will have in our result, // allowing for a possibly denormalized result: // common_exp_type digits_to_round_to = std::numeric_limits<Float>::digits; if(original_arg.exponent() < std::numeric_limits<Float>::min_exponent - 1) { common_exp_type diff = original_arg.exponent(); diff -= std::numeric_limits<Float>::min_exponent - 1; digits_to_round_to += diff; } if(digits_to_round_to < 0) { // Result must be zero: *res = 0; if(original_arg.sign()) *res = -*res; return; } // // Perform rounding first, then afterwards extract the digits: // cpp_bin_float<std::numeric_limits<Float>::digits, digit_base_2, Allocator, Exponent, MinE, MaxE> arg; typename cpp_bin_float<Digits, DigitBase, Allocator, Exponent, MinE, MaxE>::rep_type bits(original_arg.bits()); arg.exponent() = original_arg.exponent(); copy_and_round(arg, bits, (int)digits_to_round_to); common_exp_type e = arg.exponent(); e -= cpp_bin_float<Digits, DigitBase, Allocator, Exponent, MinE, MaxE>::bit_count - 1; static const unsigned limbs_needed = std::numeric_limits<Float>::digits / (sizeof(*arg.bits().limbs()) * CHAR_BIT) + (std::numeric_limits<Float>::digits % (sizeof(*arg.bits().limbs()) * CHAR_BIT) ? 1 : 0); unsigned first_limb_needed = arg.bits().size() - limbs_needed; *res = 0; e += first_limb_needed * sizeof(*arg.bits().limbs()) * CHAR_BIT; while(first_limb_needed < arg.bits().size()) { *res += std::ldexp(static_cast<Float>(arg.bits().limbs()[first_limb_needed]), static_cast<int>(e)); ++first_limb_needed; e += sizeof(*arg.bits().limbs()) * CHAR_BIT; } if(original_arg.sign()) *res = -*res; }
demonstrates double rounding in cpp_bin_float convert_to<double> when the result is subnormal