Modify

Ticket #12527 (reopened Bugs)

Opened 7 months ago

Last modified 5 months ago

cpp_bin_float: Anal fixation. Part 3. Double rounding when result of convert_to<double>() is a subnormal

Reported by: Michael Shatz Owned by: johnmaddock
Milestone: To Be Determined Component: multiprecision
Version: Boost 1.62.0 Severity: Problem
Keywords: Cc:

Description

When convert_to<double>() applied to numbers in subnormal range the value is initially rounded to 53-bit precision and then rounded again, by ldexp() routine, to the target precision which is lower than 53 bits. The problem is exactly the same as the well-known problem that makes it virtually impossible to produce 100% IEEE-754 compliant results with Intel x87 FPU.

Because of double rounding, resulting subnormals are not always the closest representable double-precision numbers to the original value or, in case of tie, they are not always even.

Exactly the same problem applies to convert_to<float>()

The attached files demonstrate the problem and one possible workaround, not necessarily that fastest, but likely the simplest.

Attachments

af_p3d.cpp (1.7 KB) - added by Michael Shatz 7 months ago.
demonstrates double rounding in cpp_bin_float convert_to<double> when the result is subnormal
af_p3f.cpp (1.7 KB) - added by Michael Shatz 7 months ago.
demonstrates double rounding in cpp_bin_float convert_to<float> when the result is subnormal
cpp_bin_float_conversion_performance.cpp (3.9 KB) - added by johnmaddock 5 months ago.

Change History

Changed 7 months ago by Michael Shatz

demonstrates double rounding in cpp_bin_float convert_to<double> when the result is subnormal

Changed 7 months ago by Michael Shatz

demonstrates double rounding in cpp_bin_float convert_to<float> when the result is subnormal

comment:1 Changed 6 months ago by anonymous

That's a reasonable fix as it only kicks in for denorms. I really need to device a better/more comprehensive test case for convert-to-float in general though...

comment:2 Changed 6 months ago by Michael Shatz

My code above produces wrong sign of zero at -(2**-1075). Here is a fix:

double my_convert_to_double(const boost_quadfloat_t& x)
{
  if (isnormal(x)) {
    if (x.backend().exponent() < -1023+1) {
      // subnormal or zero
      if (x.backend().exponent() < -1023 - 52) {
        return x.backend().sign() ? -0.0 : 0.0; // underflow
      }
      boost_quadfloat_t tmp = abs(x) + DBL_MIN;
      double res = tmp.convert_to<double>() - DBL_MIN;
      return x.backend().sign() ? -res : res;
    }
  }
  return x.convert_to<double>();
}

comment:3 follow-up: ↓ 4 Changed 6 months ago by johnmaddock

  • Status changed from new to closed
  • Resolution set to fixed

comment:4 in reply to: ↑ 3 Changed 6 months ago by Michael Shatz

  • Status changed from closed to reopened
  • Resolution fixed deleted

Replying to johnmaddock:

Fixed in https://github.com/boostorg/multiprecision/commit/059cb63649efae0ec0062fa68f58cee357beb7c9

Thanks again for tracking these issues down!

It looks like you implemented my original workaround. Please see a post above. It explains why it is imperfect and how to do it better.

comment:5 follow-up: ↓ 6 Changed 6 months ago by johnmaddock

  • Status changed from reopened to closed
  • Resolution set to fixed

I was about to say that it's a non-issue because things that return zero are handled separately. However, there was a bug in that (separate) case, see https://github.com/boostorg/multiprecision/commit/51686cae8739741f808f58be4cdc2059ad1f2502

comment:6 in reply to: ↑ 5 Changed 6 months ago by Michael Shatz

Replying to johnmaddock:

I was about to say that it's a non-issue because things that return zero are handled separately. However, there was a bug in that (separate) case, see https://github.com/boostorg/multiprecision/commit/51686cae8739741f808f58be4cdc2059ad1f2502

I think, you didn't understand my case.

If I correctly understood the voodoo of numeric_limits, your condition if(e < min_exp_limit) is the same as if (x.backend().exponent() < -1077).

But the case that I am talking about happens when (x.backend().exponent() == -1075) for the smallest absolute value in this range. The rest of the range is rounded away from zero, i.e. to smallest negative number == -(2**-1074), but this single point on the edge of the range, is a tie, and as a good tie it has to be rounded to nearest even, i.e. to negative zero. However my original workaround rounds this point to positive zero. It happens due to same IEEE rule that we discussed few weeks ago - difference of equals numbers is always a positive zero.

That's why a new workaround is better than the original and why any change of min_exp_limit would not help this particular unique point.

comment:7 Changed 6 months ago by anonymous

  • Status changed from closed to reopened
  • Resolution fixed deleted

comment:8 Changed 6 months ago by johnmaddock

Right the case denorm_min()/2 is the only one with that exponent that doesn't round up, I though I had this tested already but apparently not. The fix is trivial, but in the process of generating tests there are two other issues:

  • When we add numeric_limits::min() we may loose important digits (that would otherwise break a tie).
  • When we add numeric_limits::min() we may round up, leading to double rounding again (or at least that's what I think is happening).

I may need to just bite the bullet and write a custom conversion routine...

comment:9 Changed 6 months ago by Michael Shatz

Unfortunately, you are right. Still, I can think about another workaround to keep my workaround alive:

  1. Convert to cpp_bin_float<N+52>
  2. add numeric_limits::min(). Now addition is exact!
  3. convert to double
  4. subtract numeric_limits::min().

Practically it's probably faster to convert to cpp_bin_float<N+64>

BTW, special-casing of (x.backend().exponent() < -1023 - 52) in my code is a performance optimization. This step is not necessary for correction.

comment:10 follow-up: ↓ 11 Changed 6 months ago by johnmaddock

I've updated the core rounding code to work at arbitrary locations, and changed the conversion code to make use of it. So hopefully, this time fixe din https://github.com/boostorg/multiprecision/commit/0006227416f0107693fcce4d8c15d127dc51e437

comment:11 in reply to: ↑ 10 Changed 6 months ago by Michael Shatz

Replying to johnmaddock:

I've updated the core rounding code to work at arbitrary locations, and changed the conversion code to make use of it. So hopefully, this time fixe din https://github.com/boostorg/multiprecision/commit/0006227416f0107693fcce4d8c15d127dc51e437

So far i see one bug: numbers in range (-INF..-DBL_MAX] are converted to +inf instead of -inf.

Also one defect: new version is almost twice slower than the previous version, which by itself was pretty slow, considering a simplicity of the function.

comment:12 Changed 6 months ago by johnmaddock

I've fixed the obvious bug, the speed drop comes from making an additional copy of the bit array - it shouldn't be *twice* as slow though, will investigate.

comment:13 Changed 6 months ago by Michael Shatz

I measured on my favorite size - 128 bits, where time to copy a bit array is practically zero, even when copied byte-by-byte rather than 64-bit-by-64-bit. It should be something else.

comment:14 Changed 6 months ago by Michael Shatz

Try this:

template <typename T>
double my_convert_to_double(const T& x)
{
  double ret = 0;
  if (isfinite(x)) {
    if (x.backend().exponent() >= -1023 - 52 && x != 0) {
      if (x.backend().exponent() <= 1023) {
        int e = x.backend().exponent();
        T y = ldexp(abs(x), 55 - e);
        T t = trunc(y);
        int64_t ti = t.template convert_to<int64_t>();
        if ((ti & 1)==0) {
          if (t < y)
            ti |= 1;
        }
        if (e >= -1023 + 1) {
          ret = ldexp(double(ti), e - 55);
        } else {
          // subnormal
          typedef boost::multiprecision::number<boost::multiprecision::cpp_bin_float<128, boost::multiprecision::backends::digit_base_2> > cpp_bin_float128_t;
          cpp_bin_float128_t sx = ldexp(cpp_bin_float128_t(ti), e - 55);
          sx += DBL_MIN;
          e = -1023 + 1;
          cpp_bin_float128_t sy = ldexp(sx, 55 - e);
          cpp_bin_float128_t st = trunc(sy);
          ti = st.convert_to<int64_t>();
          if ((ti & 1)==0) {
            if (st < sy)
              ti |= 1;
          }
          ret = ldexp(double(ti), e - 55) - DBL_MIN;
        }
      } else {
        // overflow
        ret = HUGE_VAL;
      }
    }
  } else {
    if (isnan(x))
      return nan("");
    // inf
    ret = HUGE_VAL;
  }
  return x.backend().sign() ? -ret : ret;
}

For 128-bit cpp_bin_float on 64-bit x64 platform it is approximately 4 times faster than your last variant. Also, to me it looks much simpler.

Unfortunately, I am not too good in complex template programming, so I don't want to code template <class Float, unsigned Digits, digit_base_type DigitBase?, class Allocator, class Exponent, Exponent MinE, Exponent MaxE> inline typename boost::enable_if_c<boost::is_float<Float>::value>::type eval_convert_to(Float *res, const cpp_bin_float<Digits, DigitBase?, Allocator, Exponent, MinE, MaxE> &original_arg) { ... } Too many things in that code look to me as a black magic.

But, hopefully, the ideas are obvious from the example above.

comment:15 Changed 6 months ago by Michael Shatz

I did a little more research and now I know why your new version is slow. That's because on both of platforms that I did the tests (x64/gcc 5.3.0 on MINGW and x64/msvc 19.00.24210) library function std::ldexp() is much slower than can be reasonably expected. So, for not to wide cpp_bin_float types, execution time of convert_to<>() is almost directly proportional to number of ldexp() calls.

There are circumstances when ldexp() is inlined by compiler and then it runs several times faster, but so far I was unable to figure out what they are.

Below is a variant of eval_convert_to() that is ideologically similar to your code - all rounding is done with integer arithmetic, but unlike you code it calls ldexp() exactly once. On my test platforms it is ~4 times faster than yours. Unfortunately, it is not complete: it only handles cases in which std::numeric_limits<Float>::digits <= 62. Thus, works very well for IEEE double and float, but even x87 'long double' is too much, as well as anything wider than that.

Nevertheless, look at it. Even if you can't extend it to wider types, IMHO, IEEE double and float are sufficiently important to be special-cased.

template <class Float, unsigned Digits, digit_base_type DigitBase, class Allocator, class Exponent, Exponent MinE, Exponent MaxE>
inline typename boost::enable_if_c<boost::is_float<Float>::value>::type eval_convert_to(Float *res, const cpp_bin_float<Digits, DigitBase, Allocator, Exponent, MinE, MaxE> &original_arg)
{
  Float ret = 0;
  switch (eval_fpclassify(original_arg)) {
    case FP_INFINITE:
      ret = std::numeric_limits<Float>::infinity();
      break;

    case FP_NAN:
      *res = std::numeric_limits<Float>::quiet_NaN();
      return;

    case FP_ZERO:
      break;

    default:
    { // == normal, because cpp_bin_float does not support subnormal
      if (original_arg.exponent() >= std::numeric_limits<Float>::min_exponent - std::numeric_limits<Float>::digits - 1) {
        if (original_arg.exponent() <= std::numeric_limits<Float>::max_exponent-1) {
          int e = original_arg.exponent();
          cpp_bin_float<Digits, DigitBase, Allocator, Exponent, MinE, MaxE> y = original_arg;
          y.exponent() = 63;
          y.sign()     = 0;

          uint64_t bits;
          eval_convert_to(&bits, y);

          int nhbits = std::numeric_limits<Float>::digits;
          if (e < std::numeric_limits<Float>::min_exponent-1)
            nhbits = e - std::numeric_limits<Float>::min_exponent + 1 + std::numeric_limits<Float>::digits;
          // round
          uint64_t lbits = bits << nhbits;
          uint64_t hbits = (bits >> 1) >> (63-nhbits); // shift by (64-nhbits) that works for nhbits in range [0..63]
          lbits |= (hbits & 1);                        // assure that tie rounded to even
          const uint64_t BIT63 = uint64_t(1) << 63;
          if (lbits == BIT63) {
            cpp_bin_float<Digits, DigitBase, Allocator, Exponent, MinE, MaxE> truncY;
            truncY = bits;
            if (!eval_eq(truncY, y))
              lbits |= 1;
          }
          hbits += (lbits > BIT63);
          ret = std::ldexp(static_cast<Float>(static_cast<int64_t>(hbits)), e + 1 - nhbits);
        } else {
          // overflow
          ret = std::numeric_limits<Float>::infinity();
        }
      }
    }
      break;
  }
  *res = original_arg.sign() ? -ret : ret;
}

comment:16 Changed 6 months ago by anonymous

Nod. I started experimenting with an improved version last night - like yours it calls ldexp exactly once for float and double conversions (at least on 64-bit GCC), and it's *much* faster. Unfortunately I ran out of time before tracking down a few bugs. There's also a strange issue when the bit count is between 65 and 128 when it runs approx 10x slower! I suspect __int128 -> double conversion is the bottleneck, but I'll investigate more later in the week. Other than that, it's more or less constant time regardless of the bit count.

comment:17 Changed 6 months ago by Michael Shatz

It seems to me, one simple change is sufficient to get to acceptable speed: Replace:

   cpp_bin_float<Digits, DigitBase, Allocator, Exponent, MinE, MaxE> arg;

With

   cpp_bin_float<((std::numeric_limits<Float>::digits-1)/64 + 1)*64, DigitBase, Allocator, Exponent, MinE, MaxE> arg;

That's enough to get speed of convert_to<double>() and convert_to<float>() to about 80% of the speed of my variant.

Changed 5 months ago by johnmaddock

comment:18 Changed 5 months ago by johnmaddock

I've just pushed some changes to develop which address this, using the attached test program and comparing to your routine in comment 14 here are the results I see:

MSVC:

Testing times for type: cpp_bin_float_double
Reference time:   0.519 seconds (total sum = 0.000)
Boost time:       0.345 seconds (total sum = 383586723786.860)
Testing times for type: cpp_bin_float_quad
Reference time:   1.450 seconds (total sum = 383870655523.786)
Boost time:       0.506 seconds (total sum = 383870655523.786)
Testing times for type: cpp_bin_float_50
Reference time:   1.261 seconds (total sum = 383509869056.583)
Boost time:       0.495 seconds (total sum = 383509869056.583)
Testing times for type: cpp_bin_float_100
Reference time:   1.762 seconds (total sum = 383488757195.157)
Boost time:       0.508 seconds (total sum = 383488757195.157)

GCC Mingw64:

Testing times for type: cpp_bin_float_double
Reference time:   0.292 seconds (total sum = 0.000)
Boost time:       0.121 seconds (total sum = 383586723786.860)
Testing times for type: cpp_bin_float_quad
Reference time:   0.417 seconds (total sum = 383870655523.786)
Boost time:       0.271 seconds (total sum = 383870655523.786)
Testing times for type: cpp_bin_float_50
Reference time:   2.416 seconds (total sum = 383509869056.583)
Boost time:       0.303 seconds (total sum = 383509869056.583)
Testing times for type: cpp_bin_float_100
Reference time:   1.669 seconds (total sum = 383488757195.157)
Boost time:       0.307 seconds (total sum = 383488757195.157)

GCC Ubuntu times are similar.

So the new Boost code is nearly twice as fast for you preferred 128-bit float, and much more than that for 100 decimal digit precision.

The remaining issue was that the bit-scanning/testing code wasn't properly optimised for __int128, fixing that and handling ldexp better led to the speedup. I've attached the test program I used to this bug case, can you test at your end as well? BTW you will need all the develop branch of Multiprecision, not just cpp_bin_float.hpp as it's some of the core integer ops that needed work.

comment:19 Changed 5 months ago by Michael Shatz

Thank you

Unfortunately, tonight I have more important things to do. Like watching Carlsen vs Karjakin ;-)

Most likely, I wouldn't test it until Saturday night.

Sorry

P.S.

Variant in comment 15 is non-trivially faster than comment 14. Especially for cpp_bin_float_100.

But yes, even without measurement I can believe that using low-level cpp_bin_float primitives can be faster than using relatively high-level eval_convert_to(uint64_t*, cpp_bin_float*).

On the other hand, if the difference (aginst variant in comment 15) is as significant as you suggest, it means that eval_convert_to(uint64_t*, cpp_bin_float*) is undertuned.

comment:20 Changed 5 months ago by Michael Shatz

I finally came to testing your new version.

It is not bad. Appears to produce correct results. It is not quite as fast as the one, I suggested in comment 14. But the difference (on gcc 5.3.0/Mingw x64/Intel Ivy Bridge) is only 30-35 clocks, so it's not mighty important.

In the process I figured out why ldexp() was painfully slow in part of my measurement and rather acceptable in the others. The reason is a performance bug in my version of gcc when the code is compiled with -mavx flag (my personal default).

gcc/mingw64 implementation of ldexp() is the same for -mavx and -mno-avx. It uses "legacy" SSE instructions. So, when the function is called from AVX code, compiler should have been inserting zeroupper instruction before the call. But gcc forgot to do it, which leads to the switching penalty of 150-200 clocks on all Intel processors starting from Sandy Bridge and up to Broadwell. I heard that on Skylake the penalty is much smaller, but still exists.

gcc maintainers practically never fix anything in their old versions, not even obvious mistakes in documentation, so it does not make much practical sense reporting bugs in 5.3.0. But if you see similar behavior in 6.2.0 then it's worth reporting.

I can't do it easily myself, because right now I how no machine with 6.2 installed.

Regard, Michael

comment:21 Changed 5 months ago by Michael Shatz

Just for fun, I coded "light-speed" variant of conversion routine, that detects and specializes cases of 'Float' equivalent to IEEE-754 binary32 and binary64 and uses bit-level knowledge of this formats in order to implement conversion that does not use FPU at all. Enjoy:

template <class Float, unsigned Digits, digit_base_type DigitBase, class Allocator, class Exponent, Exponent MinE, Exponent MaxE>
inline typename boost::enable_if_c<boost::is_float<Float>::value>::type eval_convert_to(Float *res, const cpp_bin_float<Digits, DigitBase, Allocator, Exponent, MinE, MaxE> &original_arg)
{
   typedef cpp_bin_float<std::numeric_limits<Float>::digits, digit_base_2, void, Exponent, MinE, MaxE>  conv_type;
   typedef typename common_type<typename conv_type::exponent_type, int>::type                           common_exp_type;
   //
   // Special cases first:
   //
   switch(original_arg.exponent())
   {
   case cpp_bin_float<Digits, DigitBase, Allocator, Exponent, MinE, MaxE>::exponent_zero:
      *res = 0;
      if(original_arg.sign())
         *res = -*res;
      return;
   case cpp_bin_float<Digits, DigitBase, Allocator, Exponent, MinE, MaxE>::exponent_nan:
      *res = std::numeric_limits<Float>::quiet_NaN();
      return;
   case cpp_bin_float<Digits, DigitBase, Allocator, Exponent, MinE, MaxE>::exponent_infinity:
      *res = (std::numeric_limits<Float>::infinity)();
      if(original_arg.sign())
         *res = -*res;
      return;
   }

   if (std::numeric_limits<Float>::radix==2) {
     if (original_arg.exponent() >= std::numeric_limits<Float>::max_exponent ||
         original_arg.exponent() < std::numeric_limits<Float>::min_exponent - std::numeric_limits<Float>::digits -1)
      {
        Float ret =
          original_arg.exponent() >= std::numeric_limits<Float>::max_exponent ?
            (std::numeric_limits<Float>::has_infinity ?
              std::numeric_limits<Float>::infinity() :
              std::numeric_limits<Float>::max() ):
            0;
        *res = original_arg.sign() ? -ret : ret;
        return;
      }

      // fast handling of mundane cases
      if (std::numeric_limits<Float>::digits <= 62) {
        // extract 64 MS bits of significand
        typename cpp_bin_float<Digits, DigitBase, Allocator, Exponent, MinE, MaxE>::rep_type bits(original_arg.bits());
        if (original_arg.bit_count > 64)
          eval_right_shift(bits, original_arg.bit_count - 64);
        uint64_t msbits;
        eval_convert_to(&msbits, bits);
        if (original_arg.bit_count < 64)
          msbits <<= (64 - original_arg.bit_count) % 64;

        int digits_to_round_to = std::numeric_limits<Float>::digits;
        int e = original_arg.exponent();
        if (e < std::numeric_limits<Float>::min_exponent-1) {
          digits_to_round_to += e - (std::numeric_limits<Float>::min_exponent-1);
        }

        const uint64_t offmask  = uint64_t(-1) >> digits_to_round_to;
        const uint64_t offmsbit = uint64_t(1) << (63-digits_to_round_to);
        uint64_t offbits = msbits & offmask;
        uint64_t onbits  = (msbits >> 1) >> (63-digits_to_round_to);
        offbits |= (onbits & 1); // tie rounded toward even

        if (original_arg.bit_count > 64 && offbits == offmsbit) {
          // Suspect for a tie. We have to examine lower bits of original
          // The method used below is certainly not the fastest possible,
          // but the code is short and does not rely on internal representation of cpp_int
          // Taking into account limits of my template-fu, that's not a small advantage.
          // I am feeling less guilty about it because in typical cases of conversion (double/float)
          // we will not arrive here too often
          bits = original_arg.bits();
          eval_left_shift(bits, 64);
          if (!eval_eq(bits, limb_type(0)))
            offbits |= 1;
        }

        if (((std::numeric_limits<Float>::digits==53 && sizeof(Float)==sizeof(uint64_t)) ||
             (std::numeric_limits<Float>::digits==24 && sizeof(Float)==sizeof(uint32_t))) &&
            std::numeric_limits<Float>::has_denorm == std::denorm_present &&
            std::numeric_limits<Float>::is_iec559)
        {
          // 'Float' is IEEE 754 binary64 or  IEEE 754 binary32
          // exploit knowledge of bit-level representation of IEEE 754 binaryXX
          // for extremely fast replacement of ldexp()
          if (e < std::numeric_limits<Float>::min_exponent-1)
            e = std::numeric_limits<Float>::min_exponent-2;
          onbits &= uint64_t(-1) >> (65-std::numeric_limits<Float>::digits); // remove hidden bit
          onbits += (offbits > offmsbit);
          const int exp_bias = std::numeric_limits<Float>::digits==24 ? 127 : 1023;
          onbits += uint64_t(exp_bias+e) << (std::numeric_limits<Float>::digits-1); // '+' rather than '|' , so case of carry from significand into exponent handled automatically
          onbits |= uint64_t(original_arg.sign() ? 1 : 0) <<
            (std::numeric_limits<Float>::digits==24 ? 31 : 63);
          if (sizeof(Float)==sizeof(uint32_t)) {
            uint32_t u32 = static_cast<uint32_t>(onbits);
            memcpy(res, &u32, sizeof(*res));
          } else {
            memcpy(res, &onbits, sizeof(*res));
          }
          return;
        }

        onbits += (offbits > offmsbit);
        if (std::numeric_limits<double>::radix==2 &&
            sizeof(uint64_t)==sizeof(double) &&
            std::numeric_limits<double>::digits==53 &&
            std::numeric_limits<double>::has_denorm == std::denorm_present &&
            std::numeric_limits<double>::is_iec559 && // platform's 'double' is IEEE 754 binary64
            std::numeric_limits<Float>::digits <= std::numeric_limits<double>::digits &&
            std::numeric_limits<Float>::min_exponent >= std::numeric_limits<double>::min_exponent &&
            std::numeric_limits<Float>::max_exponent <= std::numeric_limits<double>::max_exponent) {
          // 'Float' is a subset of 'double'
          // exploit knowledge of bit-level representation of IEEE 754 binary64
          // for replacement of ldexp() which is not necessarily faster than good library implementation,
          // but have more robust performance
          int ee = e+1-digits_to_round_to;
          if (ee & 1) {
            onbits <<= 1;
            ee -= 1;
          }
          double ret = static_cast<Float>(static_cast<int64_t>(onbits));
          ret = original_arg.sign() ? -ret : ret;
          int eh = ee / 2;
          uint64_t u_scale = uint64_t(1023 + eh) << 52;
          double d_scale;
          memcpy(&d_scale, &u_scale, sizeof(d_scale)); // d_scale = 2**(ee/2)
          *res = static_cast<Float>((ret * d_scale) * d_scale);
          return;
        }

        Float ret = static_cast<Float>(static_cast<int64_t>(onbits));
        ret = original_arg.sign() ? -ret : ret;
        *res = std::ldexp(ret, e+1-digits_to_round_to);
        return;
      }
   }

   //
   // Check for super large exponent that must be converted to infinity:
   //
   if(original_arg.exponent() > std::numeric_limits<Float>::max_exponent)
   {
      *res = std::numeric_limits<Float>::has_infinity ? std::numeric_limits<Float>::infinity() : (std::numeric_limits<Float>::max)();
      if(original_arg.sign())
         *res = -*res;
      return;
   }
   //
   // Figure out how many digits we will have in our result,
   // allowing for a possibly denormalized result:
   //
   common_exp_type digits_to_round_to = std::numeric_limits<Float>::digits;
   if(original_arg.exponent() < std::numeric_limits<Float>::min_exponent - 1)
   {
      common_exp_type diff = original_arg.exponent();
      diff -= std::numeric_limits<Float>::min_exponent - 1;
      digits_to_round_to += diff;
   }
   if(digits_to_round_to < 0)
   {
      // Result must be zero:
      *res = 0;
      if(original_arg.sign())
         *res = -*res;
      return;
   }


   //
   // Perform rounding first, then afterwards extract the digits:
   //
   cpp_bin_float<std::numeric_limits<Float>::digits, digit_base_2, Allocator, Exponent, MinE, MaxE> arg;
   typename cpp_bin_float<Digits, DigitBase, Allocator, Exponent, MinE, MaxE>::rep_type bits(original_arg.bits());
   arg.exponent() = original_arg.exponent();
   copy_and_round(arg, bits, (int)digits_to_round_to);
   common_exp_type e = arg.exponent();
   e -= cpp_bin_float<Digits, DigitBase, Allocator, Exponent, MinE, MaxE>::bit_count - 1;
   static const unsigned limbs_needed = std::numeric_limits<Float>::digits / (sizeof(*arg.bits().limbs()) * CHAR_BIT)
      + (std::numeric_limits<Float>::digits % (sizeof(*arg.bits().limbs()) * CHAR_BIT) ? 1 : 0);
   unsigned first_limb_needed = arg.bits().size() - limbs_needed;
   *res = 0;
   e += first_limb_needed * sizeof(*arg.bits().limbs()) * CHAR_BIT;
   while(first_limb_needed < arg.bits().size())
   {
      *res += std::ldexp(static_cast<Float>(arg.bits().limbs()[first_limb_needed]), static_cast<int>(e));
      ++first_limb_needed;
      e += sizeof(*arg.bits().limbs()) * CHAR_BIT;
   }
   if(original_arg.sign())
      *res = -*res;
}

View

Add a comment

Modify Ticket

Change Properties
<Author field>
Action
as reopened
Author


E-mail address and user name can be saved in the Preferences.

 
Note: See TracTickets for help on using tickets.