Modify

Ticket #12039 (closed Bugs: fixed)

Opened 14 months ago

Last modified 13 months ago

cpp_bin_float convert_to<double>() rounding mistake

Reported by: Michael Shatz <shatz@…> Owned by: johnmaddock
Milestone: To Be Determined Component: multiprecision
Version: Boost 1.61.0 Severity: Problem
Keywords: Cc:

Description

There are cases where cpp_bin_float convert_to<double>() does not produce the nearest 'double' result. It happens when the source argument differs form the middle point between two double-precision values in 65th bit or further. I didn't look at boost source code, but pretty sure that the mistakes occurs due to double rounding. I.e. instead of direct rounding to 53-bit precision of 'double' the number is initially converted to 64-bit 'long double' and then converted from 'long double' to 'double'.

In particular, on Microsoft compilers 'long double' and 'double' refer to the same type and the mistake does not happen. But on the same machine/OS with gcc compiler the mistake does happen.

Below attached a simple test case that prints 'good' on platforms without bug and prints bad on platforms with bug.

Attachments

convert_test.cpp (516 bytes) - added by Michael Shatz <shatz@…> 14 months ago.
convert_to<double>() bug reproducer
convert32_test.cpp (719 bytes) - added by Michael Shatz <shatz@…> 14 months ago.
convert_to_double_core.cpp (716 bytes) - added by shatz@… 14 months ago.

Change History

Changed 14 months ago by Michael Shatz <shatz@…>

convert_to<double>() bug reproducer

comment:1 Changed 14 months ago by johnmaddock

  • Owner set to johnmaddock
  • Component changed from None to multiprecision

comment:2 Changed 14 months ago by johnmaddock

  • Status changed from new to closed
  • Resolution set to fixed

Changed 14 months ago by Michael Shatz <shatz@…>

comment:3 Changed 14 months ago by Michael Shatz <shatz@…>

Hi John

There exists another more obscure but very similar problem: double rounding in conversion to 'float' due to initial conversion to 'double'. Practically, due to huge difference in precision between 'double' and 'float', this problem is far less severe than the previous one. It is very unlikely that it negatively affect real-world applications. Still, when you know where to look, it can be easily demonstrated.

May be, when you are at it, it makes sense to fix this minor problem as well?

A test case (convert32_test.cpp) attached above.

comment:4 Changed 14 months ago by johnmaddock

That case was already fixed by the patch above - however there was a different one involving rounding of ties - test case and patch for that in: https://github.com/boostorg/multiprecision/commit/a96bea66e191ba827626a75c9721f3018c7fb1f3

Changed 14 months ago by shatz@…

comment:5 Changed 14 months ago by shatz@…

May be, I don't know how fetch your patch (my github skills are below basic), but it seems to me that instead of fixing bad case you had broken good case. Since my stupid test only compares two results of conversion to each other, it is happy. But it shouldn't.

I can't say that I fully understood your cpp_bin_float format (in particular, I can't figure out the business with guards) but assuming that I didn't misunderstood too badly, I recommend the attached core routine for conversion to double. In this routine rounding/ties handling is done by compiler/hardware rather than by us. Sometimes it does a better job. As additional advantage, it's likely several times (or many times for wide numbers) faster than your variant. Of course, it only works when arg.backend().bits().limbs() has a type 'uint64_t*' or its equivalent. I didn't figure out if it's a case on all supported platforms or only on mine (x64). But even if it's the later, it still makes sense to specialize, because I think it's safe to assume that x64 platform is by far the most important for your customers.

Another thing that I didn't figure out is what happens when # of binary digits is not an integer multiple of 64. But I would believe, that you will have no trouble to handle this case as well. At worst it will take a simple mask applied to the first (i.e. least significant) word.

Best regards, Michael

comment:6 Changed 14 months ago by shatz@…

Hi,

My previous comment was wrong. Because of my unfamiliarity with github, I was testing the version after the first patch, not after the second patch. The second version looks o.k. Actually, it is pretty close to my proposed variant above. A bit slower than mine, but a difference in speed is not much above noise level.

Good job, John. Thank you.

Regards, Michael

comment:7 Changed 13 months ago by anonymous

OK good, I just pushed a few additional commits, adds an exhaustive convert-to-float test program (takes about half a day to run, so not part of the regular tests). This uncovered a fencepost error in the subtraction code, plus a double-rounding error in mpfr_float_backend (both also fixed).

The main advantage of the new code is that it uses cpp_bin_float's existing rounding code to ensure a correct result, and is completely agnostic as to the widest integer size, or the size of the float being converted to. In trivial cases it basically degenerates to the same principle as yours - the temporary cpp_bin_float becomes a trivial wrapper around a single unsigned integer, and the bit-extraction loop executes just the once.

View

Add a comment

Modify Ticket

Change Properties
<Author field>
Action
as closed
The resolution will be deleted. Next status will be 'reopened'
Author


E-mail address and user name can be saved in the Preferences.

 
Note: See TracTickets for help on using tickets.