Modify

Opened 21 months ago

Closed 21 months ago

Last modified 20 months ago

#12039 closed Bugs (fixed)

cpp_bin_float convert_to<double>() rounding mistake

Reported by: Michael Shatz <shatz@…> Owned by: John Maddock
Milestone: To Be Determined Component: multiprecision
Version: Boost 1.61.0 Severity: Problem
Keywords: Cc:

Description

There are cases where cpp_bin_float convert_to<double>() does not produce the nearest 'double' result. It happens when the source argument differs form the middle point between two double-precision values in 65th bit or further. I didn't look at boost source code, but pretty sure that the mistakes occurs due to double rounding. I.e. instead of direct rounding to 53-bit precision of 'double' the number is initially converted to 64-bit 'long double' and then converted from 'long double' to 'double'.

In particular, on Microsoft compilers 'long double' and 'double' refer to the same type and the mistake does not happen. But on the same machine/OS with gcc compiler the mistake does happen.

Below attached a simple test case that prints 'good' on platforms without bug and prints bad on platforms with bug.

Attachments (3)

convert_test.cpp (516 bytes) - added by Michael Shatz <shatz@…> 21 months ago.
convert_to<double>() bug reproducer
convert32_test.cpp (719 bytes) - added by Michael Shatz <shatz@…> 21 months ago.
convert_to_double_core.cpp (716 bytes) - added by shatz@… 20 months ago.

Download all attachments as: .zip

Change History (10)

Changed 21 months ago by Michael Shatz <shatz@…>

Attachment: convert_test.cpp added

convert_to<double>() bug reproducer

comment:1 Changed 21 months ago by John Maddock

Component: Nonemultiprecision
Owner: set to John Maddock

comment:2 Changed 21 months ago by John Maddock

Resolution: fixed
Status: newclosed

Changed 21 months ago by Michael Shatz <shatz@…>

Attachment: convert32_test.cpp added

comment:3 Changed 21 months ago by Michael Shatz <shatz@…>

Hi John

There exists another more obscure but very similar problem: double rounding in conversion to 'float' due to initial conversion to 'double'. Practically, due to huge difference in precision between 'double' and 'float', this problem is far less severe than the previous one. It is very unlikely that it negatively affect real-world applications. Still, when you know where to look, it can be easily demonstrated.

May be, when you are at it, it makes sense to fix this minor problem as well?

A test case (convert32_test.cpp) attached above.

comment:4 Changed 21 months ago by John Maddock

That case was already fixed by the patch above - however there was a different one involving rounding of ties - test case and patch for that in: https://github.com/boostorg/multiprecision/commit/a96bea66e191ba827626a75c9721f3018c7fb1f3

Changed 20 months ago by shatz@…

Attachment: convert_to_double_core.cpp added

comment:5 Changed 20 months ago by shatz@…

May be, I don't know how fetch your patch (my github skills are below basic), but it seems to me that instead of fixing bad case you had broken good case. Since my stupid test only compares two results of conversion to each other, it is happy. But it shouldn't.

I can't say that I fully understood your cpp_bin_float format (in particular, I can't figure out the business with guards) but assuming that I didn't misunderstood too badly, I recommend the attached core routine for conversion to double. In this routine rounding/ties handling is done by compiler/hardware rather than by us. Sometimes it does a better job. As additional advantage, it's likely several times (or many times for wide numbers) faster than your variant. Of course, it only works when arg.backend().bits().limbs() has a type 'uint64_t*' or its equivalent. I didn't figure out if it's a case on all supported platforms or only on mine (x64). But even if it's the later, it still makes sense to specialize, because I think it's safe to assume that x64 platform is by far the most important for your customers.

Another thing that I didn't figure out is what happens when # of binary digits is not an integer multiple of 64. But I would believe, that you will have no trouble to handle this case as well. At worst it will take a simple mask applied to the first (i.e. least significant) word.

Best regards, Michael

comment:6 Changed 20 months ago by shatz@…

Hi,

My previous comment was wrong. Because of my unfamiliarity with github, I was testing the version after the first patch, not after the second patch. The second version looks o.k. Actually, it is pretty close to my proposed variant above. A bit slower than mine, but a difference in speed is not much above noise level.

Good job, John. Thank you.

Regards, Michael

comment:7 Changed 20 months ago by anonymous

OK good, I just pushed a few additional commits, adds an exhaustive convert-to-float test program (takes about half a day to run, so not part of the regular tests). This uncovered a fencepost error in the subtraction code, plus a double-rounding error in mpfr_float_backend (both also fixed).

The main advantage of the new code is that it uses cpp_bin_float's existing rounding code to ensure a correct result, and is completely agnostic as to the widest integer size, or the size of the float being converted to. In trivial cases it basically degenerates to the same principle as yours - the temporary cpp_bin_float becomes a trivial wrapper around a single unsigned integer, and the bit-extraction loop executes just the once.

Modify Ticket

Change Properties
Set your email in Preferences
Action
as closed The owner will remain John Maddock.
The resolution will be deleted.

Add Comment


E-mail address and name can be saved in the Preferences.

 
Note: See TracTickets for help on using tickets.