"A Question Of Rounding" in issue #143

Paul Sephton [paul at inet.co.za]

Sat, 06 Oct 2007 19:53:57 +0200

Hello, all

In order to clarify a few points, set some issues to rest, and update the listing of the source code that appeared in "A Question Of Rounding", the following may be of interest.

First of all, I have to state unequivocally that there is no bug. There never was a bug. The printf() family of functions completely follow the letter of both the C99 and IEEE specifications (at least for GLibC). The matter has been completely explored and put to rest.

The issues brought to light in the article are, however very pertinent to those of us who make a living producing code. The listing below addresses some of the real life issues some of us face when having to deal with floating point arithmetic.

Please refer to the discussion which follows for an explanation as to why printf() behaves the way it does, and what this code does to address some of the problems I highlighted in my article. The discussion is also a summary of some of the many points raised in the bug report discussion, and the information provided to me during that process by a few very smart people.

#include <math.h>
#include <string.h>
#include <stdio.h>
#include <stdlib.h>
 
//______________________________________________________________________
// Utility function converts an IEEE double precision number to a
// fixed precision decimal format stored in a buffer.
void tobuf(size_t max, unsigned int *len, char *buf, 
           double x, int decimals, double max_prec)
{
  int    l_dec = 0, sign  = x < 0;                // remember the sign
  double q     = pow(10,decimals);                // current mask
  double i;                                       // Integer portion
  double y = modf(x*q, &i) / q;                   // Fractional portion
  double l_div = round(y*max_prec)/max_prec;      // significant digit
  if (q <= max_prec)
    l_dec = abs((int)round(l_div*10*q))%10;       // this decimal
  
  if (l_dec) x = i / q;
  
  if (fabs(x) > 0)                            // recurse while |x| > 0
    tobuf(max, len, buf, x, decimals-1, max_prec);
  else {                                      // x == 0 - first digit
    if (*len+1 < max && sign) buf[(*len)++] = '-';
    if (*len+2 < max && decimals >= 0) { 
      buf[(*len)++] = '0'; 
      buf[(*len)++] = '.'; 
    }
    while (*len+1 < max && decimals-- > 0) 
      buf[(*len)++] = '0';
  }
  if (*len+1 < max && decimals == 0)
    buf[(*len)++] = '.'; 
  
  // for first and subsequent digits, add the digit to the buffer
  if (*len+1 >= max) return;
  buf[(*len)++] = '0' + l_dec;
}
 
 
//______________________________________________________________________
// Convert the value x to a decimal representation stored in a buffer
unsigned int dbl2buf(size_t max, char *buf, double x, int decimals) {
  const int DECIMALS=15;                         // max significant
digits
  int    max_dec = DECIMALS-(int)(trunc(log10(fabs(x)))+1); 
  double max_prec = pow(10,max_dec);             // magnitude for
precision loss
  unsigned int len = 0;                          // buffer length init
  
  if (x != x) { strncpy(buf, "NAN", max); return 0; }
  if ((x-x) != (x-x)) { strncpy(buf, "INF", max); return 0; }
  
  tobuf(max, &len, buf, x, decimals-1, max_prec); // fill in buffer
  buf[len] = 0;                                   // terminate buffer
  return len;                                     // return buffer
length used
}
 
//______________________________________________________________________
//  Test the dbl2buf function.
int main (void)
{
  int n, nfails=0, nspfails=0;
  char buf1[128];
  char buf2[64];
  char buf3[64];
  
  for(n=0;n<10000000;n++){
    // generate a random floating point number with random exponent
    double x = random() / (double)random() * pow(10, 15-random()%30);
    x *= (random()%2==0)?1:-1;          // gen negative numbers too
    dbl2buf(sizeof(buf1), buf1, x, random()%20);  // constraint
    
    x = atof(buf1);         // initialise test value from decimal.
    dbl2buf(sizeof(buf2), buf2, x, 20);          // IEEE->decimal
    snprintf(buf3, sizeof(buf3), "%-.20f", x);   // ditto (sprintf)
    if (x != atof(buf2)) {                       // decimal->IEEE
      nfails++;
    }
    if (x != atof(buf3)) {                       // decimal=>IEEE
      nspfails++;
    }    
  }
  printf("%i random floating point values tested\n", n);
  printf("  Number of dbl2buf failures: %i\n", nfails);
  printf("  Number of sprintf failures: %i\n", nspfails);
  return 0;
}

Output:

10000000 random floating point values tested
  Number of dbl2buf failures: 0
  Number of sprintf failures: 0

The dbl2buf procedure implements a IEEE to decimal conversion similar to snprintf(). The main differences are: a) Usage of the "common rounding" rules where precision is lost, in a conversion as opposed to "bankers rounding" as implemented in sprintf() b) Recognises the loss of accuracy when a number was first converted from decimal=>IEEE, and consequently limits the allowable precision when converting back to decimal. This internal precision is fixed at 15 significant decimal digits, for which decimal=>IEEE=>decimal is the identity. c) Only supports fixed point output at present d) It's slower :-)

Discussion:

The C functions atof() and strtod() convert to a number from a string containing a decimal representation. IEEE often cannot store the decimal number accurately, as:

   a) Binary and decimal radix do not share the same set of irrational
      numbers
   b) The size of the floating point unit is constrained to a fixed 
      size.

The above is untrue for the conversion from IEEE format to decimal, where the width of the decimal number is unconstrained. It is always possible to convert IEEE=>decimal in such a way that the exact same IEEE number can be reproduced from the decimal buffer, provided the buffer is big enough.

Therefore, it is always possible to have IEEE=>decimal=>IEEE as the identity function, but with decimal=>IEEE=>decimal it is only possible to retain identity by constraining the precision of the decimal number you wish to reproduce to a maximum as dictated by the size of the floating point unit.

The following examples all involve decimal=>IEEE=>decimal conversions. The statement printf("%.5f", 1.234) tells the compiler to produce the binary equivalent of 1.234 and store it on stack, or in an FP register (decimal=>binary) before calling the printf() function to convert it back to decimal, and printing it to the screen.

printf() is not specified to allow for artificial precision constraints when doing the binary=>decimal conversion; as a result there is no loss in accuracy for it's output. This allows for reproducing an IEEE value from it's result, and so you get for example:

    printf("%.64f\n", 0.15);
    =>0.1499999999999999944488848768742172978818416595458984375000000000

such that

    x = 0.15;
    sprintf(buf, "%64f", x);
    y = atof(buf);
    return x == y; // is the identity

which is very nice if you want atof() to reproduce the original binary number, but not so good if what you really wanted to see was

    dbl2buf(sizeof(buf1), buf1, 0.15, 64);  printf("%s\n", buf1);
    =>0.1500000000000000000000000000000000000000000000000000000000000000

Of course, we all know that

  printf("%.2f\n", 0.15); would have given us
  =>0.15

since printf() would round the number to the nearest value at decimal place 2. This is exactly what we would expect to see given the circumstance. But, how many people would have expected

  printf("%.1f\n", 0.15);  to have given us
  =>0.1  as a result?

There is a perfectly logical reasons for this: The stored number, being less than 1.5 (having lost accuracy in the decimal=>IEEE conversion) is accurately and correctly rounded to 0.1 since it is closer to 0.1 than to 0.2.

what one might have preferred to see, is

  dbl2buf(sizeof(buf1), buf1, 0.15, 2); printf("%s\n", buf1);
  =>0.15
  dbl2buf(sizeof(buf1), buf1, 0.15, 1); printf("%s\n", buf1);
  =>0.2

We can correctly round the number in the above example only if we are willing to dispense with some of the accuracy in the Floating Point Unit. There are a whole bunch of Floating Point numbers which would produce the value 0.2 when rounded to a single decimal digit of precision, or indeed the decimal number 0.150000000000000 rounded to 15 decimal digits of precision.

Constraining the precision of the target decimal value, though, effectively makes it impossible to ensure that subsequent decimal=>IEEE conversion will produce the exact same IEEE number.

For this reason amongst many others, printf() cannot be altered to use the above (precision constraint) logic, as the C specification explicitly states that IEEE=>decimal=>IEEE must be the identity function.

One more surprise awaits (some of) us, though:

  printf("%.64f\n", 1.25);
    =>1.2500000000000000000000000000000000000000000000000000000000000000
  printf("%.2f\n", 1.25);
    =>1.25
  printf("%.1f\n", 1.25);
    =>1.2

This has a perfectly good explanation too. Although we cannot blame the accuracy of the floating point unit for this, we are looking at a situation where 1.25 is no closer to 1.3 than it is to 1.2. Either 1.2 or 1.3 would therefore suffice, and we fall back to the rounding rule used by printf().

For the GNU C library, the rounding rule used by printf() is "bankers rounding" or "round to even". This is more correct than some other C libraries, as the C99 specification says that conversion to decimal should use the currently selected IEEE rounding mode (default bankers rounding).

The dbl2buf() procedure though, is free to implement what is known as "common rounding" or "symmetric arithmetic rounding" which is commonly tought in schools. It therefore produces

    dbl2buf(sizeof(buf1), buf1, 1.25, 2); printf("%s\n", buf1);
      =>1.25
    dbl2buf(sizeof(buf1), buf1, 1.25, 1); printf("%s\n", buf1);
      =>1.3

It should be stressed that neither "bankers rounding" nor "common rounding" is necessarily the more "correct" way to round numbers. Indeed, these are far from being the only two choices of rounding rules.

Bankers rounding produces less error when the number is used in further calculations, as the error averages out. However, common rounding is so named, because it is the most common and well known form of rounding.

It is relatively simple to adapt the dbl2buf() procedure for whichever form of rounding is most suitable.

IEEE does not specify a "common rounding" or "round away from zero" option, so again printf() would be wrong to implement common rounding- even as an option. For those of us who, like myself, need this sort of rounding, we need to code our own binary to decimal conversions.

Frankly, doing so has the distinct advantage of not being reliant on a particular library's implementation of printf() and family. If this is your case, I hope the provided method will be of some use.

Paul Sephton

Top Back