c++ – 1 bit lost in long double

Question:

In the wake of the question about bitwise representation of real numbers and my answer to it .

I want to programmatically determine for any real type how many bits in it are allocated for the mantissa, and how many for the exponent. To do this, I wrote the following code (in it, the bit under the sign is counted separately from the mantissa, so the numbers are 1 less):

https://ideone.com/YuIWNc – C code (float, double, long double)
https://ideone.com/342B4S – C ++ code (float, double, long double)
https://ideone.com/VURQnw – C ++ code (float, double, long double, __float128)

#include <cstdio>

template <typename typed> void count(unsigned *result_m, unsigned *result_e)
{
  typed x = 1, exp;
  unsigned res, e;
  for (res=0; x!=0; ++res) x/=2;
  for (exp=1,e=0; exp*2<res; ++e) exp*=2;
  *result_e = e+1;
  *result_m = res-exp+1;
}

int main(void)
{
  unsigned f_m, f_e, d_m, d_e, ld_m, ld_e, f128_m, f128_e;

  count<float>(&f_m, &f_e);
  count<double>(&d_m, &d_e);
  count<long double>(&ld_m, &ld_e);
  count<__float128>(&f128_m, &f128_e);

  printf("              S    M   E   SZ\n");
  printf("float:        1  %3u  %2u  %3u\n",    f_m,    f_e, 8 * sizeof(float));
  printf("double:       1  %3u  %2u  %3u\n",    d_m,    d_e, 8 * sizeof(double));
  printf("long double:  1  %3u  %2u  %3u\n",   ld_m,   ld_e, 8 * sizeof(long double));
  printf("__float128:   1  %3u  %2u  %3u\n", f128_m, f128_e, 8 * sizeof(__float128));
}

It turns out like this:

              S    M   E   SZ
float:        1   23   8   32
double:       1   52  11   64
long double:  1   63  15  128
__float128:   1  112  15  128

For float , double and even __float128 everything works ( Wikipedia, IEEE 754-2008 ).
But with long double , problems arise:

  1. 1+63+15 = 79 – 79 bits. Instead of 80. Where is another bit?
  2. long double represents 10 byte numbers, but sizeof returned 16.
    How can you get 10?

Answer:

One bit was lost due to the fact that on the x86 platform the 80-bit floating value has one fundamental difference in representation from the 32- and 64-bit IEEE754 floating values ​​( float and double ).

float and double use the implicit leading unit representation in the mantissa. That is, in the normalized representation, the most significant unit in the mantissa is not stored explicitly, but only implied. But in the extended 80-bit floating type long double this leading unit in the mantissa is always stored explicitly .

Because of this, the difference arises.

For float and double your first loop will first iterate through the normalized representations of the number, in which the explicit mantissa is always zero and the exponent decreases from half its maximum value ( 127 for float ) to 1 :

// Для `float`

// Нормализованные представления: мантисса равна 0, а экспонента убывает от 127 до 1

0x3F800000
...
0x00800000  <- после 126 делений

After that, your loop continues to iterate through the denormalized representations of the number, in which the exponent is 0 , and the lone unit moves to the right along the mantissa. When this lonely unit flies past the right edge of the mantissa, x becomes zero and the cycle ends

// Денормализованные представления: экспонента равна 0, а мантисса состоит
// из движущейся вправо единицы

0x00400000
0x00200000
...
0x00000001
0x00000000  <- после 150 делений

Note that in float and double unit in the mantissa occurs only in the very first denormalized value and goes through all the bits of the mantissa. It turns out that the number of denormalized non-zero values ​​in this case is equal to the number of bits in the mantissa.

However, when using a long double unit in the most significant bit of the mantissa was always clearly present, from the very beginning. When in your loop the exponent of a long double reaches zero and the loop starts counting denormalized long double values, the unit in the mantissa does not appear "out of nowhere" to the high position of the mantissa (as it did in float and double ), but is already present in the high position from the beginning and it "starts" from there. Because of this, the part of the loop that counts denormalized values ​​does one less iteration.


By the way, the strange way of adding in one sum – res – half of the exponent range and the width of the mantissa is fraught with problems. You then calculate the value of log2 res and expect that this value will correctly describe the number of bits in the exponent. However, if in some hypothetical floating type the mantissa turns out to be very wide, then the value of log2 res may be erroneous.

Scroll to Top