Tech Tip: It's Just Numbers, All the Way Down!

To represent integer values on a digital computer is relatively easy - you can just interpret a sequence of ones and zeros as the desired value in binary notation, so for example "00101101" is usually understood to mean 45 decimal - but when you want to deal with fractional values, things get a little more involved.

One simple way is to use an integer value, but interpret it as a scaled representation of the real value, for example using a scaling factor of 100, so a stored value of 100 means a true value of 1.0, and 50 means 0.5. This is a fixed-point system, and provided the range meets your needs is very easy to use and extremely space-efficient. You can use any integer type as your base type, depending on the range you need, and any scale factor you choose. With our present compilers you will see the following:

Base type Scale Precision Range
unsigned char 100 1.0/100 0.0 to 2.55
unsigned long 1000000 1.0/1000000 0.0 to 4294.967295
unsigned int 65536 1.0/65536 0.0 to 0.9999847412109375

Using a scale factor which is a power of your preferred base will greatly simplify your I/O code. Using a scale factor which is a power of 2 will (on a binary computer) maximize your data storage efficiency at the cost of more complex I/O code.

The great thing about using an integer-based fixed-point system is that all arithmetic operations are no more expensive, computationally, that the underlying integer operations - to add 1.2 to 0.75 (in the first example above) and print it out requires something like:

#include <stdio.h>

typedef unsigned char fixed;

char buffer[5]; /* room for "2.55\0" */

fixed a = 120; /* 1.20 */
fixed b =  75; /* 0.75 */
/* don't use 075 here, or octal will bite you */

fixed c = a + b;
if (c < a || c < b) {
	printf("overflow in addition\n");
}
else if (sprintf(buffer+1, "%03u", (unsigned)c) > 2) {
        /* move the units digit */
	buffer[0] = buffer[1];

        /* insert the decimal point */
        buffer[1] = '.';

        printf("%s\n", buffer);
}
else {
	printf("sprintf() failed\n");
}

If a fixed-point system does not meet your needs then you may want to move to a floating-point system, which provides a second, variable and usually exponential, scale factor allowing the range to be greatly expanded when necessary, at the expense of precision. The floating-point system in most common use is defined in IEEE standard 754, and our compilers provide easy access to a simplified implementation of this with the float and double types.

Our float uses the following:

1 bit of sign information (==1 for negative)
8 bits of exponent
16 bits of mantissa


The mantissa is a binary fixed-point value, and to get the true value it is multiplied by 2 raised to the power of the exponent. Observant readers will notice that the total number of bits adds up to 25, which is doesn't fit neatly in a group of 8-bit bytes. The most significant bit of the mantissa is always 1 (and any calculation that ends up otherwise is then "normalized" by adjusting the exponent to ensure this remains true) so need not actually be stored. The exponent is stored "excess 127", so a true exponent of 5 would be represented by a stored exponent of 132. A true value of 0.0 is represented by a stored exponent of 0 (with the sign and mantissa parts ignored).

Our double is either the same as float, or (selectable at compile time) it uses an extra 8 bits of mantissa for extra precision - the range remains the same.


 

It is still possible for a calculation to produce a result that does not fit within a floating-point value, because the exponent needed to represent it is either too large or too small. The full IEEE 754 specification includes a number of tools to help reduce the impact of overflows and underflows, but avoiding such problems in the first place (by proper numerical analysis) will simplify your coding and debugging.

 

January 2006.