Floating Point Representation • Jonathan Cook

Programming with Real Numbers and their Floating Point Representation

Thus far in class we have dealt only with integer values. We have seen a simple, direct unsigned binary representation for positive integers, and the Two's Complement (2C) representation for signed integers (both positive and negative).

Higher-level programming languages offer us the ability to use real numbers, simply because many problems need to calculate fractional values, not just integers. It is extremely useful for a programmer to understand how these real numbers are stored, so that they understand what might go wrong when they use real values.

This tutorial and exercise will take you through several steps of how real values might be supported, and will end up explaining the true method that languages such as C/C++/Java use.

Idea 1: Fixed point representation

One possible method for storing and using fractional values at the binary level is to have a fixed point representation. That is, with the bits you have in your data value, just pick a point at which the bits to the right are right of the decimal point, and the bits to the left are left of the decimal point.

Thus, if we have 8 bits in our values, we could pick a fixed point representation of:

iiiii.fff

That is, 5 bits to the left, and 3 bits to the right of the decimal point. Conversion to a decimal value is done exactly like we did for integers, with the positions to the right of the decimal point being negative exponents.

For example, the value 01011.101 would be 2^3 + 2^1 + 2^0 + 2^-1 + 2^-3 == 11.625 in decimal. With our representation, we can have values from 0.0 (it is unsigned) to 31.875, in steps of 0.125 (0.001 binary)

[Recall that negative exponents are reciprocals of positive exponents, so 2^-2 = 1/4, 2^-3 = 1/8, etc.]

This is useful, and it has been used in computer design, but the problem is that it doesn't scale to different problems. What if all the values we had were smaller than 1? This representation would waste all of the five bits in front of the decimal point, and we wouldn't have very much accuracy with only three bits.

Idea 2: Floating point representation

A better solution would be to allow the decimal point to float -- in other words have a floating point representation. Actually, this is already familiar to us in the form of scientific notation. Think of a scientific notation where you are only allowed to write down, let's say, 8 digits (not including the 10 that is the base of the exponent) -- and that 3 of those digits are in the exponent, and five in the value.

With this, we could write down a number like 4.3275 * 10^182, or a number like 8.4832*10^003, or even (if we can have a negative exponent) 6.3491*10^-821.

The important notion here is that we always have five digits of accuracy, even though our numbers range from very big to extremely small. We are always able to use all our digits of accuracy. That is the advantage of floating point numbers.

Ok, we can do the same thing in binary. But we need some names. The value that is the exponent is called the exponent, and the value of our number without the exponent is called the mantissa. That is, in the first example in the above paragraph, 4.3275 is the mantissa, and 182 is the exponent. 10 is the base of the exponent, and in binary, the base will be 2.

Let's take our 8 bits and pick the same dividing point for the mantissa and exponent -- that is:

mmmmm eee

So we have 5 mantissa bits and 3 exponent bits. For now, let's assume nothing is signed, and that we put a decimal point to the right of the first mantissa bit. So, for example, the 8-bit value 10101011 is interpreted as 1.0101 * 2^011. All of the conversions we have learned still apply, so this number is just (2^0 + 2^-2 + 2^-4) * 2^3, which is (1.3125 * 8) == 10.5. Notice that the mantissa 1.0101 just gets shifted over 3 places, since the exponent is 3. This is just like decimal scientific notation, there is nothing new going on here -- it just happens to all be in binary!

With our representation, the biggest number we can represent is a bit value 11111111, or 1.1111 * 2^7, which is decimal 248. Notice that the next smallest number we can represent is 11110111, or 1.1110 * 2^7, which is 240. So we skipped some integer values!

Well, this should be no surprise -- it is the same as with decimal scientific notation. If we limit ourselves to 2 mantissa digits in base-10 scientific notation, we can write 1.3*10^3 and 1.2*10^3, which are the numbers 1,300 and 1,200 respectively. But without more mantissa digits, we cannot write the numbers between 1,300 and 1,200. So it is the same with binary floating point.

This is an important lesson, because it underscores the need for the programmer to understand the accuracy of the floating point representation. It is not infinitely accurate.

Lastly, the only thing we haven't dealt with is signs. We will talk about this in the next section, but just notice that there are two signs needed -- one on the mantissa, signalling a negative number, and one on the exponent, signalling a very small number (the decimal point is moving far left).

IEEE Floating Point Representation

In the olden days (oh, 15 years ago and more), computer manufacturers (actually, the CPU chip designers) basically picked their own floating point representation. They would pick different sized mantissas and exponents, and handle signs in different ways, and the end result of this would be that a program that ran on different machines might get different results!

Well, a standard representation was devised by the IEEE, and now essentially all computers use the same formats for floating point representation. This is called the IEEE floating point standard (big surprise!).

The standard is actually a class of several representations of different sizes: single precision is 32 bits, double precision is 64 bits, and quad precision is 128 bits. In the C programming language, these correspond to the float, double, and long double data types.

In single precision the IEEE format has 1 sign bit, 8 exponent bits, and 23 mantissa bits, in that order from left to right (the 32 bits are seeeeeeeemmmmmmmmmmmmmmmmmmmmmmm). The sign bit is for the sign of the mantissa (i.e., the sign of the overall number). Exponents of all 1's and all 0's are reserved, and the exponent stored in the bits is the actual exponent + 127. That way, the exponent stored as a positive unsigned number, but it represents exponent values from -126 to +127. The exponent is a power of 2, of course.

In addition to the mantissa, there is a hidden bit that is a 1 bit tacked onto the front of the mantissa. If you think about it, a binary mantissa always begins with 1 since we don't write leading 0's on numbers. So the IEEE format just assumes that a 1 is there, and doesn't store it. It is a free extra bit of accuracy.

So, the number represented by a single precision IEEE number is

Value = s * 1.mmmmmmmmmmmmmmmmmmmmmmm * 2 ^ (eeeeeeee - 127)

In decimal terms, this gives a number with about 7 digits of accuracy, and magnitudes from about 10^-38 to 10^38.

In double precision (64 bit) IEEE format, the mantissa is 52 bits, and the exponent is 11 bits (with an offset of 1023). It gives us almost 16 decimal digits of precision, and magnitudes from 10^-308 to 10^308. This is a much larger range than single precision. Quad precision is 112 bits of mantissa, 15 of exponent.

We said earlier that exponents of all 1's and all 0's are reserved. This is for special error conditions, like trying to divide by 0, or taking the square root of a negative number.

An exponent of all 1's is considered to be infinity -- positive infinity if the sign bit is 0, negative infinity if the sign bit is 1. Dividing a non-zero number by zero results in infinity.

An exponent of all 0's is considered to be not-a-number, or NaN for short. Dividing 0 by 0, or taking the square root of a negative number, will result in a NaN value.

Floating Point Accuracy

Errors accumulate. [I will be adding more here.] Your end result might not be right -- in terms of the actual physics or underlying theory.

Scientists used to have to be very careful, since different CPUs used different FP formats, to write their programs so that they would get the same results. Now that all computers use IEEE format, programmers and scientists are being less careful -- the result is that even though all different computers will get the same result for their program, this result might be wrong!