Floating Point

Number Base Conversions


Binary Fractions

First, remember what digits to the right of the decimal point mean in decimal: they're the "tenths" place, the "hundredths" place, and so on. In binary, they're the "halves" place, the "quarters" place, and so on.

Let's suppose we've got a number like 5/8 that we want to convert to a binary fraction. We start with the binary for 5 (that's 0101), and right-shift it three places to divide by 8. That gives us 0.101.

If we think about that number, we've got a 1 in the halves place, and a 1 in the eights place: so we've got 1/2 + 1/8, which is 5/8.

Converting fractions from/to binary

Naturally, just as for integers, we've got more "algorithmically friendly" ways to convert numbers; especially for the important case of converting a decimal fraction to binary.

Remember conversion of decimal integers to binary using division method, binary to decimal using multiplication method.

Fraction conversion is the other way around: we start with a decimal fraction, and keep multiplying by two. The result of each multiplication is one bit of the result.

Key point in converting decimal to binary: remember that in converting ints from decimal to binary, we worked by continually picking off the least significant bit by asking if the number is odd, and then dividing the number by two to get the next bit. We'll work here by continually asking if the number is less than 1/2, and grabbing a 0 if it is or a 1 if it isn't.

Converting decimal fractions to binary

A good way to organize the problem is to use a table, with columns old, bit, and new. We start with the number we're planning to convert in the top row old entry.

Now, we work across each row, and multiply the old entry by 2 to get a new value. Now we take the integer part of the new value and put it in the bit entry, and put the fraction part in the new entry. If the new entry is non-zero, we copy it into the old entry in the next row down and continue (if it is 0, we're done).

The result is the contents of the bit column, with the most significant bit in the top row.

Here's an example: let's convert .375 from decimal to binary.

.3750.75Multiply .375 by 2, get 0.75
.751.5Multiply .75 by 2, get 1.5
.510Multiply .5 by 2, get 1.0

So the result is .011.

You keep it up until you either have converted the number, or you've generated a bit string that is as long as the space you have available to put the result. To see an example of a number where you will have to quit early, try converting 0.2. The result, in binary, repeats.

Converting binary fractions to decimal

Likewise, we can convert binary fractions to decimal by using a division method. We'll build a table like before; this time the columns are old, bit, and new. The top-row old value is 0, and we put one bit in each row of the bits column with the least-significant bit first.

Now, we go across each row, and add the old value to the bit value and divide by two to get the new value. Then we copy the new value to the old value of the next row, and keep it up until we've filled in all the rows. The answer is in the bottom new value.

Here's an example, using .0112.

oldbit(old + bit)/2

So the result is .37510.

Converting between bases that are a power of two

As with integer conversions, we can do this by converting groups of bits. But, instead of starting from the right, we start from the left. A way to remember this is that in both cases we start grouping from the radix point; another way is that with integers leading zeros are insignificant but with fractions it's the trailing zeros that are insignificant.

IEEE Floating Point Format

Floating point notation is essentially the same as scientific notation, only translated to binary. There are three fields: the sign (which is the sign of the number), the exponent (some representations have used a separate exponent sign and exponent magnitude; IEEE format does not), and a significand.

As we discuss the details of the format, you'll find that the motivations used to select some features seem like they should have driven other features in directions other than what was actually used. It seems inconsistent to me, too...

One other thing to mention is that the IEEE floating point format actually specifies several different formats: a ``single-precision'' format that takes 32 bits (ie one word in most machines) to represent a value, and a ``double-precision'' format that allows for both greater precision and greater range, but uses 64 bits. We'll be talking about single-precision here.


The one-bit sign is 0 for positive, or 1 for negative. The representation is sign-magnitude.

In integers, we use 2's complement for negative numbers because it makes the arithmetic ``just work;'' we can add two numbers together without regard to whether they are positive or negative, and get the right answer. This won't work for floating point numbers because the exponents need to be manipulated; if we used a 2's complement representation for the entire word we'd have to reconstruct the exponent any time we wanted to add or subtract, so it wouldn't gain us anything; in fact, trying to do arithmetic involving a negative number would involve converting it to positive first.

All the same, using the same negative-representation for integer and floating point has been done: the CDC 6600, which used 1's complement arithmetic for integers, also represented floating point numbers by taking the 1's complement of the entire word. The CDC Cyber 205 left the exponent alone, and represented negatives by taking the 2's complement of the mantissa.


The exponent gives a power of two, rather than a power of ten as in scientific notation (again, there have been floating point formats using a power of eight or sixteen; IEEE uses two).

The eight-bit exponent uses excess 127 notation. What this means is that the exponent is represented in the field by a number 127 greater than its value. Why? Because it lets us use an integer comparison to tell if one floating point number is larger than another, so long as both are the same sign.

Of course, this is only a benefit if we use the same registers for both integers and floating point numbers, which has become quite rare today. By the time you've moved two operands from floating point registers to integer registers and then performed a comparison, you might as well have just done a floating point compare. Also, an integer compare will fail to give the right answer for comparisons involving NaN's (see later). This really seems to me like a vestige of older formats, with no particularly good reason for its use in a modern computer.

The use of excess-127, instead of excess-128, is also a head-scratcher. Most previous floating point formats using an excess representation for the exponent used an excess that was a power of two; this allowed conversion from exponent representation to exponent value (and vice versa) by simply inverting a bit. I have yet to come across a good explanation for the use of excess-127.


Using a binary exponent gives us an unexpected benefit. In scientific notation, we always work with a ``normalized'' number: a number whose mantissa is between 1 and 9. If a binary floating point number is normalized, it must have the form 1.f -- the most significant bit must be a 1. Well, if we know what it is, we don't need to explicitly represent it, right? So we just store the fraction part in the word, and put in the ``1.'' when we're actually inside the floating point unit. Sometimes this is called using a ``phantom bit'' or a ``hidden bit.''

Since we're going to fill a 32-bit word, the fraction is 23 bits, but represents a 24 bit significand.

A note on mantissas: a ``mantissa'' is the fractional part of the logarithm of a number. For instance, if we take log1073.2, we get 1.864511. The mantissa is .864511. I've also seen the word used to mean the fracitonal part of any decimal number -- in the above example, using this definition, the mantissa would be .2. The term is also frequently used to mean the significand of a floating point number; we're going to try to be consistent and use the term ``significand.''

Putting It All Together

The value represented by an IEEE floating point number is

(-1)s * 1.f * 2exp-127

Floating Point Operations

Let's think a minute about just how we do arithmetic operations in scientific notation:

Addition and subtraction:

  1. Align the exponents
  2. Add (subtract) the significands
  3. Renormalize

Multiplication and division:

  1. Add (subtract) the exponents
  2. Multiply (divide) the significands
  3. Renormalize

A Complete Example

Let's add 2.5 + 4.75

  1. Convert 2.5 to IEEE floating point format
    1. Convert 2 to binary (use the division method)


      So we get 10

    2. Convert .5 to binary (use the multiplication method)


      So the fraction part is .1

    3. Calculate the exponent and fraction fields

      The number we're converting is 10.1, which is 1.01 x 2 1. The exponent is 127+1 = 12810, or 100000002, and the fraction is 010-02.

    4. Put it all together
    5. We get 0 10000000 010-02, or 4020000016.
  2. Convert 4.75 to IEEE floating point format. Following the same steps, we get 4098000016
  3. Add the numbers.
    1. Determine the values of all of the fields
      (notice I put the phantom bit in italics)
    2. Adjust the number with the smaller exponent to make the exponents the same. In the example, the first significand becomes .101.
    3. Add (or subtract, as appropriate) the significands
    4. Renormalize the result. In the case of this example it's already normalized, so we don't need to do anything.
    5. Put the result together: 0 10000001 11010-02, or 40e8000016.
    6. One small point to notice here is that I didn't ever have to figure out what the exponents meant; I just had to compare them.
    7. Convert the result back to decimal.
      1. Since the exponent field is 10000001, its value is 129-127=2. So the number's value is 1.1101 x 2 2, or 111.01.
      2. The integer part is found with
    8. 0.1010
    9. The fraction part is found with
    and we get a final result of 7.25.


Let's run an example of multiplication in floating point. We'll use the same two numbers that we used for addition: 40200000 * 40980000.

First, we find the contents of the sign, exponent, and significand fields. As before, this gives us


So now we apply the standard multiplication algorithm.

  1. Determine the sign. If the signs of the two operands are the same the result will be positive; if they are different, the result will be negative.
  2. Determine the exponent by adding the operands' exponents together. The only catch here is that we've left the exponents in excess-127 notation; if we just add them, we'll get

    e1 + 127 + e2 + 127 = e1 + e2 + 254
    so we have to add the exponents and subtract 127 (yes, we could have subtracted 127 from the exponent fields, added them, and added 127 to the result. But the answer would have been the same, and we would have gone to some extra work).

    10000000 + 1000001 - 01111111 = 10000010

  3. Multiply the significands using the standard multiplication algorithm


  4. Renormalize. If we'd wound up with two places to the left of the binary point we would have had to shift one place to the right, and add one to the exponent.

  5. Reconstruct the answer as an IEEE floating point number:

    0 10000010 0111110-0 = 413e0000


This time let's divide 42340000 / 41100000. We break the numbers up into fields as before:


  1. Determine the sign. If the signs of the two operands are the same the result will be positive; if they are different, the result will be negative. In this case, since the signs differ, the result will be negative
  2. Determine the exponent by subtracting the operands' exponents. This time the excesses will cancel out, so we need to add them back in; we get

    10000100 - 1000010 + 01111111 = 10000001

  3. Perform the standard fractional division operation. Note: check my math here!


    So, our 24-bit significand is 1010--0

  4. Renormalize. Our result is already normalized, so we don't need to do this.

  5. Reconstruct the answer as an IEEE floating point number:

    0 10000001 010--0 = 40a00000

extra Features

IEEE FP uses a normalized representation where possible, and also extends its range at the expense of normalization with denormalized numbers.

Extend range of representation (at cost in precision of really small numbers) with ``denormals.'' These have an exp field of 0, and represent

(-1)s * 0.f * 2-126

exp field of ff is used for other goodies: if fraction field is 0, +- infinity; any other fraction is Not A Number.

So we can express everything possible in the format like this:

1ff!= 0NaN
101-feanything-1.f * 2(exp-127)
100!= 0-0.f * 2-126
1000-0(special case of last line)
00000(special case of next line)
000!= 00.f * 2-126
001-feanything1.f * 2(exp - 127)
0ff!= 0NaN

There are actually two classes of NaNs: if the most significant fraction bit is 1, it's a "Quiet NaN" (QNaN), identifying an indeterminate result. QNaN's can be used in arithmetic, and propagate freely (so nothing breaks, but when you're done you get a NaN result). If the most significant fraction bit is a 0, it's a "Signalling NaN" (SNaN), identifying an invalid result. Using a SNaN will raise an exception. These are handy for initializing variables, so use before setting can be recognized.

Operations on the "special cases" are well defined by the IEEE standard. Any operation involving a QNaN results in a QNaN; other operations give results of:

n / ±Infinity0
±Infinity × ±Infinity±Infinity
±nonzero / 0±Infinity
Infinity + InfinityInfinity
±0 / ±0NaN
Infinity - InfinityNaN
±Infinity / ±InfinityNaN
±Infinity × 0NaN

Double Precision

Double precision works just like single precision, except it's 64 bits. The exponent is 11 bits, the fraction is 52.

Last modified: Fri Jan 21 08:59:56 MST 2005