IEEE Standard Floating Point

Format

Floating point notation is essentially the same as scientific notation, only translated to binary. There are three fields: the sign (which is the sign of the number), the exponent (some representations have used a separate exponent sign and exponent magnitude; IEEE format does not), and a mantissa (which IEEE format calls a significand).

As we discuss the details of the format, you'll find that the motivations used to select some features seem like they should have driven other features in directions other than what was actually used. It seems inconsistent to me, too...

One other thing to mention is that the IEEE floating point format actually specifies several different formats: a ``single-precision'' format that takes 32 bits (ie one word in most machines) to represent a value, and a ``double-precision'' format that allows for both greater precision and greater range, but uses 64 bits. We'll be talking about single-precision here.

Sign

The one-bit sign is 0 for positive, or 1 for negative. The representation is sign-magnitude.

Exponent

The exponent gives a power of two, rather than a power of ten as in scientific notation (again, there have been floating point formats using a power of eight or sixteen; IEEE uses two).

The eight-bit exponent uses excess 127 notation. What this means is that the exponent is represented in the field by a number 127 greater than its value. Why? Because it lets us use an integer comparison to tell if one floating point number is larger than another, so long as both are the same sign.

Significand

Using a binary exponent gives us an unexpected benefit. In scientific notation, we always work with a ``normalized'' number: a number whose mantissa is between 1 and 9. If a binary floating point number is normalized, it must have the form 1.f -- the most significant bit must be a 1. Well, if we *know* what it is, we don't need to explicitly represent it, right? So we just store the fraction part in the word, and put in the 1. when we're actually inside the floating point unit. Sometimes this is called using a ``phantom bit'' or a ``hidden bit.''

Since we're going to fill a 32-bit word, the significand is 23 bits, but represents a 24 bit value. The 23 bits actually represented are called the ``significand,'' and the full 24 bits including the hidden bit is the fraction.

The value represented by an IEEE floating point number is

(-1)s * 1.f * 2exp-127

Extra Features

IEEE FP uses a normalized representation where possible, and also extends its range at the expense of normalization with denormalized numbers.

Extend range of representation (at cost in precision of really small numbers) with ``denormals.'' These have an exp field of 0, and represent

(-1)s * 0.f * 2-126
So we can actually represent numbers as small as s-149, though with less and less precision as they get smaller. This also gives us a way to precisely represent 0 - and to use the same representation for floating point 0 as for integer 0.

exp field of ff is used for other goodies: if fraction field is 0, +- infinity; any other fraction is Not A Number.

So we can express everything possible in the format like this:

SignExponentFractionRepresentsNotes
1ff!= 0NaN
1ff0-infinity
101-feanything-1.f * 2(exp-127)
100!= 0-0.f * 2-126
1000-0(special case of last line)
00000(special case of next line)
000!= 00.f * 2-126
001-feanything1.f * 2(exp - 127)
0ff0infinity
0ff!= 0NaN

IEEE FP has many, many more features -- this is just scratching the surface. For example, it's possible to specify the rounding behavior, the use of guard bits, behavior on overflow and underflow....

A Complete Example

Let's add 2.5 + 4.75

  1. Convert 2.5 to IEEE floating point format
    1. Convert 2 to binary (use the division method)

      OldOld/2Bit
      210
      101

      So we get 10

    2. Convert .5 to binary (use the multiplication method)

      OldBitNew
      .510

      So the fraction part is .1

    3. Calculate the exponent and fraction fields

      The number we're converting is 10.12, which is 1.01 x 2 1. The exponent is 127+1 = 12810, or 100000002, and the fraction is 010-02.

    4. Put it all together
    5. We get 0 10000000 010-02, or 4020000016.
  2. Convert 4.75 to IEEE floating point format. Following the same steps, we get 4098000016
  3. Add the numbers.
    1. Determine the values of all of the fields
      2.5
      Sign:0
      Exponent:10000000
      Mantissa:1.01
      4.75
      Sign:0
      Exponent:10000001
      Mantissa:1.0011
      (notice I put the phantom bit in italics)
    2. Adjust the number with the smaller exponent to make the exponents the same. In the example, the first mantissa becomes .101.
    3. Add (or subtract, as appropriate) the mantissas
    4. Renormalize the result. In the case of this example it's already normalized, so we don't need to do anything.
    5. Put the result together: 0 10000001 11010-02, or 40e8000016.
    6. One small point to notice here is that I didn't ever have to figure out what the exponents meant; I just had to compare them.
    7. Convert the result back to decimal.
      1. Since the exponent field is 10000001, its value is 129-127=2. So the number's value is 1.1101 x 2 2, or 111.01.
      2. The integer part is found with
    8. 0.1010
      1.0011
      ------
      1.1101
      OldBitNew
      011
      113
      317
    9. The fraction part is found with
      OldBitNew
      011
      .500.5
      .25
    and we get a final result of 7.25.

Last modified: Wed Jan 16 09:09:48 MST 2002