First, remember what digits to the right of the decimal point mean in decimal: they're the "tenths" place, the "hundredths" place, and so on. In binary, they're the "halves" place, the "quarters" place, and so on.
Let's suppose we've got a number like 5/8 that we want to convert to a binary fraction. We start with the binary for 5 (that's 0101), and right-shift it three places to divide by 8. That gives us 0.101.
If we think about that number, we've got a 1 in the halves place, and a 1 in the eights place: so we've got 1/2 + 1/8, which is 5/8.
Naturally, just as for integers, we've got more "algorithmically friendly" ways to convert numbers; especially for the important case of converting a decimal fraction to binary.
Remember conversion of decimal integers to binary using division method, binary to decimal using multiplication method.
Fraction conversion is the other way around: we start with a decimal fraction, and keep multiplying by two. The result of each multiplication is one bit of the result.
Key point in converting decimal to binary: remember that in converting ints from decimal to binary, we worked by continually picking off the least significant bit by asking if the number is odd, and then dividing the number by two to get the next bit. We'll work here by continually asking if the number is less than 1/2, and grabbing a 0 if it is or a 1 if it isn't.
A good way to organize the problem is to use a table, with columns old, bit, and new. We start with the number we're planning to convert in the top row old entry.
Now, we work across each row, and multiply the old entry by 2 to get a new value. Now we take the integer part of the new value and put it in the bit entry, and put the fraction part in the new entry. If the new entry is non-zero, we copy it into the old entry in the next row down and continue (if it is 0, we're done).
The result is the contents of the bit column, with the most significant bit in the top row.
Here's an example: let's convert .375 from decimal to binary.
|.375||0||.75||Multiply .375 by 2, get 0.75|
|.75||1||.5||Multiply .75 by 2, get 1.5|
|.5||1||0||Multiply .5 by 2, get 1.0|
So the result is .011.
You keep it up until you either have converted the number, or you've generated a bit string that is as long as the space you have available to put the result. To see an example of a number where you will have to quit early, try converting 0.2. The result, in binary, repeats.
Likewise, we can convert binary fractions to decimal by using a division method. We'll build a table like before; this time the columns are old, bit, and new. The top-row old value is 0, and we put one bit in each row of the bits column with the least-significant bit first.
Now, we go across each row, and add the old value to the bit value and divide by two to get the new value. Then we copy the new value to the old value of the next row, and keep it up until we've filled in all the rows. The answer is in the bottom new value.
Here's an example, using .0112.
|old||bit||(old + bit)/2|
So the result is .37510.
As with integer conversions, we can do this by converting groups of bits. But, instead of starting from the right, we start from the left. A way to remember this is that in both cases we start grouping from the radix point; another way is that with integers leading zeros are insignificant but with fractions it's the trailing zeros that are insignificant.
Floating point notation is essentially the same as scientific notation, only translated to binary. There are three fields: the sign (which is the sign of the number), the exponent (some representations have used a separate exponent sign and exponent magnitude; IEEE format does not), and a significand.
As we discuss the details of the format, you'll find that the motivations used to select some features seem like they should have driven other features in directions other than what was actually used. It seems inconsistent to me, too...
One other thing to mention is that the IEEE floating point format actually specifies several different formats: a ``single-precision'' format that takes 32 bits (ie one word in most machines) to represent a value, and a ``double-precision'' format that allows for both greater precision and greater range, but uses 64 bits. We'll be talking about single-precision here.
The one-bit sign is 0 for positive, or 1 for negative. The representation is sign-magnitude.
In integers, we use 2's complement for negative numbers because it makes the arithmetic ``just work;'' we can add two numbers together without regard to whether they are positive or negative, and get the right answer. This won't work for floating point numbers because the exponents need to be manipulated; if we used a 2's complement representation for the entire word we'd have to reconstruct the exponent any time we wanted to add or subtract, so it wouldn't gain us anything; in fact, trying to do arithmetic involving a negative number would involve converting it to positive first.
All the same, using the same negative-representation for integer and floating point has been done: the CDC 6600, which used 1's complement arithmetic for integers, also represented floating point numbers by taking the 1's complement of the entire word. The CDC Cyber 205 left the exponent alone, and represented negatives by taking the 2's complement of the mantissa.
The exponent gives a power of two, rather than a power of ten as in scientific notation (again, there have been floating point formats using a power of eight or sixteen; IEEE uses two).
The eight-bit exponent uses excess 127 notation. What this means is that the exponent is represented in the field by a number 127 greater than its value. Why? Because it lets us use an integer comparison to tell if one floating point number is larger than another, so long as both are the same sign.
Of course, this is only a benefit if we use the same registers for both integers and floating point numbers, which has become quite rare today. By the time you've moved two operands from floating point registers to integer registers and then performed a comparison, you might as well have just done a floating point compare. Also, an integer compare will fail to give the right answer for comparisons involving NaN's (see later). This really seems to me like a vestige of older formats, with no particularly good reason for its use in a modern computer.
The use of excess-127, instead of excess-128, is also a head-scratcher. Most previous floating point formats using an excess representation for the exponent used an excess that was a power of two; this allowed conversion from exponent representation to exponent value (and vice versa) by simply inverting a bit. I have yet to come across a good explanation for the use of excess-127.
Using a binary exponent gives us an unexpected benefit. In scientific notation, we always work with a ``normalized'' number: a number whose mantissa is between 1 and 9. If a binary floating point number is normalized, it must have the form 1.f -- the most significant bit must be a 1. Well, if we know what it is, we don't need to explicitly represent it, right? So we just store the fraction part in the word, and put in the ``1.'' when we're actually inside the floating point unit. Sometimes this is called using a ``phantom bit'' or a ``hidden bit.''
Since we're going to fill a 32-bit word, the fraction is 23 bits, but represents a 24 bit significand.
A note on mantissas: a ``mantissa'' is the fractional part of the logarithm of a number. For instance, if we take log1073.2, we get 1.864511. The mantissa is .864511. I've also seen the word used to mean the fracitonal part of any decimal number -- in the above example, using this definition, the mantissa would be .2. The term is also frequently used to mean the significand of a floating point number; we're going to try to be consistent and use the term ``significand.''
The value represented by an IEEE floating point number is
(-1)s * 1.f * 2exp-127
Let's think a minute about just how we do arithmetic operations in scientific notation:
Addition and subtraction:
Multiplication and division:
Let's add 2.5 + 4.75
So we get 10
So the fraction part is .1
The number we're converting is 10.1, which is 1.01 x 2 1. The exponent is 127+1 = 12810, or 100000002, and the fraction is 010-02.
Let's run an example of multiplication in floating point. We'll use
the same two numbers that we used for addition:
40200000 * 40980000.
First, we find the contents of the sign, exponent, and significand fields. As before, this gives us
So now we apply the standard multiplication algorithm.
Determine the exponent by adding the operands' exponents together. The only catch here is that we've left the exponents in excess-127 notation; if we just add them, we'll get
e1 + 127 + e2 + 127 = e1 + e2 + 254so we have to add the exponents and subtract 127 (yes, we could have subtracted 127 from the exponent fields, added them, and added 127 to the result. But the answer would have been the same, and we would have gone to some extra work).
10000000 + 1000001 - 01111111 = 10000010
Multiply the significands using the standard multiplication algorithm
1.0011 1.01 ------ .010011 .00000 1.0011 ------- 1.011111
Renormalize. If we'd wound up with two places to the left of the binary point we would have had to shift one place to the right, and add one to the exponent.
Reconstruct the answer as an IEEE floating point number:
0 10000010 0111110-0 = 413e0000
This time let's divide
42340000 / 41100000. We break the
numbers up into fields as before:
Determine the exponent by subtracting the operands' exponents. This time the excesses will cancel out, so we need to add them back in; we get
10000100 - 1000010 + 01111111 = 10000001
Perform the standard fractional division operation. Note: check my math here!
1.01 -------- 1.001)1.01101 1.001 ----- .0100 .0000 ---- .01001 1001 ---- 0000
So, our 24-bit significand is 1010--0
Renormalize. Our result is already normalized, so we don't need to do this.
Reconstruct the answer as an IEEE floating point number:
0 10000001 010--0 = 40a00000
IEEE FP uses a normalized representation where possible, and also extends its range at the expense of normalization with denormalized numbers.
Extend range of representation (at cost in precision of really small numbers) with ``denormals.'' These have an exp field of 0, and represent
(-1)s * 0.f * 2-126
exp field of ff is used for other goodies: if fraction field is 0, +- infinity; any other fraction is Not A Number.
So we can express everything possible in the format like this:
|1||01-fe||anything||-1.f * 2(exp-127)|
|1||00||!= 0||-0.f * 2-126|
|1||00||0||-0||(special case of last line)|
|0||00||0||0||(special case of next line)|
|0||00||!= 0||0.f * 2-126|
|0||01-fe||anything||1.f * 2(exp - 127)|
There are actually two classes of NaNs: if the most significant fraction bit is 1, it's a "Quiet NaN" (QNaN), identifying an indeterminate result. QNaN's can be used in arithmetic, and propagate freely (so nothing breaks, but when you're done you get a NaN result). If the most significant fraction bit is a 0, it's a "Signalling NaN" (SNaN), identifying an invalid result. Using a SNaN will raise an exception. These are handy for initializing variables, so use before setting can be recognized.
Operations on the "special cases" are well defined by the IEEE standard. Any operation involving a QNaN results in a QNaN; other operations give results of:
|n / ±Infinity||0|
|±Infinity × ±Infinity||±Infinity|
|±nonzero / 0||±Infinity|
|Infinity + Infinity||Infinity|
|±0 / ±0||NaN|
|Infinity - Infinity||NaN|
|±Infinity / ±Infinity||NaN|
|±Infinity × 0||NaN|
Double precision works just like single precision, except it's 64 bits. The exponent is 11 bits, the fraction is 52.