Error Detection, Correction, and Related Topics

The basic problem we have to resolve is that memory and communications technology isn't totally reliable; we have to expect and be ready to deal with errors in the hardware. This document will describe two very different technologies for detecting, and maybe correcting, errors that may occur in data storage and transmission.

The first approach to be described is more appropriate for environments like memory: a relatively small amount of data is fetched in parallel. This approach, called "error detecting and correcting codes," is based on defining a distance between two bit strings in terms of the number of bits that have to change to get from the first string to the second. Extra bits are added to each string, which are set so that some minimum number of bits must change to get from one valid string to another. If the received string isn't valid, it is assumed that the correct string is the one "closest" to the received string.

The second approach is more appropriate to environments in which relatively large amounts of data are to be transferred, but they are transferred serially. In this approach a "signature" is appended to the data string; the number of bits in the signature is much less than the number of bits that would be required to do an error correcting code. This approach will lead to adding checksums or cyclic redundancy checks to the string.

Error Detecting and Correcting Codes

R. W. Hamming wrote the paper that both opened and closed this field in 1950. His interest was in providing a means of self-checking in computers, which were just being developed at the time he wrote this. the paper appeared in the Bell System Technical Journal, April, 1950. Definitely worth tracking down in the library and reading.

Bit Strings as Addresses in Binary Hypercubes

The best starting point for understanding ECC codes is to consider bit strings as addresses in a binary hypercube. A hypercube is a generalization of a cube to various dimensions; we're probably most familiar with the notion of a four-dimensional hypercube. Here's a picture of binary hypercubes for several different dimensionalities:

hypercubes

Each of them was created by copying the one to the left twice, and connecting corresponding vertices.

We can assign each vertex in a hypercube a location in a coordinate space determined by the dimensionality of the hypercube.

This can continue through as many dimensions as you want.

The Hamming distance between two bit strings is the number of bits you have to change to convert one to the other: this is the same as the number of edges you have to traverse in a binary hypercube to get from one of the vertices to the other. The basic idea of an error correcting code is to use extra bits to increase the dimensionality of the hypercube, and make sure the Hamming distance between any two valid points is greater than one.

So... now we have to think about how to increase the Hamming distance between valid strings.

Parity

The simplest case is by adding a parity bit. Suppose we have a three-bit word (so the bit strings define points in a cube). If we add a fourth bit, we can decree that any time we want to switch a bit in the original three-bit string, we also have to switch the parity bit. If we start with 000 in the left cube, so the full string is 0000, changing any one of the original three bits requires us to change to the other cube: 1001, 1010, and 1100. Now if we change a second bit, we have to move back to the left cube: 0011, 0101, 0110. And if we change the third bit, we move back to the right cube: 0111.

So, there is a Hamming distance of two between any two valid strings. If we get a one-bit error, we know it is an error because it's on one of the invalid vertices.

This can be computed by counting the number of 1's, and making sure it's always even (so this is called even parity). We could have selected exactly the opposite set of vertices as the valid ones, which would have given us odd parity. We picked even parity because we'll be using it in the next step.

Error Correction

The weakness of the parity scheme is that we can tell we had an error, but we can't know which bit is wrong. If we use enough extra bits, we can tell not only that a bit is wrong, but which one it is. Since we need to have enough check bits to spot both an error in the data and in the check bits themselves (after all, they aren't going to be perfect either), we need (log n) + 1 bits (Hamming derives this result much, much more carefully in his paper). The basic idea in what follows is that we'll divide the data bits into log n subsets where each subset contains roughly half of all the bits, and compute the even parity of each subset. If we have an error, we'll be able to tell which bit has the error because it will be uniquely determined by the set of subsets that turn up with bad parity.

(note: in Hamming's paper, the following appears just as unmotivated as it does here. I really have no idea how he derived this technique; he does show that it actually does establish the needed distance between valid bit positions) We'll put the check bits in bit positions which are powers of two, and intersperse the data bits between them. Here's what it looks like if we have eight data bits:
Bit
Position
Position
Number
Check
Bit
Data
Bit
121100 M8
111011 M7
101010 M6
91001 M5
81000C8 
70111 M4
60110 M3
50101 M2
40100C4 
30011 M1
20010C2 
10001C1 

Here's how we find the subsets: The data bit positions which contain a 1 in the bit corresponding to a check bit number are used in calculating that check bit. So, looking at the table, data bits M1, M2, M4, M5, and M7 are in rows 3, 5, 7, 9, and 11; those row numbers all contain 20; those data bits are used in calculating check bit C1. We simply set C1 as having the parity of its data bits.

Looking at all the check bits, we get:
C1=M1^M2 ^M4^M5 ^M7
C2=M1 ^M3^M4 ^M6^M7
C4= M2^M3^M4 ^ ^M8
C8= ^M5^M6^M7^M8

Now: if we get an error, the parity will be wrong for all of the sets based on that bit. The check bits that turn up wrong will be the bit number of the error!

We can combine ECC with parity. The way we do this, is we take the parity over all the bits in the word (including the check bits). In our bit numbering scheme, we consider Parity as bit 0000.

So, when we look at the parity and check bits, we get the following results:

Checksums and Cyclic Redundancy Checks

This technique has seen a lot more development, by a lot more authors, than error correcting codes have. The basic technique as described here appeared in a paper by Peterson and Brown, which appeared in the January, 1961 issue of the Proceedings of the Institute of Radio Engineers (the IRE was, of course, a predecessor organization to the IEEE). Much has been done since on selecting good CRCs.

Once again, we'll start by defining a simple technique, and then define a more complex one that works better.

Checksums

Suppose we have a fairly long message, which can reasonably be divided into shorter words (a 128 byte message, for instance). We can introduce an accumulator with the same width as a word (one byte, for instance), and as each word comes in, add it to the accumulator. When the last word has been added, the contents of the accumulator are appended to the message (as a 129th byte, in this case). The added word is called a checksum.

Now, the receiver performs the same operation, and checks the checksum. If the checksums agree, we assume the message was sent without error.

A related approach would be, instead of performing an actual addition, we can just do a bitwise exclusive-or of the new word with the accumulator. If we do this, we calculate a vertical parity on the data. Notice that in the special case of a one-bit word, this is equivalent to calculating the parity of the buffer!

Performing a vertical parity has two advantages over a real checksum: it can be performed with less hardware if the data is serial, and it will lead us into performing a CRC.

To see how a vertical parity can be performed with less hardware than a checksum, take a look at the next figure:

8-bit shift register

This figure shows an eight bit shift register and an exclusive-or gate. Initially, the shift register is filled with 0's. As each bit is put into it, the new bit is exclusive-ored with the contents of the eighth cell in the register. When the entire message has been passed through the shift register, it contains the vertical parity.

Here's an example of passing a 32 bit message through the unit:

00000000 11010110101010010100011101101010
(as we start, the shift register is empty
00000001 1010110101010010100011101101010
00000011 010110101010010100011101101010   
00000110 10110101010010100011101101010   
00001101 0110101010010100011101101010   
00011010 110101010010100011101101010   
00110101 10101010010100011101101010   
01101011 0101010010100011101101010   
11010110 101010010100011101101010   
(at this point, the shift register contains the first byte of the message
10101100 01010010100011101101010   
01011001 1010010100011101101010   
10110011 010010100011101101010   
01100111 10010100011101101010   
11001111 0010100011101101010   
10011111 010100011101101010   
00111111 10100011101101010   
01111111 0100011101101010   
(it now contains the vertical parity of the first two bytes of the message)
11111110 100011101101010
11111100 00011101101010   
11111001 0011101101010   
11110011 011101101010   
11100111 11101101010   
11001110 1101101010   
10011100 101101010   
00111000 01101010   
(first three bytes)
01110000 1101010   
11100001 101010   
11000010 01010   
10000101 1010   
00001010 010   
00010100 10   
00101001 0   
01010010
(and the vertical parity of the whole 32 bit message)

Notice that a checksum or a vertical parity is much more efficient than ECC (in the sense that it doesn't need as many added bits), but it isn't capable of correcting errors.

The problem with checksums is that a 1-bit error turns into a 1-bit code. If you have a burst of noise, the odds are far too good that you'll end up with something that still looks correct, even though it isn't. The next approach, CRC checks, "smears" the results of the parity calculations through the signature, reducing the likelihood of that happening.

Mathematical Digression: Modulo-2 Arithmetic

Taking a bitwise exclusive-or in place of performing an addition is an example of "Modulo-2 Arithmetic," which is one form of "polynomial arithmetic." I've seen one author call it "CRC arithmetic."

Modulo-2 arithmetic is an arithmetic scheme; like most of the oddities that mathematicians like to study it seems completely useless to a non-mathematician at first glance but turns out to have some very practical applications. In this case, the practical application is in developing CRC checks.

The basic idea of modulo-2 arithmetic is just that we are working in binary, but we don't have a carry in addition or a borrow in subtraction. This means:

It'll probably help to show examples of modulo-2 multiplication and division:

Multiplication

   1101
   0110
   ----
   0000
  11010
 110100
0000000
-------
0101110

Division

        1101
    --------
0110)0101110
     0110
     ----
      0111
      0110
      ----
       0011
       0000
       ----
        0110
        0110
        ----
        0000

Notice that the first subtraction is possible in modulo-2 arithmetic, while it wouldn't be possible in normal arithmetic.

One last thing to say here is that most of the time, when we perform a modulo-2 addition on two numbers we get an answer of 0 or 1. In this case, we're performing the arithmetic on each coefficient of the polynomial modulo-2. Easy to get confused....

Cyclic Redundancy Checks

I'm going to be following some of Peterson & Brown's notation here...

The way we calculate a CRC is we establish some predefined n-k+1 bit number P (called the Polynomial, for reasons relating to the fact that modulo-2 arithmetic is a special case of polynomial arithmetic). Now we append n-k 0's to our message, and divide the result by P using modulo-2 arithmetic. The remainder is called the Frame Check Sequence. Now we ship off the message with the remainder appended in place of the 0's. The receiver can either recompute the FCS and see if it gets the same answer, or it can just divide the whole message (including the FCS) by P and see if it gets a remainder of 0!

As an example, let's set a 5-bit polynomial of 11001, and compute the CRC of a 16 bit message:

     ---------------------
11001)10011101010101100000
      11001
      -----
       1010101010101100000
       11001
        ----
        110001010101100000
        11001
         ----
         00011010101100000
            11001
             ----
             0011101100000
               11001
                ----
                 100100000
                 11001
                 -----
                  10110000
                  11001
                  -----
                   1111000
                   11001
                   -----
                     11100
                     11001
                     -----
                      0101

Notice that when I did the division, I didn't bother to keep track of the quotient; we don't care about the quotient. Our only goal here is to get the remainder (0101), which is the FCS.

CRC's can actually be computed in hardware using a shift register and some number of exclusive-or gates (sounds a bit like the vertical parity calculation, doesn't it?).

The key insight is that we can perform a subtraction any time there is a 1 in the bit that lines up with the most significant bit of the polynomial, and we can perform that subtraction by performing an exclusive-or of the bits corresponding to 1's in all the other places of the polynomial. This lets us implement the CRC calculation by using a shift register similar to the one for vertical parity.

You can see how it's done by comparing the division we performed above to the circuit in the next figure. The figure shows a shift register; the string to be checked is inserted from the right. Whenever a "1" exits the left side of the shift register, it means there is a 1 in the most significant bit of the part of the dividend we're working with; since we're working in modulo-2 arithmetic, this means we can do a subtraction. What this works out to is:

  1. The most significant bit will be xored away, so it falls off to the left.
  2. For every other bit with a "1" in the divisor, perform an exclusive-or with the corresponding bit in the number being checked.
  3. For bits with a "0" in the divisor, do nothing.

The figure below attempts to show this for the example CRC polynomial. Each of the square boxes is a position in the shift register, where a value can be stored. Every round box is a position where we may or may not perform an exclusive-or, depending on the polynomial we're using. You can see the value of the CRC polynomial written above the round boxes.

Shift Register for 11001 CRC polynomial

I keep calling this a polynomial, and writing it as a binary number. Frequently, you'll find a CRC polynomial written in polynomial form; the one we've been using would be written as x4 + x3 + x0.

So, just a little bit more. First, there is quite a bit of theory behind choosing a "good" CRC polynomial; the choice of polynomial can be tuned to make sure that any burst of some given length can be caught.

Properties of Cyclic Redundancy Checks

The paper lists a few properties of CRCs, which deserve mention:

There are a few "classic" CRC polynomials of given lengths which are so sell established that they've been given names.

NameDefinition
As PolynomialAs Number
CRC12X12 + X11 + X3 + X + 1 1100000001011
CRC16X16 + X15 + X2 + 111000000000000101
CRC-CCITTX16 + X12 + X5 + 1 10001000000100001
CRC32X32 +X26 + X23 + X22 +X16 + X12 + X11 +X10 + X8 + X7 +X5 + X4 + X2 + X + 1 100000100110000010001110110110111


Last modified: Mon May 3 08:18:56 MDT 2010