Testing C++ Programs

Roger Hartley

The need for testing

Assuming your program has compiled successfully with no warning messages, what comes next? The answer is testing, but this is not a simple as it seems. Any program needs testing in order to show that it works in the desired fashion. Most programs need some form of input, and most of these can accept a potentially infinite kinds of input. Even simple programs like the example below show this behavior. It is a good guess that we cannot test our programs on all inputs, and fact it can proved that no amount of testing can prove, in general, that a program works, i.e. that it meets the requirements laid down in the specification, or task description. We can, however, prove that certain kinds of programs are correct (they work correctly), but this is a hard job, involving complex mathematics and special-purpose computer tools. The rest of us need to test our programs by choosing appropriate inputs, observing the behavior of the program with these inputs, and comparing the outputs with the ones expected by examining the specification.

The test suite

Choosing appropriate tests can be an iterative, trial-and-error process, like much of software development. However, by examining the program specification, and following some simple guidelines, an initial set of tests can be assembled. We will only do testing of whole programs here, but often individual modules or even sub-programs (classes, functions, etc.) need to be tested independently of the whole program. The same principles can be used in these cases. Inputs can be classified as follows:

Unique or very special values. For instance a program that expects to read a name might get an empty string; a program that expects a number of inputs of the same kind may get none; a program that stores up to 100 names. The program should be able to cater for these, without failing, so they become part of the test suite.
Many inputs have range restrictions, so the extreme ends of the range, and beyond, should be tested. Names of one character; ages less than zero, or greater than 150; car speeds over 500 mph, and so on. These ranges should be tested for at their extremes, and beyond them to be safe.
Some programs do not place restrictions on the input values, but even if they don't, programs need to be tested on typical values. In some cases, they will be sets of typical values, perhaps corresponding to ranges of values, but in any case it is wise to choose several typical inputs, not just one. The problem with testing on a single value is that the program may work perfectly on your chosen typical representative value, but fail on another because it handles the "typical" value specially without your knowledge. Choosing more than one typical value can minimize this problem. These typical inputs also become part of the test suite.

Having assembled the suite of inputs (and this could be as small as a handful of values for small programs) the program should be executed using the inputs and its behavior observed, and recorded. This is rather like experimentation. If your program succeeds on all inputs, i.e. its observed behavior matches the expected behavior, as extracted from the specification, then you are done! Mostly, however, and especially for programs of any size, disasters of various sorts will occur.

Types of errors in program execution

There are three main kinds of error that programs show when they run. They are:

The "crash". Sometimes a program violates the constraints that the operating system puts on it. If it oversteps its allotted space, usually by a wayward pointer being dereferenced, then the program will crash. i.e. it will be terminated by the operating system with a nasty message like "segmentation violation" (UNIX) or "General Processor Fault" (Microsoft Windows). Sometimes the machine will just lock solid and be unresponsive to keyboard and/or mouse (DOS).
The program runs, but produces unexpected results. These errors can show themselves as garbled messages or displays of data; or by terminating too early or not at all; or by producing results that are clearly out of the expected range (such as negative speeds, or ages in the thousands).
The programs runs and behaves "normally", but some part of the specification is not satisfied. This can happen when a array that should be sorted is not sorted, or when a table is full of zeros, or similar "non-events".

After a test or tests have been run, either the program works (it passed the tests) or it does not, producing one of the above kinds of error. Debugging is the process of isolating the fault in the program source code. This will be the subject of another discussion. Until then de bugging can be done by doing a hand simulation with the input data that produced the error. With small programs this technique is sure to work, and is not too unwieldy. However, with medium and large programs it is too time-consuming, and subject to its own errors. In this case a debugging method is needed, and we will look at several appropriate ways later.

Strategies for altering the test suite

When a test fails, the program will behave in one of the three ways mentioned above. The first thing to try is to reproduce the error with different input values. This will tell you what the error is not caused by (i.e. particular input values) but rather is caused by a type of input, or, in bad cases, any input. Second, take the input values that caused the error and do a simulation trace on paper. If the program is not too complex this method will guarantee success. However, in more complex cases we must use more advanced debugging techniques.

Different kinds of error naturally lead to different strategies for testing. However, as stated before, it is an unfortunate fact that no amount of testing (of reasonable sized programs) can prove that the program is correct, i.e. that its output is correct for every size and content of input. Most practical programs have an infinite, or at least very large, number of possible inputs. It is impossible to test them all, and to check that the output is correct in each case. Instead possible inputs must be split into categories, and this breakdown is the subject of a testing strategy. All strategies should follow the "divide and conquer" method. This assumes that an error is produced by a particular input or set of inputs, and that isolating this set will enable the error to be isolated. There are, however, additional methods that will help in this endeavor. Each of the above categories of error leads to a particular strategy and testing method:

When a program crashes it is because some operation has violated the operating system constraints, usually by trying to access memory outside the space allocated to your program. In C (less in C++) this is almost always (probably 85%) due to pointer problems. Often these errors can only be found by a debugging strategy, involving tracing, and, if possible, single stepping through statements. However, some can be caught by changing the length of input strings, where the program reads character strings as input data. If the error goes away when strings have smaller length, then the culprit is an array whose length is being exceeded. Common causes are a destination array for the string library functions strcpy and strcat that is too small; strings not terminated by the null character; indexing array elements past the end of the array. As mentioned above, a sure way of uncovering these errors is to do a simulation trace with the input that produced the bad result.
If the program does not crash, but still does not work, then it must be producing unexpected results. Unexpected output can be tested for by stretching the limits on the input data. Typically a program will fail when presented with data that is zero or absent, or too small. It can also be that the data is too big, or too long. A good strategy in these cases is to present the program with the simplest possible case (typically zero or absent data) and then to present it with a complex case involving large values, long strings, or large numbers of items. If the program passes these tests, it may still fail on "typical" input data, and several tests should be run on data that represents the typical case. It is useless to rely on one representative input, and even when several tests have succeeded, suspicion is still appropriate, and more probing is often necessary before feeling happy about the program.
The last kind of error is the hardest of all to catch, because the specification of the program is a moving target; it is very rare to have a complete and exhaustive specification for a program. More usual is an incomplete specification, or problem statement that is difficult to test for. A program may appear to work, but inadequate testing has failed to uncover possible mismatches between the requirements in the program specification and outputs produced by the program. The strategy here is to make the specification clearer by adding constraints or more definite kinds of statements. Every clarification should lead to a test, or group of tests. A typical example is that of a sorting program that fails to mention what should happen in the case of repeated values in the input. Unless tests are made with these repeated values, the sort program cannot be said to be correct. Just because the specification did not mention the possibility does not mean that the program should not handle the repetitions. Very often common sense will lead you to a test that can reveal inadequacies in the specification. Of course if an error is uncovered by this thinking, so much the better.

An Example: The miles per gallon problem

Problem statement: Print the average gas consumption in miles per gallon given a list of trips, each consisting of the number of miles traveled and the number of gallons used on that trip.

Design: the TDSR diagram shows a while loop, terminated by a sentinel value that accumulates total miles and total gallons. When the loop terminates the average mpg is calculated and printed.

[The dashed lines indicate more refinement in terms of variables and operations]

Testing this program should follow this sequence:

Try a sentinel alone, i.e. an empty list of trips. Since the totals of miles and gallons will both be zero, a message should tell the user that no trips have been read. At the very least the number of gallons should be tested as greater than zero, since dividing a number by zero (finding the average) will not work in this case.
Try one trip, with a number of miles and gallons that gives an easy average. 100 miles and 10 gallons would do. The average printed should be 10 mpg. Do this several times with different values, such as 1000 and 100, 10000 and 1000 etc. You could try very large numbers, such as 1000000000000, but the results will be unpredictable. You know that integers, even 32 bits long, have a finite, but very large maximum value. C++ will work with very large values, but produce erroneous results due to overflowing the size of the integer variables.
Try a few trips, again with values that you can work with in your head, or on paper (perhaps with a calculator). "A few" means whatever you can handle, but at least two.
If all these test succeed, then prepare a representative list of input values, either by collecting real data, or by using informed guesswork. In this case, assume you are going on a trip to Washington DC from Las Cruces, and work out some typical distances between stopping places and typical gas consumptions. Again there should be an independent way of checking the program's answer. The calculator will rescue you in this case, but you may have to use other manual methods.