Testing C++ Programs
Roger Hartley
The need for testing
Assuming your program has compiled successfully with no warning messages,
what comes next? The answer is testing, but this is not a simple as it seems.
Any program needs testing in order to show that it works in the desired fashion.
Most programs need some form of input, and most of these can accept a
potentially infinite kinds of input. Even simple programs like the example below
show this behavior. It is a good guess that we cannot test our programs on all
inputs, and fact it can proved that no amount of testing can prove, in
general, that a program works, i.e. that it meets the requirements laid
down in the specification, or task description. We can, however, prove that
certain kinds of programs are correct (they work correctly), but this is a hard
job, involving complex mathematics and special-purpose computer tools. The rest
of us need to test our programs by choosing appropriate inputs,
observing the behavior of the program with these inputs, and comparing the
outputs with the ones expected by examining the specification.
The test suite
Choosing appropriate tests can be an iterative, trial-and-error process,
like much of software development. However, by examining the program
specification, and following some simple guidelines, an initial set of tests can
be assembled. We will only do testing of whole programs here, but often
individual modules or even sub-programs (classes, functions, etc.) need to be
tested independently of the whole program. The same principles can be used in
these cases. Inputs can be classified as follows:
- Unique or very special values. For instance a program that expects to read
a name might get an empty string; a program that expects a number of inputs of
the same kind may get none; a program that stores up to 100 names. The program
should be able to cater for these, without failing, so they become part of the
test suite.
- Many inputs have range restrictions, so the extreme ends of the range,
and beyond, should be tested. Names of one character; ages less than
zero, or greater than 150; car speeds over 500 mph, and so on. These ranges
should be tested for at their extremes, and beyond them to be safe.
- Some programs do not place restrictions on the input values, but even if
they don't, programs need to be tested on typical values. In some cases, they
will be sets of typical values, perhaps corresponding to ranges of values, but
in any case it is wise to choose several typical inputs, not just one. The
problem with testing on a single value is that the program may work perfectly on
your chosen typical representative value, but fail on another because it handles
the "typical" value specially without your knowledge. Choosing more
than one typical value can minimize this problem. These typical inputs also
become part of the test suite.
Having assembled the suite of inputs (and this could be as small as a
handful of values for small programs) the program should be executed using the
inputs and its behavior observed, and recorded. This is rather like
experimentation. If your program succeeds on all inputs, i.e. its observed
behavior matches the expected behavior, as extracted from the specification,
then you are done! Mostly, however, and especially for programs of any size,
disasters of various sorts will occur.
Types of errors in program execution
There are three main kinds of error that programs show when they run.
They are:
- The "crash". Sometimes a program violates the constraints that
the operating system puts on it. If it oversteps its allotted space, usually by
a wayward pointer being dereferenced, then the program will crash. i.e. it will
be terminated by the operating system with a nasty message like "segmentation
violation" (UNIX) or "General Processor Fault" (Microsoft
Windows). Sometimes the machine will just lock solid and be unresponsive to
keyboard and/or mouse (DOS).
- The program runs, but produces unexpected results. These errors can show
themselves as garbled messages or displays of data; or by terminating too early
or not at all; or by producing results that are clearly out of the expected
range (such as negative speeds, or ages in the thousands).
- The programs runs and behaves "normally", but some part of the
specification is not satisfied. This can happen when a array that should be
sorted is not sorted, or when a table is full of zeros, or similar "non-events".
After a test or tests have been run, either the program works (it passed the
tests) or it does not, producing one of the above kinds of error. Debugging
is the process of isolating the fault in the program source code. This will be
the subject of another discussion. Until then de bugging can be done by doing
a hand simulation with the input data that produced the error. With small
programs this technique is sure to work, and is not too unwieldy. However, with
medium and large programs it is too time-consuming, and subject to its own
errors. In this case a debugging method is needed, and we will look at several
appropriate ways later.
Strategies for altering the test suite
When a test fails, the program will behave in one of the three ways
mentioned above. The first thing to try is to reproduce the error with
different input values. This will tell you what the error is not
caused by (i.e. particular input values) but rather is caused by a type
of input, or, in bad cases, any input. Second, take the input values
that caused the error and do a simulation trace on paper. If the program is not
too complex this method will guarantee success. However, in more complex cases
we must use more advanced debugging techniques.
Different kinds of error naturally lead to different strategies for testing.
However, as stated before, it is an unfortunate fact that no amount of testing
(of reasonable sized programs) can prove that the program is correct, i.e. that
its output is correct for every size and content of input. Most practical
programs have an infinite, or at least very large, number of possible inputs.
It is impossible to test them all, and to check that the output is correct in
each case. Instead possible inputs must be split into categories, and this
breakdown is the subject of a testing strategy. All strategies should follow
the "divide and conquer" method. This assumes that an error is
produced by a particular input or set of inputs, and that isolating this set
will enable the error to be isolated. There are, however, additional methods
that will help in this endeavor. Each of the above categories of error leads to
a particular strategy and testing method:
- When a program crashes it is because some operation has violated the
operating system constraints, usually by trying to access memory outside the
space allocated to your program. In C (less in C++) this is almost always
(probably 85%) due to pointer problems. Often these errors can only be found by
a debugging strategy, involving tracing, and, if possible, single stepping
through statements. However, some can be caught by changing the length of input
strings, where the program reads character strings as input data. If the error
goes away when strings have smaller length, then the culprit is an array whose
length is being exceeded. Common causes are a destination array for the string
library functions strcpy and strcat that is too small;
strings not terminated by the null character; indexing array elements
past the end of the array. As mentioned above, a sure way of uncovering these
errors is to do a simulation trace with the input that produced the bad result.
- If the program does not crash, but still does not work, then it must be
producing unexpected results. Unexpected output can be tested for by stretching
the limits on the input data. Typically a program will fail when presented with
data that is zero or absent, or too small. It can also be that the data is too
big, or too long. A good strategy in these cases is to present the program with
the simplest possible case (typically zero or absent data) and then to present
it with a complex case involving large values, long strings, or large numbers of
items. If the program passes these tests, it may still fail on "typical"
input data, and several tests should be run on data that represents the typical
case. It is useless to rely on one representative input, and even when several
tests have succeeded, suspicion is still appropriate, and more probing is often
necessary before feeling happy about the program.
- The last kind of error is the hardest of all to catch, because the
specification of the program is a moving target; it is very rare to have a
complete and exhaustive specification for a program. More usual is an
incomplete specification, or problem statement that is difficult to test for. A
program may appear to work, but inadequate testing has failed to uncover
possible mismatches between the requirements in the program specification and
outputs produced by the program. The strategy here is to make the specification
clearer by adding constraints or more definite kinds of statements. Every
clarification should lead to a test, or group of tests. A typical example is
that of a sorting program that fails to mention what should happen in the case
of repeated values in the input. Unless tests are made with these repeated
values, the sort program cannot be said to be correct. Just because the
specification did not mention the possibility does not mean that the program
should not handle the repetitions. Very often common sense will lead you to a
test that can reveal inadequacies in the specification. Of course if an error
is uncovered by this thinking, so much the better.
An Example: The miles per gallon problem
Problem statement: Print the average gas consumption in miles per gallon
given a list of trips, each consisting of the number of miles traveled and the
number of gallons used on that trip.
Design: the TDSR diagram shows a while loop, terminated by a sentinel value
that accumulates total miles and total gallons. When the loop terminates the
average mpg is calculated and printed.
[The dashed lines indicate more refinement in terms of variables and
operations]
Testing this program should follow this sequence:
- Try a sentinel alone, i.e. an empty list of trips. Since the totals of
miles and gallons will both be zero, a message should tell the user that no
trips have been read. At the very least the number of gallons should be tested
as greater than zero, since dividing a number by zero (finding the average) will
not work in this case.
- Try one trip, with a number of miles and gallons that gives an easy
average. 100 miles and 10 gallons would do. The average printed should be 10
mpg. Do this several times with different values, such as 1000 and 100, 10000
and 1000 etc. You could try very large numbers, such as 1000000000000, but the
results will be unpredictable. You know that integers, even 32 bits long, have
a finite, but very large maximum value. C++ will work with very large values,
but produce erroneous results due to overflowing the size of the integer
variables.
- Try a few trips, again with values that you can work with in your head, or
on paper (perhaps with a calculator). "A few" means whatever you can
handle, but at least two.
- If all these test succeed, then prepare a representative list of input
values, either by collecting real data, or by using informed guesswork. In this
case, assume you are going on a trip to Washington DC from Las Cruces, and work
out some typical distances between stopping places and typical gas
consumptions. Again there should be an independent way of checking the
program's answer. The calculator will rescue you in this case, but you may have
to use other manual methods.