Software Testing
lecture #1 began here
- Class Notes: http://www.cs.nmsu.edu/~jeffery/courses/579/testing.html
Software testing is the art and discipline of looking for bugs.
Like Dr. Cook described about ESC/Java, when you pass a bunch
of software tests, this means "sorry, didn't find any more bugs",
it doesn't mean there actually aren't any bugs in your program.
Interesting Software Testing Books
Some of the material from this lecture (and the next few) might come from
one of:
- The Art of Software Testing, by Glenford Myers
- Testing Computer Software, by Cem Kaner, Jack Falk, Hung Quoc Nguyen
- The Complete Guide to Software Testing, by Bill Hetzel
- Testing Object-Oriented Software, by David Kung, Pei Hsia, Jerry Gao
- How to Break Software, by James Whittaker
Glen Myers' classic book is one of the highest-rated; it dates back to the
'70s but still has much useful information. Cem Kaner's book is a highly
rated recent book. Bill Hetzel is said to have convened the first conference
on software testing. The Kung/Hsia/Gao book is a collection of papers by
many important authors. James Whittaker is a software testing researcher
at Florida Tech.
Motivational Speeches
"approximately 50% of the elapsed time and over 50% of the total cost are
expended in testing a program or system being developed." - Hetzel,
quoting Myers
"We've interviewed and hired a lot of testers. We have yet to meet a
computer science graduate who learned anything useful about testing at
a university." - Kaner et al
Practical Considerations
For the Mars trip, the goal might be 100% reliability, and in that case
the development schedule and testing budget had better be high enough to
cover that case. Even with "near-infinity" resources, these high-reliability
softwares still regularly see rocket crashes, fatal radiation accidents. etc.
For "normal" commercial software, however, the applications may be much larger,
the pressure of time to market is more intense, and the testing budget isn't
even close to sufficient to aim for fixing all the bugs.
Being Objective
Testing is best done by someone other than the person who wrote the code.
This is because the person who wrote the code writes tests that
reflect the assumptions and perspectives they have already made, and
cannot be objective. Of course, coders must do their own testing and
have the program "working" before they even hand it over to the testers.
Kinds of Errors
- Syntax & semantics errors
- typos, language misuse
- Logic errors
- a broad category including:
- < instead of >; missing an = somewhere; getting precedence wrong
- off-by-one errors
- deadlock/livelock
- accumulated numeric errors; not using enough bits/precision
- performing a calculation in the wrong place
- not terminating a loop or recursion
- not handling null/0, singleton/1, or multi-hit conditions
- I/O errors
- not checking inputs; formatting & parsing failures, network handshake trouble, ...
- Design errors
- misinheritance; using a bad algorithm; not establishing preconditions
- Requirements errors
- failing to specify a requirement, such as real-time response, or a needed ability to run on low-end machines or various operating systems
- Portability errors
- program crashes or behaves incorrectly or differently on some platforms
- Resource errors
- program crashes after running a long time; runs out of memory due to leaks; runs out of OS handles for files, threads, or graphics device contexts...
- Fault-tolerance errors
- program can't recover after a crash; system fails due to a single subprocess failure
- Documentation errors
- program does one (reasonable) thing, document says another
Testing Universals (Whittaker)
All testing starts from: a description of a program's behavior, such as
a specification. Expected behavior may be easy, or very difficult, to
extract from a program's documentation. But in order for a test to be
meaningful, you generally have to be able to tell whether the output
was correct (test passed) or not (test failed).
All testing involves running
the program, either in a real or simulated environment.
Kinds of Software Testing
- black box
- tests written from specifications, cast in terms of inputs
and their expected outputs.
- white box
- tests written with the program source code in hand, for example, to
guarantee that every line of code has been executed in one or more tests.
- unit testing
- test each individual class or module to demonstrate that it works in isolation
- integration testing
- test progressively larger collections of classes/modules, especially their interfaces
- system testing
- test the whole system after primary coding (alpha test, beta test, golden beta, etc.)
- regression testing
- retest old bugs, they have a habit of reappearing after other changes
Unit Testing
Whole books have been written about methods of writing good tests, much of
which boils down to: write tests to challenge the boundary conditions and
assumptions that programmers typically make when writing code.
Example of (white box) testing: Testing Loops
If your job is to write tests for some procedure that has a loop in it,
you can write tests that:
- skip the loop entirely
- execute only one iteration of the loop
- execute two iterations
- execute M iterations, for some random 2 < M < N-1
- execute (N-1), N, and (N+1) iterations
where N is the maximum number of allowable passes.
Coverage Testing
Coverage means: writing tests that execute all the code. Since a significant
portion of errors are due to simple typos and logic mistakes, if we execute
every line of code we are likely to catch all such "easy" errors.
There are at least two useful kinds of coverage: statement coverage
(executing every statement), and path coverage (executing every path
through the code). Statement coverage is not sufficient to catch
all bugs, but path coverage tends to suffer from a combinatorial
explosion of possibilities. Exhaustive path coverage may not be
an option, but some weaker forms of path coverage are useful.
Example coverage tools:
- Clover
- Hansel
Testing is like detective work?
Lethbridge makes an unfortunate analogy between programmers and criminals;
they have a modus operandi, and once you find what type of bugs a
programmer is writing in one place, the programmer may well repeat similar
bugs elsewhere in the code.
Whittaker's Attacks (black box examples)
Testers have long been told to adopt the role of "adversary" and try to
break code. Whittaker rightly suggests that testers try to "hack" the
program; since hackers will be trying this later on, it makes great sense.
Software testing and software security have a clear overlap.
- force every
error message to occur. Error handling code is seldom
executed, programmers often fail to test it, hence it is
especially likely to have bugs.
- try inputs which force the default values. Programmers often
fail to initialize properly under default conditions.
- explore allowable character sets and data types. Using weird control
characters or programming-language operators may cause failures
if the programmer has not considered all possibilities.
- overflow input buffers
- find inputs that interact and test combinations of their values
- repeat the same input (or series of inputs) multiple times.
Programmers tend to assume resources are unlimited.
- force invalid outputs to be generated
- force properties of an output (font? color? size?) to change
- force the screen to refresh
In selecting test cases, look for equivalence classes
You usually cannot test all the possible inputs to a program or parameters
to a procedure that you wish to test. If you can identify what
ranges of values ought to evoke different kinds of responses,
it will help you minimize test cases to: one representative from each
class of expected answer, plus extra tests at the boundaries of the
equivalence classes to make sure the ranges are nonoverlapping.
A Testing Process
Kaner's "First Cycle"
- Start with an obvious and simple test.
- Make some notes about what else needs testing.
- Check the valid cases and see what happens
- Do some testing "on the fly"
- Summarize what you know about the program and its problems.
Second Cycle
- Review responses to the problem reports, to see what needs to be done, and what doesn't.
- Review comments on problems that won't be fixed, they may suggest further tests.
- Pull out your notes from that time, add your new notes to them, and start testing.
Some Rules of Thumb (Kaner)
- Submit one problem report for each error
- If you expect the same result from two tests, use only one of them.
- When you choose representatives of class of values for a test,
always pick the ones you think the program is most likely to fail.
- The reason the program is "in testing" is that it probably doesn't work.
- Always write down what you do and what happens when you run exploratory
tests.
- The most effective way to get a bug fixed is to find test cases that make
it appear so likely to occur, under such innocent conditions, that no
one would be willing to tolerate such a bug.
- If the programmer's were careful, most of your tests won't find errors,
even clever ones you spent a lot of time thinking about.
- The best tester isn't the one who finds the most bugs or embarrasses
the programmer the most, it is the one who gets the most bugs fixed.
Bug Tracking
There is a need to
document and track problems in a standardized way. In some organizations
there is a paper Bug Report Form. Whether they are committed to paper or not,
bugs should also be entered into a database, where management can see what is
getting fixed and what isn't. Bug reports include:
- submitter contact information (for questions)
- a description of the problem
- instructions on how to reproduce it (if possible)
- supporting files (sample data, logfiles, screenshots...)
- person(s) assigned to the bug
- classification and estimation of severity/importance
A couple sample bug trackers:
lecture #2 began here
Pooley/Stevens on Testing (read Section 19.5)
In addition to finding bugs, P/S claim testing serves two more purposes:
- to convince the customer that there are no (important) bugs.
For example, a beta test may have to "pass" before a customer will "pay".
- to provide information that will help with system evolution.
Sometimes the outcome of a test is: that will work in version 2.
Sometimes testing produces more precise performance information
than can be measured earlier in analysis, design, or implementation.
Tests have to be repeatable, documented, and precise. Repeatability
requires more than just the test inputs; one has to know exactly which
version of the software (say, by CVS version numbers, or last update date),
and what OS/compiler/library versions were used.
Many test cases are often needed; often they can be generated
automatically by some script. Scripts are also good for rerunning tests.
Special Problems of Testing OO Software
A class is harder to test than a function. It us often harder to get the
object into the state in which a given method does its thing, than to set
up the circumstances needed to test a function.
OO systems tend to be complex, which makes testing a challenge. For example,
to test code based on a statechart, one wants to write tests to exercise
every transition in the statechart (as with the previous lecture's suggestion
of testing every error message, this sounds like "design coverage" testing).
Encapsulation, so vital in OO systems, also makes it harder to test them.
Extracting private variable information is a pain. It may involve writing
lots of extra methods.
Testing in the presence of inheritance and polymorphism is complicated by
the fact that any given polymorphic code may be working with any of several
different object types; the potential interactions are such that the same
chunk of code may need to be tested many times with each combination of
parameterized types.
Validation Testing: an old example
Prologue: This is approximately what I learned about testing from
a university, so according to Kaner, it should not be useful, or I should
not have learned anything from it.
The Unicon test suite attempts to validate, in a general way, the major
functions of the Unicon language; it is used by folks who build Unicon
from sources, especially those who build it on a new OS platform. The
unicon/tests/README file divides the testing into categories as follows:
The sub-directories here contain various test material for
Version 11.0 of Unicon and Version 9.4 of Icon.
bench benchmarking suite
calling calling C functions from Icon
general main test suite
graphics tests of graphic features
preproc tests of the rtt (not Icon) preprocessor
samples sample programs for quick tests
special tests of special features
Each subdirectory has a suite of tests and sample data, and a Makefile for
building and running tests. The master test/Makefile automates execution of
the general and posix tests, which are routinely run on new Unicon builds.
The general/ directory contains tests "inherited" from the Icon programming
language (50 files, 5K LOC):
augment.icn collate.icn gc1.icn mem01c.icn prefix.icn struct.icn
btrees.icn concord.icn gc2.icn mem01x.icn prepro.icn tracer.icn
cfuncs.icn diffwrds.icn gener.icn mem02.icn proto.icn transmit.icn
checkc.icn endetab.icn helloc.icn mffsol.icn recent.icn var.icn
checkfpc.icn env.icn hellox.icn mindfa.icn recogn.icn wordcnt.icn
checkfpx.icn errors.icn ilib.icn numeric.icn roman.icn
checkx.icn evalx.icn kross.icn others.icn scan.icn
ck.icn fncs.icn large.icn over.icn sieve.icn
coexpr.icn fncs1.icn meander.icn pdco.icn string.icn
Some of these tests were introduced when new language features were
introduced and may constitute unit tests; many others were introduced when
a bug was reported and fixed (and hence, are regression tests). A
semi-conscious attempt has been made to use pretty much every language
feature, thus, the test suite forms somewhat of a validation of a Unicon
build.
The tests are all run from a script, which looks about like the following.
Each test is run from a for-loop, and its output diff'ed against an
expected output. Some differences are expected, such as
the test which prints out what operating system, version and so forth.
for F in $*; do
F=`basename $F .std`
F=`basename $F .icn`
rm -f $F.out
echo "Testing $F"
$IC -s $F.icn || continue
if test -r $F.dat
then
./$F <$F.dat >$F.out 2>&1
else
./$F </dev/null >$F.out 2>&1
fi
diff $F.std $F.out
rm -f $F
done
Sample test (diffwrds.icn):
#
# D I F F E R E N T W O R D S
#
# This program lists all the different words in the input text.
# The definition of a "word" is naive.
procedure main()
words := set()
while text := read() do
text ? while tab(upto(&letters)) do
insert(words,tab(many(&letters)))
every write(!sort(words))
end
Sample data file (diffwords.dat):
procedure main()
local limit, s, i
limit := 100
s := set([])
every insert(s,1 to limit)
every member(s,i := 2 to limit) do
every delete(s,i + i to limit by i)
primes := sort(s)
write("There are ",*primes," primes in the first ",limit," integers.")
write("The primes are:")
every write(right(!primes,*limit + 1))
end
Sample expected output (diffwrds.std):
The
There
are
by
delete
do
end
every
first
i
in
insert
integers
limit
local
main
member
primes
procedure
right
s
set
sort
the
to
write
What I Have Learned About Testing
Remember, this was in an academic environment, so Kaner would dismiss it.
- Software changes, so testing is largely about re-testing.
- To reduce the cost of testing, Automate
- Floating point numbers print out differently on different platforms
- Not every difference between expected and actual output is a bug
- Automated test scripts, as with make, are only as portable as your shell,
but since test scripts aren't as complicated as a large system build
process, test scripts are easier to reconstruct on odd platforms.
- Automated test scripts only help when you use them.
- Bug tracking systems only help when you use them. Putting them
up and not using them is negative advertising.
- Properties like "coverage" must be reestablished after changes
- Graphics programs are harder to test. GUI sessions can be recorded,
but its harder to "diff" two computer screens than two text files.
- Testing is only half of the maintenance job: testing without bug fixing
is like holding an election and then keeping the results secret and not
using them.
Application: Develop a Test Strategy
So you have written a set of Java classes that constitute a set of 1+
Eclipse plugin(s), how do you develop a test strategy to find all the bugs?...
More Unit Testing
Beyond "testing a unit", what are the goals of Unit Testing?
Can we come up with a more concrete definition?
Suppose you want to find "as many bugs as possible" within
class X; how do you go about it?
- If you are whole-program testing, you are writing test cases as data files (inputs)
for a program run. If you are unit testing, you are writing test cases as calls
(inputs) to individual functions, or sequences of calls to related functions in the unit.
- If you are whole-program testing, you are comparing (maybe using "diff") the
execution output against expected execution output. If you are unit-testing,
you are comparing (probably not using diff...) actual output of the function call(s)
with expected output. What outputs may you need to compare? Primarily function
return value, I/O output, and side effects to globals and parameters. Anything
else?
...and this is almost exactly what JUnit does. You write test methods to test other
methods, calling them with some artificial environment (parameters, etc.) in which
a particular output is expected, and checking whether the output matches or not.
lecture #3 began here
McCabe's Principles
- Since 10% of the code is responsible for 90% of the execution time,
90% of your testing may only be exercising 10% of your code!
- Typical large projects spend 50% of their design/coding time on
error detection and correction -- that half of the code is often poorly
tested, if at all.
- Error distributions differ between large and small projects:
- large projects
-
- interface inconsistencies
- incomplete problem statements
- ambiguous specifications
- inconsistent assumptions
- management visibility
- small projects
-
- computational accuracy
- intraroutine control
- syntax
- indexing and control flow
- design errors versus coding errors: on many large projects,
most of the errors (about 2/3rds?) are design errors. But
most of the errors uncovered by testing are the coding errors.
How to test for design errors?
More Coverage Testing
Coverage Testing was mentioned briefly last lecture. It is
in some sense meta-testing: testing your tests.
There is a paper at
http://www.bullseye.com/coverage.html
that provides a good introduction and defines several
forms of coverage testing. You can combine, or perform multiple forms,
of these coverage tests. Note that although these are phrased as
yes/no questions, a coverage tool doesn't just answer yes/no or even
give you a percentage: it gives you a percentage and shows in detail
each location or case in which the coverage property did not hold.
- function coverage
- did you execute every function?
- call coverage
- did you execute every call to every function?
- loop coverage
- did you execute every loop 0, 1, and >1 times?
- relational coverage
- did you test every relational operator with equal, lesser-, and greater-than
values, in order to avoid off-by-one errors and logic inversions?
- data coverage
- did every variable actually get referenced? every array element?
was every pointer followed?
- statement coverage
- measure % of executable lines that were executed by the tests.
A good starting point with many weaknesses. For example,
not all of a short-circuit boolean expression may have been tested
in 100% statement-coverage. If an if statement has no else clause,
100% coverage does not include: is the "then" ever not taken?
- decision coverage
- evaluate every (outermost level) boolean expression to both
true and false. May find things missed
in statement coverage, but still doesn't cover short circuit code
- condition coverage
- check whether individual boolean subexpressions were evaluated to both
true and false. Handles short circuits. Does not guarantee decision
coverage!
- multiple condition coverage
- check whether all possible combinations of boolean subexpressions were
executed. Scoring well requires a lot of test cases.
- condition/decision coverage
- union of condition coverage and decision coverage
- modified condition/decision coverage
- verify every condition can affect the result of its encompassing
decision. By definition, short-circuit conditions fail this test.
Invented at Boeing; required for certain aviation softwares.
- path coverage
- check if each possible paths in each function have been followed.
A path is a unique sequence of branches from function entry to exit.
Loops introduce an unbounded number of paths. # of paths is
exponential in the number of branches. Some paths may be impossible.
Many many variations introduced to try and develop "practical" path
coverage methods.
Another useful article may be found at
a Buyer's Guide to Code Coverage Terminology, and the related
How to Misuse
Code Coverage.
A couple more words on our sample coverage tools, Clover and Hansel.
Clover is a commercial product which works by instrumenting the source
code. It does statement and branch coverage, but not most of the
other forms of coverage. It might actually be cool.
Another commercial coverage tool is
JCover, which does
more types of coverage tests. There are no doubt dozens of others.
Hansel is an open source extension to JUnit, based on code developed at
the University of Oregon. It works with bytecode not source code.
It appears to just do statement coverage. Its not much, but its free
and its better than nothing.
Apologies to Whittaker
Earlier I indicated Whittaker was from Microsoft. He's not. He's
from Florida Tech. He's spoken at Microsoft before. Mea culpa.
Bug hunts
Bug hunts employ sets of test teams consisting of two participants at a
computer in a testing lab; plus a referee, a recorder, and a coach.
Participants apply attacks and test the application in a direct attempt to
break the program. One participant usually drives the application, the
other sits back and thinks about the big picture. When the team finds a
bug, they ring a bell (a nice tension breaker) which summons the referee and
the recorder. The participants reproduce the bug for the referee, who
judges whether it is a bug or not. Video recordings are often replayed
later as a postgame analysis of technique, explanation, and style.
When to hold bug hunts? Immediately after a major new build or a new
feature is added. Limit the hunt to specific areas of the software,
and keep them to a strict time limit (say, two hours). At the end,
referee and coaches should pick the best test team and award prizes.
Bugfests
Earlier we said testing finds bugs, and that finding bugs doesn't
by itself solve any problems.
How to translate testing into positive gain among your programmers?
(a) get the bugs fixed, (b) find ways to educate bug authors so they
don't repeat their mistakes.
For (a) the answer is sometimes easy, sometimes hard, but usually
an individual enterprise. For (b) you can
meet to discuss bugs; Whittaker calls such meetings bugfests. The
goal of a bugfest is to learn by discussing bugs.
I believe that bugs are corporate assets and that we can learn a
great deal about how to improve our software by studying the very
things that make our software need improvement.
Want to know what types of errors your programmers are making?
Study their bug reports. What to know what testing techniques
are finding the best bugs? Study bug reports. What to get new
testers testing like your best veterans? Make them study bug
reports.
Studying bug reports is boring you say? The report doesn't contain
the insights you really need? I agree. That's exactly why we
instituted bug fests.
Whittaker holds bugfests weekly at a local pizza parler within walking
distance of his office. By emphasizing describing the bug well enough
to be worth conversation, testers tend to study the bug harder, make
a careful note of how to reproduce it, etc. Testers bring laptops and
demo the bugs for developers. After the demo, you get inside of the
head of the tester to find out how they came up with the test case
(helping other testers learn), and analyze the bug: what is wrong,
why is it wrong, how to fix it, does it occur in other places in
the code (helping developers avoid repeating the mistake).