Software Testing

lecture #1 began here

Software testing is the art and discipline of looking for bugs. Like Dr. Cook described about ESC/Java, when you pass a bunch of software tests, this means "sorry, didn't find any more bugs", it doesn't mean there actually aren't any bugs in your program.

Interesting Software Testing Books

Some of the material from this lecture (and the next few) might come from one of: Glen Myers' classic book is one of the highest-rated; it dates back to the '70s but still has much useful information. Cem Kaner's book is a highly rated recent book. Bill Hetzel is said to have convened the first conference on software testing. The Kung/Hsia/Gao book is a collection of papers by many important authors. James Whittaker is a software testing researcher at Florida Tech.

Motivational Speeches

"approximately 50% of the elapsed time and over 50% of the total cost are expended in testing a program or system being developed." - Hetzel, quoting Myers
"We've interviewed and hired a lot of testers. We have yet to meet a computer science graduate who learned anything useful about testing at a university." - Kaner et al

Practical Considerations

For the Mars trip, the goal might be 100% reliability, and in that case the development schedule and testing budget had better be high enough to cover that case. Even with "near-infinity" resources, these high-reliability softwares still regularly see rocket crashes, fatal radiation accidents. etc. For "normal" commercial software, however, the applications may be much larger, the pressure of time to market is more intense, and the testing budget isn't even close to sufficient to aim for fixing all the bugs.

Being Objective

Testing is best done by someone other than the person who wrote the code. This is because the person who wrote the code writes tests that reflect the assumptions and perspectives they have already made, and cannot be objective. Of course, coders must do their own testing and have the program "working" before they even hand it over to the testers.

Kinds of Errors

Syntax & semantics errors
typos, language misuse
Logic errors
a broad category including:
I/O errors
not checking inputs; formatting & parsing failures, network handshake trouble, ...
Design errors
misinheritance; using a bad algorithm; not establishing preconditions
Requirements errors
failing to specify a requirement, such as real-time response, or a needed ability to run on low-end machines or various operating systems
Portability errors
program crashes or behaves incorrectly or differently on some platforms
Resource errors
program crashes after running a long time; runs out of memory due to leaks; runs out of OS handles for files, threads, or graphics device contexts...
Fault-tolerance errors
program can't recover after a crash; system fails due to a single subprocess failure
Documentation errors
program does one (reasonable) thing, document says another

Testing Universals (Whittaker)

All testing starts from: a description of a program's behavior, such as a specification. Expected behavior may be easy, or very difficult, to extract from a program's documentation. But in order for a test to be meaningful, you generally have to be able to tell whether the output was correct (test passed) or not (test failed). All testing involves running the program, either in a real or simulated environment.

Kinds of Software Testing

black box
tests written from specifications, cast in terms of inputs and their expected outputs.
white box
tests written with the program source code in hand, for example, to guarantee that every line of code has been executed in one or more tests.
unit testing
test each individual class or module to demonstrate that it works in isolation
integration testing
test progressively larger collections of classes/modules, especially their interfaces
system testing
test the whole system after primary coding (alpha test, beta test, golden beta, etc.)
regression testing
retest old bugs, they have a habit of reappearing after other changes

Unit Testing

Whole books have been written about methods of writing good tests, much of which boils down to: write tests to challenge the boundary conditions and assumptions that programmers typically make when writing code.

Junit HowTo

Example of (white box) testing: Testing Loops

If your job is to write tests for some procedure that has a loop in it, you can write tests that:
  1. skip the loop entirely
  2. execute only one iteration of the loop
  3. execute two iterations
  4. execute M iterations, for some random 2 < M < N-1
  5. execute (N-1), N, and (N+1) iterations
where N is the maximum number of allowable passes.

Coverage Testing

Coverage means: writing tests that execute all the code. Since a significant portion of errors are due to simple typos and logic mistakes, if we execute every line of code we are likely to catch all such "easy" errors.

There are at least two useful kinds of coverage: statement coverage (executing every statement), and path coverage (executing every path through the code). Statement coverage is not sufficient to catch all bugs, but path coverage tends to suffer from a combinatorial explosion of possibilities. Exhaustive path coverage may not be an option, but some weaker forms of path coverage are useful. Example coverage tools:

Clover
Hansel

Testing is like detective work?

Lethbridge makes an unfortunate analogy between programmers and criminals; they have a modus operandi, and once you find what type of bugs a programmer is writing in one place, the programmer may well repeat similar bugs elsewhere in the code.

Whittaker's Attacks (black box examples)

Testers have long been told to adopt the role of "adversary" and try to break code. Whittaker rightly suggests that testers try to "hack" the program; since hackers will be trying this later on, it makes great sense. Software testing and software security have a clear overlap.

In selecting test cases, look for equivalence classes

You usually cannot test all the possible inputs to a program or parameters to a procedure that you wish to test. If you can identify what ranges of values ought to evoke different kinds of responses, it will help you minimize test cases to: one representative from each class of expected answer, plus extra tests at the boundaries of the equivalence classes to make sure the ranges are nonoverlapping.

A Testing Process

Kaner's "First Cycle"
  1. Start with an obvious and simple test.
  2. Make some notes about what else needs testing.
  3. Check the valid cases and see what happens
  4. Do some testing "on the fly"
  5. Summarize what you know about the program and its problems.
Second Cycle
  1. Review responses to the problem reports, to see what needs to be done, and what doesn't.
  2. Review comments on problems that won't be fixed, they may suggest further tests.
  3. Pull out your notes from that time, add your new notes to them, and start testing.

Some Rules of Thumb (Kaner)

Bug Tracking

There is a need to document and track problems in a standardized way. In some organizations there is a paper Bug Report Form. Whether they are committed to paper or not, bugs should also be entered into a database, where management can see what is getting fixed and what isn't. Bug reports include:
  1. submitter contact information (for questions)
  2. a description of the problem
  3. instructions on how to reproduce it (if possible)
  4. supporting files (sample data, logfiles, screenshots...)
  5. person(s) assigned to the bug
  6. classification and estimation of severity/importance
A couple sample bug trackers:

lecture #2 began here

Pooley/Stevens on Testing (read Section 19.5)

In addition to finding bugs, P/S claim testing serves two more purposes: Tests have to be repeatable, documented, and precise. Repeatability requires more than just the test inputs; one has to know exactly which version of the software (say, by CVS version numbers, or last update date), and what OS/compiler/library versions were used.

Many test cases are often needed; often they can be generated automatically by some script. Scripts are also good for rerunning tests.

Special Problems of Testing OO Software

A class is harder to test than a function. It us often harder to get the object into the state in which a given method does its thing, than to set up the circumstances needed to test a function.

OO systems tend to be complex, which makes testing a challenge. For example, to test code based on a statechart, one wants to write tests to exercise every transition in the statechart (as with the previous lecture's suggestion of testing every error message, this sounds like "design coverage" testing).

Encapsulation, so vital in OO systems, also makes it harder to test them. Extracting private variable information is a pain. It may involve writing lots of extra methods.

Testing in the presence of inheritance and polymorphism is complicated by the fact that any given polymorphic code may be working with any of several different object types; the potential interactions are such that the same chunk of code may need to be tested many times with each combination of parameterized types.

Validation Testing: an old example

Prologue: This is approximately what I learned about testing from a university, so according to Kaner, it should not be useful, or I should not have learned anything from it.

The Unicon test suite attempts to validate, in a general way, the major functions of the Unicon language; it is used by folks who build Unicon from sources, especially those who build it on a new OS platform. The unicon/tests/README file divides the testing into categories as follows:

The sub-directories here contain various test material for
Version 11.0 of Unicon and Version 9.4 of Icon.

	bench		benchmarking suite
	calling		calling C functions from Icon
	general		main test suite
	graphics	tests of graphic features
	preproc		tests of the rtt (not Icon) preprocessor
	samples		sample programs for quick tests
	special		tests of special features

Each subdirectory has a suite of tests and sample data, and a Makefile for building and running tests. The master test/Makefile automates execution of the general and posix tests, which are routinely run on new Unicon builds. The general/ directory contains tests "inherited" from the Icon programming language (50 files, 5K LOC):

augment.icn   collate.icn   gc1.icn	 mem01c.icn   prefix.icn  struct.icn
btrees.icn    concord.icn   gc2.icn	 mem01x.icn   prepro.icn  tracer.icn
cfuncs.icn    diffwrds.icn  gener.icn	 mem02.icn    proto.icn   transmit.icn
checkc.icn    endetab.icn   helloc.icn	 mffsol.icn   recent.icn  var.icn
checkfpc.icn  env.icn	    hellox.icn	 mindfa.icn   recogn.icn  wordcnt.icn
checkfpx.icn  errors.icn    ilib.icn	 numeric.icn  roman.icn
checkx.icn    evalx.icn     kross.icn	 others.icn   scan.icn
ck.icn	      fncs.icn	    large.icn	 over.icn     sieve.icn
coexpr.icn    fncs1.icn     meander.icn  pdco.icn     string.icn
Some of these tests were introduced when new language features were introduced and may constitute unit tests; many others were introduced when a bug was reported and fixed (and hence, are regression tests). A semi-conscious attempt has been made to use pretty much every language feature, thus, the test suite forms somewhat of a validation of a Unicon build.

The tests are all run from a script, which looks about like the following. Each test is run from a for-loop, and its output diff'ed against an expected output. Some differences are expected, such as the test which prints out what operating system, version and so forth.

for F in $*; do
   F=`basename $F .std`
   F=`basename $F .icn`
   rm -f $F.out
   echo "Testing $F"
   $IC -s $F.icn || continue
   if test -r $F.dat
   then
      ./$F <$F.dat >$F.out 2>&1
   else
      ./$F </dev/null >$F.out 2>&1
   fi
   diff $F.std $F.out
   rm -f $F
done

Sample test (diffwrds.icn):

#
#          D I F F E R E N T   W O R D S
#

#  This program lists all the different words in the input text.
#  The definition of a "word" is naive.

procedure main()
   words := set()
   while text := read() do
      text ? while tab(upto(&letters)) do
         insert(words,tab(many(&letters)))
   every write(!sort(words))
end
Sample data file (diffwords.dat):
procedure main()
   local limit, s, i
   limit := 100
   s := set([])
   every insert(s,1 to limit)
   every member(s,i := 2 to limit) do
      every delete(s,i + i to limit by i)
   primes := sort(s)
   write("There are ",*primes," primes in the first ",limit," integers.")
   write("The primes are:")
   every write(right(!primes,*limit + 1))
end
Sample expected output (diffwrds.std):
The
There
are
by
delete
do
end
every
first
i
in
insert
integers
limit
local
main
member
primes
procedure
right
s
set
sort
the
to
write

What I Have Learned About Testing

Remember, this was in an academic environment, so Kaner would dismiss it.

Application: Develop a Test Strategy

So you have written a set of Java classes that constitute a set of 1+ Eclipse plugin(s), how do you develop a test strategy to find all the bugs?...

More Unit Testing

Beyond "testing a unit", what are the goals of Unit Testing? Can we come up with a more concrete definition? Suppose you want to find "as many bugs as possible" within class X; how do you go about it?

...and this is almost exactly what JUnit does. You write test methods to test other methods, calling them with some artificial environment (parameters, etc.) in which a particular output is expected, and checking whether the output matches or not.

Now, Let's Look at your "Software Testing Homework"

lecture #3 began here

McCabe's Principles

More Coverage Testing

Coverage Testing was mentioned briefly last lecture. It is in some sense meta-testing: testing your tests. There is a paper at http://www.bullseye.com/coverage.html that provides a good introduction and defines several forms of coverage testing. You can combine, or perform multiple forms, of these coverage tests. Note that although these are phrased as yes/no questions, a coverage tool doesn't just answer yes/no or even give you a percentage: it gives you a percentage and shows in detail each location or case in which the coverage property did not hold.
function coverage
did you execute every function?
call coverage
did you execute every call to every function?
loop coverage
did you execute every loop 0, 1, and >1 times?
relational coverage
did you test every relational operator with equal, lesser-, and greater-than values, in order to avoid off-by-one errors and logic inversions?
data coverage
did every variable actually get referenced? every array element? was every pointer followed?
statement coverage
measure % of executable lines that were executed by the tests. A good starting point with many weaknesses. For example, not all of a short-circuit boolean expression may have been tested in 100% statement-coverage. If an if statement has no else clause, 100% coverage does not include: is the "then" ever not taken?
decision coverage
evaluate every (outermost level) boolean expression to both true and false. May find things missed in statement coverage, but still doesn't cover short circuit code
condition coverage
check whether individual boolean subexpressions were evaluated to both true and false. Handles short circuits. Does not guarantee decision coverage!
multiple condition coverage
check whether all possible combinations of boolean subexpressions were executed. Scoring well requires a lot of test cases.
condition/decision coverage
union of condition coverage and decision coverage
modified condition/decision coverage
verify every condition can affect the result of its encompassing decision. By definition, short-circuit conditions fail this test. Invented at Boeing; required for certain aviation softwares.
path coverage
check if each possible paths in each function have been followed. A path is a unique sequence of branches from function entry to exit. Loops introduce an unbounded number of paths. # of paths is exponential in the number of branches. Some paths may be impossible. Many many variations introduced to try and develop "practical" path coverage methods.
Another useful article may be found at a Buyer's Guide to Code Coverage Terminology, and the related How to Misuse Code Coverage.

A couple more words on our sample coverage tools, Clover and Hansel. Clover is a commercial product which works by instrumenting the source code. It does statement and branch coverage, but not most of the other forms of coverage. It might actually be cool. Another commercial coverage tool is JCover, which does more types of coverage tests. There are no doubt dozens of others.

Hansel is an open source extension to JUnit, based on code developed at the University of Oregon. It works with bytecode not source code. It appears to just do statement coverage. Its not much, but its free and its better than nothing.

Apologies to Whittaker

Earlier I indicated Whittaker was from Microsoft. He's not. He's from Florida Tech. He's spoken at Microsoft before. Mea culpa.

Bug hunts

Bug hunts employ sets of test teams consisting of two participants at a computer in a testing lab; plus a referee, a recorder, and a coach. Participants apply attacks and test the application in a direct attempt to break the program. One participant usually drives the application, the other sits back and thinks about the big picture. When the team finds a bug, they ring a bell (a nice tension breaker) which summons the referee and the recorder. The participants reproduce the bug for the referee, who judges whether it is a bug or not. Video recordings are often replayed later as a postgame analysis of technique, explanation, and style.

When to hold bug hunts? Immediately after a major new build or a new feature is added. Limit the hunt to specific areas of the software, and keep them to a strict time limit (say, two hours). At the end, referee and coaches should pick the best test team and award prizes.

Bugfests

Earlier we said testing finds bugs, and that finding bugs doesn't by itself solve any problems. How to translate testing into positive gain among your programmers? (a) get the bugs fixed, (b) find ways to educate bug authors so they don't repeat their mistakes.

For (a) the answer is sometimes easy, sometimes hard, but usually an individual enterprise. For (b) you can meet to discuss bugs; Whittaker calls such meetings bugfests. The goal of a bugfest is to learn by discussing bugs.

I believe that bugs are corporate assets and that we can learn a great deal about how to improve our software by studying the very things that make our software need improvement.

Want to know what types of errors your programmers are making? Study their bug reports. What to know what testing techniques are finding the best bugs? Study bug reports. What to get new testers testing like your best veterans? Make them study bug reports.

Studying bug reports is boring you say? The report doesn't contain the insights you really need? I agree. That's exactly why we instituted bug fests.

Whittaker holds bugfests weekly at a local pizza parler within walking distance of his office. By emphasizing describing the bug well enough to be worth conversation, testers tend to study the bug harder, make a careful note of how to reproduce it, etc. Testers bring laptops and demo the bugs for developers. After the demo, you get inside of the head of the tester to find out how they came up with the test case (helping other testers learn), and analyze the bug: what is wrong, why is it wrong, how to fix it, does it occur in other places in the code (helping developers avoid repeating the mistake).