Unit Testing: Black Box and White Box • Jonathan Cook

This page is my own shorter description of a few of the basic unit testing ideas found in the paper Software unit test coverage and adequacy, by Zhu et al., ACM Computing Surveys 29(4), pp 366-427, 1997. Although old, this paper is a good foundational survey of unit testing ideas.

Unit Testing

The Internet has myriad descriptions of unit testing. Basically we are testing an isolated software component apart from a whole system. The test needs a driver, often a framework like JUnit, and may need stubs (or mocks) to provide context.

Test, Test Set, and Test Suite

A test is an execution of the unit on a specific input in a specific environment, checking the output against a known expected output. Sometimes students will, when asked for a test, say “I’ll test an input value less than 10.” That is not a test! It is a description of a class of tests. A test would be “I’ll test an input value of 7, and expect an output of 21.”

Where does the known expected output come from? The unit’s specifications. Even if they are not fully written down, the expected output is based on what we understand that the unit should do. A test passes if the observed output matches the expected output.

A test set is a set of tests. Suite is an older term that is synonymous with set. I will try to always say “test set”, but sometimes will lapse into “test suite”. In modern days, we should assume that we are using an automated testing framework and can re-run our test set at any time.

Test Set Adequacy

The main question in testing is: Did we test our software well enough? The other way to say this is: Is our test set adequate, or do we need to create more tests?

In other words, when do we stop creating more tests?

All of this assumes that:

we have created a test set;
the test set is automated (we can rerun them); and
all the tests in the test set pass!

The last one is important–when thinking about test set adequacy and when to stop creating tests, we are assuming all tests are passing. If we have any failing tests, we need to fix the code so that they pass!

Note that we do not talk the adequacy of a single test, but always a test set.

Abstract Metric of Test Set Adequacy

To decide whether a test set is adequate or not, we need to have a metric of adequacy. This page will present several metrics of adequacy. But for now, think of a metric as a function:

ADQ(unit,spec,testSet) -> [0,1]

That is, ADQ is a function that takes a unit, a specification, and a test set, and then maps it to a metric value between 0 and 1, where 1 is perfectly adequate.

When we get to specific adequacy metrics, we will see that test set adequacy is generally thought of as coverage of some code or problem concept. It is not wrong to think: adequacy is coverage.

We can then specify a stopping criterion as a value C between 0 and 1. If our test set measures at C or above, we decide that the test set is adequate, and we stop creating more tests. If it measures below C, we need to create more tests and add them to the test set, run the tests, and re-measure ADQ.

Remember, we always assume that we have a test set where all tests are passing! It is quite ridiculous to think about measuring adequacy of a test set where some tests are still failing!

White Box Code-Based Adequacy

In white box testing, we look at and consider the source code of the unit being tested. Sometimes this is called clear box testing (since white is actually an opaque color).

Statement Coverage

The most common, and easiest, adequacy metric for white box testing is statement coverage: the fraction of statements that are executed by the test set. This metric does not count how many times a statement is executed, but is simply binary: did any test in the test set execute the statement at least once? If yes then the statement is covered.

It would be wonderful to set a stopping criterion at 1 (all code must be executed by at least one test), but for real systems with exception handlers and other hard to reach code, it is often unrealistic. Usually something high, like 0.95, is more realistic. In fact, whether or not a statement is feasibly exectuable in any execution is an known undecidable problem! Infeasible code is known colloquially as dead code, and is more common than you think.

How do we measure statement coverage? With tools, of course! JaCoCo is an open-source coverage-measuring tool for Java. These tools work by constructing the Control Flow Graph of the code and then instrumenting the basic blocks and edges.

It is crazy to automate testing without also measuring coverage, so just do it!

Statement coverage is equivalent to basic block coverage, so the tool only needs to instrument basic blocks to measure statement coverage (although it has to map it back to source code lines). This page does not present detail on control flow graphs; use the link above, or other sources, or AI, to learn about control flow graphs.

Branch Coverage

Even in a pretty simple control flow graph, it is possible to execute all basic blocks but not traverse all edges.

Edges capture the branches in the code, and so a stronger white box adequacy metric is edge (or branch) coverage. Edge coverage subsumes basic block coverage: if you cover all edges, you will cover all basic blocks. But the reverse is not always true.

The place where branch coverage gets tricky is when we have compound conditions in our code, with logical AND and OR operators. In virtually all programming languages, these operators are short-circuited, meaning the right-hand side is not evaluated if the left-hand side already determines the outcome. To fully cover all edges out of such a condition, we must cause each atomic condition to produce a true and false result in some test.

Tools like JaCoCo also measure edge (branch) coverage, and so we could also choose a stopping criterion for branch coverage. When using a code coverage tool, pay attention to both statement and branch coverage!

Path Coverage

Statement and branch coverage take a very atomic view of the code, treating each individual piece by itself. But code interacts with itself, and what really matters is the entire path through the code!

Unfortunately, the number of possible paths through code explodes astronomically (literally!). Every branch doubles the possible number of paths through code, so if there are B branches in a unit, there are 2^B unique paths through the code. Scientists estimate that the universe has about 10⁸⁰ atoms, which is equivalent to 2²⁷⁰. So code with about 270 branches has a number of unique paths equivalent to the size of the universe, in atoms.

Now, this is theoretical, and many branches in code interact with each other to limit the feasible paths, but the analysis above did not take into account the fact that many loops in code do not have a built-in bounds, and so most real programs have a theoretically infinite number of paths in them.

In reality, we simply cannot think about testing at the path level. Some have tried to simplify it with ideas such as Cyclomatic Complexity, but we mostly just use statement and branch coverage. I have long thought an interesting project would be N-gram path testing: enumerate all path segments with N basic blocks in them, and then measure coverage of these N-length segments; but I’ve never had time to pursue it. Maybe you do!

Black Box Specification-Based Adequacy

In black box testing we are not allowed to consider the code of the unit under test. It is a “black box”. So we can only base how we select our test cases to add to our test set based on the specification itself. We read the spec, figure out how the unit should behave, and make tests to test that behavior. In reality, this is where we should start with our testing anyway! If we don’t care about understanding how the spec says our unit should behave, we’re in big trouble!

In this description we are only going to talk about Equivalence Partitioning and Boundary Value Analysis. The main point to remember is that measuring adequacy is still about coverage, but it is coverage of concepts in the specification rather than in the code.

Equivalence Partitioning

The basic idea is simple: using your understanding of the unit’s specification, partition the entire input space of the unit into equivalence partitions, where all inputs in an equivalence partition elicit the same behavior in the unit (though each may have its own output expectation). Note that this must be a true partitioning: each possible input is in exactly one equivalence partition. This includes “invalid” inputs: e.g., if a Java method takes an integer parameter, and the specification only defines “valid” behavior for positive values, then negative values must be included in some partition(s).

With complex data, partitioning could get difficult, but the idea is simple.

What is test set adequacy for equivalence partitioning: the fraction of partitions that have a test in them. For this, we really would want a stopping criterion of 1 – at the minimum, we should have at least one test in every equivalence partition. It would be crazy to skip any partition entirely!

Indeed, as we go further, we change this rule slightly: we should have at least one average test in each partition. It should be somewhere “in the middle” of the partition.

Boundary Value Analysis

An average test in an equivalence partition tests whether or not the unit has the correct computation for that partition. But it is well known that most programming mistakes are not mistakes in computation, but mistakes in boundaries: that is, we write code that misplaces the boundaries between equivalence partitions. This is captured in the “off by one” concept of coding mistakes.

So, once we define our equivalence partitions and test one average case in each partition, we should heavily test the boundaries.

Note that a boundary is itself inside one equivalence partition. E.g., if my unit has an integer input named V and partition A is defined as V<5 and partition B is defined as V>=5, then the boundary is V==5 and it is in partition B.

One adequacy criterion defined for boundary value analysis is: Nx1 domain adequacy: if a unit has N input variables, then for each equivalence partition P and each boundary B, create N tests on the boundary and one test off the boundary. The off test must be as close to the boundary as possible but in the partition that does not contain the boundary. More details can be added, such as: you should “space out” the N tests as much as possible. If the input is numerical and the boundary is linear, it takes N points in N-dimensional space to establish linearity. The off test would detect a parallel linear shift in the boundary.

A stronger adequacy criterion is NxN domain adequacy: here we must have, for a unit with N inputs, N tests on the boundary and N tests off the boundary (as close as possible to the boundary, in the other partition, and spread out). If the input is numerical and the boundary is linear, this would also detect a rotation in the border.

Another good view of equivalence partition boundaries is that of vertices where boundaries come together and multiple partitions touch. The vertex would be a unique point in the input space and so defines a specific test itself. So we can define Vertex domain adequacy: for each boundary vertex V where N partitions meet, have N tests, with one test on the vertex and N-1 tests in each of the equivalence partitions that the vertex is not in. The N-1 tests must each be in a unique partition and must be as close as possible to the vertex. (TODO: This should be extended to have more than one test in each partition, depending on the number of inputs to the unit.)

Black Box Practical Summary

Figure out your equivalence partitions;
Create one average test for each partition;
Create many more tests that heavily test the boundaries;
Pay attention to vertices, where multiple partitions meet.

Further White Box Coverage Ideas (Data Flow)

The initial presentation centered on code coverage, and this is the most common view of white box testing. However, it is worthwhile to consider data coverage.

The data view centers on data flow ideas: where is data created, and where is it used?

For this we have two foundational concepts: a definition (or def) of a variable, which is a place in the code where a new value is assigned to the variable, and a use of a variable, a place in the code where a variable is read and used in an expression. Note that the concept of a definition is not just the variable’s declaration, but rather each place in the code where a new value is assigned to it. Assignment statements are definitions, but so are operators like ++ and –. Any place a variable is modified, including the initial declaration, is a definition.

A def-use pair for a variable is a two-tuple of code line numbers, where the first number is a definition of the variable, and the second is a use of the variable. If we list all definitions of a variable and then list all uses, we can simply combine them in a Cartesian product to produce all def-use pairs for that variable.

A test covers a def-use pair for a variable if, during the test, a value produced by the definition reaches and is used by the use.

An easy idea for test set adequacy then is: for each variable, measure the fraction of def-use pairs covered by the test set.

But there’s a problem: in most programs, a very large fraction of def-use pairs are infeasible: that is, no possible execution can cover them.

Because of this, data flow coverage is not really used in industry for test adequacy metrics. However, data flow ideas are very central to many static analysis techniques, for both verification and security analyses.

Further adequacy criteria have been defined, such as:

All definitions: for all definitions of a variable, if a definition has at least one feasibly reachable use, have at least one test that covers a def-use pair.
All uses: for all definitions and all uses of a variable, if the def-use pair is feasible, have a test that covers it.
All D-U paths: for every unique simple path on which a definition reaches a use, have a test that covers the def-use on that path. Simple means cycle-free or at most one cycle.

More Data Flow Complications

A further issue is that we should recognize that feasibility of a def-use chain is essentially the same as feasibility of executing a statement, and so it is an undecidable problem, in general. That does not mean it cannot be decided for a given program, but in general we cannot expect to compute it for large programs.

Another issue is programs that use pointers or references, especially in programming languages that are not very safe with pointers, such as C and C++. In a language with pointers and references, a statement can be a definition or a use of some variable without ever referring to that variable by name! If a pointer can potentially point to a variable, then any statement that dereferences that pointer might be a definition or use of that variable. This makes data flow analysis terribly complicated!

Data flow analysis of programs with pointers and references requires a points-to analysis, to decide what can point to where. In general this is, again, undecidable, but many researchers have created many algorithms that try to do a good approximate calculation. Usually the approximation is guaranteed to be safe, meaning all true possible points-to’s are included, but false positives may also be included (and probably are). Indeed, in a C program with a bit of pointer arithmetic, an analysis may have to reach the conclusion that all pointers can point to anything.

Unit Testing Practicalities

Coming…