Introduction to Software Engineering
Lectures 7 and 8 - Software verification and validation
Jim Briggs, 30 November 1998
Verification and validation
See Sommerville (5th ed.) chaps 21-23.
Check at each stage of the process using documents produced during previous stage. Do it early to catch problems early.
Two types of V&V
Neither is sufficient by itself.
Dynamic testing is the "traditional" approach, but static techniques are becoming more sophisticated. Some people believe static techniques make testing unnecessary, but this is not a widely held view.
Testing involves executing the program (or part of it) using sample data and inferring from the output whether the software performs correctly or not. This can be done either during module development (unit testing) or when several modules are combined (system testing).
Defect testing is testing for situations where the program does not meet its functional specification. Performance testing (aka statistical testing) tests a system's performance or reliability under realistic loads. This may go some way to ensuring that the program meets its non-functional requirements.
Debugging is a cycle of detection, location, repair and test. Debugging is a hypothesis testing process. When a bug is detected, the tester must form a hypothesis about the cause and location of the bug. Further examination of the execution of the program (possibly including many reruns of it) will usually take place to confirm the hypothesis. If the hypothesis is demonstrated to be incorrect, a new hypothesis must be formed. Debugging tools that show the state of the program are useful for this, but inserting print statements is often the only approach. Experienced debuggers use their knowledge of common and/or obscure bugs to facilitate the hypothesis testing process.
After fixing a bug, the system must be retested to ensure that the fix has worked and that no other bugs have been introduced. This is called regression testing. In principle, all tests should be performed again but this is often too expensive to do.
Best testing process is to test each subsystem separately. Best done during implementation. Best done after small sub-steps of the implementation rather than large chunks. Incremental process.
Once each lowest level unit has been tested, units should be combined with related units and retested in combination. This should proceed hierarchically bottom-up until the entire system is tested as a whole.
Typical levels of testing:
Alpha testing is acceptance testing with a single client (common for bespoke systems).
Beta testing involves distributing system to potential customers to use and provide feedback. Exposes system to situations and errors that might not be anticipated by the developers.
Often large proportion of development budget spent on testing. Needs to be planned to be cost effective.
Planning is setting out standards for tests. Test plans set out the context in which individual engineers can place their own work.
Typical test plan contents:
Prepare plan at same time as requirements are being analysed. Develop as design proceeds. Relate to units, modules and sub-systems in design. Revise as needed.
Figure 1 Davis's symmetrical waterfall model (Davis, p10) [From Lecture 2]
Who is responsible for doing testing?
Large systems usually tested using a mixture of strategies. Different strategies may be needed for different parts of the system or at different stages of the process. Whatever strategy is used, do testing incrementally.
Tests high levels of system before detailed components. Sub-components represented by stubs with same interface but limited functionality (e.g. printf("function X called\n");)
Appropriate when developing the system top-down. Likely to show up structural design errors early (and therefore cheaply). Has advantage that a limited, working system available early on. Psychologically valuable? Validation (as distinct from verification) can begin early.
Disadvantage is that stubs need to be generated (extra effort) and might be impracticable if component is complex (e.g. converting an array into a linked list; unrealistic to generate random list; therefore end up implementing unit anyway). Test output may be difficult to observe (needs creation of artificial environment). Not appropriate for OO systems (except within a class).
Opposite of top-down testing. Test low-level units then work up hierarchy. Advantages and disadvantages of bottom-up mirror those of top-down.
Need to write test drivers for each unit. These are as reusable as the unit itself.
Combining top-down development with bottom-up testing means that all parts of system must be implemented before testing can begin, therefore does not accord with incremental approach recommended above.
Bottom-up testing less likely to reveal architectural faults early on. However, bottom-up testing of critical low-level components is almost always necessary. Appropriate for OO systems.
Event-based approach suitable for real-time and OO systems. Aka transaction-flow testing.
Usually performed on sub-systems rather than units/modules. Generate event and follow it through the thread of modules and units called by its invocation. Threads may be associated with combinations of input events as well as individual ones. [In CORE methodology, CVMs may represent threads].
For completeness, must identify every possible "thread" (but may be impracticable to do). Often reduced to testing common or critical threads. Multiple event testing is also required to ensure full coverage.
Test system's ability to cope with a specified load (e.g. transactions per second).
Plan tests to increase load incrementally. Go beyond design limit until system fails (this tests failure behaviour is fail-safe and may cause defects to come to light).
Particularly important for distributed systems (check degradation as network exchanges data).
Comparison of test results from different versions of the system (e.g. compare with prototype, previous version or different configuration).
Process - Run first system, saving test case results. Run second system, also saving its results. Compare results files.
Note that no differences doesn't imply no bugs. Both systems may have made the same mistake.
A successful defect test is a test that causes the system to behave incorrectly.
Defect testing is not intended to show that a program meets its specification. If tests don't show up defects it may mean that the tests are not exhaustive enough. Exhaustive testing is not always practicable. Subset has to be defined (this should be part of the test plan, not left to the individual programmer). Possible methods:
Three approaches to defect testing. Each is most appropriate to different types of component.
Studies show that black-box testing is more effective in discovering faults than white-box testing. However, the rate of fault detection (faults detected per unit time) was similar for each approach. Also showed that static code reviewing was more effective and less expensive than defect testing. Sommerville predicts that defect testing will gradually be replaced by program inspections and code reviews.
Black-box (functional) testing
Testing against specification of system or component. Study it by examining its inputs and related outputs.
Key is to devise inputs that have a higher likelihood of causing outputs that reveal the presence of defects. Use experience and knowledge of domain to identify such test cases. Failing this a systematic approach may be necessary.
Equivalence partitioning is where the input to a program falls into a number of classes. E.g. positive numbers vs. negative numbers. Programs normally behave the same way for each member of a class. Partitions exist for both input and output. Partitions may be discrete or overlap. Invalid data (i.e. outside the normal partitions) is one or more partitions that should be tested.
Test cases are chosen to exercise each partition. Also test boundary cases (atypical, extreme, zero) since these frequently show up defects. For completeness, test all combinations of partitions.
Sommerville's example: a program accepts four to ten inputs which are 5-digit integers greater than 10,000. Partitions are: (for number of input values) less than 4 inputs, 4-10 inputs, more than 10 inputs; (for input values themselves) less than 10000, between 10000 and 99999, more than 99999. Test cases could be: (for number of input values) 3 (invalid boundary), 4 (valid boundary), 7 (typical), 10 (valid boundary), 11 (invalid boundary); (for input values) 9999, 10000, 50000, 99999, 100000.
Useful guidelines for testing arrays, strings, lists, etc.:
Black-box testing is rarely exhaustive (because one doesn't test every value in an equivalence partition) and sometimes fails to reveal defects caused by "weird" combinations of inputs.
Black-box testing should not be used to try and reveal corruption defects caused, for example, by assigning a pointer to point to an object of the wrong type. Static inspection (or using a better programming language!) is preferable for this.
White-box (structural) testing
Testing based on knowledge of structure of component (e.g. by looking at source code). Advantage is that structure of code can be used to find out how many test cases need to be performed. Knowledge of the algorithm (examination of the code) can be used to identify the equivalence partitions.
Path testing is where the tester aims to exercise every independent execution path through the component. All conditional statements tested for both true and false cases. If a unit has n control statements, there will be up to 2n possible paths through it. This demonstrates that it is much easier to test small program units than large ones.
Flow graphs are a pictorial representation of the paths of control through a program (ignoring assignments, procedure calls and I/O statements). Use flow graph to design test cases that execute each path. Static tools may be used to make this easier in programs that have a complex branching structure.
Tool support. Dynamic program analysers instrument a program with additional code. Typically this will count how many times each statement is executed. At end, print out report showing which statements have and have not been executed.
Problems with flow graph derived testing:
Usually done at integration stage when modules or sub-systems are combined. Objective is to detect errors or invalid assumptions about interfaces between modules. Reason these are not shown up in unit testing is that test case may perpetuate same incorrect assumption made by module designer. Particularly important when OO development has been used.
Four types of interface:
Three common kinds of interface error:
Common manifestations are when each unit assumes the other one is checking for invalid data (failure to check return status) and the consequences of when such a fault is propagated to other units.
Guidelines for interface testing:
Many interface errors will be shown up statically by using strongly typed languages like Ada. C/C++ are notoriously expensive to develop in because they give relatively little support for interface testing.
Problem with defect testing is that each test only reveals one (or a small number) of faults. Faults difficult to locate even when detected.
Pioneer work in static testing done by Fagan at IBM in 1970s, 1980s. He reckoned that 60% of faults could be detected using informal program inspections. Another study (Mills et al) reckoned 90% if using formal techniques (mathematical verification, e.g. pre/post condition analysis). Static inspections can also be used to check compliance with code quality standards and portability, maintainability, etc.
Cannot completely replace testing but if by using it one can significantly reduce the number of defects to be found, then it may be very cost effective. However cannot replace testing for reliability assessment, performance analysis, user interface validation or validating requirements. Also applicable to other outputs of the software engineering process such as requirements documents, specifications, user documentation, test plans, etc.
Principle objective is to detect defects by looking at code.
Need to have a checklist of likely types of error to look for (updated periodically). Static verification will "front-load" development costs because it is carried out earlier in each stage of the lifecycle than testing. However should result in overall costs being lower. Culture must be that inspections are part of the verification process, not personnel appraisals. Need to train inspection team leaders.
Conducted by small team. Fagan proposed four roles: author, reader, tester, moderator. Reader reads the code out loud. Tester looks at it from testing perspective. Moderator organises process and keeps order. Grady and Van Slack (Hewlett-Packard) proposed six roles: author/owner, inspector (finds errors), reader (optional), scribe (takes notes), moderator and chief moderator (responsible for improvements to inspection process).
Members of team should see code in advance of meeting so they have time to prepare, including looking for defects.
Inspection should identify defects but leave to author to decide how best to fix them. A further inspection may or may not be required after that. Inspection should not recommend changes to other components.
Typical faults to look for (which should make up the inspection checklist):
As organisation gains experience in use of inspection, analysis of defects detected can be used to improve process.
Efficiency. Fagan reports about 90-125 statements per hour can be inspected (max. two hours, thereafter efficiency drops). Therefore do frequently on small components rather than infrequently on large ones. With four people involved, cost of inspecting code is approx. 1 person-day per 100 lines (including preparation time).
Mathematically based verification
Prove, using mathematical arguments, that program meets its specification.
Semantics of programming language must be formally defined (few languages are). Program must be specified in a notation that is consistent with the mathematical verification techniques being used. Developing mathematical proofs is time-consuming and expensive. Need complex theorem provers to support process.
Tends to be applied to language subsets (e.g. without pointers). SPARK (see Barnes's book) is such a subset of Ada. Tends to be concentrated on critical sections of program.
Rigorous techniques (hybrid of formal and inspection) often used as a compromise.
Most common approach is the axiomatic approach. Program contains assertions about state of system at that point. Verifier checks that code to be executed does not violate these assertions. (Can be done dynamically as well.)
Static analysis tools
Some tools that are available:
Cleanroom software development
Philosophy based on static verification techniques (Mills et al) to ensure "zero defects".
Relies on strict inspection process.
Typically uses formal specification, incremental development, structured programming, static verification and statistical testing.
Separate specification, development and certification teams.
Results show not significantly more expensive than conventional development.