Introduction to Software Engineering

Introduction to Software Engineering

Lectures 7 and 8 - Software verification and validation

Jim Briggs, 30 November 1998

Verification and validation

See Sommerville (5^th ed.) chaps 21-23.

Definitions (Boehm):

Verification: Are we building the product right? (i.e. does it meet the requirements specification?)
Validation: Are we building the right product? (i.e. does the requirements specification describe what the customer wants?)

Check at each stage of the process using documents produced during previous stage. Do it early to catch problems early.

Two types of V&V

Static. Analysis and checking of documents (including source code). Performed at all stages of the process.
Dynamic. Exercise the implementation ("testing"). Obviously only performed when executable available (which might be prototype).

Neither is sufficient by itself.

Dynamic testing is the "traditional" approach, but static techniques are becoming more sophisticated. Some people believe static techniques make testing unnecessary, but this is not a widely held view.

Other terminology

Testing involves executing the program (or part of it) using sample data and inferring from the output whether the software performs correctly or not. This can be done either during module development (unit testing) or when several modules are combined (system testing).

Defect testing is testing for situations where the program does not meet its functional specification. Performance testing (aka statistical testing) tests a system's performance or reliability under realistic loads. This may go some way to ensuring that the program meets its non-functional requirements.

Debugging is a cycle of detection, location, repair and test. Debugging is a hypothesis testing process. When a bug is detected, the tester must form a hypothesis about the cause and location of the bug. Further examination of the execution of the program (possibly including many reruns of it) will usually take place to confirm the hypothesis. If the hypothesis is demonstrated to be incorrect, a new hypothesis must be formed. Debugging tools that show the state of the program are useful for this, but inserting print statements is often the only approach. Experienced debuggers use their knowledge of common and/or obscure bugs to facilitate the hypothesis testing process.

After fixing a bug, the system must be retested to ensure that the fix has worked and that no other bugs have been introduced. This is called regression testing. In principle, all tests should be performed again but this is often too expensive to do.

Testing process

Best testing process is to test each subsystem separately. Best done during implementation. Best done after small sub-steps of the implementation rather than large chunks. Incremental process.

Once each lowest level unit has been tested, units should be combined with related units and retested in combination. This should proceed hierarchically bottom-up until the entire system is tested as a whole.

Typical levels of testing:

unit - procedure, function, method
module - package, abstract data type, class
sub-system - collection of related modules, cluster of classes, method-message paths, thread testing
system - whole system
acceptance testing - whole system with real data (involve customer, user, etc.)

Alpha testing is acceptance testing with a single client (common for bespoke systems).

Beta testing involves distributing system to potential customers to use and provide feedback. Exposes system to situations and errors that might not be anticipated by the developers.

Test planning

Often large proportion of development budget spent on testing. Needs to be planned to be cost effective.

Planning is setting out standards for tests. Test plans set out the context in which individual engineers can place their own work.

Typical test plan contents:

Overview of testing process
Requirements traceability (to ensure that all requirements are tested)
List of items to be tested
Schedule
Recording procedures so that test results can be audited
Hardware and software requirements
Constraints

Prepare plan at same time as requirements are being analysed. Develop as design proceeds. Relate to units, modules and sub-systems in design. Revise as needed.

Figure 1 Davis's symmetrical waterfall model (Davis, p10) [From Lecture 2]

Who is responsible for doing testing?

Unit and module testing often done by programmers who develop them. However they may have "blind spots" about their work and units should be re-tested by someone else. This could be another member of the same team (may be too close to the work or the original programmer) or an independent tester. Should the retester use different test cases?
Sub-system and system testing often done by independent team of testers (QA dept.?)

Testing strategies

Large systems usually tested using a mixture of strategies. Different strategies may be needed for different parts of the system or at different stages of the process. Whatever strategy is used, do testing incrementally.

Top-down testing

Tests high levels of system before detailed components. Sub-components represented by stubs with same interface but limited functionality (e.g. printf("function X called\n");)

Appropriate when developing the system top-down. Likely to show up structural design errors early (and therefore cheaply). Has advantage that a limited, working system available early on. Psychologically valuable? Validation (as distinct from verification) can begin early.

Disadvantage is that stubs need to be generated (extra effort) and might be impracticable if component is complex (e.g. converting an array into a linked list; unrealistic to generate random list; therefore end up implementing unit anyway). Test output may be difficult to observe (needs creation of artificial environment). Not appropriate for OO systems (except within a class).

Bottom-up testing

Opposite of top-down testing. Test low-level units then work up hierarchy. Advantages and disadvantages of bottom-up mirror those of top-down.

Need to write test drivers for each unit. These are as reusable as the unit itself.

Combining top-down development with bottom-up testing means that all parts of system must be implemented before testing can begin, therefore does not accord with incremental approach recommended above.

Bottom-up testing less likely to reveal architectural faults early on. However, bottom-up testing of critical low-level components is almost always necessary. Appropriate for OO systems.

Thread testing

Event-based approach suitable for real-time and OO systems. Aka transaction-flow testing.

Usually performed on sub-systems rather than units/modules. Generate event and follow it through the thread of modules and units called by its invocation. Threads may be associated with combinations of input events as well as individual ones. [In CORE methodology, CVMs may represent threads].

For completeness, must identify every possible "thread" (but may be impracticable to do). Often reduced to testing common or critical threads. Multiple event testing is also required to ensure full coverage.

Stress testing

Test system's ability to cope with a specified load (e.g. transactions per second).

Plan tests to increase load incrementally. Go beyond design limit until system fails (this tests failure behaviour is fail-safe and may cause defects to come to light).

Particularly important for distributed systems (check degradation as network exchanges data).

Back-to-back testing

Comparison of test results from different versions of the system (e.g. compare with prototype, previous version or different configuration).

Process - Run first system, saving test case results. Run second system, also saving its results. Compare results files.

Note that no differences doesn't imply no bugs. Both systems may have made the same mistake.

Defect testing

A successful defect test is a test that causes the system to behave incorrectly.

Defect testing is not intended to show that a program meets its specification. If tests don't show up defects it may mean that the tests are not exhaustive enough. Exhaustive testing is not always practicable. Subset has to be defined (this should be part of the test plan, not left to the individual programmer). Possible methods:

Usual method is to ensure that every line of code is executed at least once.
Test capabilities rather than components (e.g. concentrate on tests for data loss over ones for screen layout).
Test old in preference to new (users less effected by failure of new capabilities).
Test typical cases rather than boundary ones (ensure normal operation works properly).

Three approaches to defect testing. Each is most appropriate to different types of component.

Studies show that black-box testing is more effective in discovering faults than white-box testing. However, the rate of fault detection (faults detected per unit time) was similar for each approach. Also showed that static code reviewing was more effective and less expensive than defect testing. Sommerville predicts that defect testing will gradually be replaced by program inspections and code reviews.

Black-box (functional) testing

Testing against specification of system or component. Study it by examining its inputs and related outputs.

Key is to devise inputs that have a higher likelihood of causing outputs that reveal the presence of defects. Use experience and knowledge of domain to identify such test cases. Failing this a systematic approach may be necessary.

Equivalence partitioning is where the input to a program falls into a number of classes. E.g. positive numbers vs. negative numbers. Programs normally behave the same way for each member of a class. Partitions exist for both input and output. Partitions may be discrete or overlap. Invalid data (i.e. outside the normal partitions) is one or more partitions that should be tested.

Test cases are chosen to exercise each partition. Also test boundary cases (atypical, extreme, zero) since these frequently show up defects. For completeness, test all combinations of partitions.

Sommerville's example: a program accepts four to ten inputs which are 5-digit integers greater than 10,000. Partitions are: (for number of input values) less than 4 inputs, 4-10 inputs, more than 10 inputs; (for input values themselves) less than 10000, between 10000 and 99999, more than 99999. Test cases could be: (for number of input values) 3 (invalid boundary), 4 (valid boundary), 7 (typical), 10 (valid boundary), 11 (invalid boundary); (for input values) 9999, 10000, 50000, 99999, 100000.

Useful guidelines for testing arrays, strings, lists, etc.:

test with an object of length of 0 (if permitted) and 1;
use different sizes in different tests
test the first, middle and last elements of the object

Black-box testing is rarely exhaustive (because one doesn't test every value in an equivalence partition) and sometimes fails to reveal defects caused by "weird" combinations of inputs.

Black-box testing should not be used to try and reveal corruption defects caused, for example, by assigning a pointer to point to an object of the wrong type. Static inspection (or using a better programming language!) is preferable for this.

White-box (structural) testing

Testing based on knowledge of structure of component (e.g. by looking at source code). Advantage is that structure of code can be used to find out how many test cases need to be performed. Knowledge of the algorithm (examination of the code) can be used to identify the equivalence partitions.

Path testing is where the tester aims to exercise every independent execution path through the component. All conditional statements tested for both true and false cases. If a unit has n control statements, there will be up to 2ⁿ possible paths through it. This demonstrates that it is much easier to test small program units than large ones.

Flow graphs are a pictorial representation of the paths of control through a program (ignoring assignments, procedure calls and I/O statements). Use flow graph to design test cases that execute each path. Static tools may be used to make this easier in programs that have a complex branching structure.

Tool support. Dynamic program analysers instrument a program with additional code. Typically this will count how many times each statement is executed. At end, print out report showing which statements have and have not been executed.

Problems with flow graph derived testing:

Data complexity not taken into account.
Does not test all paths in combination.
Really only possible at unit and module testing stages because beyond that complexity is too high.

Interface testing

Usually done at integration stage when modules or sub-systems are combined. Objective is to detect errors or invalid assumptions about interfaces between modules. Reason these are not shown up in unit testing is that test case may perpetuate same incorrect assumption made by module designer. Particularly important when OO development has been used.

Four types of interface:

Parameter: data (or occasionally function references) passed from one unit to another.
Shared memory: block of memory shared between units (e.g. global variable). One places data there and the other retrieves it.
Procedural: object-oriented or abstract data type form of interface, encapsulating several procedures.
Message passing: one sub-system requests a service by passing a message. The result of the service is returned in another message. Client-server interface also used by some OO architectures.

Three common kinds of interface error:

Interface misuse: caller gives wrong number/type/order of parameters or sends invalid message.
Interface misunderstanding: caller misunderstands specification of called component and provides or receives data in legal but unexpected form.
Timing errors: producer/consumer of data operate at different speeds and data is accessed before being ready. "Race conditions".

Common manifestations are when each unit assumes the other one is checking for invalid data (failure to check return status) and the consequences of when such a fault is propagated to other units.

Guidelines for interface testing:

Devise tests which "stretch" external components (boundary conditions, etc.).
Test interface with null pointers.
Design tests that should cause component to fail. Differing failure models are one of the most common specification misunderstandings.
Stress test message passing systems (e.g. many more messages than are likely). May reveal timing problems.
Vary the order of actions on shared memory. May reveal implicit assumptions about ordering relationships between events.

Many interface errors will be shown up statically by using strongly typed languages like Ada. C/C++ are notoriously expensive to develop in because they give relatively little support for interface testing.

Static verification

Problem with defect testing is that each test only reveals one (or a small number) of faults. Faults difficult to locate even when detected.

Pioneer work in static testing done by Fagan at IBM in 1970s, 1980s. He reckoned that 60% of faults could be detected using informal program inspections. Another study (Mills et al) reckoned 90% if using formal techniques (mathematical verification, e.g. pre/post condition analysis). Static inspections can also be used to check compliance with code quality standards and portability, maintainability, etc.

Cannot completely replace testing but if by using it one can significantly reduce the number of defects to be found, then it may be very cost effective. However cannot replace testing for reliability assessment, performance analysis, user interface validation or validating requirements. Also applicable to other outputs of the software engineering process such as requirements documents, specifications, user documentation, test plans, etc.

Program inspections

Principle objective is to detect defects by looking at code.

Need to have a checklist of likely types of error to look for (updated periodically). Static verification will "front-load" development costs because it is carried out earlier in each stage of the lifecycle than testing. However should result in overall costs being lower. Culture must be that inspections are part of the verification process, not personnel appraisals. Need to train inspection team leaders.

Conducted by small team. Fagan proposed four roles: author, reader, tester, moderator. Reader reads the code out loud. Tester looks at it from testing perspective. Moderator organises process and keeps order. Grady and Van Slack (Hewlett-Packard) proposed six roles: author/owner, inspector (finds errors), reader (optional), scribe (takes notes), moderator and chief moderator (responsible for improvements to inspection process).

Before starting:

Have precise specification of what code should do.
Ensure all members of inspection team know organisational standards.
Code being inspected must be complete, up-to-date and syntactically correct (otherwise waste of time).

Members of team should see code in advance of meeting so they have time to prepare, including looking for defects.

Inspection should identify defects but leave to author to decide how best to fix them. A further inspection may or may not be required after that. Inspection should not recommend changes to other components.

Typical faults to look for (which should make up the inspection checklist):

Data faults. Initialisation of variables. Naming of constants. Bounds of arrays, etc.
Control faults. Correctness of conditions. Termination of loops. Bracketing of compound statements. Accounting for all cases in a case-statement.
Input/output faults. All input gets used. All output values are defined.
Interface faults. Number, type and order of parameters (in languages that don't enforce that). Shared memory used correctly.
Storage management faults. Reassignment of links in a linked structure. Dynamic allocation of memory allocates the right size. Space de-allocated when no longer used.
Exception management faults. Are all error conditions taking into account? Are status returns always checked?
Quality issues. Layout. Names of objects and procedures.

As organisation gains experience in use of inspection, analysis of defects detected can be used to improve process.

Efficiency. Fagan reports about 90-125 statements per hour can be inspected (max. two hours, thereafter efficiency drops). Therefore do frequently on small components rather than infrequently on large ones. With four people involved, cost of inspecting code is approx. 1 person-day per 100 lines (including preparation time).

Mathematically based verification

Prove, using mathematical arguments, that program meets its specification.

Semantics of programming language must be formally defined (few languages are). Program must be specified in a notation that is consistent with the mathematical verification techniques being used. Developing mathematical proofs is time-consuming and expensive. Need complex theorem provers to support process.

Tends to be applied to language subsets (e.g. without pointers). SPARK (see Barnes's book) is such a subset of Ada. Tends to be concentrated on critical sections of program.

Rigorous techniques (hybrid of formal and inspection) often used as a compromise.

Most common approach is the axiomatic approach. Program contains assertions about state of system at that point. Verifier checks that code to be executed does not violate these assertions. (Can be done dynamically as well.)

Static analysis tools

Some tools that are available:

control flow analysis (unreachable code)
data use analysis (used without initialisation, never used, etc.)
interface analysis (already done in e.g. Ada)
information flow analysis (which variables contribute to the ultimate calculation of some output)
path analysis (which statements make up a path through the program)

Cleanroom software development

Philosophy based on static verification techniques (Mills et al) to ensure "zero defects".

Relies on strict inspection process.

Typically uses formal specification, incremental development, structured programming, static verification and statistical testing.

Separate specification, development and certification teams.

Results show not significantly more expensive than conventional development.