Do Automatically Generated Unit Tests Find Real Faults? An Empirical Study of Effectiveness and Challenges (T)

@article{Shamshiri2015DoAG,
  title={Do Automatically Generated Unit Tests Find Real Faults? An Empirical Study of Effectiveness and Challenges (T)},
  author={Sina Shamshiri and Ren{\'e} Just and Jos{\'e} Miguel Rojas and Gordon Fraser and Phil McMinn and Andrea Arcuri},
  journal={2015 30th IEEE/ACM International Conference on Automated Software Engineering (ASE)},
  year={2015},
  pages={201-211}
}
  • S. Shamshiri, René Just, A. Arcuri
  • Published 9 November 2015
  • Computer Science
  • 2015 30th IEEE/ACM International Conference on Automated Software Engineering (ASE)
Rather than tediously writing unit tests manually, tools can be used to generate them automatically - sometimes even resulting in higher code coverage than manual testing. But how good are these tests at actually finding faults? To answer this question, we applied three state-of-the-art unit test generation tools for Java (Randoop, EvoSuite, and Agitar) to the 357 real faults in the Defects4J dataset and investigated how well the generated test suites perform at detecting these faults. Although… 

Figures and Tables from this paper

How Do Automatically Generated Unit Tests Influence Software Maintenance?
TLDR
An empirical study in which participants were presented with an automatically generated or manually written failing test, and were asked to identify and fix the cause of the failure, found developers to be equally effective with manually written and automatically generated tests.
A Large Scale Study On the Effectiveness of Manual and Automatic Unit Test Generation
TLDR
An empirical study, using ten programs, written in Java, that already have MTSs and apply two sophisticated tools that automatically generate test cases: Randoop and EvoSuite, indicates that Mtss are, in general, more effective than ATSs regarding the investigated metrics.
A Systematic Evaluation of Problematic Tests Generated by EvoSuite
  • Zhiyu Fan
  • Computer Science
    2019 IEEE/ACM 41st International Conference on Software Engineering: Companion Proceedings (ICSE-Companion)
  • 2019
TLDR
A comprehensive study of EvoSuite in Defects4j is presented, and a detailed analysis of the reasons behind these automatically generated problematic tests are performed.
Can automated test case generation cope with extract method validation?
TLDR
An empirical study that applies the Randoop and Evosuite tools for generating regression test suites, focusing on detecting Extract Method faults, and identifies factors that may influence on the performance of the tools for effectively testing the edits.
How effective are mutation testing tools? An empirical analysis of Java mutation testing tools with manual analysis and real faults
TLDR
There are large differences between the tools’ effectiveness and it is demonstrated that no tool is able to subsume the others and overall, PITRV achieves the best results, by finding 6% more faults than the other tools combined.
On the Effectiveness of Manual and Automatic Unit Test Generation: Ten Years Later
TLDR
This paper revises an initial case study comparing automatic and manually generated test suites using current tools as well as complementing their research method by evaluating these tools' ability in finding regressions.
An Industrial Evaluation of Unit Test Generation: Finding Real Faults in a Financial Application
TLDR
Challenges that need to be addressed in order to improve fault detection in test generation tools are demonstrated, such as a need to integrate with popular build tools, and to improve the readability of the generated tests.
Classifying generated white-box tests: an exploratory study
TLDR
This paper recommended a conceptual framework to describe the classification task and suggested taking this problem into account when using or evaluating white-box test generators.
Branch coverage prediction in automated testing
TLDR
It is argued that knowing a priori the branch coverage that can be achieved with test‐data generation tools can help developers into taking informed decision about issues and it is investigated the possibility to use source‐code metrics to predict the coverage achieved by test‐ data generation tools.
Automatic Unit Test Generation for Machine Learning Libraries: How Far Are We?
TLDR
An empirical study on five widely used machine learning libraries with two popular unit testcase generation tools, i.e., EVOSUITE and Randoop, finds that most of the machineLearning libraries do not maintain a high-quality unit test suite regarding commonly applied quality metrics such as code coverage and mutation score.
...
...

References

SHOWING 1-10 OF 48 REFERENCES
Does automated white-box test generation really help software testers?
TLDR
A controlled experiment comparing a total of 49 subjects split between writing tests manually and writing tests with the aid of an automated unit test generation tool, EvoSuite found that tool support leads to clear improvements in commonly applied quality metrics such as code coverage, however, there was no measurable improvement in the number of bugs actually found by developers.
Augmenting Automatically Generated Unit-Test Suites with Regression Oracle Checking
TLDR
Results show that an automatically generated test suite's fault-detection capability can be effectively improved after being augmented by Orstra, and the augmented test suite has an improved capability of guarding against regression faults.
Are mutants a valid substitute for real faults in software testing?
TLDR
This paper investigates whether mutants are indeed a valid substitute for real faults, i.e., whether a test suite’s ability to detect mutants is correlated with its able to detect real faults that developers have fixed, and shows a statistically significant correlation between mutant detection and real fault detection, independently of code coverage.
Defects4J: a database of existing faults to enable controlled testing studies for Java programs
TLDR
Defects4J, a database and extensible framework providing real bugs to enable reproducible studies in software testing research, and provides a high-level interface to common tasks in softwareTesting research, making it easy to con- duct and reproduce empirical studies.
Whole Test Suite Generation
TLDR
This work proposes a novel paradigm in which whole test suites are evolved with the aim of covering all coverage goals at the same time while keeping the total size as small as possible, and implemented this novel approach in the EvoSuite tool.
Precise identification of problems for structural test generation
TLDR
A novel approach, called Covana, is proposed, which precisely identifies and reports problems that prevent the tools from achieving high structural coverage primarily by determining whether branch statements containing notcovered branches have data dependencies on problem candidates.
Automated unit test generation for classes with environment dependencies
Automated test generation for object-oriented software typically consists of producing sequences of calls aiming at high code coverage. In practice, the success of this process may be inhibited when
DART: directed automated random testing
TLDR
DART is a new tool for automatically testing software that combines three main techniques, automated extraction of the interface of a program with its external environment using static source-code parsing, and dynamic analysis of how the program behaves under random testing and automatic generation of new test inputs to direct systematically the execution along alternative program paths.
OCAT: object capture-based automated testing
TLDR
This work proposes a novel approach called Object Capture based Automated Testing (OCAT), which captures object instances dynamically from program executions and helps an existing automated test-generation tool, such as a random testing tool, to achieve higher code coverage.
Is mutation an appropriate tool for testing experiments?
TLDR
It is concluded that, based on the data available thus far, the use of mutation operators is yielding trustworthy results (generated mutants are similar to real faults); Mutants appear however to be different from hand-seeded faults that seem to be harder to detect than real faults.
...
...