On the relative value of cross-company and within-company data for defect prediction

Abstract

We propose a practical defect prediction approach for companies that do not track defect related data. Specifically, we investigate the applicability of cross-company (CC) data for building localized defect predictors using static code features. Firstly, we analyze the conditions, where CC data can be used as is. These conditions turn out to be quite few. Then we apply principles of analogy-based learning (i.e. nearest neighbor (NN) filtering) to CC data, in order to fine tune these models for localization. We compare the performance of these models with that of defect predictors learned from within-company (WC) data. As expected, we observe that defect predictors learned from WC data outperform the ones learned from CC data. However, our analyses also yield defect predictors learned from NN-filtered CC data, with performance close to, but still not better than, WC data. Therefore, we perform a final analysis for determining the minimum number of local defect reports in order to learn WC defect predictors. We demonstrate in this paper that the minimum number of data samples required to build effective defect predictors can be quite small and can be collected quickly within a few months. Hence, for companies with no local defect data, we recommend a two-phase approach that allows them to employ the defect prediction process instantaneously. In phase one, companies should use 541 NN-filtered CC data to initiate the defect prediction process and simultaneously start collecting WC (local) data. Once enough WC data is collected (i.e. after a few months), organizations should switch to phase two and use predictors learned from WC data.

DOI: 10.1007/s10664-008-9103-7

Extracted Key Phrases

20 Figures and Tables

Showing 1-10 of 35 references

UCI repository of machine learning databases

  • C Blake, C Merz
  • 1998
Highly Influential
3 Excerpts

Empir Software Eng

  • 2009

A hybrid approach to expert and model-based effort estimation. Master's thesis

  • D Baker
  • 2007
Showing 1-10 of 127 extracted citations
02040200920102011201220132014201520162017
Citations per Year

223 Citations

Semantic Scholar estimates that this publication has received between 167 and 301 citations based on the available data.

See our FAQ for additional information.