The Impact of Feature Importance Methods on the Interpretation of Defect Classifiers

  title={The Impact of Feature Importance Methods on the Interpretation of Defect Classifiers},
  author={Gopi Krishnan Rajbahadur and Shaowei Wang and Gustavo Ansaldi Oliva and Yasutaka Kamei and Ahmed E. Hassan},
  journal={IEEE Transactions on Software Engineering},
Classifier specific (CS) and classifier agnostic (CA) feature importance methods are widely used (often interchangeably) by prior studies to derive feature importance ranks from a defect classifier. However, different feature importance methods are likely to compute different feature importance ranks even for the same dataset and classifier. Hence such interchangeable use of feature importance methods can lead to conclusion instabilities unless there is a strong agreement among different… 

Figures and Tables from this paper

A Comprehensive Investigation of the Impact of Class Overlap on Software Defect Prediction

It is suggested that future work in SDP should apply the proposed KNN method to identify whether the overlap ratios of their defect datasets are greater than 12.5% before building SDP models and remove the overlapping instances to find the more consistent guiding significance metrics.

A Machine Learning-Based Method for Content Verification in the E-Commerce Domain

The main concept of this approach is to gather different types of information referring to persons, compare different person instances and predict whether they are similar or not, and using the Jaro algorithm for person attribute similarity calculation, recommendations can be provided to users regarding the similarity or not between two person instances.

Community Smell Occurrence Prediction on Multi-Granularity by Developer-Oriented Features and Process Metrics

Community smells are sub-optimal developer community structures that hinder productivity. Prior studies performed smell prediction and provided refactoring guidelines from a top-down aspect to help

Evaluating Simple and Complex Models’ Performance When Predicting Accepted Answers on Stack Overflow

Examining the performance and quality of two modelling methods used for predicting Java and JavaScript acceptable answers on Stack Overflow reveals significant differences in models’ performances and quality given the type of features and complexity of models used.

Towards a consistent interpretation of AIOps models

This study investigates the consistency of AIOps model interpretation along three dimensions: internal consistency, external consistency, and time consistency and finds that the randomness from learners, hyperparameter tuning, and data sampling should be controlled to generate consistent interpretations.

Revisiting reopened bugs in open source software systems

This study revisits reopened bugs and provides new insights into developer’s bug reopening activities using the modern techniques such as SMOTE, permutation importance together with 7 different machine learning models.

Rejection Analysis of Cast wheel by CRISP-DM and Machine Learning

The pandemic, global competition, customer demand for high-quality products, a wide variety of products, shortened delivery times, and declining profit margins have all had a huge impact on the

Why Don’t XAI Techniques Agree? Characterizing the Disagreements Between Post-hoc Explanations of Defect Predictions

This study first investigates three disagreement metrics between LIME and SHAP explanations of 10 defect-predictors, and exposes that disagreements regarding the rankings of feature importance are most frequent, leading to a method of aggregating Lime andSHAP explanations that puts less emphasis on these disagreements while highlighting the aspect on which explanations agree.



The impact of automated feature selection techniques on the interpretation of defect models

It is found that the subsets of metrics produced by the commonly-used feature selection techniques (except for AutoSpearman) are often inconsistent and correlated, these techniques should be avoided when interpreting defect models.

The Impact of Using Regression Models to Build Defect Classifiers

It is found that random forest based classifiers outperform other classifiers (best AUC) for both classifier building approaches and it is suggested that future defect classification studies should consider building regression-based classifiers, in particular when the defective ratio of the modeled dataset is low.

The Impact of Automated Parameter Optimization on Defect Prediction Models

It is found that traditionally overlooked techniques like C5.0 and neural networks can actually outperform widely-used techniques after optimization is applied, highlighting the importance of exploring the parameter space when using parameter-sensitive classification techniques.

Revisiting the Impact of Classification Techniques on the Performance of Defect Prediction Models

The results suggest that some classification techniques tend to produce defect prediction models that outperform others, contrary to earlier research.

Predicting Fault-Prone Software Modules with Rank Sum Classification

This work presents a new approach to predictive modelling of fault proneness in software modules, introducing a new feature representation to overcome some of these issues, and offers improved or at worst comparable performance to earlier approaches for standard data sets.

Researcher Bias: The Use of Machine Learning in Software Defect Prediction

A meta-analysis of all relevant, high quality primary studies of defect prediction to determine what factors influence predictive performance finds that the choice of classifier has little impact upon performance and the major explanatory factor is the researcher group.

Comments on “Researcher Bias: The Use of Machine Learning in Software Defect Prediction”

The relationship between the research group and the performance of a defect prediction model are more likely due to the tendency of researchers to reuse experimental components (e.g., datasets and metrics).

The Impact of Correlated Metrics on the Interpretation of Defect Models

It is found that correlated metrics have the largest impact on the consistency, the level of discrepancy, and the direction of the ranking of metrics, especially for ANOVA techniques, and that removing all correlated metrics improves the consistency of the produced rankings regardless of the ordering of metrics.

AutoSpearman: Automatically Mitigating Correlated Software Metrics for Interpreting Defect Models

To automatically mitigate correlated metrics when interpreting defect models, it is recommended that future studies use AutoSpearman in lieu of commonly-used feature selection techniques, an automated metric selection approach based on correlation analyses.

Impact of Discretization Noise of the Dependent Variable on Machine Learning Classifiers in Software Engineering

This paper proposes a framework to help researchers and practitioners systematically estimate the impact of discretization noise on classifiers in terms of its impact on various performance measures and the interpretation of classifiers and finds that it affects the different performance measures of a classifier differently for different datasets.