Data Balancing Improves Self-Admitted Technical Debt Detection

  title={Data Balancing Improves Self-Admitted Technical Debt Detection},
  author={Murali Sridharan and Mika M{\"a}ntyl{\"a} and Leevi Rantala and Ma{\"e}lick Claes},
  journal={2021 IEEE/ACM 18th International Conference on Mining Software Repositories (MSR)},
A high imbalance exists between technical debt and non-technical debt source code comments. Such imbalance affects Self-Admitted Technical Debt (SATD) detection performance, and existing literature lacks empirical evidence on the choice of balancing technique. In this work, we evaluate the impact of multiple balancing techniques, including Data level, Classifier level, and Hybrid, for SATD detection in Within-Project and Cross-Project setup. Our results show that the Data level balancing… 

Figures and Tables from this paper

Predictive Models in Software Engineering: Challenges and Opportunities
The key models and approaches used, classify the different models, summarize the range of key application areas, and analyze research results are described.


Neural Network-based Detection of Self-Admitted Technical Debt
A Convolutional Neural Network-- (CNN) based approach for classifying code comments as SATD or non-SATD is proposed and its superior performance, generalizability, adaptability, and explainability over current state-of-the-art traditional text-mining-based methods for SATD classification is confirmed.
Identifying self-admitted technical debt in open source projects using text mining
This paper proposes an automated approach to detect SATD in source code comments using text mining, and utilizes feature selection to select useful features for classifier training, and combines multiple classifiers from different source projects to build a composite classifier that identifies SATD comments in a target project.
Using Natural Language Processing to Automatically Detect Self-Admitted Technical Debt
This paper presents an approach to automatically identify design and requirement self-admitted technical debt using Natural Language Processing (NLP), and shows that the proposed approach can achieve a good accuracy even with a relatively small training dataset.
On the role of data balancing for machine learning-based code smell detection
This study investigates several approaches able to mitigate data unbalancing issues to understand their impact on ML-based approaches for code smell detection and highlights a number of limitations and open issues.
Recommending when Design Technical Debt Should be Self-Admitted
This paper investigates the extent to which previously self-admitted technical debt can be used to provide recommendations to developers when they write new source code, suggesting them when to "self-admit" design technical debt, or possibly when to improve the code being written.
Prevalence, Contents and Automatic Detection of KL-SATD
Using KL-SATD offers a potential to bootstrap a complete SATD detector and it is demonstrated that using machine learning the authors can identify comments that are currently missing but which should have a SATD keyword in them.
Borderline-SMOTE: A New Over-Sampling Method in Imbalanced Data Sets Learning
Two new minority over-sampling methods are presented, borderline- SMOTE1 and borderline-SMOTE2, in which only the minority examples near the borderline are over- Sampling, which achieve better TP rate and F-value than SMOTE and random over-Sampling methods.
Data imbalance in classification: Experimental evaluation
Detecting bad smells with machine learning algorithms: an empirical study
An evaluation of seven different machine learning algorithms on the task of detecting four types of bad smells and an analysis of the impact of software metrics for bad smell detection using a unified approach for interpreting the models' decisions are provided.
Hybrid Sampling and Random Forest Based Machine Learning Approach for Software Defect Prediction
A new model based on machine learning approach to predict software defect and identify the key factors that may help the software engineer to identify the most defect-prone part of the system is proposed.