Data Balancing Improves Self-Admitted Technical Debt Detection

  title={Data Balancing Improves Self-Admitted Technical Debt Detection},
  author={Murali Sridharan and Mika M{\"a}ntyl{\"a} and Leevi Rantala and Ma{\"e}lick Claes},
  journal={2021 IEEE/ACM 18th International Conference on Mining Software Repositories (MSR)},
A high imbalance exists between technical debt and non-technical debt source code comments. Such imbalance affects Self-Admitted Technical Debt (SATD) detection performance, and existing literature lacks empirical evidence on the choice of balancing technique. In this work, we evaluate the impact of multiple balancing techniques, including Data level, Classifier level, and Hybrid, for SATD detection in Within-Project and Cross-Project setup. Our results show that the Data level balancing… 

Figures and Tables from this paper

Predictive Models in Software Engineering: Challenges and Opportunities
The key models and approaches used, classify the different models, summarize the range of key application areas, and analyze research results are described.


Identifying self-admitted technical debt in open source projects using text mining
This paper proposes an automated approach to detect SATD in source code comments using text mining, and utilizes feature selection to select useful features for classifier training, and combines multiple classifiers from different source projects to build a composite classifier that identifies SATD comments in a target project.
Using Natural Language Processing to Automatically Detect Self-Admitted Technical Debt
This paper presents an approach to automatically identify design and requirement self-admitted technical debt using Natural Language Processing (NLP), and shows that the proposed approach can achieve a good accuracy even with a relatively small training dataset.
On the role of data balancing for machine learning-based code smell detection
This study investigates several approaches able to mitigate data unbalancing issues to understand their impact on ML-based approaches for code smell detection and highlights a number of limitations and open issues.
Prevalence, Contents and Automatic Detection of KL-SATD
Using KL-SATD offers a potential to bootstrap a complete SATD detector and it is demonstrated that using machine learning the authors can identify comments that are currently missing but which should have a SATD keyword in them.
Borderline-SMOTE: A New Over-Sampling Method in Imbalanced Data Sets Learning
Two new minority over-sampling methods are presented, borderline- SMOTE1 and borderline-SMOTE2, in which only the minority examples near the borderline are over- Sampling, which achieve better TP rate and F-value than SMOTE and random over-Sampling methods.
Data imbalance in classification: Experimental evaluation
Detecting bad smells with machine learning algorithms: an empirical study
An evaluation of seven different machine learning algorithms on the task of detecting four types of bad smells and an analysis of the impact of software metrics for bad smell detection using a unified approach for interpreting the models' decisions are provided.
Hybrid Sampling and Random Forest Based Machine Learning Approach for Software Defect Prediction
A new model based on machine learning approach to predict software defect and identify the key factors that may help the software engineer to identify the most defect-prone part of the system is proposed.
An Exploratory Study on Self-Admitted Technical Debt
  • A. Potdar, Emad Shihab
  • Computer Science
    2014 IEEE International Conference on Software Maintenance and Evolution
  • 2014
Throughout a software development life cycle, developers knowingly commit code that is either incomplete, requires rework, produces errors, or is a temporary workaround. Such incomplete or temporary
Imbalanced-learn: A Python Toolbox to Tackle the Curse of Imbalanced Datasets in Machine Learning
Imbalanced-learn is an open-source python toolbox aiming at providing a wide range of methods to cope with the problem of imbalanced dataset frequently encountered in machine learning and pattern