Foundations of data imbalance and solutions for a data democracy

@article{Kulkarni2021FoundationsOD,
  title={Foundations of data imbalance and solutions for a data democracy},
  author={Ajay Kulkarni and Deri Chong and Feras A. Batarseh},
  journal={ArXiv},
  year={2021},
  volume={abs/2108.00071}
}
The Impact of Partial Balance of Imbalanced Dataset on Classification Performance
TLDR
The concept of partial balance is proposed, consisting of Class Number of Partial Balance (β) and Balance Degree of Partial Samples (μ) and combined with Global Slope (α), a parameterized model is established to describe the difference of imbalanced datasets.
Automated Valuation Modelling: Analysing Mortgage Behavioural Life Profile Models Using Machine Learning Techniques
TLDR
These factors could provide a solid basis for the sustainable development of the mortgage market, and the approach in this research is a starting point for identifying the best decisions taken by banking institutions to contribute to theustainable development of mortgage lending.
Integration of a machine learning model into a decision support tool to predict absenteeism at work of prospective employees
TLDR
A web-based decision tool allows hiring managers to make more informed decisions before hiring a potential employee, thus reducing time, financial loss and reducing the probability of economic insolvency.
Normalization and outlier removal in class center-based firefly algorithm for missing value imputation
TLDR
Combining normalization and outlier removals in C3-FA (ON  +  C3FA) was an efficient technique for obtaining actual data in handling missing values, and it also outperformed the previous studies methods with r and RMSE values.
Normalization and Outlier Removal in Class Center-Based Firefly Algorithm for Missing Value Imputation
TLDR
This study aims to proposed combination of normalization and outlier removal’s before imputing missing values using several methods, mean, random value, regression, multiple imputation, KNN, and C3-FA, and shows that the proposed method is able to reproduce the real values of the data or the prediction accuracy and maintain the distribution accuracy.
A survey on artificial intelligence assurance
TLDR
This manuscript provides a systematic review of research works that are relevant to AI assurance, between years 1985 and 2021, and aims to provide a structured alternative to the landscape.
Air Temperature Prediction Using Different Datamining Approaches In Sulaymaniyah City In Iraq
TLDR
This paper investigates the use of various data mining approaches such as Support Vector Machine, Decision tree, and Naïve Bayes for air temperature prediction within Sulaymaniyah City in Kurdistan, IRAQ and finds support vector machine has accomplished promising performance among using algorithms.
User Perception Analysis of Online Learning Platform “Zenius” During the Coronavirus Pandemic Using Text Mining Techniques
TLDR
Data from user reviews of the Zenius platform on Google Play Store is explored to determine the priorities for service improvement by the provider and reveals that the service aspects that should be prioritized to improve the online learning platform are related to tryouts and user accounts.
Machine Learning and Its Applications for Protozoal Pathogens and Protozoal Infectious Diseases
TLDR
This work presents a brief overview of important concepts in ML, with a focus on basic workflows, popular algorithms, feature extraction and selection, and model evaluation metrics, and provides forward-looking insights for perspectives and opportunities in future advances in ML techniques in this field.
...
1
2
3
...

References

SHOWING 1-10 OF 24 REFERENCES
Data Mining for Imbalanced Datasets: An Overview
  • N. Chawla
  • Computer Science
    The Data Mining and Knowledge Discovery Handbook
  • 2005
TLDR
In this Chapter, some of the sampling techniques used for balancing the datasets, and the performance measures more appropriate for mining imbalanced datasets are discussed.
Learning From Imbalanced Data
TLDR
This chapter aims to highlight the existence of imbalance in all real world data and the need to focus on the inherent characteristics present in imbalanced data that can degrade the performance of classifiers.
The class imbalance problem: A systematic study
TLDR
The assumption that the class imbalance problem does not only affect decision tree systems but also affects other classification systems such as Neural Networks and Support Vector Machines is investigated.
Learning from Imbalanced Data Sets
TLDR
Data Science can be considered as a discipline for discovering new and significant relationships, patterns and trends in the examination of large amounts of data in the search for knowledge contained in the information stored in large databases.
SMOTE: Synthetic Minority Over-sampling Technique
TLDR
A combination of the method of oversampling the minority (abnormal) class and under-sampling the majority class can achieve better classifier performance (in ROC space) and a combination of these methods and the area under the Receiver Operating Characteristic curve (AUC) and the ROC convex hull strategy is evaluated.
Machine Learning from Imbalanced Data Sets 101
TLDR
What the problem is not is discussed, which will lead to a profitable discussion of what the problem indeed is, and how it might be addressed most effectively.
Imbalanced-learn: A Python Toolbox to Tackle the Curse of Imbalanced Datasets in Machine Learning
Imbalanced-learn is an open-source python toolbox aiming at providing a wide range of methods to cope with the problem of imbalanced dataset frequently encountered in machine learning and pattern
Addressing the Curse of Imbalanced Training Sets: One-Sided Selection
TLDR
Criteria to evaluate the utility of clas-siiers induced from such imbalanced training sets are discussed, explanation of the poor behavior of some learners under these circumstances is given, and a simple technique called one-sided selection of examples is suggested.
ADASYN: Adaptive synthetic sampling approach for imbalanced learning
TLDR
Simulation analyses on several machine learning data sets show the effectiveness of the ADASYN sampling approach across five evaluation metrics.
Learning Goal Oriented Bayesian Networks for Telecommunications Risk Management
TLDR
It is argued and demonstrated that current Bayesian network learning methods may fail to perform satisfactorily in real life applications since they do not learn models tailored to a specific goal or purpose.
...
1
2
3
...