Corpus ID: 237513734

A Survey on Data Cleaning Methods for Improved Machine Learning Model Performance

  title={A Survey on Data Cleaning Methods for Improved Machine Learning Model Performance},
  author={Ga Young Lee and Lubna Alzamil and Bakhtiyar Doskenov and Arash Termehchy},
Data cleaning is the initial stage of any machine learning project and is one of the most critical processes in data analysis. It is a critical step in ensuring that the dataset is devoid of incorrect or erroneous data. It can be done manually with data wrangling tools, or it can be completed automatically with a computer program. Data cleaning entails a slew of procedures that, once done, make the data ready for analysis. Given its significance in numerous fields, there is a growing interest… Expand


ActiveClean: An Interactive Data Cleaning Framework For Modern Machine Learning
This work proposes ActiveClean, a progressive framework for training Machine Learning models with data cleaning, which updates a model iteratively as the analyst cleans small batches of data, and includes numerous optimizations such as importance weighting and dirty data detection. Expand
AlphaClean: Automatic Generation of Data Cleaning Pipelines
A framework, called AlphaClean, that rethinks parameter tuning for data cleaning pipelines, which is significantly more robust to straggling data cleaning methods and redundancy in the data cleaning library, and can incorporate state-of-the-art cleaning systems such as HoloClean as cleaning operators. Expand
SampleClean: Fast and Reliable Analytics on Dirty Data
The SampleClean project has developed a new suite of techniques to estimate the results of queries when only a sample of data can be cleaned, and a gradient-descent algorithm is described that extends the key ideas to the increasingly common Machine Learning-based analytics. Expand
Nearest Neighbor Classifiers over Incomplete Information: From Certain Answers to Certain Predictions
It is shown that the proposed CPClean approach built based on CP can often significantly outperform existing techniques in terms of classification accuracy with mild manual cleaning effort. Expand
HoloClean: Holistic Data Repairs with Probabilistic Inference
A series of optimizations are introduced which ensure that inference over HoloClean's probabilistic model scales to instances with millions of tuples, and yields an average F1 improvement of more than 2× against state-of-the-art methods. Expand
The Staggering Impact of Dirty Data
  • 2021
Data Cleaning