Big Data Cleaning

@inproceedings{Tang2014BigDC,
  title={Big Data Cleaning},
  author={Nan Tang},
  booktitle={APWeb},
  year={2014}
}
  • N. Tang
  • Published in APWeb 2014
  • Computer Science
Data cleaning is, in fact, a lively subject that has played an important part in the history of data management and data analytics, and it still is undergoing rapid development. Moreover, data cleaning is considered as a main challenge in the era of big data, due to the increasing volume, velocity and variety of data in many applications. This paper aims to provide an overview of recent work in different aspects of data cleaning: error detection methods, data repairing algorithms, and a… Expand
Big RDF data cleaning
  • N. Tang
  • Computer Science
  • 2015 31st IEEE International Conference on Data Engineering Workshops
  • 2015
TLDR
This paper revisits data quality problems appeared in RDF data and describes possible solutions that shed lights on (semi-)automatically cleaning (big) RDFData, a standard model for data interchange on the semantic web. Expand
A Survey on Big Data Pre-processing
  • Zhi-bin Guan, Tongkai Ji, Xu Qian, Y. Ma, Xuehai Hong
  • Computer Science
  • 2017 5th Intl Conf on Applied Computing and Information Technology/4th Intl Conf on Computational Science/Intelligence and Applied Informatics/2nd Intl Conf on Big Data, Cloud Computing, Data Science (ACIT-CSII-BCD)
  • 2017
TLDR
The four phases of data pre-processing, including data cleansing, data integration, data reduction, and data transformation, have been discussed and different approaches for a variety of purposes have been presented, which show current methods and techniques need to be further modified in order to improve the quality of data before data analysis. Expand
Data Cleaning Optimization for Grain Big Data Processing using Task Merging
  • X. Ju, F. Lian, Yuan Zhang
  • Computer Science
  • 2019 6th International Conference on Information Science and Control Engineering (ICISCE)
  • 2019
TLDR
This paper optimized several modules of data cleaning such as entity identification, inconsistent data restoration, and missing value filling using a new optimization technique based on task merge that can increase efficiency for grain big data cleaning. Expand
A Data Cleaning Method on Massive Spatio-Temporal Data
TLDR
Time-based clustering and rule-based filtering for data cleaning is proposed on massive bus IC card data, which guarantees the consistency and legality among spatio-temporal attributes. Expand
Big Data Validation Case Study
  • Chunli Xie, J. Gao, Chuanqi Tao
  • Computer Science
  • 2017 IEEE Third International Conference on Big Data Computing Service and Applications (BigDataService)
  • 2017
TLDR
This paper is designed to study original big data quality, data quality dimension, data validation process and tools, and an important process to recognize and improve data quality. Expand
A general perspective of Big Data: applications, tools, challenges and trends
TLDR
This paper aims to provide a comprehensive review of Big Data literature of the last 4 years, to identify the main challenges, areas of application, tools and emergent trends of Big data. Expand
Big Data Pre-processing: A Quality Framework
TLDR
A QBD model incorporating processes to support Data quality profile selection and adaptation is proposed and it tracks and registers on a data provenance repository the effect of every data transformation happened in the pre-processing phase. Expand
A Data Cleaning Service on Massive Spatio-Temporal Data in Highway Domain
TLDR
A data cleaning service through business rules is proposed that can efficiently clean the raw toll data with spatio-temporal attributes, including the data calibration of erroneous data and invalid data, the repair of erroneousData, and the filtering of duplicate data. Expand
Big Data Quality Framework: Pre-Processing Data in Weather Monitoring Application
  • A. Juneja, N. N. Das
  • Computer Science
  • 2019 International Conference on Machine Learning, Big Data, Cloud and Parallel Computing (COMITCon)
  • 2019
TLDR
A Pre-Processing Framework to address quality of data in a weather monitoring and forecasting application that also takes into account global warming parameters and raises alerts/notifications to warn users and scientists in advance is proposed. Expand
Big data quality framework: a holistic approach to continuous quality management
TLDR
A BDQ Management Framework for enhancing the pre-processing activities while strengthening data control is proposed and uses a new concept called Big Data Quality Profile, which captures quality outline, requirements, attributes, dimensions, scores, and rules. Expand
...
1
2
3
4
...

References

SHOWING 1-10 OF 33 REFERENCES
NADEEF: a commodity data cleaning system
TLDR
NADEEF is presented, an extensible, generalized and easy-to-deploy data cleaning platform that is designed in a way to allow cleaning algorithms to cope with multiple rules holistically, i.e. detecting and repairing data errors without differentiating between various types of rules. Expand
Dependencies revisited for improving data quality
  • W. Fan
  • Computer Science
  • PODS
  • 2008
TLDR
An overview of recent advances in revising classical dependencies for improving data quality is provided to provide an overview of the increasing demand for data quality technology. Expand
NADEEF: A Generalized Data Cleaning System
TLDR
NADEEF distinguishes between a programming interface and a core to achieve generality and extensibility, an extensible, generic and easy-to-deploy data cleaning system. Expand
Holistic data cleaning: Putting violations into context
TLDR
Experimental results on real datasets show that the holistic approach outperforms previous algorithms in terms of quality and efficiency of the repair. Expand
Potter's Wheel: An Interactive Data Cleaning System
TLDR
Potter’s Wheel is presented, an interactive data cleaning system that tightly integrates transformation and discrepancy detection, and users can gradually build a transformation as discrepancies are found, and clean the data without writing complex programs or enduring long delays. Expand
Modeling and Querying Possible Repairs in Duplicate Detection
TLDR
A novel uncertainty model that compactly encodes the space of possible repairs corresponding to different parameter settings is proposed, and it is shown how to efficiently support relational queries under the model, and to allow new types of queries on the set of possibly repairs. Expand
ERACER: a database approach for statistical inference and data cleaning
Real-world databases often contain syntactic and semantic errors, in spite of integrity constraints and other safety measures incorporated into modern DBMSs. We present ERACER, an iterativeExpand
Incremental Detection of Inconsistencies in Distributed Data
TLDR
It is shown that the incremental detection problem is NP-complete for database D that is partitioned either vertically or horizontally, even when Σ and D are fixed, and that it is bounded: there exist algorithms to detect errors such that their computational cost and data shipment are both linear in the size of ΔD and ΔV. Expand
Interaction between Record Matching and Data Repairing
TLDR
This article provides a uniform framework that seamlessly unifies repairing and matching operations to clean a database based on integrity constraints, matching rules, and master data, and proposes efficient algorithms to clean data via both matching and repairing. Expand
Towards dependable data repairing with fixing rules
TLDR
This work introduces an automated approach for dependably repairing data errors, based on a novel class offixing rules, and develops ecient algorithms to check whether a set of fixing rules is consistent, and discusses approaches to resolve inconsistent fixing rules. Expand
...
1
2
3
4
...