Handling Duplicate Data in Data Warehouse for Data Mining

@article{Tamilselvi2011HandlingDD,
  title={Handling Duplicate Data in Data Warehouse for Data Mining},
  author={J. Tamilselvi and C. B. Gifta},
  journal={International Journal of Computer Applications},
  year={2011},
  volume={15},
  pages={7-15}
}
The problem of detecting and eliminating duplicated data is one of the major problems in the broad area of data cleaning and data quality in data warehouse. Many times, the same logical real world entity may have multiple representations in the data warehouse. Duplicate elimination is hard because it is caused by several types of errors like typographical errors, and different representations of the same logical value. Also, it is important to detect and clean equivalence errors because an… Expand
Duplicates Detection Within Incomplete Data Sets Using Blocking and Dynamic Sorting Key Methods
TLDR
This paper proposes a method that can deal with the impact of missing values by using a dynamic sorting key, an extension of blocking method that essentially works on two functions namely uniqueness calculation function (UF) and completeness function (CF) (to search for missing values). Expand
Unsupervised record matching with noisy and incomplete data
TLDR
The problem of duplicate detection in noisy and incomplete data is considered, and a vectorized soft term frequency-inverse document frequency method is introduced, with an optional refinement step, for automatically determining the number of groups. Expand
Feature Extraction and Duplicate Detection for Text Mining : A Survey
Text mining, also known as Intelligent Text Analysis is an important research area. It is very difficult to focus on the most appropriate information due to the high dimensionality of data. FeatureExpand
A Review of Unsupervised and Semi-supervised Blocking Methods for Record Linkage
TLDR
This chapter reviews existing blocking techniques and discusses their advantages and disadvantages, and detail other research areas that have recently arose and discuss other unresolved issues that are still to be addressed. Expand
Lossless Data Deduplication: Alternatif Solusi untuk Mengatasi Duplicated Record
TLDR
This paper seeks to provide data deduplication solutions without losing the historical value of the transaction by using mapping table between Dimension table and facs table to save all duplicated records which related to the transactions. Expand
A framework for data-driven informatization of the construction company
TLDR
The proposed informatization solution provides a theoretical basis for realizing data sharing and interoperability between business management and project management and will help construction companies to improve the efficiency of both company operations and project delivery by optimizing the business process and supporting decision making. Expand
Fault Prediction using Metric Threshold Value of Object Oriented Systems
Software metrics helps in analyzing many factors of software quality such as fault proneness, reusability, and maintenance effort. Software metrics are values collected from software source code toExpand
A Summarisation Tool for Hotel Reviews
  • N. H. A. Rahim, M. Hasnan
  • Computer Science
  • 2018 14th International Conference on Semantics, Knowledge and Grids (SKG)
  • 2018
TLDR
In order to summarise the hotel reviews, a method named Featured Noun Pairing has been chosen, which associates features (nouns) and adjectives to represent a whole review sentence. Expand
Social Media Analytics, Types and Methodology
TLDR
This chapter discusses concepts elaborating on and categorizing various mining tasks (supervised and unsupervised) while presenting the required process and its steps to analyze data retrieved from the Social Media (SM) ecosystem. Expand
Advance on large scale near-duplicate video retrieval
TLDR
A comprehensive survey and an updated review of the advance on large-scale NDVR to supply guidance for researchers and present the development trends and research directions of this topic. Expand
...
1
2
...

References

SHOWING 1-10 OF 16 REFERENCES
A knowledge-based approach for duplicate elimination in data cleaning
TLDR
Experimental study with two real-world datasets show that the generic knowledge-based framework for effective data cleaning can accurately identify duplicates and anomalies with high recall and precision, thus effectively resolving the recall–precision dilemma. Expand
Duplicate Detection with PMC -- A Parallel Approach to Pattern Matching
TLDR
This project presents an approach to the problem which takes advantage of a multiple instruction stream, multiple data stream (MIMD) architecture called a Pattern Matching Chip (PMC), which allows large amounts of parallel character comparisons, which will allow for fuzzy matching against the entire data set very quickly. Expand
Duplicate Detection in Biological Data using Association Rule Mining
Recent advancement in biotechnology has produced a massive amount of raw biological data which are accumulating at an exponential rate. Errors, redundancy and discrepancies are prevalent in the rawExpand
Duplicate Record Detection: A Survey
TLDR
This paper presents an extensive set of duplicate detection algorithms that can detect approximately duplicate records in a database and covers similarity metrics that are commonly used to detect similar field entries. Expand
Data Mining: Concepts and Techniques
TLDR
This book presents dozens of algorithms and implementation examples, all in pseudo-code and suitable for use in real-world, large-scale data mining projects, and provides a comprehensive, practical look at the concepts and techniques you need to get the most out of real business data. Expand
Unsupervised Duplicate Detection Using Sample Non-duplicates
TLDR
This paper presents an unsupervised, domain independent approach to duplicate detection that starts with a broad alignment of potential duplicate, and analyses the distribution of observed similarity values among these potential duplicates and among representative sample non-duplicates to improve the initial alignment. Expand
Data Preparation for Data Mining
TLDR
A twenty-five-year veteran of what has become the data mining industry, Pyle shares his own successful data preparation methodology, offering both a conceptual overview for managers and complete technical details for IT professionals. Expand
Adaptive duplicate detection using learnable string similarity measures
TLDR
This paper proposes to employ learnable text distance functions for each database field, and shows that such measures are capable of adapting to the specific notion of similarity that is appropriate for the field's domain. Expand
Improved Approximate Detection of Duplicates for Data Streams Over Sliding Windows
TLDR
This paper presents a novel data structure, Decaying Bloom Filter (DBF), as an extension of the Counting Bloom Filter, that effectively removes stale elements as new elements continuously arrive over sliding windows on the DBF basis. Expand
Enhancing data analysis with noise removal
TLDR
Four techniques intended for noise removal to enhance data analysis in the presence of high noise levels are explored, including a hyperclique-based data cleaner (HCleaner), which generally leads to better clustering performance and higher quality association patterns as the amount of noise being removed increases. Expand
...
1
2
...