A fast approach for parallel deduplication on multicore processors

@article{Bianco2011AFA,
  title={A fast approach for parallel deduplication on multicore processors},
  author={Guilherme Dal Bianco and Renata de Matos Galante and Carlos Alberto Heuser},
  journal={Proceedings of the 2011 ACM Symposium on Applied Computing},
  year={2011}
}
In this paper, we propose a fast approach that parallelizes the deduplication process on multicore processors. Our approach, named MD-Approach, combines an efficient blocking method with a robust data parallel programming model. The blocking phase is composed of two steps. The first step generates large blocks by grouping records with low degree of similarity. The second step segments large blocks, that may result in unbalanced load, in more precise sub-blocks. A parallel data programming model… Expand
G-Paradex: GPU-Based Parallel Indexing for Fast Data Deduplication
TLDR
G-Paradex is a novel deduplication framework that can significantly reduce the duplicate detecting time, utilizing a prefix tree to organize the chunk fingerprints, and achieves a speedup of 2-4X for duplicate detecting. Expand
Improving storage capacity by distributed exact deduplication systems
TLDR
In a world where data deduplication storage systems are continuously struggling in providing the required throughput and disk capacities necessary to store and retrieve data within reasonable times, this paper presents a proof-of-concept design that one can use to implement such a system: A Distributed Exact Deduplication System. Expand
HARENS: Hardware Accelerated Redundancy Elimination in Network Systems
TLDR
The results indicate that throughput can be increased by a factor of 14 compared to a native implementation of a network deduplication algorithm, providing a net transmission increase of up to 10.7 Gigabits per second (Gbps). Expand
A Data Deduplication Framework of Disk Images with Adaptive Block Skipping
TLDR
Under the framework, deduplication operations are skipped for data chunks determined as likely non-duplicates via heuristic prediction, in conjunction with a hit and matching extension process for duplication identification within skipped blocks and a hysteresis mechanism based hash indexing process to update the hash indices for the re-encountered skipped chunks. Expand
Towards task-based parallelization for entity resolution
TLDR
This paper presents a framework for task-parallelization of ER, supporting in particular ER of large amounts of semi-structured and heterogeneous data and discusses a possible implementation of the framework. Expand
Efficient sequential and parallel algorithms for record linkage
TLDR
Efficient sequential and parallel algorithms for record linkage which handle any number of datasets and outperform previous algorithms are reported which are compared with TPA (FCED). Expand
Deduplication in Databases using Locality Sensitive Hashing and Bloom filter
TLDR
This paper proposes a similarity-based data deduplication scheme by integrating the technologies of bloom filter and Locality Sensitive hashing (LSH), which can significantly reduce the computation overhead by only performing dedUplication operations for similar texts. Expand
Adaptive sorted neighborhood blocking for entity matching with MapReduce
TLDR
This paper investigates how the MapReduce programming model can be used to perform efficient parallel EM using a variation of the Sorted Neighborhood Method (SNM) that uses a varying size (adaptive) window. Expand
GHOST: GPGPU-offloaded high performance storage I/O deduplication for primary storage system
TLDR
This work thoroughly analyzes the performance bottleneck of previous deduplication systems to enhance the system to meet the requirement of the primary storage, and proposes a new dedUplication system utilizing GPGPU. Expand
A Practical Approach for Scalable Record Linkage on Hadoop
TLDR
This paper proposes a practical 3-phase MapReduce approach that fulfills blocking, filtering, and linking in 3 consecutive processes on Hadoop cluster and shows that it functions efficiently and effectively with keeping high recall in contrast to tradition method. Expand
...
1
2
3
4
...

References

SHOWING 1-10 OF 18 REFERENCES
A Scalable Parallel Deduplication Algorithm
  • W. Santos, T. Teixeira, +4 authors A. Silva
  • Computer Science
  • 19th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD'07)
  • 2007
TLDR
By using probabilistic record linkage, the parallel deduplication algorithm FER- APARDA was able to successfully detect replicas in synthetic datasets with more than 1 million records in about 7 minutes using a 20- computer cluster, achieving an almost linear speedup. Expand
Efficient parallel set-similarity joins using MapReduce
TLDR
This paper proposes a 3-stage approach for end-to-end set-similarity joins in parallel using the popular MapReduce framework, and reports results from extensive experiments on real datasets to evaluate the speedup and scaleup properties of the proposed algorithms using Hadoop. Expand
Evaluating MapReduce for Multi-core and Multiprocessor Systems
TLDR
It is established that, given a careful implementation, MapReduce is a promising model for scalable performance on shared-memory systems with simple parallel code. Expand
MapReduce: Simplified Data Processing on Large Clusters
TLDR
This paper presents the implementation of MapReduce, a programming model and an associated implementation for processing and generating large data sets that runs on a large cluster of commodity machines and is highly scalable. Expand
Parallel linkage
TLDR
This work shows that the intricate interplay between match and merge can exploit the characteristics of each scenario to achieve good parallelization in the parallelization of the (record) linkage problem. Expand
D-Swoosh: A Family of Algorithms for Generic, Distributed Entity Resolution
TLDR
A family of algorithms, D-Swoosh, is presented for distributing the ER workload across multiple processors that use generic match and merge functions, and ensure that new merged records are distributed to processors that may have matching records. Expand
Robust record linkage blocking using suffix arrays
TLDR
This work designs and evaluates an efficient and highly scalable blocking approach based on suffix arrays, which exploits the ordering used by the index to merge similar blocks at marginal extra cost, resulting in a much higher accuracy while retaining the high scalability of the base suffix array method. Expand
Febrl - A Parallel Open Source Data Linkage System: http://datamining.anu.edu.au/linkage.html
TLDR
This paper presents an innovating data linkage system called Febrl, which includes a new probabilistic approach for improved data cleaning and standardisation, innovative indexing methods, a parallelisation approach which is implemented transparently to the user, and a data set generator which allows the random creation of records containing names and addresses. Expand
Duplicate Record Detection: A Survey
TLDR
This paper presents an extensive set of duplicate detection algorithms that can detect approximately duplicate records in a database and covers similarity metrics that are commonly used to detect similar field entries. Expand
Duplicate Record Detection: A Survey
TLDR
This paper presents an extensive set of duplicate detection algorithms that can detect approximately duplicate records in a database and covers similarity metrics that are commonly used to detect similar field entries. Expand
...
1
2
...