Efficient Exact Set-Similarity Joins

Abstract

Given two input collections of sets, a set-similarity join (SSJoin) identifies all pairs of sets, one from each collection, that have high similarity. Recent work has identified SSJoin as a useful primitive operator in data cleaning. In this paper, we propose new algorithms for SSJoin. Our algorithms have two important features: They are exact, i.e., they always produce the correct answer, and they carry precise performance guarantees. We believe our algorithms are the first to have both features; previous algorithms with performance guarantees are only probabilistically approximate. We demonstrate the effectiveness of our algorithms using a thorough experimental evaluation over real-life and synthetic data sets.

Extracted Key Phrases

17 Figures and Tables

020406020072008200920102011201220132014201520162017
Citations per Year

409 Citations

Semantic Scholar estimates that this publication has 409 citations based on the available data.

See our FAQ for additional information.

Cite this paper

@inproceedings{Arasu2006EfficientES, title={Efficient Exact Set-Similarity Joins}, author={Arvind Arasu and Venkatesh Ganti and Raghav Kaushik}, booktitle={VLDB}, year={2006} }