Efficient Exact Set-Similarity Joins


Given two input collections of sets, a set-similarity join (SSJoin) identifies all pairs of sets, one from each collection, that have high similarity. Recent work has identified SSJoin as a useful primitive operator in data cleaning. In this paper, we propose new algorithms for SSJoin. Our algorithms have two important features: They are exact, i.e., they always produce the correct answer, and they carry precise performance guarantees. We believe our algorithms are the first to have both features; previous algorithms with performance guarantees are only probabilistically approximate. We demonstrate the effectiveness of our algorithms using a thorough experimental evaluation over real-life and synthetic data sets.

Extracted Key Phrases

17 Figures and Tables

Citations per Year

409 Citations

Semantic Scholar estimates that this publication has 409 citations based on the available data.

See our FAQ for additional information.

Cite this paper

@inproceedings{Arasu2006EfficientES, title={Efficient Exact Set-Similarity Joins}, author={Arvind Arasu and Venkatesh Ganti and Raghav Kaushik}, booktitle={VLDB}, year={2006} }