• Publications
  • Influence
PASS-JOIN: A Partition-based Method for Similarity Joins
TLDR
In this paper, we study string similarity joins with edit-distance constraints, which find similar string pairs from two large sets of strings whose edit distance is within a given threshold. Expand
  • 169
  • 21
  • PDF
MassJoin: A mapreduce-based method for scalable string similarity joins
String similarity join is an essential operation in data integration. The era of big data calls for scalable algorithms to support large-scale string similarity joins. In this paper, we studyExpand
  • 101
  • 12
  • PDF
An Efficient Partition Based Method for Exact Set Similarity Joins
TLDR
We study the exact set similarity join problem, which, given two collections of sets, finds out all the similar set pairs from the collections. Expand
  • 56
  • 11
  • PDF
The Data Civilizer System
TLDR
In many organizations, it is often challenging for users to find relevant data for specific tasks, since the data is usually scattered across the enterprise and often inconsistent. Expand
  • 85
  • 4
  • PDF
A pivotal prefix based filtering algorithm for string similarity search
TLDR
We study the string similarity search problem with edit-distance constraints, which, given a set of data strings and a query string, finds the similar strings to the query. Expand
  • 53
  • 4
  • PDF
Faerie: efficient filtering algorithms for approximate dictionary-based entity extraction
TLDR
We propose a unified framework to support many similarity/dissimilarity functions, such as jaccard similarity, cosine similarity, dice similarity, edit similarity, and edit distance. Expand
  • 43
  • 4
  • PDF
Efficient Similarity Join and Search on Multi-Attribute Data
TLDR
In this paper we study similarity join and search on multi- attribute data. Expand
  • 30
  • 4
  • PDF
String similarity search and join: a survey
TLDR
String similarity search and join are two important operations in data cleaning and integration, which extend traditional exact search. Expand
  • 73
  • 3
  • PDF
A unified framework for approximate dictionary-based entity extraction
TLDR
In this paper, we propose a unified framework to support various similarity/dissimilarity functions, such as jaccard similarity, cosine similarity, dice similarity, edit similarity, and edit distance. Expand
  • 22
  • 3
  • PDF
Approximate String Joins with Abbreviations
TLDR
We study approximate string joins with abbreviations, which are a frequent type of term variation. Expand
  • 18
  • 3
  • PDF