• Publications
  • Influence
PASS-JOIN: A Partition-based Method for Similarity Joins
TLDR
This paper study string similarity joins with edit-distance constraints, which find similar string pairs from two large sets of strings whose edit distance is within a given threshold, using a partition-based method called Pass-Join. Expand
MassJoin: A mapreduce-based method for scalable string similarity joins
TLDR
A MapReduce-based framework, called MASSJOIN, is proposed, which supports both set-based similarity functions and character-based Similarity functions, and extends the existing partition-based signature scheme to support set- based similarity functions. Expand
An Efficient Partition Based Method for Exact Set Similarity Joins
TLDR
A partition scheme to partition the sets into several subsets and guarantee that two sets are similar only if they share a common subset, and an adaptive grouping mechanism that can reduce the complexity to O(s log s). Expand
Detecting Data Errors: Where are we and what needs to be done?
TLDR
A holistic multi-tool strategy that orders the invocations of the available tools to maximize their benefit, while minimizing human effort in verifying results is proposed. Expand
The Data Civilizer System
TLDR
Initial positive experiences are described that show the preliminary DATA CIVILIZER system shortens the time and effort required to find, prepare, and analyze data. Expand
JOSIE: Overlap Set Similarity Search for Finding Joinable Tables in Data Lakes
TLDR
The new algorithm, JOSIE (Joining Search using Intersection Estimation) minimizes the cost of set reads and inverted index probes used in finding the top-k sets and completely out performs the state-of-the-art overlap set similarity search techniques on data lakes. Expand
Efficient Similarity Join and Search on Multi-Attribute Data
TLDR
It is proved that constructing an optimal prefix tree is NP-complete and developed a greedy algorithm to achieve high performance and extend the cost model to support similarity search and devise a budget-based algorithm to construct multiple high-quality prefix trees. Expand
Faerie: efficient filtering algorithms for approximate dictionary-based entity extraction
TLDR
This paper proposes a unified framework to support many similarity/dissimilarity functions, such as jaccard similarity, cosine similarity, dice similarity, edit similarity, and edit distance, and shows that the method achieves high performance and outperforms state-of-the-art studies. Expand
Approximate String Joins with Abbreviations
TLDR
This paper studies ASJ with abbreviations, which are a frequent type of term variation, and proposes an end-to-end workflow that outputs accurate join results, scales well as input size grows and greatly outperforms state-of-the-art approaches in both accuracy and efficiency. Expand
String similarity search and join: a survey
TLDR
A comprehensive survey on string similarity search and join is presented and widely-used similarity functions to quantify the similarity are introduced, including approximate entity extraction, type-ahead search, and approximate substring matching. Expand
...
1
2
3
4
5
...