LSH Ensemble: Internet-Scale Domain Search

@article{Zhu2016LSHEI,
  title={LSH Ensemble: Internet-Scale Domain Search},
  author={Erkang Zhu and Fatemeh Nargesian and Ken Q. Pu and Ren{\'e}e J. Miller},
  journal={ArXiv},
  year={2016},
  volume={abs/1603.07410}
}
We study the problem of domain search where a domain is a set of distinct values from an unspecified universe. We use Jaccard set containment score, defined as | Q ∩ X |/| Q |, as the measure of relevance of a domain X to a query domain Q . Our choice of Jaccard set containment over Jaccard similarity as a measure of relevance makes our work particularly suitable for searching Open Data and data on the web, as Jaccard similarity is known to have poor performance over sets with… 
LSF-Join: Locality Sensitive Filtering for Distributed All-Pairs Set Similarity Under Skew
TLDR
Theoretically, it is shown that LSF-Join efficiently finds most close pairs, even for small similarity thresholds and for skewed input sets, and guarantees on the communication, work, and maximum load of the algorithm are proved.
Lazo: A Cardinality-Based Method for Coupled Estimation of Jaccard Similarity and Containment
TLDR
LAZO is a method to simultaneously estimate both the similarity and containment of datasets, based on a redefinition of Jaccard similarity which takes into account the cardinality of each set.
GB-KMV: An Augmented KMV Sketch for Approximate Containment Similarity Search
TLDR
This paper proposes a novel augmented KMV sketch technique, namely GB-KMV, which is data-dependent and can achieve a good trade-off between the sketch size and the accuracy, and shows that it outperforms the state-of-the-art technique LSH-E in terms of estimation accuracy under practical assumption.
Adaptive Top-k Overlap Set Similarity Joins
TLDR
A solution to the top-k overlap set similarity join (TkOSSJ) that returns k pairs of sets with the highest overlap similarities is proposed and an adaptive step size algorithm is presented that is capable of automatically adjusting the step size, thus reducing redundant computations.
Selectivity Estimation on Set Containment Search
TLDR
Ot-Sampling partitions records based on element frequency and occurrence patterns and is significantly more accurate compared with simple random sampling method and IL-GKMV and considers weighted set containment selectivity estimation and devise stratified random sampling approach named StrRS.
Scalable Data Discovery Using Profiles
TLDR
This work defines a novel notion of join quality that relies on a metric considering both the containment and cardinality proportions between candidate attributes, and implements this approach in a system called NextiaJD, and presents extensive experiments to show the predictive performance and computational efficiency of this method.
High-Dimensional Similarity Search for Scalable Data Science
TLDR
This tutorial revisits the similarity search problem in light of the recent advances in the field and the new big data landscape, and surveys the state-of-the-art high-dimensional similarity search approaches and shares surprising insights about their strengths and weaknesses.
JOSIE: Overlap Set Similarity Search for Finding Joinable Tables in Data Lakes
TLDR
The new algorithm, JOSIE (Joining Search using Intersection Estimation) minimizes the cost of set reads and inverted index probes used in finding the top-k sets and completely out performs the state-of-the-art overlap set similarity search techniques on data lakes.
Similarity query processing for high-dimensional data
TLDR
This tutorial reviews exact and approximate methods such as cover tree, locality sensitive hashing, product quantization, and proximity graphs, and discusses the selectivity estimation problem and shows how researchers are bringing in state-of-the-art ML techniques to address the problem.
Table Union Search on Open Data
TLDR
This work defines the table union search problem and presents a probabilistic solution for finding tables that are unionable with a query table within massive repositories, and proposes a data-driven approach that automatically determines the best model to use for each pair of attributes.
...
1
2
3
4
5
...

References

SHOWING 1-10 OF 30 REFERENCES
LSH forest: self-tuning indexes for similarity search
TLDR
This index uses the well-known technique of locality-sensitive hashing (LSH), but improves upon previous designs by eliminating the different data-dependent parameters for which LSH must be constantly hand-tuned, and improving on LSH's performance guarantees for skewed data distributions while retaining the same storage and query overhead.
On indexing error-tolerant set containment
TLDR
This paper studies the indexing problem for the asymmetric Jaccard containment similarity function that is an error-tolerant variation of set containment and enhances this similarity function to also account for string transformations that reflect synonyms such as "Bob" and "Robert" referring to the same first name.
Similarity Search in High Dimensions via Hashing
TLDR
Experimental results indicate that the novel scheme for approximate similarity search based on hashing scales well even for a relatively large number of dimensions, and provides experimental evidence that the method gives improvement in running time over other methods for searching in highdimensional spaces based on hierarchical tree decomposition.
MeanKS: meaningful keyword search in relational databases with complex schema
TLDR
This work demonstrates MeanKS, a new system for meaningful keyword search over relational databases that uses schema-based ranking to rank join trees that cover the keyword roles and uses the relevance of relations and foreign-key relationships in the schema over the information content of the database.
Locality-sensitive hashing scheme based on p-stable distributions
TLDR
A novel Locality-Sensitive Hashing scheme for the Approximate Nearest Neighbor Problem under lp norm, based on p-stable distributions that improves the running time of the earlier algorithm and yields the first known provably efficient approximate NN algorithm for the case p<1.
Answering Table Queries on the Web using Column Keywords
TLDR
The design of a structured search engine which returns a multi-column table in response to a query consisting of keywords describing each of its columns is presented and a novel query segmentation model for matching keywords to table columns is defined.
WebTables: exploring the power of tables on the web
TLDR
The WEBTABLES system develops new techniques for keyword search over a corpus of tables, and shows that they can achieve substantially higher relevance than solutions based on a traditional search engine.
Instance-Based Matching of Large Ontologies Using Locality-Sensitive Hashing
TLDR
This paper proposes the use of state-of-the art locality-sensitive hashing techniques to vastly improve the scalability of instance matching across multiple types, and describes how these techniques can be used to estimate containment or equivalence relations between two type systems.
Summarizing data using bottom-k sketches
TLDR
It is shown that k-mins sketches can be derived from respective bottom-K sketches, which enables the use of bottom-k sketches with off-the-shelf k-min estimators, and develops and analyze data structures that incrementally construct bottom-c sketches and all-distances bottom- k sketches.
InfoGather: entity augmentation and attribute discovery by holistic matching with web tables
TLDR
A novel architecture that leverages preprocessing in MapReduce to achieve extremely fast response times at query time is proposed and has significantly higher precision and coverage and four orders of magnitude faster response times compared with the state-of-the-art approach.
...
1
2
3
...