Searching Web Data using MinHash LSH

  title={Searching Web Data using MinHash LSH},
  author={BiChen Rao and Erkang Zhu},
  journal={Proceedings of the 2016 International Conference on Management of Data},
  • B. Rao, Erkang Zhu
  • Published 26 June 2016
  • Computer Science
  • Proceedings of the 2016 International Conference on Management of Data
In this extended abstract, we explore the use of MinHash Locality Sensitive Hashing (MinHash LSH) to address the problem of indexing and searching Web data. We discuss a statistical tuning strategy of MinHash LSH, and experimentally evaluate the accuracy and performance, compared with inverted index. In addition, we describe an on-line demo for the index with real Web data. 

Figures and Tables from this paper

A fast and accurate LSH framework, called PM-LSH, that aims to compute the c-ANN query on large- scale, high-dimensional datasets and develops a tunable confidence interval to achieve accurate distance estimation and guarantee high result quality.
Fast Eclat Algorithms Based on Minwise Hashing for Large Scale Transactions
The theoretical analysis and experimental results show that the proposed Eclat algorithms can obtain almost all frequent itemsets with higher speed and less memory usage than other algorithms.
Locality-Sensitive Hashing for Earthquake Detection: A Case Study of Scaling Data-Driven Science (Extended Version)
In this work, we report on a novel application of Locality Sensitive Hashing (LSH) to seismic data at scale. Based on the high waveform similarity between reoccurring earthquakes, our application
Locality-Sensitive Hashing for Earthquake Detection: A Case Study Scaling Data-Driven Science
Improved scalability enabled seismologists to perform seismic analysis on more than ten years of continuous time series data from over ten seismic stations, and has directly enabled the discovery of 597 new earthquakes near the Diablo Canyon nuclear power plant in California and 6123 new earthquakes in New Zealand.
Web-Scale Web Table to Knowledge Base Matching
The first publicly available web table corpus containing millions of web tables is introduced and T2K Match++ is the only system that achieves F-measure scores above 0:8 for all tasks, and the T2D gold standard is introduced which covers a wide variety of challenges.
Abstractive Snippet Generation
A bidirectional abstractive snippet generation model is proposed and the quality of both the corpus and the generated abstractive snippets are assessed with standard measures, crowdsourcing, and in comparison to the state of the art.


LSH forest: self-tuning indexes for similarity search
This index uses the well-known technique of locality-sensitive hashing (LSH), but improves upon previous designs by eliminating the different data-dependent parameters for which LSH must be constantly hand-tuned, and improving on LSH's performance guarantees for skewed data distributions while retaining the same storage and query overhead.
In Defense of Minhash over Simhash
A theoretical answer is provided (validated by experiments) that MinHash virtually always outperforms SimHash when the data are binary, as common in practice such as search.
Modeling LSH for performance tuning
A statistical performance model of Multi-probe LSH, a state-of-the-art variance of LSH is presented, which can accurately predict the average search quality and latency given a small sample dataset and an adaptive LSH search algorithm is devised to determine the probing parameter dynamically for each query.
LSH Ensemble: Internet-Scale Domain Search
It is proved that there exists an optimal partitioning for any data distribution, as observed in Open Data and Web data corpora, and for datasets following a power-law distribution, it can be approximated using equi-depth.
Approximate nearest neighbors: towards removing the curse of dimensionality
Two algorithms for the approximate nearest neighbor problem in high-dimensional spaces are presented, which require space that is only polynomial in n and d, while achieving query times that are sub-linear inn and polynometric in d.
A Large Public Corpus of Web Tables containing Time and Context Metadata
A large public corpus of Web tables which contains over 233 million tables and has been extracted from the July 2015 version of the CommonCrawl is presented to provide a common ground for evaluating Web table systems.
On the resemblance and containment of documents
  • A. Broder
  • Computer Science
    Proceedings. Compression and Complexity of SEQUENCES 1997 (Cat. No.97TB100171)
  • 1997
The basic idea is to reduce these issues to set intersection problems that can be easily evaluated by a process of random sampling that could be done independently for each document.
Similarity estimation techniques from rounding algorithms
It is shown that rounding algorithms for LPs and SDPs used in the context of approximation algorithms can be viewed as locality sensitive hashing schemes for several interesting collections of objects.
Defense of Minhash over Simhash AISTATS, volume 33 of JMLR Proceedings
  • Defense of Minhash over Simhash AISTATS, volume 33 of JMLR Proceedings
  • 2014