Efficient distributed locality sensitive hashing

@article{Bahmani2012EfficientDL,
  title={Efficient distributed locality sensitive hashing},
  author={Bahman Bahmani and Ashish Goel and Rajendra Shinde},
  journal={Proceedings of the 21st ACM international conference on Information and knowledge management},
  year={2012}
}
Distributed frameworks are gaining increasingly widespread use in applications that process large amounts of data. One important example application is large scale similarity search, for which Locality Sensitive Hashing (LSH) has emerged as the method of choice, specially when the data is high-dimensional. To guarantee high search quality, the LSH scheme needs a rather large number of hash tables. This entails a large space requirement, and in the distributed setting, with each query requiring… 

Figures and Tables from this paper

Intelligent Probing for Locality Sensitive Hashing: Multi-Probe LSH and Beyond
TLDR
The problem motivation, the challenges, the key design considerations of multi-probe LSH, as well as discuss recent developments in this space and some questions for further research are revisited.
LSH-based distributed similarity indexing with load balancing in high-dimensional space
TLDR
Two theoretical LSH-based data distribution models in P2P networks for datasets with homogeneous and heterogeneous $$l_2$$ l 2 norms, respectively are proposed, which focus on load balancing for a single hash table rather than multiple tables, which has not been considered previously.
Towards Load Balancing for LSH-based Distributed Similarity Indexing in High-Dimensional Space
  • Lu ShenJiagao WuYongrong WangLinfeng Liu
  • Computer Science
    2018 IEEE 20th International Conference on High Performance Computing and Communications; IEEE 16th International Conference on Smart City; IEEE 4th International Conference on Data Science and Systems (HPCC/SmartCity/DSS)
  • 2018
TLDR
A novel theoretical model of data distribution to solve the load balancing problem in single hash table rather than multiple tables is proposed and a static distributed indexing scheme based on the theoretical model to predict the distribution of hash results is proposed.
Bucket-size balancing locality sensitive hashing using the map reduce paradigm
TLDR
The proposed method extends the hyperplanes to occupy their vicinity so that the data objects in the vicinity of a hyperplane are treated as belonging to both sides of the hyperplane simultaneously.
Scalable Locality-Sensitive Hashing for Similarity Search in High-Dimensional, Large-Scale Multimedia Datasets
TLDR
This work proposes a distributed, efficient, and scalable index based on Locality-Sensitive Hashing (LSH), based on a widely asynchronous dataflow parallelization with a number of optimizations that include a hierarchical parallelization to decouple indexing and data storage, locality-aware data partition strategies to reduce message passing, and multi-probing to limit memory usage.
NetSHa: In-Network Acceleration of LSH-Based Distributed Search
TLDR
This work introduces a heuristic sort-reduce approach to drop potentially poor candidate answers while preserving search quality inLocality Sensitive Hashing and introduces a best-effort replacement mechanism to improve its concurrency.
An evaluation of multi-probe locality sensitive hashing for computing similarities over web-scale query logs
TLDR
This work adopts Locality Sensitive Hashing methods and evaluates four variants in a distributed computing environment (specifically, Hadoop) to identify several optimizations which improve performance, suitable for deployment in very large scale settings.
Dynamic Partition Forest: An Efficient and Distributed Indexing Scheme for Similarity Search based on Hashing
TLDR
A new index structure called Dynamic Partition Forest (DPF) is designed to hierarchically divide the high collision areas with dynamic hashing, which leads itself to auto-adapt various data distributions, which demonstrates the efficiency of the content-based distributed scheme.
Salient Index for Similarity Search Over High Dimensional Vectors
  • Computer Science
  • 2018
TLDR
This thesis proposes a new content-based index called Random Draw Forest (RDF), which not only uses an adaptive tree structure by applying the dynamic length of compound hash functions to meet the different cardinality of data, but also applies the shuffling permutations to solve the MSB problem in the traditional LSH- based index.
...
...

References

SHOWING 1-10 OF 40 REFERENCES
Distributed similarity search in high dimensions using locality sensitive hashing
TLDR
This paper considers distributed K-Nearest Neighbor (KNN) search and range query processing in high dimensional data and shows how to leverage the linearly aligned data for efficient KNN search and how to efficiently process range queries which is not possible in existing LSH schemes.
Distributed Locality Sensitivity Hashing
TLDR
Distributed LSH (D-LSH) performs better for finding approximate near neighbors on extremely large scales, as DLSH distributes close points on single boxes, and far points on different boxes based on projections.
Multi-Probe LSH: Efficient Indexing for High-Dimensional Similarity Search
TLDR
This paper proposes a new indexing scheme called multi-probe LSH, built on the well-known LSH technique, but it intelligently probes multiple buckets that are likely to contain query results in a hash table to achieve the same search quality.
Locality-sensitive hashing scheme based on p-stable distributions
TLDR
A novel Locality-Sensitive Hashing scheme for the Approximate Nearest Neighbor Problem under lp norm, based on p-stable distributions that improves the running time of the earlier algorithm and yields the first known provably efficient approximate NN algorithm for the case p<1.
Kernelized locality-sensitive hashing for scalable image search
  • B. KulisK. Grauman
  • Computer Science
    2009 IEEE 12th International Conference on Computer Vision
  • 2009
TLDR
It is shown how to generalize locality-sensitive hashing to accommodate arbitrary kernel functions, making it possible to preserve the algorithm's sub-linear time similarity search guarantees for a wide class of useful similarity functions.
Similarity Search in High Dimensions via Hashing
TLDR
Experimental results indicate that the novel scheme for approximate similarity search based on hashing scales well even for a relatively large number of dimensions, and provides experimental evidence that the method gives improvement in running time over other methods for searching in highdimensional spaces based on hierarchical tree decomposition.
Entropy based nearest neighbor search in high dimensions
TLDR
The problem of finding the approximate nearest neighbor of a query point in the high dimensional space is studied, focusing on the Euclidean space, and it is shown that the <i>c</i> nearest neighbor can be computed in time and near linear space where <i*p</i><sup> ≈ 2.06/<i*c—i> becomes large.
Bayesian Locality Sensitive Hashing for Fast Similarity Search
TLDR
This paper presents BayesLSH, a principled Bayesian algorithm for the subsequent phase of similarity search - performing candidate pruning and similarity estimation using LSH, which enables significant speedups over baseline approaches.
Near-Optimal Hashing Algorithms for Approximate Nearest Neighbor in High Dimensions
  • Alexandr AndoniP. Indyk
  • Computer Science
    2006 47th Annual IEEE Symposium on Foundations of Computer Science (FOCS'06)
  • 2006
We present an algorithm for the c-approximate nearest neighbor problem in a d-dimensional Euclidean space, achieving query time of O(dn 1c2/+o(1)) and space O(dn + n1+1c2/+o(1)). This almost matches
A platform for scalable one-pass analytics using MapReduce
TLDR
A new data analysis platform that employs hash techniques to enable fast in-memory processing, and a new frequent key based technique to extend such processing to workloads that require a large key-state space is proposed.
...
...