• Corpus ID: 236987097

Learning to Hash Robustly, with Guarantees

  title={Learning to Hash Robustly, with Guarantees},
  author={Alexandr Andoni and Daniel Beaglehole},
The indexing algorithms for the high-dimensional nearest neighbor search (NNS) with the best worst-case guarantees are based on the randomized Locality Sensitive Hashing (LSH), and its derivatives. In practice, many heuristic approaches exist to "learn" the best indexing method in order to speed-up NNS, crucially adapting to the structure of the given dataset. Oftentimes, these heuristics outperform the LSH-based algorithms on real datasets, but, almost always, come at the cost of losing the… 

Figures and Tables from this paper


LSH Forest: Practical Algorithms Made Theoretical
The end result is the first instance of a simple, practical algorithm that provably leverages data-dependent hashing to improve upon data-oblivious LSH, and is provably better than the best LSH algorithm for the Hamming space.
Learning to Hash for Indexing Big Data—A Survey
A comprehensive survey of the learning-to-hash framework and representative techniques of various types, including unsupervised, semisupervised, and supervised, is provided and recent hashing approaches utilizing the deep learning models are summarized.
Optimal Data-Dependent Hashing for Approximate Near Neighbors
The new bound is not only optimal, but in fact improves over the best LSH data structures (Indyk, Motwani 1998) (Andoni, Indyk 2006) for all approximation factors c>1.
A Heterogeneous High-Dimensional Approximate Nearest Neighbor Algorithm
  • Moshe Dubiner
  • Mathematics, Computer Science
    IEEE Transactions on Information Theory
  • 2012
An old style probabilistic formulation is introduced instead of the more general locality sensitive hashing (LSH) formulation, and it is shown that at least for sparse problems it recognizes much more efficient algorithms than the sparseness destroying LSH random projections.
Practical and Optimal LSH for Angular Distance
This work shows the existence of a Locality-Sensitive Hashing (LSH) family for the angular distance that yields an approximate Near Neighbor Search algorithm with the asymptotically optimal running time exponent and establishes a fine-grained lower bound for the quality of any LSH family for angular distance.
Refinements to nearest-neighbor searching ink-dimensional trees
  • R. Sproull
  • Mathematics, Computer Science
  • 2005
This note presents a simplification and generalization of an algorithm for searchingk-dimensional trees for nearest neighbors reported by Friedmanet al [3], which can be generalized to allow a partition plane to have an arbitrary orientation, rather than insisting that it be perpendicular to a coordinate axis, as in the original algorithm.
Spectral Approaches to Nearest Neighbor Search
In practice, a number of spectral NNS algorithms outperform the random-projection methods that seem otherwise theoretically optimal on worst-case datasets, and theoretical justification for this disparity is provided.
LSH forest: self-tuning indexes for similarity search
This index uses the well-known technique of locality-sensitive hashing (LSH), but improves upon previous designs by eliminating the different data-dependent parameters for which LSH must be constantly hand-tuned, and improving on LSH's performance guarantees for skewed data distributions while retaining the same storage and query overhead.
Learning Space Partitions for Nearest Neighbor Search
A new framework for building space partitions reducing the problem to balanced graph partitioning followed by supervised classification is developed and the partitions obtained by Neural LSH consistently outperform partitions found by quantization-based and tree-based methods as well as classic, data-oblivious LSH.
Improved nearest neighbor search using auxiliary information and priority functions
This paper exploits properties of single and multiple random projections, which allows us to store meaningful auxiliary information at internal nodes of a random projection tree as well as to design priority functions to guide the search process that results in improved nearest neighbor search performance.