Proximity in the Age of Distraction: Robust Approximate Nearest Neighbor Search

@inproceedings{HarPeled2017ProximityIT,
  title={Proximity in the Age of Distraction: Robust Approximate Nearest Neighbor Search},
  author={Sariel Har-Peled and Sepideh Mahabadi},
  booktitle={SODA},
  year={2017}
}
We introduce a new variant of the nearest neighbor search problem, which allows for some coordinates of the dataset to be arbitrarily corrupted or unknown. Formally, given a dataset of $n$ points $P=\{ x_1,\ldots, x_n\}$ in high-dimensions, and a parameter $k$, the goal is to preprocess the dataset, such that given a query point $q$, one can compute quickly a point $x \in P$, such that the distance of the query to the point $x$ is minimized, when ignoring the "optimal" $k$ coordinates. Note… 

LSH on the Hypercube Revisited

TLDR
The most basic settings are revisited, where $P$ is a set of points in the binary hypercube under the $L_1/Hamming metric, and a short description of the LSH scheme is presented, which is inspired by the authors recent work.

Sampling a Near Neighbor in High Dimensions — Who is the Fairest of Them All?

TLDR
This work shows that LSH based algorithms can be made fair, without a significant loss in efficiency, and develops a data structure for fair similarity search under inner product that requires nearly-linear space and exploits locality sensitive filters.

Parameter-free Locality Sensitive Hashing for Spherical Range Reporting

TLDR
A parameter-free way of using multi-probing, for LSH families that support it, and it is shown that for many such families this approach allows us to get expected query time close to $O(n^\rho+t)$, which is the best the authors can hope to achieve using LSH.

High-dimensional Spherical Range Reporting by Output-Sensitive Multi-Probing LSH

TLDR
This work presents a parameter free way of using multi-probing, for LSH families that support it, and shows that for many such families this approach allows us to get query time close to O(n^\rho+t)$, which is the best the authors can hope to achieve using LSH.

C G ] 9 A pr 2 01 7 LSH on the Hypercube Revisited

TLDR
This note revisits the most basic settings, where P is a set of points in the binary hypercube {0, 1} d , under the L1/Hamming metric, and presents a short description of the LSH scheme in this case.

Sublinear algorithms for massive data problems

TLDR
This thesis presents algorithms and proves lower bounds for fundamental computational problems in the models that address massive data sets, and introduces theoretical problems and concepts that model computational issues arising in databases, computer vision and other areas.

Analysis of the Period Recovery Error Bound

TLDR
This paper provides the first analysis of the relationship between the error bound and the number of candidates, as well as identification of the error parameters that still guarantee recovery, and provides a hierarchy of more restrictive upper error bounds that asymptotically reduces the size of the potential period candidate set.

Index Structures for Fast Similarity Search for Binary Vectors

TLDR
Index structures are presented that are based on hash tables and similarity-preserving hashing and also on tree structures, neighborhood graphs, and distributed neural autoassociation memory for fast similarity search for objects represented by binary vectors.

References

SHOWING 1-10 OF 30 REFERENCES

Approximate line nearest neighbor in high dimensions

TLDR
This work considers the problem of approximate nearest neighbors in high dimensions, when the queries are lines, and designs a data structure to support efficiently the following queries: given a line L, report the point p closest to L.

Approximate k-flat Nearest Neighbor Search

TLDR
This work presents the first efficient data structure that can handle approximate nearest neighbor queries for arbitrary k, and generalizes the techniques of AIKN for 1-ANN: the authors partition P into clusters of increasing radius, and build a low-dimensional data structure for a random projection of P.

Entropy based nearest neighbor search in high dimensions

TLDR
The problem of finding the approximate nearest neighbor of a query point in the high dimensional space is studied, focusing on the Euclidean space, and it is shown that the <i>c</i> nearest neighbor can be computed in time and near linear space where <i*p</i><sup> ≈ 2.06/<i*c—i> becomes large.

Two algorithms for nearest-neighbor search in high dimensions

TLDR
A new approach to the nearest-neighbor problem is developed, based on a method for combining randomly chosen one-dimensional projections of the underlying point set, which results in an algorithm for finding e-approximate nearest neighbors with a query time of O((d log d)(d + log n)).

Approximate Nearest Line Search in High Dimensions

TLDR
The bounds achieved by the data structure match the performance of the best algorithm for the approximate nearest neighbor problem for point sets, and are the first high-dimensional data structure for this problem with poly-logarithmic query time and polynomial space.

Efficient search for approximate nearest neighbor in high dimensional spaces

TLDR
Significantly improving and extending recent results of Kleinberg, data structures whose size is polynomial in the size of the database and search algorithms that run in time nearly linear or nearly quadratic in the dimension are constructed.

An Optimal Randomized Cell Probe Lower Bound for Approximate Nearest Neighbor Searching

TLDR
The approximate nearest neighbor search problem on the Hamming cube is considered and it is shown that a randomized cell probe algorithm that uses polynomial storage and word size requires a worst case query time of $d^{O(1)$ and considerations of bit complexity alone cannot prove any nontrivial cell probe lower bound for the problem.

Optimal Data-Dependent Hashing for Approximate Near Neighbors

TLDR
The new bound is not only optimal, but in fact improves over the best LSH data structures (Indyk, Motwani 1998) (Andoni, Indyk 2006) for all approximation factors c>1.

Locality-sensitive hashing scheme based on p-stable distributions

TLDR
A novel Locality-Sensitive Hashing scheme for the Approximate Nearest Neighbor Problem under lp norm, based on p-stable distributions that improves the running time of the earlier algorithm and yields the first known provably efficient approximate NN algorithm for the case p<1.

Beyond Locality-Sensitive Hashing

TLDR
By a standard reduction, a new data structure is presented for the Hamming space and e1 norm with ρ ≤ 7/(8c)+ O(1/c3/2)+ oc(1), which is the first improvement over the result of Indyk and Motwani (STOC 1998).