Results of the NeurIPS'21 Challenge on Billion-Scale Approximate Nearest Neighbor Search

@article{Simhadri2022ResultsOT,
  title={Results of the NeurIPS'21 Challenge on Billion-Scale Approximate Nearest Neighbor Search},
  author={Harsha Vardhan Simhadri and G. R. Williams and Martin Aum{\"u}ller and Matthijs Douze and Artem Babenko and Dmitry Baranchuk and Qi Chen and Lucas Hosseini and Ravishankar Krishnaswamy and Gopal Srinivasa and Suhas Jayaram Subramanya and Jingdong Wang},
  journal={ArXiv},
  year={2022},
  volume={abs/2205.03763}
}
Despite the broad range of algorithms for Approximate Nearest Neighbor Search, most empirical evaluations of algorithms have focused on smaller datasets, typically of 1 million points (Aum¨uller et al., 2020). However, deploying recent advances in embedding based techniques for search, recommendation and ranking at scale require ANNS indices at billion, trillion or larger scale. Barring a few recent papers, there is limited consensus on which algorithms are effective at this scale vis-`a-vis… 

Figures and Tables from this paper

Manu: A Cloud Native Vector Database Management System
TLDR
Manu is a cloud native vector database that extensively optimize for performance and usability with hardware-aware implementations and support for complex search semantics, and utilizes multi-version concurrency control (MVCC) and a delta consistency model to simplify the communication and cooperation among the system components.

References

SHOWING 1-10 OF 31 REFERENCES
DiskANN: Fast Accurate Billion-point Nearest Neighbor Search on a Single Node
TLDR
It is demonstrated that the SSD-based indices built by DiskANN can meet all three desiderata for large-scale ANNS: high-recall, low query latency and high density (points indexed per node).
A Comprehensive Survey and Experimental Comparison of Graph-Based Approximate Nearest Neighbor Search
TLDR
This study provides a thorough comparative analysis and experimental evaluation of 13 representative graph-based ANNS algorithms via a new taxonomy and fine-grained pipeline, and designs an optimized method that outperforms the state-of-the-art algorithms.
Revisiting the Inverted Indices for Billion-Scale Approximate Nearest Neighbors
TLDR
It is argued that the potential of the simple inverted index was not fully exploited in previous works and advocate its usage both for the highly-entangled deep descriptors and relatively disentangled SIFT descriptors.
SRS: Solving c-Approximate Nearest Neighbor Queries in High Dimensional Euclidean Space with a Tiny Index
TLDR
Several surprisingly simple methods to answer c-ANN queries with theoretical guarantees requiring only a single tiny index are proposed and demonstrate superior performance against the state-of-the-art LSH-based methods, and scale up well to 1 billion high-dimensional points on a single commodity PC.
ANN-Benchmarks: A Benchmarking Tool for Approximate Nearest Neighbor Algorithms
TLDR
ANN-Benchmarks provides a standard interface for measuring the performance and quality achieved by nearest neighbor algorithms on different standard data sets and supports several different ways of integrating k-NN algorithms, and its configuration system automatically tests a range of parameter settings for each algorithm.
HD-Index: Pushing the Scalability-Accuracy Boundary for Approximate kNN Search in High-Dimensional Spaces
TLDR
This paper proposes a novel yet simple indexing scheme, HD-Index, to solve the problem of approximate k-nearest neighbor queries in massive high-dimensional databases and uses Ptolemaic inequality to produce better lower bounds.
Cover trees for nearest neighbor
TLDR
A tree data structure for fast nearest neighbor operations in general n-point metric spaces (where the data set consists of n points) that shows speedups over the brute force search varying between one and several orders of magnitude on natural machine learning datasets.
Efficient Indexing of Billion-Scale Datasets of Deep Descriptors
TLDR
This paper introduces a new dataset of one billion descriptors based on DNNs and reveals the relative inefficiency of IMI-based indexing for such descriptors compared to SIFT data, and introduces two new indexing structures that provide considerably better trade-off between the speed of retrieval and recall, given similar amount of memory, as compared to the standard Inverted Multi-Index.
FreshDiskANN: A Fast and Accurate Graph-Based ANN Index for Streaming Similarity Search
TLDR
This paper presents the first graph-based ANNS index that reflects corpus updates into the index in real-time without compromising on search performance, and designs FreshDiskANN, a system that can index over a billion points on a workstation with an SSD and limited memory.
Billion-Scale Similarity Search with GPUs
TLDR
This paper proposes a novel design for an inline-formula that enables the construction of a high accuracy, brute-force, approximate and compressed-domain search based on product quantization, and applies it in different similarity search scenarios.
...
...