Revisiting the Inverted Indices for Billion-Scale Approximate Nearest Neighbors

  title={Revisiting the Inverted Indices for Billion-Scale Approximate Nearest Neighbors},
  author={Dmitry Baranchuk and Artem Babenko and Yury Malkov},
  booktitle={European Conference on Computer Vision},
This work addresses the problem of billion-scale nearest neighbor search. The state-of-the-art retrieval systems for billion-scale databases are currently based on the inverted multi-index, the recently proposed generalization of the inverted index structure. The multi-index provides a very fine-grained partition of the feature space that allows extracting concise and accurate short-lists of candidates for the search queries. In this paper, we argue that the potential of the simple inverted… 

Inverted Semantic-Index for Image Retrieval

This paper replaces the clustering method with image classification, during the construction of codebook, and proposes a merging and method to solve the problem that the number of partitions is unchangeable in the inverted semantic-index.

Vector and Line Quantization for Billion-scale Similarity Search on GPUs

SPANN: Highly-efficient Billion-scale Approximate Nearest Neighbor Search

This paper presents a simple but efficient memory-disk hybrid indexing and search system, named SPANN, that follows the inverted index methodology and guarantees both disk-access efficiency and high recall by effectively reducing the disk- access number and retrieving high-quality posting lists.

DiskANN: Fast Accurate Billion-point Nearest Neighbor Search on a Single Node

It is demonstrated that the SSD-based indices built by DiskANN can meet all three desiderata for large-scale ANNS: high-recall, low query latency and high density (points indexed per node).

Progressively Optimized Bi-Granular Document Representation for Scalable Embedding Based Retrieval

This work addresses the problem of massive-scale embedding-based retrieval with Bi-Granular Document Representation, where the lightweight sparse embeddings are indexed and standby in memory for coarse-grained candidate search, and the heavyweight dense embedDings are hosted in disk for fine- grained post verification.

Efficient Nearest Neighbor Search by Removing Anti-hub

This work empirically found that such unnecessary vectors have low hubness scores and thus can be easily identified beforehand and removed by removing anti-hubs, achieving a memory-efficient search while preserving accuracy.

Hybrid Approximate Nearest Neighbor Indexing and Search (HANNIS) for Large Descriptor Databases

A new hybrid method for indexing and searching for the approximate nearest neighbors in high-dimensional large deep-descriptor databases retrieves truly similar items in the database, even if the retrieval set is large.

FreshDiskANN: A Fast and Accurate Graph-Based ANN Index for Streaming Similarity Search

FreshDiskANN is presented, a system that can index over a billion points on a workstation with an SSD and limited memory, and support thousands of concurrent real-time inserts, deletes and searches per second each, while retaining 5-10x reduction in the cost of maintaining freshness in indices when compared to existing methods.

HQANN: Efficient and Robust Similarity Search for Hybrid Queries with Structured and Unstructured Constraints

HQANN is a simple yet highly efficient hybrid query processing framework which can be easily embedded into existing proximity graph-based ANNS algorithms and guarantees both low latency and high recall by leveraging navigation sense among attributes and fusing vector similarity search with attribute filtering.

Efficient Indexing of Billion-Scale Datasets of Deep Descriptors

This paper introduces a new dataset of one billion descriptors based on DNNs and reveals the relative inefficiency of IMI-based indexing for such descriptors compared to SIFT data, and introduces two new indexing structures that provide considerably better trade-off between the speed of retrieval and recall, given similar amount of memory, as compared to the standard Inverted Multi-Index.

The Inverted Multi-Index

Inverted multi-indices were able to significantly improve the speed of approximate nearest neighbor search on the dataset of 1 billion SIFT vectors compared to the best previously published systems, while achieving better recall and incurring only few percent of memory overhead.

Improving Bilayer Product Quantization for Billion-Scale Approximate Nearest Neighbors in High Dimensions

This work introduces and evaluates two approximate nearest neighbor search systems that both exploit the synergy of product quantization processes in a more efficient way and provides a significantly better recall for the same runtime at a cost of small memory footprint increase.

Object retrieval with large vocabularies and fast spatial matching

To improve query performance, this work adds an efficient spatial verification stage to re-rank the results returned from the bag-of-words model and shows that this consistently improves search quality, though by less of a margin when the visual vocabulary is large.

Sparse composite quantization

Sparse composite quantization is developed, which constructs sparse dictionaries and the benefit is that the distance evaluation between the query and the dictionary element (a sparse vector) is accelerated using the efficient sparse vector operation, and thus the cost of distance table computation is reduced a lot.

Efficient Large-Scale Approximate Nearest Neighbor Search on the GPU

This work proposes a two level product and vector quantization tree that reduces the number of vector comparisons required during tree traversal and includes a novel highly parallelizable re-ranking method for candidate vectors by efficiently reusing already computed intermediate values.

Polysemous Codes

Polysemous codes are introduced, which offer both the distance estimation quality of product quantization and the efficient comparison of binary codes with Hamming distance, and their design is inspired by algorithms introduced in the 90's to construct channel-optimized vector quantizers.

Searching in one billion vectors: Re-rank with source coding

This paper releases a new public dataset of one billion 128-dimensional vectors and proposed an experimental setup to evaluate high dimensional indexing algorithms on a realistic scale and accurately and efficiently re-ranks the neighbor hypotheses using little memory compared to the full vectors representation.

Fast Neighborhood Graph Search Using Cartesian Concatenation

Experimental results on searching over large scale datasets (SIFT, GISTand HOG) show that the proposed new data structure for approximate nearest neighbor search outperforms state-of-the-art ANN search algorithms in terms of efficiency and accuracy.

Composite Quantization for Approximate Nearest Neighbor Search

This paper presents a novel compact coding approach, composite quantization, for approximate nearest neighbor search. The idea is to use the composition of several elements selected from the