Improved Residual Vector Quantization for High-dimensional Approximate Nearest Neighbor Search

Abstract

Quantization methods have been introduced to perform large scale approximate nearest search tasks. Residual Vector Quantization (RVQ) is one of the effective quantization methods. RVQ uses a multi-stage codebook learning scheme to lower the quantization error stage by stage. However, there are two major limitations for RVQ when applied to on high-dimensional approximate nearest neighbor search: 1. The performance gain diminishes quickly with added stages. 2. Encoding a vector with RVQ is actually NP-hard. In this paper, we propose an improved residual vector quantization (IRVQ) method, our IRVQ learns codebook with a hybrid method of subspace clustering and warm-started kmeans on each stage to prevent performance gain from dropping, and uses a multi-path encoding scheme to encode a vector with lower distortion. Experimental results on the benchmark datasets show that our method gives substantially improves RVQ and delivers better performance compared to the state-of-the-art. Introduction Nearest neighbor search is a fundamental problem in many computer vision applications such as image retrieval (Rui, Huang, and Chang 1999) and image recognition (Lowe 1999). In high dimensional data-space, nearest neighbor search becomes very expensive due to the curse of dimensionality (Indyk and Motwani 1998). Approximate nearest neighbor (ANN) search is a much more practical approach. Quantization-based algorithms have recently been developed to perform ANN search tasks. They achieved superior performances against other ANN search methods (Jegou, Douze, and Schmid 2011). Product Quantization (Jegou, Douze, and Schmid 2011) is a representative quantization algorithm. PQ splits the original d-dimensional data vector into M disjoint sub-vectors and learn M codebooks {C1 · · ·CM}, where each codebook contains K codewords Cm = {cm(1), · · · , cm(K)},m ∈ 1 · · ·M . Then the original data vector is approximated by the Cartesian product of the codewords it has been assigned to. PQ allows fast distance computation between a quantized vector x and an input query vector q via asymmetric distance Copyright c © 2015, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved. computation (ADC): the distances between q and all codewords cm(k),m ∈ 1 · · ·M,k ∈ 1 · · ·K are precomputed, then the approximate distance between q and x can be efficiently computed by the sum of distances between q and codewords of x in O(M) time. Compared to the exact distance computation taking O(d) time, the time complexity is drastically reduced. Product Quantization is based on the assumption that the sub-vectors are statistically mutual independent, such that the original vector can be effectively represented by the Cartesian product of quantized sub-vectors. However vectors in real data do not all meet that assumption. Optimized Product Quantization (OPQ) (Ge et al. 2013) and Cartesian K-means (Norouzi and Fleet 2013) are proposed to find an optimal subspace decomposition to overcome this issue. Residual Vector Quantization (RVQ) (Chen, Guan, and Wang 2010) is an alternative approach to perform approximate nearest neighbor search task. Similar to Additive Quantization (AQ) (Babenko and Lempitsky 2014) and Composite Quantization (Ting Zhang 2014), RVQ approximates the original vector as the sum of codewords instead of Cartesian product. Asymmetric distance computation can also be applied to data quantized by RVQ. RVQ adopts a multi-stage clustering scheme, on each stage the residual vectors are clustered instead of a segment of the original vector. Compared to PQ, RVQ naturally produces mutually independent codebooks. However, the gain of adding an additional stage drops quickly as residual vectors become more random, limiting the effectiveness of multi-stage methods to only a few stages (Gersho and Gray 1992). A direct observation is that the encodings of codebooks learned on the latter stages have low information entropy. Moreover, encoding a vector with dictionaries learned by RVQ is essentially a high-order Markov random field problem, which is NP-hard. In this paper, we propose the Improved Residual Vector Quantization (IRVQ). IRVQ uses a hybrid method of subspaces clustering and warm-started k-means to obtain high information entropy for each codebook, and uses a multipath search method to obtain a better encoding. The basic idea behind IRVQ is rather simple: 1. Subspace clustering generally produces high information entropy codebook. Though we seek a clustering on the whole feature space, such codebook is still useful. We utilize these information by warm-start k-means with this codebook. 2. The norms of codewords reduce stage by stage. Though the naive ”greedy” encoding fails to produce optimal encoding, a less ”greedy” encoding is more likely to obtain the optimal encoding. We propose a multi-path encoding algorithm for learn codebooks. The codebooks learned by IRVQ are mutually independent and each codebook has high information entropy. And a significantly lower quantization error observed compared to RVQ and other state-of-the-art methods. We have validated our method on two commonly used datasets for evaluating ANN search performance: SIFT-1M and GIST1M (Jegou, Douze, and Schmid 2011). The empirical results show that our IRVQ improves RVQ significantly. Our IRVQ also outperforms other state-of-the-art quantization methods such as PQ, OPQ, and AQ. Residual Vector Quantization Residual vector quantization (RVQ) (Juang and Gray Jr 1982) is a common technique to approximate original data with several low complexity quantizers, instead of a prohibitive high complexity quantizer. RVQ reduces the quantization error by learning quantizers on the residues. RVQ is introduced to perform ANN-search in (Chen, Guan, and Wang 2010), The gain of adding an additional stage relies on the commonality among residual vectors from different cluster centers. Thus on high-dimensional data this approach performs badly. Information Entropy It has been observed that the residual vectors become very random with increasing stages, limiting the effectiveness of RVQ to a small number of stages. To begin with, we first examine the encoded dataset by RVQ from the point of view of information entropy. For hashing based approximate nearest neighbor search methods, e.g. Spectral Hashing (Weiss, Torralba, and Fergus 2009), we seek a code that each bit has a 50% chance of being one or zero, and different bits are mutually independent. Similarly, we would like to obtain maximum information entropy S(Cm), defined below, for each codebook and no mutual information between different codebooks.

Extracted Key Phrases

6 Figures and Tables

Cite this paper

@article{Liu2015ImprovedRV, title={Improved Residual Vector Quantization for High-dimensional Approximate Nearest Neighbor Search}, author={Shicong Liu and Hongtao Lu and Junru Shao}, journal={CoRR}, year={2015}, volume={abs/1509.05195} }