1 Excerpt

- Published 2016 in NIPS

In supervised binary hashing, one wants to learn a function that maps a highdimensional feature vector to a vector of binary codes, for application to fast image retrieval. This typically results in a difficult optimization problem, nonconvex and nonsmooth, because of the discrete variables involved. Much work has simply relaxed the problem during training, solving a continuous optimization, and truncating the codes a posteriori. This gives reasonable results but is quite suboptimal. Recent work has tried to optimize the objective directly over the binary codes and achieved better results, but the hash function was still learned a posteriori, which remains suboptimal. We propose a general framework for learning hash functions using affinity-based loss functions that uses auxiliary coordinates. This closes the loop and optimizes jointly over the hash functions and the binary codes so that they gradually match each other. The resulting algorithm can be seen as an iterated version of the procedure of optimizing first over the codes and then learning the hash function. Compared to this, our optimization is guaranteed to obtain better hash functions while being not much slower, as demonstrated experimentally in various supervised datasets. In addition, our framework facilitates the design of optimization algorithms for arbitrary types of loss and hash functions. Information retrieval arises in several applications, most obviously web search. For example, in image retrieval, a user is interested in finding similar images to a query image. Computationally, this essentially involves defining a high-dimensional feature space where each relevant image is represented by a vector, and then finding the closest points (nearest neighbors) to the vector for the query image, according to a suitable distance. For example, one can use features such as SIFT or GIST [23] and the Euclidean distance for this purpose. Finding nearest neighbors in a dataset of N images (where N can be millions), each a vector of dimension D (typically in the hundreds) is slow, since exact algorithms run essentially in time O(ND) and space O(ND) (to store the image dataset). In practice, this is approximated, and a successful way to do this is binary hashing [12]. Here, given a high-dimensional vector x ∈ R, the hash function h maps it to a b-bit vector z = h(x) ∈ {−1,+1}, and the nearest neighbor search is then done in the binary space. This now costs O(Nb) time and space, which is orders of magnitude faster because typically b < D and, crucially, (1) operations with binary vectors (such as computing Hamming distances) are very fast because of hardware support, and (2) the entire dataset can fit in (fast) memory rather than slow memory or disk. The disadvantage is that the results are inexact, since the neighbors in the binary space will not be identical to the neighbors in the original space. However, the approximation error can be controlled by using sufficiently many bits and by learning a good hash function. This has been the topic of much work in recent years. The general approach consists of defining a supervised objective that has a small value for good hash functions and minimizing it. Ideally, such an objective function should be minimal when the neighbors of any given image are the same in both original and binary spaces. Practically, in information retrieval, this is often evaluated using precision and recall. However, this 30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain. ideal objective cannot be easily optimized over hash functions, and one uses approximate objectives instead. Many such objectives have been proposed in the literature. We focus here on affinity-based loss functions, which directly try to preserve the original similarities in the binary space. Specifically, we consider objective functions of the form minL(h) = ∑N n,m=1 L(h(xn),h(xm); ynm) (1) whereX = (x1, . . . ,xN ) is the high-dimensional dataset of feature vectors,minh means minimizing over the parameters of the hash function h (e.g. over the weights of a linear SVM), and L(·) is a loss function that compares the codes for two images (often through their Hamming distance ‖h(xn)− h(xm)‖) with the ground-truth value ynm that measures the affinity in the original space between the two images xn and xm (distance, similarity or other measure of neighborhood; [12]). The sum is often restricted to a subset of image pairs (n,m) (for example, within the k nearest neighbors of each other in the original space), to keep the runtime low. Examples of these objective functions (described below) include models developed for dimension reduction, be they spectral such as Laplacian Eigenmaps [2] and Locally Linear Embedding [24], or nonlinear such as the Elastic Embedding [4] or t-SNE [26]; as well as objective functions designed specifically for binary hashing, such as Supervised Hashing with Kernels (KSH) [19], Binary Reconstructive Embeddings (BRE) [14] or sequential Projection Learning Hashing (SPLH) [29]. If the hash function h was a continuous function of its input x and its parameters, one could simply apply the chain rule to compute derivatives over the parameters of h of the objective function (1) and then apply a nonlinear optimization method such as gradient descent. This would be guaranteed to converge to an optimum under mild conditions (for example, Wolfe conditions on the line search), which would be global if the objective is convex and local otherwise [21]. Hence, optimally learning the function h would be in principle doable (up to local optima), although it would still be slow because the objective can be quite nonlinear and involve many terms. In binary hashing, the optimization is much more difficult, because in addition to the previous issues, the hash function must output binary values, hence the problem is not just generally nonconvex, but also nonsmooth. In view of this, much work has sidestepped the issue and settled on a simple but suboptimal solution. First, one defines the objective function (1) directly on the b-dimensional codes of each image (rather than on the hash function parameters) and optimizes it assuming continuous codes (in R). Then, one binarizes the codes for each image. Finally, one learns a hash function given the codes. Optimizing the affinity-based loss function (1) can be done using spectral methods or nonlinear optimization as described above. Binarizing the codes has been done in different ways, from simply rounding them to {−1,+1} using zero as threshold [18, 19, 30, 33], to optimally finding a threshold [18], to rotating the continuous codes so that thresholding introduces less error [11, 32]. Finally, learning the hash function for each of the b output bits can be considered as a binary classification problem, where the resulting classifiers collectively give the desired hash function, and can be solved using various machine learning techniques. Several works (e.g. [16, 17, 33]) have used this approach, which does produce reasonable hash functions (in terms of retrieval measures such as precision/recall). In order to do better, one needs to take into account during the optimization (rather than after the optimization) the fact that the codes are constrained to be binary. This implies attempting directly the discrete optimization of the affinity-based loss function over binary codes. This is a daunting task, since this is usually an NP-complete problem with Nb binary variables altogether, and practical applications could make this number as large as millions or beyond. Recent works have applied alternating optimization (with various refinements) to this, where one optimizes over a usually small subset of binary variables given fixed values for the remaining ones [16, 17], and this did result in very competitive precision/recall compared with the state-of-the-art. This is still slow and future work will likely improve it, but as of now it provides an option to learn better binary codes. Of the three-step suboptimal approach mentioned (learn continuous codes, binarize them, learn hash function), these works manage to join the first two steps and hence learn binary codes [16, 17]. Then, one learns the hash function given these binary codes. Can we do better? Indeed, in this paper we show that all elements of the problem (binary codes and hash function) can be incorporated in a single algorithm that optimizes jointly over them. Hence, by initializing it from binary codes from the previous approach, this algorithm is guaranteed to achieve a lower error and learn better hash functions. Our framework can be seen as an iterated version of the two-step approach: learn binary codes given the current hash function, learn hash functions given codes, iterate (note the emphasis).

@inproceedings{Raziperchikolaei2016OptimizingAB,
title={Optimizing affinity-based binary hashing using auxiliary coordinates},
author={Ramin Raziperchikolaei and Miguel {\'A}. Carreira-Perpi{\~n}{\'a}n},
booktitle={NIPS},
year={2016}
}