Accounting for the Relative Importance of Objects in Image Retrieval


Images tagged with human-provided keywords are a valuable source of data, and are increasingly available thanks to community photo sharing sites such as Flickr and various labeling projects in the vision community. Often the keywords reflect the objects and events of significance and can thus be exploited as a loose form of labels and context. Researchers have explored a variety ways to leverage images with associated texts, including learning the correspondence between them for auto-annotation of regions, objects, and scenes, and building richer image representations based on the two simultaneous "views" for retrieval. Existing approaches largely assume that image tags’ value is purely in indicating the presence of certain objects. However, this ignores the relative importance of different objects composing a scene, and the impact that this importance can have on a user’s perception of relevance. For example, if a system were to auto-tag the bottom right image in Figure 1(c) with either ‘mud’ or ‘fence’ or ‘pole’ or ‘cow’, not all responses are equally useful. Arguably, it is more critical to name those objects that appear more prominent or best define the scene (say, ‘cow’ in this example). Likewise, in image retrieval, the system should prefer to retrieve images that are similar not only in terms of their total object composition, but also in terms of those objects’ relative importance to the scene. How can we learn the relative importance of objects and use this knowledge to improve image retrieval? Our approach rests on the assumption that humans name the most prominent or interesting items first when asked to summarize an image. Thus, rather than treating tags simply as a set of names, we consider them as an ordered list. Specifically, we record a tag-list’s nouns, their absolute ordering, and their relative rank compared to their typical placement. We propose an unsupervised approach based on Kernel Canonical Correlation Analysis (KCCA) to discover a “semantic space" that captures the relationship between those tag cues and the image content itself, and show how it can be used to more effectively process novel queries. The three tag cues are defined as follows: Word Frequency is a traditional bag-of-words that records the presence and count of each object. Each tag-list is mapped to an V -dimensional vector W = [w1, . . . ,wV ], where wi is the number of times the i-th word is mentioned, and V is the vocabulary size. This feature serves to help learn the connection between the low-level image features and the objects they refer to. Relative Tag Rank encodes the relative rank of each word compared to its typical rank: R = [r1, . . . ,rV ], where ri is the percentile of the i-th word’s rank relative to all its previous ranks observed in the training data. This feature captures the order of mention, which hints at the relative importance. Absolute Tag Rank encodes the absolute rank of each word: A = [ 1 log2 (1+a1) , . . . , 1 log2 (1+aV ) ], where ai is the average absolute rank of the i-th word in the tag-list. In contrast to the relative rank, this feature more directly captures the importance of each object in the same scene. For the image features, we use a diverse set of standard descriptors: Gist, color histograms, and bag-of-visual-words (BOW) histograms. To leverage the extracted features to improve image retrieval, we use KCCA to construct a common representation (or semantic space) for both 0 20 40 60 80 100 0.25 0.3 0.35 0.4 0.45 Object counts and scales (PASCAL)

DOI: 10.5244/C.24.58

Extracted Key Phrases

2 Figures and Tables

Cite this paper

@inproceedings{Hwang2010AccountingFT, title={Accounting for the Relative Importance of Objects in Image Retrieval}, author={Sung Ju Hwang and Kristen Grauman}, booktitle={BMVC}, year={2010} }