Using Text to Teach Image Retrieval

  title={Using Text to Teach Image Retrieval},
  author={Haoyu Dong and Ze Wang and Qiang Qiu and Guillermo Sapiro},
  journal={2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW)},
  • Haoyu Dong, Ze Wang, +1 author G. Sapiro
  • Published 2021
  • Computer Science
  • 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW)
Image retrieval relies heavily on the quality of the data modeling and the distance measurement in the feature space. Building on the concept of image manifold, we first propose to represent the feature space of images, learned via neural networks, as a graph. Neighborhoods in the feature space are now defined by the geodesic distance between images, represented as graph vertices or manifold samples. When limited images are available, this manifold is sparsely sampled, making the geodesic… Expand

Figures and Tables from this paper


Deep Image Retrieval: Learning Global Representations for Image Search
This work proposes a novel approach for instance-level image retrieval that produces a global and compact fixed-length representation for each image by aggregating many region-wise descriptors by leveraging a ranking framework and projection weights to build the region features. Expand
Composing Text and Image for Image Retrieval - an Empirical Odyssey
  • Nam S. Vo, Lu Jiang, +4 authors James Hays
  • Computer Science
  • 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
  • 2019
This paper proposes a new way to combine image and text through residual connection, that outperforms existing approaches on 3 different datasets, namely Fashion-200k, MIT-States and a new synthetic dataset the authors create based on CLEVR. Expand
Learning Joint Visual Semantic Matching Embeddings for Language-Guided Retrieval
A unified Joint Visual Semantic Matching model that learns image-text compositional embeddings by jointly associating visual and textual modalities in a shared discriminative embedding space via compositional losses is proposed. Expand
Learning Two-Branch Neural Networks for Image-Text Matching Tasks
This paper investigates two-branch neural networks for learning the similarity between image-sentence matching and region-phrase matching, and proposes two network structures that produce different output representations. Expand
Learning Deep Structure-Preserving Image-Text Embeddings
This paper proposes a method for learning joint embeddings of images and text using a two-branch neural network with multiple layers of linear projections followed by nonlinearities. The network isExpand
Deep Cross-Modal Projection Learning for Image-Text Matching
A cross-modal projection matching (CMPM) loss and a cross- modal projection classification (CMPC) loss for learning discriminative image-text embeddings are proposed and extensive analysis and experiments demonstrate the superiority of the proposed approach. Expand
Semi-Supervised Learning on Riemannian Manifolds
An algorithmic framework to classify a partially labeled data set in a principled manner and models the manifold using the adjacency graph for the data and approximates the Laplace-Beltrami operator by the graph Laplacian. Expand
Deep Residual Learning for Image Recognition
This work presents a residual learning framework to ease the training of networks that are substantially deeper than those used previously, and provides comprehensive empirical evidence showing that these residual networks are easier to optimize, and can gain accuracy from considerably increased depth. Expand
Automatic Spatially-Aware Fashion Concept Discovery
This paper proposes an automatic spatially-aware concept discovery approach using weakly labeled image-text data from shopping websites, and decomposes the visual-semantic embedding space into multiple concept-specific subspaces to facilitate structured browsing and attribute-feedback product retrieval. Expand
Visual Grounding in Video for Unsupervised Word Translation
The key idea is to establish a common visual representation between two languages by learning embeddings from unpaired instructional videos narrated in the native language, forming the basis for the proposed hybrid visual-text mapping algorithm, MUVE. Expand