Corpus ID: 204575762

Tell-the-difference: Fine-grained Visual Descriptor via a Discriminating Referee

  title={Tell-the-difference: Fine-grained Visual Descriptor via a Discriminating Referee},
  author={Shuangjie Xu and Feng Xu and Yu Cheng and Pan Zhou},
In this paper, we investigate a novel problem of telling the difference between image pairs in natural language. Compared to previous approaches for single image captioning, it is challenging to fetch linguistic representation from two independent visual information. To this end, we have proposed an effective encoder-decoder caption framework based on Hyper Convolution Net. In addition, a series of novel feature fusing techniques for pairwise visual information fusing are introduced and a… Expand
Actor-Critic Sequence Generation for Relative Difference Captioning
  • Z. Fei
  • Computer Science
  • ICMR
  • 2020
This paper proposes a reinforcement learning-based model, which utilizes a policy network and a value network in a decision procedure to collaboratively produce a difference caption, and leverages a visual-linguistic similarity-based reward function as feedback. Expand
Scene Graph with 3D Information for Change Captioning
  • Zeming Liao, Qingbao Huang, Yu Liang, Mingyi Fu, Yi Cai, Qing Li
  • Computer Science
  • ACM Multimedia
  • 2021
A three-dimensional information aware Scene Graph based Change Captioning (SGCC) model that is capable of assisting observers to locate the changed objects quickly and being immune to the viewpoint change to some extent is proposed. Expand


Show and tell: A neural image caption generator
This paper presents a generative model based on a deep recurrent architecture that combines recent advances in computer vision and machine translation and that can be used to generate natural sentences describing an image. Expand
Boosting Image Captioning with Attributes
This paper presents Long Short-Term Memory with Attributes (LSTM-A) - a novel architecture that integrates attributes into the successful Convolutional Neural Networks plus Recurrent Neural Networks (RNNs) image captioning framework, by training them in an end-to-end manner. Expand
Context-Aware Captions from Context-Agnostic Supervision
An inference technique is introduced to produce discriminative context-aware image captions using only generic context-agnostic training data that generates language that uniquely refers to one of two semantically-similar images in the COCO dataset. Expand
Semantic Jitter: Dense Supervision for Visual Comparisons via Synthetic Images
  • A. Yu, K. Grauman
  • Computer Science
  • 2017 IEEE International Conference on Computer Vision (ICCV)
  • 2017
This work proposes to overcome the sparsity of supervision problem via synthetically generated images by bootstrapping imperfect image generators to counteract sample sparsity for learning to rank. Expand
Fine-Grained Visual Comparisons with Local Learning
  • A. Yu, K. Grauman
  • Computer Science
  • 2014 IEEE Conference on Computer Vision and Pattern Recognition
  • 2014
This work proposes a local learning approach for fine-grained visual comparisons that outperforms state-of-the-art methods for relative attribute prediction and shows how to identify analogous pairs using learned metrics. Expand
Show, Tell and Discriminate: Image Captioning by Self-retrieval with Partially Labeled Data
This work proposes an image captioning framework with a self-retrieval module as training guidance, which encourages generating discriminative captions and demonstrates the effectiveness of the proposed retrieval-guided method on COCO and Flickr30k captioning datasets, and shows its superior captioning performance with more discriminating captions. Expand
Image Captioning with Semantic Attention
This paper proposes a new algorithm that combines top-down and bottom-up approaches to natural language description through a model of semantic attention, and significantly outperforms the state-of-the-art approaches consistently across different evaluation metrics. Expand
Show, Attend and Tell: Neural Image Caption Generation with Visual Attention
An attention based model that automatically learns to describe the content of images is introduced that can be trained in a deterministic manner using standard backpropagation techniques and stochastically by maximizing a variational lower bound. Expand
Generation and Comprehension of Unambiguous Object Descriptions
This work proposes a method that can generate an unambiguous description of a specific object or region in an image and which can also comprehend or interpret such an expression to infer which object is being described, and shows that this method outperforms previous methods that generate descriptions of objects without taking into account other potentially ambiguous objects in the scene. Expand
Deep visual-semantic alignments for generating image descriptions
A model that generates natural language descriptions of images and their regions based on a novel combination of Convolutional Neural Networks over image regions, bidirectional Recurrent Neural networks over sentences, and a structured objective that aligns the two modalities through a multimodal embedding is presented. Expand