• Corpus ID: 218486781

Clue: Cross-modal Coherence Modeling for Caption Generation

  title={Clue: Cross-modal Coherence Modeling for Caption Generation},
  author={Malihe Alikhani and Piyush Kumar Sharma and Shengjie Li and Radu Soricut and M. Stone},
We use coherence relations inspired by computational models of discourse to study the information needs and goals of image captioning. Using an annotation protocol specifically devised for capturing image--caption coherence relations, we annotate 10,000 instances from publicly-available image--caption pairs. We introduce a new task for learning inferences in imagery and text, coherence relation prediction, and show that these coherence annotations can be exploited to learn relation classifiers… 

Figures and Tables from this paper

Human-like Controllable Image Captioning with Verb-specific Semantic Roles
A new control signal for CIC is proposed: Verb-specific Semantic Roles (VSR), which consists of a verb and some semantic roles, which represents a targeted activity and the roles of entities involved in this activity.
Image-text discourse coherence relation discoveries on multi-image and multi-text documents
This work developed a multi-modal approach by firstly establishing the links between image and text in an unsupervised way, then discovered their coherence relations through computational models of discourse to improve the consistency and quality of the images and their corresponding texts.
Learning to Overcome Noise in Weak Caption Supervision for Object Detection.
This work proposes the first mechanism to train object detection models from weak supervision in the form of captions at the image level, and proposes several complementary mechanisms to extract image-level pseudo labels for training from the caption.
A Rapid Review of Image Captioning
  • A. Adriyendi
  • Computer Science
    Journal of Information Technology and Computer Science
  • 2021
This work reviews image captioning into 4 categories based on input model, process model, output model, and lingual image caption, and provides research opinions on trends and future research that can be developed with image caption generation.
Emerging Trends of Multimodal Research in Vision and Language
A detailed overview of the latest trends in research pertaining to visual and language modalities is presented, looking at its applications in their task formulations and how to solve various problems related to semantic perception and content generation.


CITE: A Corpus of Image-Text Discourse Relations
A novel crowd-sourced resource that characterizes inferences in image-text contexts in the domain of cooking recipes in the form of coherence relations aids in establishing a better understanding of natural communication and common-sense reasoning.
Understanding, Categorizing and Predicting Semantic Image-Text Relations
This paper derives a categorization of eight semantic image-text classes and shows how they can systematically be characterized by a set of three metrics: cross-modal mutual information, semantic correlation, and the status relation of image and text.
Integrating Text and Image: Determining Multimodal Document Intent in Instagram Posts
A multimodal dataset of 1299 Instagram posts labeled for three orthogonal taxonomies is introduced, showing that employing both text and image improves intent detection by 9.6 compared to using only the image modality, demonstrating the commonality of non-intersective meaning multiplication.
Can Neural Image Captioning be Controlled via Forced Attention?
This paper takes a standard neural image captioning model that uses attention, and fixes the attention to predetermined areas in the image, and evaluates whether the resulting output is more likely to mention the class of the object in that area than the normally generated caption.
Show, Control and Tell: A Framework for Generating Controllable and Grounded Captions
This paper introduces a novel framework for image captioning which can generate diverse descriptions by allowing both grounding and controllability, and generates the corresponding caption through a recurrent architecture which predicts textual chunks explicitly grounded on regions, following the constraints of the given control.
MSCap: Multi-Style Image Captioning With Unpaired Stylized Text
An adversarial learning network is proposed for the task of multi-style image captioning (MSCap) with a standard factual image caption dataset and a multi-stylized language corpus without paired images to enable more natural and human-like captions.
Universal Sentence Encoder for English
Transfer learning using sentence-level embeddings is shown to outperform models without transfer learning and often those that use only word-level transfer.
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
A new language representation model, BERT, designed to pre-train deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers, which can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of tasks.
From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions
This work proposes to use the visual denotations of linguistic expressions to define novel denotational similarity metrics, which are shown to be at least as beneficial as distributional similarities for two tasks that require semantic inference.
Ultra Fine-Grained Image Semantic Embedding
Graph-Regularized Image Semantic Embedding (Graph-RISE), a web-scale neural graph learning framework deployed at Google, which allows to train image embeddings to discriminate an unprecedented O(40M) ultra-fine-grained semantic labels, effectively captures semantics and differentiates nuances at levels that are closer to human-perception.