Pragmatically Informative Image Captioning with Character-Level Inference

@inproceedings{CohnGordon2018PragmaticallyII,
  title={Pragmatically Informative Image Captioning with Character-Level Inference},
  author={Reuben Cohn-Gordon and Noah D. Goodman and Christopher Potts},
  booktitle={North American Chapter of the Association for Computational Linguistics},
  year={2018}
}
We combine a neural image captioner with a Rational Speech Acts (RSA) model to make a system that is pragmatically informative: its objective is to produce captions that are not merely true but also distinguish their inputs from similar images. Previous attempts to combine RSA with neural image captioning require an inference which normalizes over the entire set of possible utterances. This poses a serious problem of efficiency, previously solved by sampling a small subset of possible… 

Figures and Tables from this paper

Pragmatic Issue-Sensitive Image Captioning

The Issue-Sensitive Image Captioning (ISIC) model, built on top of state-of-the-art pretrained neural image captioners and explicitly uses image partitions to control caption generation, generates captions that are descriptive and issue-sensitive.

Pragmatically Informative Color Generation by Grounding Contextual Modifiers

This paper proposes a computational pragmatics model that formulates this color generation task as a recursive game between speakers and listeners, and generates a modified color that is maximally informative to help the listener recover the original referents.

Decoding, Fast and Slow: A Case Study on Balancing Trade-Offs in Incremental, Character-level Pragmatic Reasoning

This work proposes a simple but highly effective relaxation of fully rational decoding, based on an existing incremental and character-level approach to pragmatically informative neural image captioning, and implements a mixed speaker that applies pragmatic reasoning occasionally (only word-initially) while unrolling the language model.

Informativity in Image Captions vs. Referring Expressions

At the intersection between computer vision and natural language processing, there has been recent progress on two natural language generation tasks: Dense Image Captioning and Referring Expression

Referring Expressions with Rational Speech Act Framework: A Probabilistic Approach

Experimental results show that while achieving lower accuracy than SOTA deep learning methods, the approach outperforms similar RSA approach in human comprehension and has an advantage over end-to-end deep learning under limited data scenario.

Switching to Discriminative Image Captioning by Relieving a Bottleneck of Reinforcement Learning

This work drastically recast discriminative image captioning as a much simpler task of encouraging low-frequency word generation and proposes methods that easily switch off-the-shelf RL models to discriminativeness-aware models with only a single-epoch fine-tuning on the part of the parameters.

Expressing Visual Relationships via Language

This work introduces a new language-guided image editing dataset that contains a large number of real image pairs with corresponding editing instructions and proposes a new relational speaker model based on an encoder-decoder architecture with static relational attention and sequential multi-head attention and extended with dynamic relational attention.

What Topics Do Images Say: A Neural Image Captioning Model with Topic Representation

A topic-guided neural image captioning model which incorporates a topic model into the CNN-RNN framework and verifies that the topic features are effective to represent high-level semantic information of images.

Color Overmodification Emerges from Data-Driven Learning and Pragmatic Reasoning

This paper proposes to address the limitation of systematically manipulating human language acquisition by adopting neural networks (NN) as learning agents by systematically varying the environments in which these agents are trained, while keeping the NN architecture constant.

Harnessing the linguistic signal to predict scalar inferences

This work shows that an LSTM-based sentence encoder trained on an English dataset of human inference strength ratings is able to predict ratings with high accuracy, and probes the model’s behavior using manually constructed minimal sentence pairs and corpus data.
...

References

SHOWING 1-10 OF 14 REFERENCES

Context-Aware Captions from Context-Agnostic Supervision

An inference technique is introduced to produce discriminative context-aware image captions using only generic context-agnostic training data that generates language that uniquely refers to one of two semantically-similar images in the COCO dataset.

Learning in the Rational Speech Acts Model

This work shows how to define and optimize a trained statistical classifier that uses the intermediate agents of RSA as hidden layers of representation forming a non-linear activation function, which opens up new application domains and new possibilities for learning effectively from data.

Reasoning about Pragmatics with Neural Listeners and Speakers

A model for pragmatically describing scenes, in which contrastive behavior results from a combination of inference-driven pragmatics and learned semantics, that succeeds 81% of the time in human evaluations on a referring expression game.

Every Picture Tells a Story: Generating Sentences from Images

A system that can compute a score linking an image to a sentence, which can be used to attach a descriptive sentence to a given image, or to obtain images that illustrate a given sentence.

Deep Visual-Semantic Alignments for Generating Image Descriptions

  • A. KarpathyLi Fei-Fei
  • Computer Science
    IEEE Transactions on Pattern Analysis and Machine Intelligence
  • 2017
A model that generates natural language descriptions of images and their regions based on a novel combination of Convolutional Neural Networks over image regions, bidirectional Recurrent Neural networks over sentences, and a structured objective that aligns the two modalities through a multimodal embedding is presented.

Generation and Comprehension of Unambiguous Object Descriptions

This work proposes a method that can generate an unambiguous description of a specific object or region in an image and which can also comprehend or interpret such an expression to infer which object is being described, and shows that this method outperforms previous methods that generate descriptions of objects without taking into account other potentially ambiguous objects in the scene.

Knowledge and implicature: Modeling language understanding as social cognition

This work applies the rational speech-act theory to model scalar implicature, which predicts an interaction between the speaker's knowledge state and the listener's interpretation and finds good fit between model predictions and human judgments.

Predicting Pragmatic Reasoning in Language Games

This model provides a close, parameter-free fit to human judgments, suggesting that the use of information-theoretic tools to predict pragmatic reasoning may lead to more effective formal models of communication.

Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations

The Visual Genome dataset is presented, which contains over 108K images where each image has an average of $$35$$35 objects, $$26$$26 attributes, and $$21$$21 pairwise relationships between objects, and represents the densest and largest dataset of image descriptions, objects, attributes, relationships, and question answer pairs.

Microsoft COCO Captions: Data Collection and Evaluation Server

The Microsoft COCO Caption dataset and evaluation server are described and several popular metrics, including BLEU, METEOR, ROUGE and CIDEr are used to score candidate captions.