Pragmatically Informative Image Captioning with Character-Level Inference

  title={Pragmatically Informative Image Captioning with Character-Level Inference},
  author={Reuben Cohn-Gordon and Noah D. Goodman and Christopher Potts},
  booktitle={North American Chapter of the Association for Computational Linguistics},
We combine a neural image captioner with a Rational Speech Acts (RSA) model to make a system that is pragmatically informative: its objective is to produce captions that are not merely true but also distinguish their inputs from similar images. Previous attempts to combine RSA with neural image captioning require an inference which normalizes over the entire set of possible utterances. This poses a serious problem of efficiency, previously solved by sampling a small subset of possible… 

Figures and Tables from this paper

Pragmatically Informative Color Generation by Grounding Contextual Modifiers

This paper proposes a computational pragmatics model that formulates this color generation task as a recursive game between speakers and listeners, and generates a modified color that is maximally informative to help the listener recover the original referents.

Decoding, Fast and Slow: A Case Study on Balancing Trade-Offs in Incremental, Character-level Pragmatic Reasoning

This work proposes a simple but highly effective relaxation of fully rational decoding, based on an existing incremental and character-level approach to pragmatically informative neural image captioning, and implements a mixed speaker that applies pragmatic reasoning occasionally (only word-initially) while unrolling the language model.

Referring Expressions with Rational Speech Act Framework: A Probabilistic Approach

Experimental results show that while achieving lower accuracy than SOTA deep learning methods, the approach outperforms similar RSA approach in human comprehension and has an advantage over end-to-end deep learning under limited data scenario.

Switching to Discriminative Image Captioning by Relieving a Bottleneck of Reinforcement Learning

This work drastically recast discriminative image captioning as a much simpler task of encouraging low-frequency word generation and proposes methods that easily switch off-the-shelf RL models to discriminativeness-aware models with only a single-epoch fine-tuning on the part of the parameters.

Expressing Visual Relationships via Language

This work introduces a new language-guided image editing dataset that contains a large number of real image pairs with corresponding editing instructions and proposes a new relational speaker model based on an encoder-decoder architecture with static relational attention and sequential multi-head attention and extended with dynamic relational attention.

What Topics Do Images Say: A Neural Image Captioning Model with Topic Representation

A topic-guided neural image captioning model which incorporates a topic model into the CNN-RNN framework and verifies that the topic features are effective to represent high-level semantic information of images.

Harnessing the linguistic signal to predict scalar inferences

This work shows that an LSTM-based sentence encoder trained on an English dataset of human inference strength ratings is able to predict ratings with high accuracy, and probes the model’s behavior using manually constructed minimal sentence pairs and corpus data.

Robust Change Captioning

A novel Dual Dynamic Attention Model (DUDA) to perform robust Change Captioning, which learns to distinguish distractors from semantic changes, localize the changes via Dual Attention over “before” and “after” images, and accurately describe them in natural language via Dynamic Speaker.

Know What You Don’t Know: Modeling a Pragmatic Speaker that Refers to Objects of Unknown Categories

This work extends a neural generator to become a pragmatic speaker reasoning about uncertain object categories, and shows that this conversational strategy for dealing with novel objects often improves communicative success, in terms of resolution accuracy of an automatic listener.

AACR: Feature Fusion Effects of Algebraic Amalgamation Composed Representation on (De)Compositional Network for Caption Generation for Images

  • C. Sur
  • Computer Science
    SN Comput. Sci.
  • 2020
This work tried to define such kind of relationship in the form of representation called Algebraic Amalgamation-based Composed Representation (AACR) which generalized the scheme of language modeling and structuring the linguistic attributes (related to grammar and parts of speech of language) which will provide a much better structure and grammatically correct sentence.



Show and tell: A neural image caption generator

This paper presents a generative model based on a deep recurrent architecture that combines recent advances in computer vision and machine translation and that can be used to generate natural sentences describing an image.

Colors in Context: A Pragmatic Neural Model for Grounded Language Understanding

We present a model of pragmatic referring expression interpretation in a grounded communication task (identifying colors from descriptions) that draws upon predictions from two recurrent neural

Context-Aware Captions from Context-Agnostic Supervision

An inference technique is introduced to produce discriminative context-aware image captions using only generic context-agnostic training data that generates language that uniquely refers to one of two semantically-similar images in the COCO dataset.

Learning in the Rational Speech Acts Model

This work shows how to define and optimize a trained statistical classifier that uses the intermediate agents of RSA as hidden layers of representation forming a non-linear activation function, which opens up new application domains and new possibilities for learning effectively from data.

Reasoning about Pragmatics with Neural Listeners and Speakers

A model for pragmatically describing scenes, in which contrastive behavior results from a combination of inference-driven pragmatics and learned semantics, that succeeds 81% of the time in human evaluations on a referring expression game.

Computational Interpretations of the Gricean Maxims in the Generation of Referring Expressions

A recommended algorithm is described, along with a specification of the resources a host system must provide in order to make use of the algorithm, and an implementation used in the natural language generation component of the IDAS system.

Every Picture Tells a Story: Generating Sentences from Images

A system that can compute a score linking an image to a sentence, which can be used to attach a descriptive sentence to a given image, or to obtain images that illustrate a given sentence.

Deep Visual-Semantic Alignments for Generating Image Descriptions

  • A. KarpathyLi Fei-Fei
  • Computer Science
    IEEE Transactions on Pattern Analysis and Machine Intelligence
  • 2017
A model that generates natural language descriptions of images and their regions based on a novel combination of Convolutional Neural Networks over image regions, bidirectional Recurrent Neural networks over sentences, and a structured objective that aligns the two modalities through a multimodal embedding is presented.

Knowledge and implicature: Modeling language understanding as social cognition

This work applies the rational speech-act theory to model scalar implicature, which predicts an interaction between the speaker's knowledge state and the listener's interpretation and finds good fit between model predictions and human judgments.

Predicting Pragmatic Reasoning in Language Games

This model provides a close, parameter-free fit to human judgments, suggesting that the use of information-theoretic tools to predict pragmatic reasoning may lead to more effective formal models of communication.