On Advances in Text Generation from Images Beyond Captioning: A Case Study in Self-Rationalization

  title={On Advances in Text Generation from Images Beyond Captioning: A Case Study in Self-Rationalization},
  author={Shruti Palaskar and Akshita Bhagia and Yonatan Bisk and Florian Metze and Alan W. Black and Ana Marasovi{\'c}},
Integrating vision and language has gained no-table attention following the success of pretrained language models. Despite that, a frac-tion of emerging multimodal models is suitable for text generation conditioned on images. This minority is typically developed and evaluated for image captioning, a text generation task conditioned solely on images with the goal to describe what is explicitly visible in an image. In this paper, we take a step back and ask: How do these models work for more… 

Figures and Tables from this paper

Does Self-Rationalization Improve Robustness to Spurious Correlations?

It is suggested that explainability can come at the cost of robustness; thus, appropriate care should be taken when training self-rationalizing models with the goal of creating more trustworthy models.



Unifying Vision-and-Language Tasks via Text Generation

This work proposes a unified framework that learns different tasks in a single architecture with the same language modeling objective, i.e., multimodal conditional text generation, where the models learn to generate labels in text based on the visual and textual inputs.

Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer

This systematic study compares pre-training objectives, architectures, unlabeled datasets, transfer approaches, and other factors on dozens of language understanding tasks and achieves state-of-the-art results on many benchmarks covering summarization, question answering, text classification, and more.

e-ViL: A Dataset and Benchmark for Natural Language Explanations in Vision-Language Tasks

e-ViL is a benchmark for explainable vision-language tasks that establishes a unified evaluation framework and provides the first comprehensive comparison of existing approaches that generate NLEs for VL tasks.

Microsoft COCO: Common Objects in Context

We present a new dataset with the goal of advancing the state-of-the-art in object recognition by placing the question of object recognition in the context of the broader question of scene

Few-Shot Self-Rationalization with Natural Language Prompts

This work identifies the right prompting approach by extensively exploring natural language prompts on FEB and demonstrates that making progress on few-shot self-rationalization is possible, and presents FEB—a stan-dardized collection of four existing English-language datasets and associated metrics.

Beyond VQA: Generating Multi-word Answers and Rationales to Visual Questions

This work presents a completely generative formulation where a multi-word answer is generated for a visual query, and proposes an end-to-end architecture to solve this task and describes how to evaluate it.

Multimodal Explanations: Justifying Decisions and Pointing to the Evidence

It is quantitatively shown that training with the textual explanations not only yields better textual justification models, but also better localizes the evidence that supports the decision, supporting the thesis that multimodal explanation models offer significant benefits over unimodal approaches.

Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering

This work balances the popular VQA dataset by collecting complementary images such that every question in the authors' balanced dataset is associated with not just a single image, but rather a pair of similar images that result in two different answers to the question.

Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks

This work introduces a Region Proposal Network (RPN) that shares full-image convolutional features with the detection network, thus enabling nearly cost-free region proposals and further merge RPN and Fast R-CNN into a single network by sharing their convolutionAL features.

Training Vision-Language Transformers from Captions Alone

It is shown that Vision-Language Transformers can be learned without human labels and introduced a new model V ision- L anguage from C aptions ( VLC ) built on top of Masked AutoEncoders that does not require this supervision.