Fusion of Detected Objects in Text for Visual Question Answering

  title={Fusion of Detected Objects in Text for Visual Question Answering},
  author={Chris Alberti and Jeffrey Ling and Michael Collins and D. Reitter},
To advance models of multimodal context, we introduce a simple yet powerful neural architecture for data that combines vision and natural language. The “Bounding Boxes in Text Transformer” (B2T2) also leverages referential information binding words to portions of the image in a single unified architecture. B2T2 is highly effective on the Visual Commonsense Reasoning benchmark, achieving a new state-of-the-art with a 25% relative reduction in error rate compared to published baselines and… 

Figures and Tables from this paper

Crossing the Format Boundary of Text and Boxes: Towards Unified Vision-Language Modeling

A vision-language (VL) model that unifies text generation and bounding box prediction into a single architecture that achieves comparable performance to task-specific state of the art on 7 VL benchmarks and shows the capability of generalizing to new tasks such as ImageNet object localization.

Iterative Answer Prediction With Pointer-Augmented Multimodal Transformers for TextVQA

A novel model is proposed based on a multimodal transformer architecture accompanied by a rich representation for text in images that enables iterative answer decoding with a dynamic pointer network, allowing the model to form an answer through multi-step prediction instead of one-step classification.

TAB-VCR: Tags and Attributes based Visual Commonsense Reasoning Baselines

A much simpler model obtained by ablating and pruning the existing intricate baseline can perform better with half the number of trainable parameters, and is obtained for the new visual commonsense reasoning (VCR) task, TAB-VCR.

Modality-Agnostic Attention Fusion for visual search with text feedback

The Modality-Agnostic Attention Fusion (MAAF) model combines image and text features and outperforms existing approaches on two visual search with modifying phrase datasets, Fashion IQ and CSS, and performs competitively on a dataset with only single-word modifications, Fashion200k.

A Simple Baseline for Visual Commonsense Reasoning

It is shown that a much simpler model can perform better with half the number of trainable parameters, and is obtained by associating visual features with attribute information and better text to image grounding, which results in further improvements for the simpler & effective baseline, TAB-VCR.

Auto-Parsing Network for Image Captioning and Visual Question Answering

A Probabilistic Graphical Model parameterized by the attention operations on each self-attention layer to incorporate sparse assumption is imposed and a PGM probability-based parsing algorithm is developed by which it can discover what the hidden structure of input is during the inference.

Beyond Language: Learning Commonsense from Images for Reasoning

This paper proposes a novel approach to learn commonsense from images, instead of limited raw texts or costly constructed knowledge bases, for the commonsense reasoning problem in NLP, and demonstrates that Loire outperforms traditional language-based methods.

LaTr: Layout-Aware Transformer for Scene-Text VQA

A novel multimodal architecture for Scene Text Visual Question Answering (STVQA), named Layout-Aware Transformer (LaTr), which performs vocabulary-free decoding and generalizes well beyond the training vocabulary, and improves robustness towards OCR errors.

Conceptual 12M: Pushing Web-Scale Image-Text Pre-Training To Recognize Long-Tail Visual Concepts

The results clearly illustrate the benefit of scaling up pre-training data for vision-and-language tasks, as indicated by the new state-of-the-art results on both the nocaps and Conceptual Captions benchmarks.

MERLOT: Multimodal Neural Script Knowledge Models

This work introduces MERLOT, a model that learns multimodal script knowledge by watching millions of YouTube videos with transcribed speech – in an entirely label-free, self-supervised manner, and achieves state-ofthe-art performance on 12 different video QA datasets when finetuned.



Yin and Yang: Balancing and Answering Binary Visual Questions

This paper addresses binary Visual Question Answering on abstract scenes as visual verification of concepts inquired in the questions by converting the question to a tuple that concisely summarizes the visual concept to be detected in the image.

VisualBERT: A Simple and Performant Baseline for Vision and Language

Analysis demonstrates that VisualBERT can ground elements of language to image regions without any explicit supervision and is even sensitive to syntactic relationships, tracking, for example, associations between verbs and image regions corresponding to their arguments.

CLEVR: A Diagnostic Dataset for Compositional Language and Elementary Visual Reasoning

This work presents a diagnostic dataset that tests a range of visual reasoning abilities and uses this dataset to analyze a variety of modern visual reasoning systems, providing novel insights into their abilities and limitations.

ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks

We present ViLBERT (short for Vision-and-Language BERT), a model for learning task-agnostic joint representations of image content and natural language. We extend the popular BERT architecture to a

Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering

A combined bottom-up and top-down attention mechanism that enables attention to be calculated at the level of objects and other salient image regions is proposed, demonstrating the broad applicability of this approach to VQA.

Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering

This work balances the popular VQA dataset by collecting complementary images such that every question in the authors' balanced dataset is associated with not just a single image, but rather a pair of similar images that result in two different answers to the question.

VL-BERT: Pre-training of Generic Visual-Linguistic Representations

A new pre-trainable generic representation for visual-linguistic tasks, called Visual-Linguistic BERT (VL-BERT), which adopts the simple yet powerful Transformer model as the backbone, and extends it to take both visual and linguistic embedded features as input.

VQA: Visual Question Answering

We propose the task of free-form and open-ended Visual Question Answering (VQA). Given an image and a natural language question about the image, the task is to provide an accurate natural language

GQA: A New Dataset for Real-World Visual Reasoning and Compositional Question Answering

We introduce GQA, a new dataset for real-world visual reasoning and compositional question answering, seeking to address key shortcomings of previous VQA datasets. We have developed a strong and

From Recognition to Cognition: Visual Commonsense Reasoning

To move towards cognition-level understanding, a new reasoning engine is presented, Recognition to Cognition Networks (R2C), that models the necessary layered inferences for grounding, contextualization, and reasoning.