• Corpus ID: 199528533

VisualBERT: A Simple and Performant Baseline for Vision and Language

@article{Li2019VisualBERTAS,
  title={VisualBERT: A Simple and Performant Baseline for Vision and Language},
  author={Liunian Harold Li and Mark Yatskar and Da Yin and Cho-Jui Hsieh and Kai-Wei Chang},
  journal={ArXiv},
  year={2019},
  volume={abs/1908.03557}
}
We propose VisualBERT, a simple and flexible framework for modeling a broad range of vision-and-language tasks. [] Key Result Further analysis demonstrates that VisualBERT can ground elements of language to image regions without any explicit supervision and is even sensitive to syntactic relationships, tracking, for example, associations between verbs and image regions corresponding to their arguments.

Figures and Tables from this paper

Vision-and-Language Pretrained Models: A Survey
TLDR
An overview of the major advances achieved in VLPMs for producing joint representations of vision and language and highlights three future directions for both CV and NLP researchers to provide insightful guidance.
ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks
We present ViLBERT (short for Vision-and-Language BERT), a model for learning task-agnostic joint representations of image content and natural language. We extend the popular BERT architecture to a
VinVL: Making Visual Representations Matter in Vision-Language Models
TLDR
This paper develops an improved object detection model to provide object-centric representations of images and feeds the visual features generated into a Transformer-based VL fusion model OSCAR, and utilizes an improved approach OSCARS to pre-train the VL model and fine-tune it on a wide range of downstream VL tasks.
VinVL: Revisiting Visual Representations in Vision-Language Models
TLDR
This paper develops an improved object detection model to provide object-centric representations of images and feeds the visual features generated into a Transformer-based VL fusion model OSCAR, and utilizes an improved approach OSCar+ to pre-train the VL model and fine-tune it on a wide range of downstream VL tasks.
Adaptive Fine-tuning for Vision and Language Pre-trained Models
TLDR
Compared to previous methods, the AFVL achieves comparable or better results while saving training time and GPU memory by a large margin for Adaptive Fine-tuning of Vision and Language pre-trained models.
VL-BERT: Pre-training of Generic Visual-Linguistic Representations
TLDR
A new pre-trainable generic representation for visual-linguistic tasks, called Visual-Linguistic BERT (VL-BERT), which adopts the simple yet powerful Transformer model as the backbone, and extends it to take both visual and linguistic embedded features as input.
ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision
TLDR
A minimal VLP model, Vision-andLanguage Transformer (ViLT), monolithic in the sense that processing of visual inputs is drastically simplified to just the same convolution-free manner that the authors process textual inputs, showing that ViLT is up to 60 times faster than previous VLP models, yet with competitive or better downstream task performance.
DiMBERT: Learning Vision-Language Grounded Representations with Disentangled Multimodal-Attention
TLDR
The proposed DiM (short for Disentangled Multimodal-Attention) module can be easily incorporated into existing pre-trained V-L models to boost their performance, up to a 5% increase on the representative task, and the effectiveness of the introduced visual concepts is demonstrated.
MVP: Multimodality-guided Visual Pre-training
TLDR
The proposed approach is named Multimodality-guided Visual Pre-training (MVP), in which the tokenizer is replaced with the vision branch of CLIP, a vision-language model pre-trained on 400 million image-text pairs, and the effectiveness is demonstrated.
Weakly-supervised VisualBERT: Pre-training without Parallel Images and Captions
TLDR
This work proposes Weakly-supervised VisualBERT with the key idea of conducting "mask-and-predict" pre-training on language-only and image-only corpora, and introduces the object tags detected by an object recognition model as anchor points to bridge two modalities.
...
...

References

SHOWING 1-10 OF 41 REFERENCES
ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks
We present ViLBERT (short for Vision-and-Language BERT), a model for learning task-agnostic joint representations of image content and natural language. We extend the popular BERT architecture to a
MUREL: Multimodal Relational Reasoning for Visual Question Answering
TLDR
This paper proposes MuRel, a multimodal relational network which is learned end-to-end to reason over real images, and introduces the introduction of the MuRel cell, an atomic reasoning primitive representing interactions between question and image regions by a rich vectorial representation, and modeling region relations with pairwise combinations.
Learning Conditioned Graph Structures for Interpretable Visual Question Answering
TLDR
This paper proposes a novel graph-based approach for Visual Question Answering that combines a graph learner module, which learns a question specific graph representation of the input image, with the recent concept of graph convolutions, aiming to learn image representations that capture question specific interactions.
Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations
TLDR
The Visual Genome dataset is presented, which contains over 108K images where each image has an average of $$35$$35 objects, $$26$$26 attributes, and $$21$$21 pairwise relationships between objects, and represents the densest and largest dataset of image descriptions, objects, attributes, relationships, and question answer pairs.
Deep Modular Co-Attention Networks for Visual Question Answering
TLDR
A deep Modular Co-Attention Network (MCAN) that consists of Modular co-attention layers cascaded in depth that significantly outperforms the previous state-of-the-art models and is quantitatively and qualitatively evaluated on the benchmark VQA-v2 dataset.
Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering
TLDR
This work balances the popular VQA dataset by collecting complementary images such that every question in the authors' balanced dataset is associated with not just a single image, but rather a pair of similar images that result in two different answers to the question.
Relation-Aware Graph Attention Network for Visual Question Answering
TLDR
A Relation-aware Graph Attention Network (ReGAT), which encodes each image into a graph and models multi-type inter-object relations via a graph attention mechanism, to learn question-adaptive relation representations.
VQA: Visual Question Answering
We propose the task of free-form and open-ended Visual Question Answering (VQA). Given an image and a natural language question about the image, the task is to provide an accurate natural language
Improving Language Understanding by Generative Pre-Training
TLDR
The general task-agnostic model outperforms discriminatively trained models that use architectures specifically crafted for each task, improving upon the state of the art in 9 out of the 12 tasks studied.
Multimodal Transformer With Multi-View Visual Representation for Image Captioning
TLDR
Inspired by the success of the Transformer model in machine translation, this work extends it to a Multimodal Transformer (MT) model for image captioning that significantly outperforms the previous state-of-the-art methods.
...
...