• Corpus ID: 199453025

ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks

@inproceedings{Lu2019ViLBERTPT,
  title={ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks},
  author={Jiasen Lu and Dhruv Batra and Devi Parikh and Stefan Lee},
  booktitle={NeurIPS},
  year={2019}
}
We present ViLBERT (short for Vision-and-Language BERT), a model for learning task-agnostic joint representations of image content and natural language. [] Key Method We pretrain our model through two proxy tasks on the large, automatically collected Conceptual Captions dataset and then transfer it to multiple established vision-and-language tasks -- visual question answering, visual commonsense reasoning, referring expressions, and caption-based image retrieval -- by making only minor additions to the base…

Figures and Tables from this paper

Transitional Adaptation of Pretrained Models for Visual Storytelling
TLDR
This work claims that a transitional adaptation task is required between pretraining and finetuning to harmonize the visual encoder and the language model for challenging downstream target tasks like visual storytelling and shows that the adaptation step significantly improves the performance of multiple language models for sequential video and image captioning tasks.
Unifying Vision-and-Language Tasks via Text Generation
TLDR
This work proposes a unified framework that learns different tasks in a single architecture with the same language modeling objective, i.e., multimodal conditional text generation, where the models learn to generate labels in text based on the visual and textual inputs.
PEVL: Position-enhanced Pre-training and Prompt Tuning for Vision-language Models
TLDR
It is shown that PEVL enables state-of-the-art performance of detector-free VLP models on position-sensitive tasks such as referring expression comprehension and phrase grounding, and also improves the performance on Position-insensitive tasks with grounded inputs.
Bridging the Gap between Recognition-level Pre-training and Commonsensical Vision-language Tasks
TLDR
The potential of incorporating commonsense knowledge into the conventional recognition-level visual-linguistic pre-training has been demonstrated, and two new tasks are proposed: masked commonsense modeling (MCM) and commonsense type prediction (CTP).
VL-BEiT: Generative Vision-Language Pretraining
TLDR
A vision-language foundation model called VL-BE I T, which is a bidirectional multimodal Transformer learned by generative pretraining, is introduced, which effectively leverages monomodal data like images and texts as well as multimodals data like image-text pairs.
12-in-1: Multi-Task Vision and Language Representation Learning
TLDR
This work develops a large-scale, multi-task model that culminates in a single model on 12 datasets from four broad categories of task including visual question answering, caption-based image retrieval, grounding referring expressions, and multimodal verification and shows that finetuning task-specific models from this model can lead to further improvements, achieving performance at or above the state-of-the-art.
Adaptive Fine-tuning for Vision and Language Pre-trained Models
TLDR
Compared to previous methods, the AFVL achieves comparable or better results while saving training time and GPU memory by a large margin for Adaptive Fine-tuning of Vision and Language pre-trained models.
DiMBERT: Learning Vision-Language Grounded Representations with Disentangled Multimodal-Attention
TLDR
The proposed DiM (short for Disentangled Multimodal-Attention) module can be easily incorporated into existing pre-trained V-L models to boost their performance, up to a 5% increase on the representative task, and the effectiveness of the introduced visual concepts is demonstrated.
Unified Multimodal Pre-training and Prompt-based Tuning for Vision-Language Understanding and Generation
TLDR
The UniVL framework attains comparable performance to recent vision-language pre-training methods on both understanding tasks and generation tasks, and demostrate that prompt-based finetuning is more data-efficient — it outperforms discriminative methods in few-shot scenarios.
VisualBERT: A Simple and Performant Baseline for Vision and Language
TLDR
Analysis demonstrates that VisualBERT can ground elements of language to image regions without any explicit supervision and is even sensitive to syntactic relationships, tracking, for example, associations between verbs and image regions corresponding to their arguments.
...
...

References

SHOWING 1-10 OF 53 REFERENCES
VisualBERT: A Simple and Performant Baseline for Vision and Language
TLDR
Analysis demonstrates that VisualBERT can ground elements of language to image regions without any explicit supervision and is even sensitive to syntactic relationships, tracking, for example, associations between verbs and image regions corresponding to their arguments.
VL-BERT: Pre-training of Generic Visual-Linguistic Representations
TLDR
A new pre-trainable generic representation for visual-linguistic tasks, called Visual-Linguistic BERT (VL-BERT), which adopts the simple yet powerful Transformer model as the backbone, and extends it to take both visual and linguistic embedded features as input.
Unified Vision-Language Pre-Training for Image Captioning and VQA
TLDR
VLP is the first reported model that achieves state-of-the-art results on both vision-language generation and understanding tasks, as disparate as image captioning and visual question answering, across three challenging benchmark datasets: COCO Captions, Flickr30k Captions and VQA 2.0.
Unicoder-VL: A Universal Encoder for Vision and Language by Cross-modal Pre-training
TLDR
After pretraining on large-scale image-caption pairs, Unicoder-VL is transferred to caption-based image-text retrieval and visual commonsense reasoning, with just one additional output layer, and shows the powerful ability of the cross-modal pre-training.
LXMERT: Learning Cross-Modality Encoder Representations from Transformers
TLDR
The LXMERT (Learning Cross-Modality Encoder Representations from Transformers) framework, a large-scale Transformer model that consists of three encoders, achieves the state-of-the-art results on two visual question answering datasets and shows the generalizability of the pre-trained cross-modality model.
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
TLDR
A new language representation model, BERT, designed to pre-train deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers, which can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of tasks.
Modulating early visual processing by language
TLDR
This paper proposes to modulate the entire visual processing by linguistic input by condition the batch normalization parameters of a pretrained residual network (ResNet) on a language embedding, which significantly improves strong baselines on two visual question answering tasks.
MAttNet: Modular Attention Network for Referring Expression Comprehension
TLDR
This work proposes to decompose expressions into three modular components related to subject appearance, location, and relationship to other objects, which allows for flexibly adapt to expressions containing different types of information in an end-to-end framework.
VideoBERT: A Joint Model for Video and Language Representation Learning
TLDR
This work builds upon the BERT model to learn bidirectional joint distributions over sequences of visual and linguistic tokens, derived from vector quantization of video data and off-the-shelf speech recognition outputs, respectively, which can be applied directly to open-vocabulary classification.
Visual Dialog
TLDR
The task of Visual Dialog, which requires an AI agent to hold a meaningful dialog with humans in natural, conversational language about visual content, is introduced and the first ‘visual chatbot’ is demonstrated.
...
...