Corpus ID: 201317624

VL-BERT: Pre-training of Generic Visual-Linguistic Representations

@article{Su2020VLBERTPO,
  title={VL-BERT: Pre-training of Generic Visual-Linguistic Representations},
  author={Weijie Su and Xizhou Zhu and Yue Cao and Bin Li and Lewei Lu and Furu Wei and Jifeng Dai},
  journal={ArXiv},
  year={2020},
  volume={abs/1908.08530}
}
We introduce a new pre-trainable generic representation for visual-linguistic tasks, called Visual-Linguistic BERT (VL-BERT for short). VL-BERT adopts the simple yet powerful Transformer model as the backbone, and extends it to take both visual and linguistic embedded features as input. In it, each element of the input is either of a word from the input sentence, or a region-of-interest (RoI) from the input image. It is designed to fit for most of the visual-linguistic downstream tasks. To… Expand
ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks
We present ViLBERT (short for Vision-and-Language BERT), a model for learning task-agnostic joint representations of image content and natural language. We extend the popular BERT architecture to aExpand
Unicoder-VL: A Universal Encoder for Vision and Language by Cross-modal Pre-training
TLDR
After pretraining on large-scale image-caption pairs, Unicoder-VL is transferred to caption-based image-text retrieval and visual commonsense reasoning, with just one additional output layer, and the powerful ability of the cross-modal pre-training is shown. Expand
VinVL: Making Visual Representations Matter in Vision-Language Models
TLDR
This paper develops an improved object detection model to provide object-centric representations of images and feeds the visual features generated into a Transformer-based VL fusion model OSCAR, and utilizes an improved approach OSCARS to pre-train the VL model and fine-tune it on a wide range of downstream VL tasks. Expand
Visual Grounding Strategies for Text-Only Natural Language Processing
TLDR
This work proposes two strategies for applying multimodal models to text-only tasks using a placeholder to replace image input and harnesses image retrieval to match texts with related images during both pretraining and text- only downstream tasks. Expand
Kaleido-BERT: Vision-Language Pre-training on Fashion Domain
TLDR
A new vision-language (VL) pre-training model dubbed Kaleido-BERT is presented, which introduces a novel kaleido strategy for fashion cross-modality representations from transformers, and design alignment guided masking to jointly focus more on image-text semantic relations. Expand
VinVL: Revisiting Visual Representations in Vision-Language Models
TLDR
This paper presents a detailed study of improving visual representations for vision language (VL) tasks and develops an improved object detection model to provide objectcentric representations of images, and shows that visual features matter significantly in VL models. Expand
Are we pretraining it right? Digging deeper into visio-linguistic pretraining
Numerous recent works have proposed pretraining generic visio-linguistic representations and then finetuning them for downstream vision and language tasks. While architecture and objective functionExpand
SimVLM: Simple Visual Language Model Pretraining with Weak Supervision
TLDR
This work presents a minimalist pretraining framework, named SimVLM, which significantly outperforms previous pretraining methods and achieves new state-of-the-art results on a wide range of discriminative and generative vision-language benchmarks, including VQA, NLVR2, and image captioning tasks. Expand
Unified Vision-Language Pre-Training for Image Captioning and VQA
TLDR
VLP is the first reported model that achieves state-of-the-art results on both vision-language generation and understanding tasks, as disparate as image captioning and visual question answering, across three challenging benchmark datasets: COCO Captions, Flickr30kCaptions, and VQA 2.0. Expand
KVL-BERT: Knowledge Enhanced Visual-and-Linguistic BERT for Visual Commonsense Reasoning
TLDR
This work incorporates commonsense knowledge into the cross-modal BERT, and proposes a novel Knowledge Enhanced Visual-and-Linguistic BERT (KVL-BERT for short), which outperforms other task-specific models and general task-agnostic pre-training models by a large margin. Expand
...
1
2
3
4
5
...

References

SHOWING 1-10 OF 56 REFERENCES
ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks
We present ViLBERT (short for Vision-and-Language BERT), a model for learning task-agnostic joint representations of image content and natural language. We extend the popular BERT architecture to aExpand
Unicoder-VL: A Universal Encoder for Vision and Language by Cross-modal Pre-training
TLDR
After pretraining on large-scale image-caption pairs, Unicoder-VL is transferred to caption-based image-text retrieval and visual commonsense reasoning, with just one additional output layer, and the powerful ability of the cross-modal pre-training is shown. Expand
VisualBERT: A Simple and Performant Baseline for Vision and Language
TLDR
Analysis demonstrates that VisualBERT can ground elements of language to image regions without any explicit supervision and is even sensitive to syntactic relationships, tracking, for example, associations between verbs and image regions corresponding to their arguments. Expand
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
TLDR
A new language representation model, BERT, designed to pre-train deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers, which can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of tasks. Expand
Visual7W: Grounded Question Answering in Images
TLDR
A semantic link between textual descriptions and image regions by object-level grounding enables a new type of QA with visual answers, in addition to textual answers used in previous work, and proposes a novel LSTM model with spatial attention to tackle the 7W QA tasks. Expand
Improving Language Understanding by Generative Pre-Training
TLDR
The general task-agnostic model outperforms discriminatively trained models that use architectures specifically crafted for each task, significantly improving upon the state of the art in 9 out of the 12 tasks studied. Expand
VideoBERT: A Joint Model for Video and Language Representation Learning
TLDR
This work builds upon the BERT model to learn bidirectional joint distributions over sequences of visual and linguistic tokens, derived from vector quantization of video data and off-the-shelf speech recognition outputs, respectively, which can be applied directly to open-vocabulary classification. Expand
Language Models are Unsupervised Multitask Learners
TLDR
It is demonstrated that language models begin to learn these tasks without any explicit supervision when trained on a new dataset of millions of webpages called WebText, suggesting a promising path towards building language processing systems which learn to perform tasks from their naturally occurring demonstrations. Expand
Fusion of Detected Objects in Text for Visual Question Answering
TLDR
A detailed ablation analysis shows that the early integration of the visual features into the text analysis is key to the effectiveness of the new architecture. Expand
MAttNet: Modular Attention Network for Referring Expression Comprehension
TLDR
This work proposes to decompose expressions into three modular components related to subject appearance, location, and relationship to other objects, which allows for flexibly adapt to expressions containing different types of information in an end-to-end framework. Expand
...
1
2
3
4
5
...