• Corpus ID: 201317624

VL-BERT: Pre-training of Generic Visual-Linguistic Representations

@article{Su2020VLBERTPO,
  title={VL-BERT: Pre-training of Generic Visual-Linguistic Representations},
  author={Weijie Su and Xizhou Zhu and Yue Cao and Bin Li and Lewei Lu and Furu Wei and Jifeng Dai},
  journal={ArXiv},
  year={2020},
  volume={abs/1908.08530}
}
We introduce a new pre-trainable generic representation for visual-linguistic tasks, called Visual-Linguistic BERT (VL-BERT for short). VL-BERT adopts the simple yet powerful Transformer model as the backbone, and extends it to take both visual and linguistic embedded features as input. In it, each element of the input is either of a word from the input sentence, or a region-of-interest (RoI) from the input image. It is designed to fit for most of the visual-linguistic downstream tasks. To… 

Figures and Tables from this paper

ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks

We present ViLBERT (short for Vision-and-Language BERT), a model for learning task-agnostic joint representations of image content and natural language. We extend the popular BERT architecture to a

What BERT Sees: Cross-Modal Transfer for Visual Question Generation

TLDR
Evaluated visual capabilities of BERT out-of-the-box are evaluated, indicating an innate capacity for BERT-gen to adapt to multi-modal data and text generation, even with few data available, avoiding expensive pre-training.

VinVL: Making Visual Representations Matter in Vision-Language Models

TLDR
This paper develops an improved object detection model to provide object-centric representations of images and feeds the visual features generated into a Transformer-based VL fusion model OSCAR, and utilizes an improved approach OSCARS to pre-train the VL model and fine-tune it on a wide range of downstream VL tasks.

VinVL: Revisiting Visual Representations in Vision-Language Models

TLDR
This paper develops an improved object detection model to provide object-centric representations of images and feeds the visual features generated into a Transformer-based VL fusion model OSCAR, and utilizes an improved approach OSCar+ to pre-train the VL model and fine-tune it on a wide range of downstream VL tasks.

Visual Grounding Strategies for Text-Only Natural Language Processing

TLDR
This work proposes two strategies for applying multimodal models to text-only tasks using a placeholder to replace image input and harnesses image retrieval to match texts with related images during both pretraining and text- only downstream tasks.

Kaleido-BERT: Vision-Language Pre-training on Fashion Domain

  • Mingchen ZhugeD. Gao L. Shao
  • Computer Science
    2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
  • 2021
TLDR
A new vision-language (VL) pre-training model dubbed Kaleido-BERT is presented, which introduces a novel kaleido strategy for fashion cross-modality representations from transformers, and design alignment guided masking to jointly focus more on image-text semantic relations.

Are we pretraining it right? Digging deeper into visio-linguistic pretraining

Numerous recent works have proposed pretraining generic visio-linguistic representations and then finetuning them for downstream vision and language tasks. While architecture and objective function

ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision

TLDR
A minimal VLP model, Vision-andLanguage Transformer (ViLT), monolithic in the sense that processing of visual inputs is drastically simplified to just the same convolution-free manner that the authors process textual inputs, showing that ViLT is up to 60 times faster than previous VLP models, yet with competitive or better downstream task performance.

Unicoder-VL: A Universal Encoder for Vision and Language by Cross-modal Pre-training

TLDR
After pretraining on large-scale image-caption pairs, Unicoder-VL is transferred to caption-based image-text retrieval and visual commonsense reasoning, with just one additional output layer, and shows the powerful ability of the cross-modal pre-training.

Unified Multimodal Pre-training and Prompt-based Tuning for Vision-Language Understanding and Generation

TLDR
The UniVL framework attains comparable performance to recent vision-language pre-training methods on both understanding tasks and generation tasks, and demostrate that prompt-based finetuning is more data-efficient — it outperforms discriminative methods in few-shot scenarios.
...

References

SHOWING 1-10 OF 56 REFERENCES

ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks

We present ViLBERT (short for Vision-and-Language BERT), a model for learning task-agnostic joint representations of image content and natural language. We extend the popular BERT architecture to a

VisualBERT: A Simple and Performant Baseline for Vision and Language

TLDR
Analysis demonstrates that VisualBERT can ground elements of language to image regions without any explicit supervision and is even sensitive to syntactic relationships, tracking, for example, associations between verbs and image regions corresponding to their arguments.

Unicoder-VL: A Universal Encoder for Vision and Language by Cross-modal Pre-training

TLDR
After pretraining on large-scale image-caption pairs, Unicoder-VL is transferred to caption-based image-text retrieval and visual commonsense reasoning, with just one additional output layer, and shows the powerful ability of the cross-modal pre-training.

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

TLDR
A new language representation model, BERT, designed to pre-train deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers, which can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of tasks.

Visual7W: Grounded Question Answering in Images

TLDR
A semantic link between textual descriptions and image regions by object-level grounding enables a new type of QA with visual answers, in addition to textual answers used in previous work, and proposes a novel LSTM model with spatial attention to tackle the 7W QA tasks.

Improving Language Understanding by Generative Pre-Training

TLDR
The general task-agnostic model outperforms discriminatively trained models that use architectures specifically crafted for each task, improving upon the state of the art in 9 out of the 12 tasks studied.

Language Models are Unsupervised Multitask Learners

TLDR
It is demonstrated that language models begin to learn these tasks without any explicit supervision when trained on a new dataset of millions of webpages called WebText, suggesting a promising path towards building language processing systems which learn to perform tasks from their naturally occurring demonstrations.

Fusion of Detected Objects in Text for Visual Question Answering

TLDR
A detailed ablation analysis shows that the early integration of the visual features into the text analysis is key to the effectiveness of the new architecture.

MAttNet: Modular Attention Network for Referring Expression Comprehension

TLDR
This work proposes to decompose expressions into three modular components related to subject appearance, location, and relationship to other objects, which allows for flexibly adapt to expressions containing different types of information in an end-to-end framework.

Skip-Thought Vectors

We describe an approach for unsupervised learning of a generic, distributed sentence encoder. Using the continuity of text from books, we train an encoder-decoder model that tries to reconstruct the
...