• Computer Science
  • Published in ArXiv 2019

VL-BERT: Pre-training of Generic Visual-Linguistic Representations

@article{Su2019VLBERTPO,
  title={VL-BERT: Pre-training of Generic Visual-Linguistic Representations},
  author={Weijie Su and Xizhou Zhu and Yue Cao and Bin Li and Lewei Lu and Furu Wei and Jifeng Dai},
  journal={ArXiv},
  year={2019},
  volume={abs/1908.08530}
}
We introduce a new pre-trainable generic representation for visual-linguistic tasks, called Visual-Linguistic BERT (VL-BERT for short). VL-BERT adopts the simple yet powerful Transformer model as the backbone, and extends it to take both visual and linguistic embedded features as input. In it, each element of the input is either of a word from the input sentence, or a region-of-interest (RoI) from the input image. It is designed to fit for most of the vision-and-language downstream tasks. To… CONTINUE READING

Citations

Publications citing this paper.
SHOWING 1-10 OF 22 CITATIONS

ImageBERT: Cross-modal Pre-training with Large-scale Weak-supervised Image-Text Data

VIEW 4 EXCERPTS
CITES METHODS
HIGHLY INFLUENCED

12-in-1: Multi-Task Vision and Language Representation Learning

VIEW 13 EXCERPTS
CITES BACKGROUND
HIGHLY INFLUENCED

References

Publications referenced by this paper.
SHOWING 1-10 OF 50 REFERENCES

Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering

VIEW 19 EXCERPTS
HIGHLY INFLUENTIAL

Aligning Books and Movies: Towards Story-Like Visual Explanations by Watching Movies and Reading Books

VIEW 5 EXCERPTS
HIGHLY INFLUENTIAL

From Recognition to Cognition: Visual Commonsense Reasoning

VIEW 7 EXCERPTS
HIGHLY INFLUENTIAL

Attention is All you Need

VIEW 8 EXCERPTS
HIGHLY INFLUENTIAL

Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering

VIEW 5 EXCERPTS
HIGHLY INFLUENTIAL

Fast R-CNN

  • Ross B. Girshick
  • Computer Science
  • 2015 IEEE International Conference on Computer Vision (ICCV)
  • 2015
VIEW 4 EXCERPTS
HIGHLY INFLUENTIAL

Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks

VIEW 9 EXCERPTS
HIGHLY INFLUENTIAL

LXMERT: Learning Cross-Modality Encoder Representations from Transformers

VIEW 5 EXCERPTS
HIGHLY INFLUENTIAL