Unicoder-VL: A Universal Encoder for Vision and Language by Cross-modal Pre-training

@inproceedings{Li2020UnicoderVLAU,
  title={Unicoder-VL: A Universal Encoder for Vision and Language by Cross-modal Pre-training},
  author={Gen Li and Nan Duan and Yuejian Fang and Daxin Jiang and Ming Zhou},
  booktitle={AAAI},
  year={2020}
}
We propose Unicoder-VL, a universal encoder that aims to learn joint representations of vision and language in a pre-training manner. [] Key Method Borrow ideas from cross-lingual pre-trained models, such as XLM and Unicoder, both visual and linguistic contents are fed into a multi-layer transformer for the cross-modal pre-training, where three pre-trained tasks are employed, including masked language model, masked object label prediction and visual-linguistic matching.

Figures and Tables from this paper

Unified Vision-Language Pre-Training for Image Captioning and VQA
TLDR
VLP is the first reported model that achieves state-of-the-art results on both vision-language generation and understanding tasks, as disparate as image captioning and visual question answering, across three challenging benchmark datasets: COCO Captions, Flickr30k Captions and VQA 2.0.
E2E-VLP: End-to-End Vision-Language Pre-training Enhanced by Visual Learning
TLDR
This paper proposes the first end-to-end vision-language pre-trained model for both V+L understanding and generation, namely E2E-VLP, where a unified Transformer framework is built to jointly learn visual representation, and semantic alignments between image and text.
UC2: Universal Cross-lingual Cross-modal Vision-and-Language Pre-training
TLDR
UC2, the first machine translation-augmented framework for cross-lingual cross-modal representation learning, is introduced, to tackle the scarcity problem of multilingual captions for image datasets and facilitate the learning of a joint embedding space of images and all languages of interest.
Uni-EDEN: Universal Encoder-Decoder Network by Multi-Granular Vision-Language Pre-training
TLDR
A pre-trainable Universal Encoder-DEcoder Network (Uni-EDEN), consisting of three modules: object and sentence encoders that separately learns the representations of each modality and sentence decoder that enables both multi- modal reasoning and sentence generation via inter-modal interaction.
UFO: A UniFied TransfOrmer for Vision-Language Representation Learning
TLDR
A single UniFied transfOrmer (UFO), which is capable of processing either unimodal inputs or multi-modality inputs, for vision-language (VL) representation learning, to achieve new state of the arts on visual question answering, COCO image captioning and nocaps.
VC-GPT: Visual Conditioned GPT for End-to-End Generative Vision-and-Language Pre-training
TLDR
A novel G-VLP framework, Visual Conditioned GPT (VC-GPT), is proposed, which achieves either the best or the second-best performance across all evaluation metrics over the previous works which consume around 30 times more distinct images during cross-modal pre-training.
Scheduled Sampling in Vision-Language Pretraining with Decoupled Encoder-Decoder Network
TLDR
This paper proposes a primary scheduled sampling strategy that elegantly mitigates such discrepancy via pretraining encoder-decoder in a two-pass manner and demonstrates the compelling generalizability of the pretrained encoding structure by fine-tuning on four VL understanding and generation downstream tasks.
Probing Inter-modality: Visual Parsing with Self-Attention for Vision-Language Pre-training
TLDR
This work proposes a metric named Inter-Modality Flow (IMF) to measure the interaction between vision and language modalities (i.e., inter-modality) and design a novel masking optimization mechanism named Masked Feature Regression (MFR) in Transformer to further promote the inter- modality learning.
A Survey of Vision-Language Pre-Trained Models
TLDR
This paper briefly introduces several ways to encode raw images and texts to single-modal embeddings before pre-training, and dives into the mainstream architectures of VL-PTMs in modeling the interaction between text and image representations.
OPT: Omni-Perception Pre-Trainer for Cross-Modal Understanding and Generation
TLDR
Experimental results show that the proposed Omni-perception PreTrainer can learn strong image-text-audio multi-modal representations and achieve promising results on a variety of cross- modal understanding and generation tasks.
...
...

References

SHOWING 1-10 OF 46 REFERENCES
Unicoder: A Universal Language Encoder by Pre-training with Multiple Cross-lingual Tasks
TLDR
It is found that doing fine-tuning on multiple languages together can bring further improvement in Unicoder, a universal language encoder that is insensitive to different languages.
ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks
We present ViLBERT (short for Vision-and-Language BERT), a model for learning task-agnostic joint representations of image content and natural language. We extend the popular BERT architecture to a
UNITER: Learning UNiversal Image-TExt Representations
TLDR
UNITER, a UNiversal Image-TExt Representation, learned through large-scale pre-training over four image-text datasets is introduced, which can power heterogeneous downstream V+L tasks with joint multimodal embeddings.
VL-BERT: Pre-training of Generic Visual-Linguistic Representations
TLDR
A new pre-trainable generic representation for visual-linguistic tasks, called Visual-Linguistic BERT (VL-BERT), which adopts the simple yet powerful Transformer model as the backbone, and extends it to take both visual and linguistic embedded features as input.
VideoBERT: A Joint Model for Video and Language Representation Learning
TLDR
This work builds upon the BERT model to learn bidirectional joint distributions over sequences of visual and linguistic tokens, derived from vector quantization of video data and off-the-shelf speech recognition outputs, respectively, which can be applied directly to open-vocabulary classification.
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
TLDR
A new language representation model, BERT, designed to pre-train deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers, which can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of tasks.
Dual-Path Convolutional Image-Text Embedding
TLDR
This paper builds a convolutional network amenable for fine-tuning the visual and textual representations, where the entire network only contains four components, i.e., convolution layer, pooling layer, rectified linear unit function (ReLU), and batch normalisation.
VisualBERT: A Simple and Performant Baseline for Vision and Language
TLDR
Analysis demonstrates that VisualBERT can ground elements of language to image regions without any explicit supervision and is even sensitive to syntactic relationships, tracking, for example, associations between verbs and image regions corresponding to their arguments.
Attention is All you Need
TLDR
A new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely is proposed, which generalizes well to other tasks by applying it successfully to English constituency parsing both with large and limited training data.
Improving Language Understanding by Generative Pre-Training
TLDR
The general task-agnostic model outperforms discriminatively trained models that use architectures specifically crafted for each task, improving upon the state of the art in 9 out of the 12 tasks studied.
...
...