Unified Vision-Language Pre-Training for Image Captioning and VQA

@article{Zhou2020UnifiedVP,
  title={Unified Vision-Language Pre-Training for Image Captioning and VQA},
  author={Luowei Zhou and Hamid Palangi and Lei Zhang and Houdong Hu and Jason J. Corso and Jianfeng Gao},
  journal={ArXiv},
  year={2020},
  volume={abs/1909.11059}
}
This paper presents a unified Vision-Language Pre-training (VLP) model. The model is unified in that (1) it can be fine-tuned for either vision-language generation (e.g., image captioning) or understanding (e.g., visual question answering) tasks, and (2) it uses a shared multi-layer transformer network for both encoding and decoding, which differs from many existing methods where the encoder and decoder are implemented using separate models. The unified VLP model is pre-trained on a large… 

Figures and Tables from this paper

BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation
TLDR
BLIP effectively utilizes the noisy web data by bootstrapping the captions, where a captioner generates synthetic captions and a filter removes the noisy ones, and demonstrates strong generalization ability when directly transferred to videolanguage tasks in a zero-shot manner.
UFO: A UniFied TransfOrmer for Vision-Language Representation Learning
TLDR
A single UniFied transfOrmer (UFO), which is capable of processing either unimodal inputs or multi-modality inputs, for vision-language (VL) representation learning, to achieve new state of the arts on visual question answering, COCO image captioning and nocaps.
Unifying Vision-and-Language Tasks via Text Generation
TLDR
This work proposes a unified framework that learns different tasks in a single architecture with the same language modeling objective, i.e., multimodal conditional text generation, where the models learn to generate labels in text based on the visual and textual inputs.
VLMo: Unified Vision-Language Pre-Training with Mixture-of-Modality-Experts
TLDR
A pretrained VLM O that jointly learns a dual encoder and a fusion encoder with a modular Transformer network and a stagewise pre-training strategy, which effectively leverages large-scale image-only and text-only data besides image-text pairs is proposed.
E2E-VLP: End-to-End Vision-Language Pre-training Enhanced by Visual Learning
TLDR
This paper proposes the first end-to-end vision-language pre-trained model for both V+L understanding and generation, namely E2E-VLP, where a unified Transformer framework is built to jointly learn visual representation, and semantic alignments between image and text.
DU-VLG: Unifying Vision-and-Language Generation via Dual Sequence-to-Sequence Pre-training
TLDR
DU-VLG, a framework which unifies vision-and-language generation as sequence generation problems, is proposed and a novel commitment loss is designed to bridge the gap between image understanding and generation.
Unified Multimodal Pre-training and Prompt-based Tuning for Vision-Language Understanding and Generation
TLDR
The UniVL framework attains comparable performance to recent vision-language pre-training methods on both understanding tasks and generation tasks, and demostrate that prompt-based finetuning is more data-efficient — it outperforms discriminative methods in few-shot scenarios.
A Survey of Vision-Language Pre-Trained Models
TLDR
This paper briefly introduces several ways to encode raw images and texts to single-modal embeddings before pre-training, and dives into the mainstream architectures of VL-PTMs in modeling the interaction between text and image representations.
ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision
TLDR
A minimal VLP model, Vision-andLanguage Transformer (ViLT), monolithic in the sense that processing of visual inputs is drastically simplified to just the same convolution-free manner that the authors process textual inputs, showing that ViLT is up to 60 times faster than previous VLP models, yet with competitive or better downstream task performance.
Crossing the Format Boundary of Text and Boxes: Towards Unified Vision-Language Modeling
TLDR
A vision-language (VL) model that unifies text generation and bounding box prediction into a single architecture that achieves comparable performance to task-specific state of the art on 7 VL benchmarks and shows the capability of generalizing to new tasks such as ImageNet object localization.
...
...

References

SHOWING 1-10 OF 48 REFERENCES
Unicoder-VL: A Universal Encoder for Vision and Language by Cross-modal Pre-training
TLDR
After pretraining on large-scale image-caption pairs, Unicoder-VL is transferred to caption-based image-text retrieval and visual commonsense reasoning, with just one additional output layer, and shows the powerful ability of the cross-modal pre-training.
UNITER: UNiversal Image-TExt Representation Learning
TLDR
UNITER, a UNiversal Image-TExt Representation, learned through large-scale pre-training over four image-text datasets is introduced, which can power heterogeneous downstream V+L tasks with joint multimodal embeddings.
Unified Language Model Pre-training for Natural Language Understanding and Generation
TLDR
A new Unified pre-trained Language Model (UniLM) that can be fine-tuned for both natural language understanding and generation tasks that compares favorably with BERT on the GLUE benchmark, and the SQuAD 2.0 and CoQA question answering tasks.
ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks
We present ViLBERT (short for Vision-and-Language BERT), a model for learning task-agnostic joint representations of image content and natural language. We extend the popular BERT architecture to a
UNITER: Learning UNiversal Image-TExt Representations
TLDR
UNITER, a UNiversal Image-TExt Representation, learned through large-scale pre-training over four image-text datasets is introduced, which can power heterogeneous downstream V+L tasks with joint multimodal embeddings.
LXMERT: Learning Cross-Modality Encoder Representations from Transformers
TLDR
The LXMERT (Learning Cross-Modality Encoder Representations from Transformers) framework, a large-scale Transformer model that consists of three encoders, achieves the state-of-the-art results on two visual question answering datasets and shows the generalizability of the pre-trained cross-modality model.
VideoBERT: A Joint Model for Video and Language Representation Learning
TLDR
This work builds upon the BERT model to learn bidirectional joint distributions over sequences of visual and linguistic tokens, derived from vector quantization of video data and off-the-shelf speech recognition outputs, respectively, which can be applied directly to open-vocabulary classification.
VL-BERT: Pre-training of Generic Visual-Linguistic Representations
TLDR
A new pre-trainable generic representation for visual-linguistic tasks, called Visual-Linguistic BERT (VL-BERT), which adopts the simple yet powerful Transformer model as the backbone, and extends it to take both visual and linguistic embedded features as input.
VisualBERT: A Simple and Performant Baseline for Vision and Language
TLDR
Analysis demonstrates that VisualBERT can ground elements of language to image regions without any explicit supervision and is even sensitive to syntactic relationships, tracking, for example, associations between verbs and image regions corresponding to their arguments.
Neural Baby Talk
TLDR
A novel framework for image captioning that can produce natural language explicitly grounded in entities that object detectors find in the image is introduced and reaches state-of-the-art on both COCO and Flickr30k datasets.
...
...