Unified Vision-Language Pre-Training for Image Captioning and VQA
@article{Zhou2020UnifiedVP, title={Unified Vision-Language Pre-Training for Image Captioning and VQA}, author={Luowei Zhou and Hamid Palangi and Lei Zhang and Houdong Hu and Jason J. Corso and Jianfeng Gao}, journal={ArXiv}, year={2020}, volume={abs/1909.11059} }
This paper presents a unified Vision-Language Pre-training (VLP) model. The model is unified in that (1) it can be fine-tuned for either vision-language generation (e.g., image captioning) or understanding (e.g., visual question answering) tasks, and (2) it uses a shared multi-layer transformer network for both encoding and decoding, which differs from many existing methods where the encoder and decoder are implemented using separate models. The unified VLP model is pre-trained on a large…
Figures and Tables from this paper
359 Citations
BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation
- Computer ScienceArXiv
- 2022
BLIP effectively utilizes the noisy web data by bootstrapping the captions, where a captioner generates synthetic captions and a filter removes the noisy ones, and demonstrates strong generalization ability when directly transferred to videolanguage tasks in a zero-shot manner.
UFO: A UniFied TransfOrmer for Vision-Language Representation Learning
- Computer ScienceArXiv
- 2021
A single UniFied transfOrmer (UFO), which is capable of processing either unimodal inputs or multi-modality inputs, for vision-language (VL) representation learning, to achieve new state of the arts on visual question answering, COCO image captioning and nocaps.
Unifying Vision-and-Language Tasks via Text Generation
- Computer ScienceICML
- 2021
This work proposes a unified framework that learns different tasks in a single architecture with the same language modeling objective, i.e., multimodal conditional text generation, where the models learn to generate labels in text based on the visual and textual inputs.
VLMo: Unified Vision-Language Pre-Training with Mixture-of-Modality-Experts
- Computer ScienceArXiv
- 2021
A pretrained VLM O that jointly learns a dual encoder and a fusion encoder with a modular Transformer network and a stagewise pre-training strategy, which effectively leverages large-scale image-only and text-only data besides image-text pairs is proposed.
E2E-VLP: End-to-End Vision-Language Pre-training Enhanced by Visual Learning
- Computer ScienceACL
- 2021
This paper proposes the first end-to-end vision-language pre-trained model for both V+L understanding and generation, namely E2E-VLP, where a unified Transformer framework is built to jointly learn visual representation, and semantic alignments between image and text.
DU-VLG: Unifying Vision-and-Language Generation via Dual Sequence-to-Sequence Pre-training
- Computer ScienceFINDINGS
- 2022
DU-VLG, a framework which unifies vision-and-language generation as sequence generation problems, is proposed and a novel commitment loss is designed to bridge the gap between image understanding and generation.
Unified Multimodal Pre-training and Prompt-based Tuning for Vision-Language Understanding and Generation
- Computer ScienceArXiv
- 2021
The UniVL framework attains comparable performance to recent vision-language pre-training methods on both understanding tasks and generation tasks, and demostrate that prompt-based finetuning is more data-efficient — it outperforms discriminative methods in few-shot scenarios.
A Survey of Vision-Language Pre-Trained Models
- Computer ScienceArXiv
- 2022
This paper briefly introduces several ways to encode raw images and texts to single-modal embeddings before pre-training, and dives into the mainstream architectures of VL-PTMs in modeling the interaction between text and image representations.
ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision
- Computer ScienceICML
- 2021
A minimal VLP model, Vision-andLanguage Transformer (ViLT), monolithic in the sense that processing of visual inputs is drastically simplified to just the same convolution-free manner that the authors process textual inputs, showing that ViLT is up to 60 times faster than previous VLP models, yet with competitive or better downstream task performance.
Crossing the Format Boundary of Text and Boxes: Towards Unified Vision-Language Modeling
- Computer ScienceArXiv
- 2021
A vision-language (VL) model that unifies text generation and bounding box prediction into a single architecture that achieves comparable performance to task-specific state of the art on 7 VL benchmarks and shows the capability of generalizing to new tasks such as ImageNet object localization.
References
SHOWING 1-10 OF 48 REFERENCES
Unicoder-VL: A Universal Encoder for Vision and Language by Cross-modal Pre-training
- Computer ScienceAAAI
- 2020
After pretraining on large-scale image-caption pairs, Unicoder-VL is transferred to caption-based image-text retrieval and visual commonsense reasoning, with just one additional output layer, and shows the powerful ability of the cross-modal pre-training.
UNITER: UNiversal Image-TExt Representation Learning
- Computer ScienceECCV
- 2020
UNITER, a UNiversal Image-TExt Representation, learned through large-scale pre-training over four image-text datasets is introduced, which can power heterogeneous downstream V+L tasks with joint multimodal embeddings.
Unified Language Model Pre-training for Natural Language Understanding and Generation
- Computer ScienceNeurIPS
- 2019
A new Unified pre-trained Language Model (UniLM) that can be fine-tuned for both natural language understanding and generation tasks that compares favorably with BERT on the GLUE benchmark, and the SQuAD 2.0 and CoQA question answering tasks.
ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks
- Computer ScienceNeurIPS
- 2019
We present ViLBERT (short for Vision-and-Language BERT), a model for learning task-agnostic joint representations of image content and natural language. We extend the popular BERT architecture to a…
UNITER: Learning UNiversal Image-TExt Representations
- Computer ScienceECCV 2020
- 2019
UNITER, a UNiversal Image-TExt Representation, learned through large-scale pre-training over four image-text datasets is introduced, which can power heterogeneous downstream V+L tasks with joint multimodal embeddings.
LXMERT: Learning Cross-Modality Encoder Representations from Transformers
- Computer ScienceEMNLP
- 2019
The LXMERT (Learning Cross-Modality Encoder Representations from Transformers) framework, a large-scale Transformer model that consists of three encoders, achieves the state-of-the-art results on two visual question answering datasets and shows the generalizability of the pre-trained cross-modality model.
VideoBERT: A Joint Model for Video and Language Representation Learning
- Computer Science2019 IEEE/CVF International Conference on Computer Vision (ICCV)
- 2019
This work builds upon the BERT model to learn bidirectional joint distributions over sequences of visual and linguistic tokens, derived from vector quantization of video data and off-the-shelf speech recognition outputs, respectively, which can be applied directly to open-vocabulary classification.
VL-BERT: Pre-training of Generic Visual-Linguistic Representations
- Computer ScienceICLR
- 2020
A new pre-trainable generic representation for visual-linguistic tasks, called Visual-Linguistic BERT (VL-BERT), which adopts the simple yet powerful Transformer model as the backbone, and extends it to take both visual and linguistic embedded features as input.
VisualBERT: A Simple and Performant Baseline for Vision and Language
- Computer ScienceArXiv
- 2019
Analysis demonstrates that VisualBERT can ground elements of language to image regions without any explicit supervision and is even sensitive to syntactic relationships, tracking, for example, associations between verbs and image regions corresponding to their arguments.
Neural Baby Talk
- Computer Science2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition
- 2018
A novel framework for image captioning that can produce natural language explicitly grounded in entities that object detectors find in the image is introduced and reaches state-of-the-art on both COCO and Flickr30k datasets.