• Corpus ID: 245131381

Towards a Unified Foundation Model: Jointly Pre-Training Transformers on Unpaired Images and Text

  title={Towards a Unified Foundation Model: Jointly Pre-Training Transformers on Unpaired Images and Text},
  author={Qing Li and Boqing Gong and Yin Cui and D. Kondratyuk and Xianzhi Du and Ming-Hsuan Yang and Matthew Brown},
In this paper, we explore the possibility of building a unified foundation model that can be adapted to both visiononly and text-only tasks. Starting from BERT and ViT, we design a unified transformer consisting of modality-specific tokenizers, a shared transformer encoder, and task-specific output heads. To efficiently pre-train the proposed model jointly on unpaired images and text, we propose two novel techniques: (i) We employ the separately-trained BERT and ViT models as teachers and apply… 

Figures and Tables from this paper

Multimodal Contrastive Learning with LIMoE: the Language-Image Mixture of Experts
This work presents the Language-Image MoE, LIMoE, a sparse mixture of experts model capable of multimodal learning, and proposes an entropy-based regularization scheme for its training stability and balanced expert utilization.
HighMMT: Towards Modality and Task Generalization for High-Modality Representation Learning
A general multimodal model that enables multitask and transfer learning: multitask learning with shared parameters enables stable parameter counts, and cross-modal transfer learning enables information sharing across modalities and tasks (addressing partial observability).
Multimodal Learning with Transformers: A Survey
A comprehensive survey of Transformer techniques oriented at multimodal data and a discussion of open problems and potential research directions for the community are presented.


UniT: Multimodal Multitask Learning with a Unified Transformer
UniT, a Unified Transformer model to simultaneously learn the most prominent tasks across different domains, ranging from object detection to natural language understanding and multimodal reasoning, achieves strong performance on each task with significantly fewer parameters.
Training data-efficient image transformers & distillation through attention
This work produces a competitive convolution-free transformer by training on Imagenet only, and introduces a teacher-student strategy specific to transformers that relies on a distillation token ensuring that the student learns from the teacher through attention.
UNITER: UNiversal Image-TExt Representation Learning
UNITER, a UNiversal Image-TExt Representation, learned through large-scale pre-training over four image-text datasets is introduced, which can power heterogeneous downstream V+L tasks with joint multimodal embeddings.
VATT: Transformers for Multimodal Self-Supervised Learning from Raw Video, Audio and Text
The convolution-free VATT outperforms state-of-the-art ConvNet-based architectures in the downstream tasks and shows the generalizability of the model despite the domain gap between videos and images.
LXMERT: Learning Cross-Modality Encoder Representations from Transformers
The LXMERT (Learning Cross-Modality Encoder Representations from Transformers) framework, a large-scale Transformer model that consists of three encoders, achieves the state-of-the-art results on two visual question answering datasets and shows the generalizability of the pre-trained cross-modality model.
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
A new language representation model, BERT, designed to pre-train deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers, which can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of tasks.
OmniNet: A unified architecture for multi-modal multi-task learning
An extended and unified architecture which can be used for tasks involving a variety of modalities like image, text, videos, etc is introduced and a spatio-temporal cache mechanism that enables learning spatial dimension of the input in addition to the hidden states corresponding to the temporal input sequence is proposed.
Attention is All you Need
A new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely is proposed, which generalizes well to other tasks by applying it successfully to English constituency parsing both with large and limited training data.
ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks
We present ViLBERT (short for Vision-and-Language BERT), a model for learning task-agnostic joint representations of image content and natural language. We extend the popular BERT architecture to a
12-in-1: Multi-Task Vision and Language Representation Learning
This work develops a large-scale, multi-task model that culminates in a single model on 12 datasets from four broad categories of task including visual question answering, caption-based image retrieval, grounding referring expressions, and multimodal verification and shows that finetuning task-specific models from this model can lead to further improvements, achieving performance at or above the state-of-the-art.