• Corpus ID: 244462912

UFO: A UniFied TransfOrmer for Vision-Language Representation Learning

@article{Wang2021UFOAU,
  title={UFO: A UniFied TransfOrmer for Vision-Language Representation Learning},
  author={Jianfeng Wang and Xiaowei Hu and Zhe Gan and Zhengyuan Yang and Xiyang Dai and Zicheng Liu and Yumao Lu and Lijuan Wang},
  journal={ArXiv},
  year={2021},
  volume={abs/2111.10023}
}
In this paper, we propose a single UniFied transfOrmer (UFO), which is capable of processing either unimodal inputs (e.g., image or language) or multimodal inputs (e.g., the concatenation of the image and the question), for vision-language (VL) representation learning. Existing approaches typically design an individual network for each modality and/or a specific fusion network for multimodal tasks. To simplify the network architecture, we use a single transformer network and enforce multi-task… 
GIT: A Generative Image-to-text Transformer for Vision and Language
TLDR
This paper designs and train a GIT to unify vision-language tasks such as image/video captioning and question answering and presents a new scheme of generation-based image classification and scene text recognition, achieving decent performance on standard benchmarks.
All in One: Exploring Unified Video-Language Pre-training
TLDR
This work introduces an end-to-end video-language model, namely all-in-one Transformer, that embeds raw video and textual signals into joint representations using a unified backbone architecture and introduces a novel and effective token rolling operation to encode temporal representations from video clips in a non-parametric manner.
Coarse-to-Fine Vision-Language Pre-training with Fusion in the Backbone
TLDR
This work presents FIBER (Fusion-In-the-Backbone-based transformER), a new VL model architecture that can seamlessly handle both these types of tasks and provides consistent performance improvements over strong baselines across all tasks, often outperforming methods using magnitudes more data.
Flamingo: a Visual Language Model for Few-Shot Learning
TLDR
It is demonstrated that a single Flamingo model can achieve a new state of the art for few-shot learning, simply by prompting the model with task-specific examples.
xGQA: Cross-Lingual Visual Question Answering
TLDR
This work provides xGQA, a new multilingual evaluation benchmark for the visual question answering task, and extends the established English GQA dataset to 7 typologically diverse languages, enabling us to detect and explore crucial challenges in cross-lingualVisual question answering.
Multimodal Contrastive Learning with LIMoE: the Language-Image Mixture of Experts
TLDR
This work presents the Language-Image MoE, LIMoE, a sparse mixture of experts model capable of multimodal learning, and proposes an entropy-based regularization scheme for which it is demonstrated remarkable performance improvement over dense models of equivalent computational cost.
Zero-Shot Video Question Answering via Frozen Bidirectional Language Models
TLDR
This work builds on frozen bidirectional language models (BiLM) and shows that such an approach provides a stronger and cheaper alternative for zero-shot VideoQA and demonstrates competitive performance in the few-shot and fully-supervised setting.
REVIVE: Regional Visual Representation Matters in Knowledge-Based Visual Question Answering
TLDR
This paper revisits visual representation in knowledge-based visual question answering (VQA) and demonstrates that using regional information in a better way can significantly improve the performance and proposes a new knowledge- based VQA method REVIVE, which tries to utilize the explicit information of object regions not only in the knowledge retrieval stage but also in the answering model.
Multimodal Learning with Transformers: A Survey
TLDR
A comprehensive survey of Transformer techniques oriented at multimodal data and a discussion of open problems and potential research directions for the community are presented.
CommerceMM: Large-Scale Commerce MultiModal Representation Learning with Omni Retrieval
TLDR
The CommerceMM model is introduced, a multimodal model capable of providing a diverse and granular understanding of commerce topics associated to the given piece of content, and having the capability to generalize to a wide range of tasks.
...
...

References

SHOWING 1-10 OF 61 REFERENCES
Unified Vision-Language Pre-Training for Image Captioning and VQA
TLDR
VLP is the first reported model that achieves state-of-the-art results on both vision-language generation and understanding tasks, as disparate as image captioning and visual question answering, across three challenging benchmark datasets: COCO Captions, Flickr30k Captions and VQA 2.0.
Unifying Vision-and-Language Tasks via Text Generation
TLDR
This work proposes a unified framework that learns different tasks in a single architecture with the same language modeling objective, i.e., multimodal conditional text generation, where the models learn to generate labels in text based on the visual and textual inputs.
Align before Fuse: Vision and Language Representation Learning with Momentum Distillation
TLDR
This paper introduces a contrastive loss to ALign the image and text representations BEfore Fusing through cross-modal attention, which enables more grounded vision and language representation learning and proposes momentum distillation, a self-training method which learns from pseudo-targets produced by a momentum model.
ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision
TLDR
A minimal VLP model, Vision-andLanguage Transformer (ViLT), monolithic in the sense that processing of visual inputs is drastically simplified to just the same convolution-free manner that the authors process textual inputs, showing that ViLT is up to 60 times faster than previous VLP models, yet with competitive or better downstream task performance.
Unicoder-VL: A Universal Encoder for Vision and Language by Cross-modal Pre-training
TLDR
After pretraining on large-scale image-caption pairs, Unicoder-VL is transferred to caption-based image-text retrieval and visual commonsense reasoning, with just one additional output layer, and shows the powerful ability of the cross-modal pre-training.
UNITER: UNiversal Image-TExt Representation Learning
TLDR
UNITER, a UNiversal Image-TExt Representation, learned through large-scale pre-training over four image-text datasets is introduced, which can power heterogeneous downstream V+L tasks with joint multimodal embeddings.
LXMERT: Learning Cross-Modality Encoder Representations from Transformers
TLDR
The LXMERT (Learning Cross-Modality Encoder Representations from Transformers) framework, a large-scale Transformer model that consists of three encoders, achieves the state-of-the-art results on two visual question answering datasets and shows the generalizability of the pre-trained cross-modality model.
Seeing Out of tHe bOx: End-to-End Pre-training for Vision-Language Representation Learning
TLDR
This paper proposes SOHO to "Seeing Out of tHe bOx" that takes a whole image as input, and learns vision-language representation in an end-to-end manner, and does not require bounding box annotations which enables inference 10 times faster than region-based approaches.
ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks
We present ViLBERT (short for Vision-and-Language BERT), a model for learning task-agnostic joint representations of image content and natural language. We extend the popular BERT architecture to a
Learning Transferable Visual Models From Natural Language Supervision
TLDR
It is demonstrated that the simple pre-training task of predicting which caption goes with which image is an efficient and scalable way to learn SOTA image representations from scratch on a dataset of 400 million (image, text) pairs collected from the internet.
...
...