MixGen: A New Multi-Modal Data Augmentation

  title={MixGen: A New Multi-Modal Data Augmentation},
  author={Xiaoshuai Hao and Yi Zhu and Srikar Appalaraju and Aston Zhang and Wanqian Zhang and Boyang Li and Mu Li},
Data augmentation is a necessity to enhance data efficiency in deep learning. For vision-language pre-training, data is only augmented either for images or for text in previous works. In this paper, we present MixGen: a joint data augmentation for vision-language representation learning to further improve data efficiency. It generates new image-text pairs with semantic relationships preserved by interpolating images and concatenating text. It’s simple, and can be plug-and-played into existing… 



Vision-Language Pre-Training with Triple Contrastive Learning

TCL is the first work that takes into account local structure information for multi-modality representation learning and achieves the new state of the art on various common downstream vision-language tasks such as image-text retrieval and visual question answering.

E2E-VLP: End-to-End Vision-Language Pre-training Enhanced by Visual Learning

This paper proposes the first end-to-end vision-language pre-trained model for both V+L understanding and generation, namely E2E-VLP, where a unified Transformer framework is built to jointly learn visual representation, and semantic alignments between image and text.

VLMo: Unified Vision-Language Pre-Training with Mixture-of-Modality-Experts

A pretrained VLM O that jointly learns a dual encoder and a fusion encoder with a modular Transformer network and a stagewise pre-training strategy, which effectively leverages large-scale image-only and text-only data besides image-text pairs is proposed.

UNITER: UNiversal Image-TExt Representation Learning

UNITER, a UNiversal Image-TExt Representation, learned through large-scale pre-training over four image-text datasets is introduced, which can power heterogeneous downstream V+L tasks with joint multimodal embeddings.

Align before Fuse: Vision and Language Representation Learning with Momentum Distillation

This paper introduces a contrastive loss to ALign the image and text representations BEfore Fusing through cross-modal attention, which enables more grounded vision and language representation learning and proposes momentum distillation, a self-training method which learns from pseudo-targets produced by a momentum model.

Conceptual 12M: Pushing Web-Scale Image-Text Pre-Training To Recognize Long-Tail Visual Concepts

The results clearly illustrate the benefit of scaling up pre-training data for vision-and-language tasks, as indicated by the new state-of-the-art results on both the nocaps and Conceptual Captions benchmarks.

BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation

BLIP is proposed, a new VLP framework which trans-fers flexibly to both vision-language understanding and generation tasks, and demonstrates strong generalization ability when directly transferred to video-language tasks in a zero-shot manner.

Learning Transferable Visual Models From Natural Language Supervision

It is demonstrated that the simple pre-training task of predicting which caption goes with which image is an efficient and scalable way to learn SOTA image representations from scratch on a dataset of 400 million (image, text) pairs collected from the internet.

Unsupervised Data Augmentation for Consistency Training

A new perspective on how to effectively noise unlabeled examples is presented and it is argued that the quality of noising, specifically those produced by advanced data augmentation methods, plays a crucial role in semi-supervised learning.

Grounded Language-Image Pre-training

A grounded language-image pretraining model for learning object-level, language-aware, and semantic-rich visual representations that unifies object detection and phrase grounding for pre-training and can leverage massive image-text pairs by generating grounding boxes in a self-training fashion.