MixGen: A New Multi-Modal Data Augmentation
@article{Hao2022MixGenAN, title={MixGen: A New Multi-Modal Data Augmentation}, author={Xiaoshuai Hao and Yi Zhu and Srikar Appalaraju and Aston Zhang and Wanqian Zhang and Boyang Li and Mu Li}, journal={ArXiv}, year={2022}, volume={abs/2206.08358} }
Data augmentation is a necessity to enhance data efficiency in deep learning. For vision-language pre-training, data is only augmented either for images or for text in previous works. In this paper, we present MixGen: a joint data augmentation for vision-language representation learning to further improve data efficiency. It generates new image-text pairs with semantic relationships preserved by interpolating images and concatenating text. It’s simple, and can be plug-and-played into existing…
Figures and Tables from this paper
References
SHOWING 1-10 OF 72 REFERENCES
Vision-Language Pre-Training with Triple Contrastive Learning
- Computer Science2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
- 2022
TCL is the first work that takes into account local structure information for multi-modality representation learning and achieves the new state of the art on various common downstream vision-language tasks such as image-text retrieval and visual question answering.
E2E-VLP: End-to-End Vision-Language Pre-training Enhanced by Visual Learning
- Computer ScienceACL
- 2021
This paper proposes the first end-to-end vision-language pre-trained model for both V+L understanding and generation, namely E2E-VLP, where a unified Transformer framework is built to jointly learn visual representation, and semantic alignments between image and text.
VLMo: Unified Vision-Language Pre-Training with Mixture-of-Modality-Experts
- Computer ScienceArXiv
- 2021
A pretrained VLM O that jointly learns a dual encoder and a fusion encoder with a modular Transformer network and a stagewise pre-training strategy, which effectively leverages large-scale image-only and text-only data besides image-text pairs is proposed.
UNITER: UNiversal Image-TExt Representation Learning
- Computer ScienceECCV
- 2020
UNITER, a UNiversal Image-TExt Representation, learned through large-scale pre-training over four image-text datasets is introduced, which can power heterogeneous downstream V+L tasks with joint multimodal embeddings.
Align before Fuse: Vision and Language Representation Learning with Momentum Distillation
- Computer ScienceNeurIPS
- 2021
This paper introduces a contrastive loss to ALign the image and text representations BEfore Fusing through cross-modal attention, which enables more grounded vision and language representation learning and proposes momentum distillation, a self-training method which learns from pseudo-targets produced by a momentum model.
Conceptual 12M: Pushing Web-Scale Image-Text Pre-Training To Recognize Long-Tail Visual Concepts
- Computer Science2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
- 2021
The results clearly illustrate the benefit of scaling up pre-training data for vision-and-language tasks, as indicated by the new state-of-the-art results on both the nocaps and Conceptual Captions benchmarks.
BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation
- Computer ScienceICML
- 2022
BLIP is proposed, a new VLP framework which trans-fers flexibly to both vision-language understanding and generation tasks, and demonstrates strong generalization ability when directly transferred to video-language tasks in a zero-shot manner.
Learning Transferable Visual Models From Natural Language Supervision
- Computer ScienceICML
- 2021
It is demonstrated that the simple pre-training task of predicting which caption goes with which image is an efficient and scalable way to learn SOTA image representations from scratch on a dataset of 400 million (image, text) pairs collected from the internet.
Unsupervised Data Augmentation for Consistency Training
- Computer ScienceNeurIPS
- 2020
A new perspective on how to effectively noise unlabeled examples is presented and it is argued that the quality of noising, specifically those produced by advanced data augmentation methods, plays a crucial role in semi-supervised learning.
Grounded Language-Image Pre-training
- Computer Science2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
- 2022
A grounded language-image pretraining model for learning object-level, language-aware, and semantic-rich visual representations that unifies object detection and phrase grounding for pre-training and can leverage massive image-text pairs by generating grounding boxes in a self-training fashion.