UNIMO: Towards Unified-Modal Understanding and Generation via Cross-Modal Contrastive Learning

@article{Li2021UNIMOTU,
  title={UNIMO: Towards Unified-Modal Understanding and Generation via Cross-Modal Contrastive Learning},
  author={Wei Li and Can Gao and Guocheng Niu and Xinyan Xiao and Hao Liu and Jiachen Liu and Hua Wu and Haifeng Wang},
  journal={ArXiv},
  year={2021},
  volume={abs/2012.15409}
}
Existed pre-training methods either focus on single-modal tasks or multi-modal tasks, and cannot effectively adapt to each other. They can only utilize single-modal data (i.e., text or image) or limited multi-modal data (i.e., image-text pairs). In this work, we propose a UNIfied-MOdal pre-training architecture, namely UNIMO, which can effectively adapt to both single-modal and multi-modal understanding and generation tasks. Large scale of free text corpus and image collections are utilized to… 

Transformers in computational visual media: A survey

TLDR
This study comprehensively surveys recent visual transformer works and focuses on visual transformer methods in low-level vision and generation, which use a self-attention mechanism rather than the RNN sequential structure.

VLMixer: Unpaired Vision-Language Pre-training via Cross-Modal CutMix

TLDR
A new unpaired VLP method, dubbed as VLMixer, is presented, that integrates CMC with contrastive learning to pull together the uni-modal and multi- modal views for better instance-level alignments among different modalities.

Vision-Language Intelligence: Tasks, Representation Learning, and Large Models

TLDR
The development in this field is summarized into three time periods, namely task-specific methods, vision-language pre-training (VLP) methods, and larger models empowered by large-scale weakly-labeled data.

BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation

TLDR
BLIP effectively utilizes the noisy web data by bootstrapping the captions, where a captioner generates synthetic captions and a filter removes the noisy ones, and demonstrates strong generalization ability when directly transferred to videolanguage tasks in a zero-shot manner.

KD-VLP: Improving End-to-End Vision-and-Language Pretraining with Object Knowledge Distillation

TLDR
This paper proposes an object-aware end-to-end VLP framework, which directly feeds image grid features from CNNs into the Transformer and learns the multi-modal representations jointly, and designs two novel pretext tasks by taking object features and their semantic labels from external detectors as supervision.

DU-VLG: Unifying Vision-and-Language Generation via Dual Sequence-to-Sequence Pre-training

TLDR
DU-VLG, a framework which unifies vision-and-language generation as sequence generation problems, is proposed and a novel commitment loss is designed to bridge the gap between image understanding and generation.

Anticipating the Unseen Discrepancy for Vision and Language Navigation

TLDR
A semi-supervised framework DAVIS that leverages visual consistency signals across similar semantic observations and enhances the basic mixture of imitation and reinforcement learning with Momentum Contrast to encourage stable decision-making on similar observations under a joint training stage and a test-time adaptation stage.

Learning Visual Representation from Modality-Shared Contrastive Language-Image Pre-training

TLDR
It is discovered that sharing parameters leads to semantic concepts from different modalities being encoded more closely in the embedding space, facilitating the transferring of common semantic structure from language to vision.

Vision-and-Language Pretraining

TLDR
This article categorize and delineate pretraining approaches, along with the summary of state-of-the-art vision-and-language pretrained models, and a list of training datasets and downstream tasks is supplied to further polish the perspective into V&L pretraining.

Improving Personalized Explanation Generation through Visualization

TLDR
A visually-enhanced approach named METER is proposed with the help of visualization generation and text–image matching discrimination: the explainable recommendation model is encouraged to visualize what it refers to while incurring a penalty if the visualization is incongruent with the textual explanation.
...

References

SHOWING 1-10 OF 51 REFERENCES

Dynamic Context-guided Capsule Network for Multimodal Machine Translation

TLDR
This paper proposes a novel Dynamic Context-guided Capsule Network (DCCN) for MMT, which represents the input image with global and regional visual features, and introduces two parallel DCCNs to model multimodal context vectors with visual features at different granularities.

UNITER: UNiversal Image-TExt Representation Learning

TLDR
UNITER, a UNiversal Image-TExt Representation, learned through large-scale pre-training over four image-text datasets is introduced, which can power heterogeneous downstream V+L tasks with joint multimodal embeddings.

InterBERT: Vision-and-Language Interaction for Multi-modal Pretraining

TLDR
A novel model, namely InterBERT (BERT for Interaction), which owns strong capability of modeling interaction between the information flows of different modalities, is proposed and developed, which is the first Chinese multi-modal pretrained model.

RoBERTa: A Robustly Optimized BERT Pretraining Approach

TLDR
It is found that BERT was significantly undertrained, and can match or exceed the performance of every model published after it, and the best model achieves state-of-the-art results on GLUE, RACE and SQuAD.

Unified Language Model Pre-training for Natural Language Understanding and Generation

TLDR
A new Unified pre-trained Language Model (UniLM) that can be fine-tuned for both natural language understanding and generation tasks that compares favorably with BERT on the GLUE benchmark, and the SQuAD 2.0 and CoQA question answering tasks.

Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks

TLDR
This work introduces a Region Proposal Network (RPN) that shares full-image convolutional features with the detection network, thus enabling nearly cost-free region proposals and further merge RPN and Fast R-CNN into a single network by sharing their convolutionAL features.

Microsoft COCO: Common Objects in Context

We present a new dataset with the goal of advancing the state-of-the-art in object recognition by placing the question of object recognition in the context of the broader question of scene

A Simple Framework for Contrastive Learning of Visual Representations

TLDR
It is shown that composition of data augmentations plays a critical role in defining effective predictive tasks, and introducing a learnable nonlinear transformation between the representation and the contrastive loss substantially improves the quality of the learned representations, and contrastive learning benefits from larger batch sizes and more training steps compared to supervised learning.

Unicoder-VL: A Universal Encoder for Vision and Language by Cross-modal Pre-training

TLDR
After pretraining on large-scale image-caption pairs, Unicoder-VL is transferred to caption-based image-text retrieval and visual commonsense reasoning, with just one additional output layer, and shows the powerful ability of the cross-modal pre-training.

VisualBERT: A Simple and Performant Baseline for Vision and Language

TLDR
Analysis demonstrates that VisualBERT can ground elements of language to image regions without any explicit supervision and is even sensitive to syntactic relationships, tracking, for example, associations between verbs and image regions corresponding to their arguments.
...