• Corpus ID: 235694336

OPT: Omni-Perception Pre-Trainer for Cross-Modal Understanding and Generation

  title={OPT: Omni-Perception Pre-Trainer for Cross-Modal Understanding and Generation},
  author={Jing Liu and Xinxin Zhu and Fei Liu and Longteng Guo and Zijia Zhao and Ming-Ting Sun and Weining Wang and Hanqing Lu and Shiyu Zhou and Jiajun Zhang and Jinqiao Wang},
In this paper, we propose an Omni-perception PreTrainer (OPT) for cross-modal understanding and generation, by jointly modeling visual, text and audio resources. OPT is constructed in an encoder-decoder framework, including three single-modal encoders to generate tokenbased embeddings for each modality, a cross-modal encoder to encode the correlations among the three modalities, and two cross-modal decoders to generate text and image respectively. For the OPT’s pre-training, we design a multi… 

Figures and Tables from this paper

MM21 Pre-training for Video Understanding Challenge: Video Captioning with Pretraining Techniques
This paper proposes single-modality pretrained feature fusion technique which is composed of reasonable multi-view feature extraction method and designed multi- modality feature fusion strategy and it surpasses the state-of-the-art methods on both MSR-VTT and VATEX datasets.
X-ray imaging meets deep learning
  • Ge Wang
  • Computer Science, Engineering
    Optical Engineering + Applications
  • 2021
A background where x-ray imaging meets deep learning is provided, representative results on low-dose CT, sparse-data CT, and deep radiomics are described, and opportunities to combine datadriven and model-based methods for x- Ray CT, other imaging modalities, and their combinations are discussed so that imaging service can be significantly improved for precision medicine.
MERLOT Reserve: Neural Script Knowledge through Vision and Language and Sound
As humans, we navigate the world through all our senses, using perceptual input from each one to correct the others. We introduce MERLOT Reserve, a model that represents videos jointly over time –


Unicoder-VL: A Universal Encoder for Vision and Language by Cross-modal Pre-training
After pretraining on large-scale image-caption pairs, Unicoder-VL is transferred to caption-based image-text retrieval and visual commonsense reasoning, with just one additional output layer, and the powerful ability of the cross-modal pre-training is shown.
LXMERT: Learning Cross-Modality Encoder Representations from Transformers
The LXMERT (Learning Cross-Modality Encoder Representations from Transformers) framework, a large-scale Transformer model that consists of three encoders, achieves the state-of-the-art results on two visual question answering datasets and shows the generalizability of the pre-trained cross-modality model.
Unified Vision-Language Pre-Training for Image Captioning and VQA
VLP is the first reported model that achieves state-of-the-art results on both vision-language generation and understanding tasks, as disparate as image captioning and visual question answering, across three challenging benchmark datasets: COCO Captions, Flickr30kCaptions, and VQA 2.0.
Listen, Look and Deliberate: Visual Context-Aware Speech Recognition Using Pre-Trained Text-Video Representations
Novel VC-ASR approaches to leverage video and text representations extracted by a self-supervised pre-trained text-video embedding model is explored and a multi-stream attention architecture to leverage signals from both audio and video modalities is proposed.
UNITER: Learning UNiversal Image-TExt Representations
UNITER, a UNiversal Image-TExt Representation, learned through large-scale pre-training over four image-text datasets is introduced, which can power heterogeneous downstream V+L tasks with joint multimodal embeddings.
Multi-Format Contrastive Learning of Audio Representations
This work investigates the use of the contrastive learning framework to learn audio representations by maximizing the agreement between the raw audio and its spectral representation and finds a significant gain using this multi-format strategy against the single-format counterparts.
ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks
We present ViLBERT (short for Vision-and-Language BERT), a model for learning task-agnostic joint representations of image content and natural language. We extend the popular BERT architecture to a
Adversarial Cross-Modal Retrieval
Comprehensive experimental results show that the proposed ACMR method is superior in learning effective subspace representation and that it significantly outperforms the state-of-the-art cross-modal retrieval methods.
End-to-end Multimodal Speech Recognition
This paper analyzes the behavior of CTC and S2S models on noisy video data (How-To corpus), and compares it to results on the clean Wall Street Journal corpus, providing insight into the robustness of both approaches.
MASS: Masked Sequence to Sequence Pre-training for Language Generation
This work proposes MAsked Sequence to Sequence pre-training (MASS) for the encoder-decoder based language generation tasks, which achieves the state-of-the-art accuracy on the unsupervised English-French translation, even beating the early attention-based supervised model.