VATT: Transformers for Multimodal Self-Supervised Learning from Raw Video, Audio and Text
@article{Akbari2021VATTTF, title={VATT: Transformers for Multimodal Self-Supervised Learning from Raw Video, Audio and Text}, author={Hassan Akbari and Linagzhe Yuan and Rui Qian and Wei-Hong Chuang and Shih-Fu Chang and Yin Cui and Boqing Gong}, journal={ArXiv}, year={2021}, volume={abs/2104.11178} }
We present a framework for learning multimodal representations from unlabeled data using convolution-free Transformer architectures. Specifically, our V ideo-A udio- T ext T ransformer ( VATT ) takes raw signals as inputs and extracts multimodal representations that are rich enough to benefit a variety of downstream tasks. We train VATT end-to-end from scratch using multimodal contrastive losses and evaluate its performance by the downstream tasks of video action recognition, audio event…
Figures and Tables from this paper
102 Citations
Everything at Once - Multi-modal Fusion Transformer for Video Retrieval
- Computer ScienceArXiv
- 2021
This work presents a multi-modal, modality agnostic fusion transformer approach that learns to exchange information between multiple modalities, such as video, audio, and text, and integrate them into a joined multi-Modal representation to obtain an embedding that aggregates multi- modal temporal information.
UFO: A UniFied TransfOrmer for Vision-Language Representation Learning
- Computer ScienceArXiv
- 2021
A single UniFied transfOrmer (UFO), which is capable of processing either unimodal inputs or multi-modality inputs, for vision-language (VL) representation learning, to achieve new state of the arts on visual question answering, COCO image captioning and nocaps.
VIOLET : End-to-End Video-Language Transformers with Masked Visual-token Modeling
- Computer ScienceArXiv
- 2021
This work presents VIOLET, a fully end-to-end VIdeO-LanguagE Transformer, which adopts a video transformer to explicitly model the temporal dynamics of video inputs, and designs a new pretraining task, Masked Visual-token Modeling (MVM), for better video modeling.
All in One: Exploring Unified Video-Language Pre-training
- Computer ScienceArXiv
- 2022
This work introduces an end-to-end video-language model, namely all-in-one Transformer, that embeds raw video and textual signals into joint representations using a unified backbone architecture and introduces a novel and effective token rolling operation to encode temporal representations from video clips in a non-parametric manner.
Long-Short Temporal Contrastive Learning of Video Transformers
- Computer ScienceArXiv
- 2021
It is empirically demonstrated that self-supervised pretraining of video transformers on video-only datasets can lead to action recognition results that are on par or better than those obtained with supervised pretraining on large-scale image datasets, even massive ones such as ImageNet-21K.
PolyViT: Co-training Vision Transformers on Images, Videos and Audio
- Computer Science
- 2021
Co-training PolyViT on multiple modalities and tasks leads to a model that is even more parameter-efficient, and learns representations that generalize across multiple domains, as well as simple and practical to implement.
Enhancing Contrastive Learning with Temporal Cognizance for Audio-Visual Representation Generation
- Computer ScienceICASSP
- 2022
The results indicate that the addition of temporal information significantly improved the performance of the contrastive loss based framework and the proposed modeling approach builds upon the recent advances in contrastive losses based audio-visual representation learning.
Towards a Unified Foundation Model: Jointly Pre-Training Transformers on Unpaired Images and Text
- Computer ScienceArXiv
- 2021
The experiments show that the resultant unified foundation transformer works surprisingly well on both the vision-only and text-only tasks, and the proposed knowledge distillation and gradient masking strategy can effectively lift the performance to approach the level of separately-trained models.
Multimodal Co-learning: Challenges, Applications with Datasets, Recent Advances and Future Directions
- Computer ScienceInf. Fusion
- 2022
Learnable Irrelevant Modality Dropout for Multimodal Action Recognition on Modality-Specific Annotated Videos
- Computer ScienceArXiv
- 2022
A novel learnable irrelevant modality dropout (IMD) is proposed to completely drop out the irrelevant audio modality and fuse only the relevant modalities and results on several vision-specific annotated datasets validated the framework as it outperforms most relevant action recognition methods.
References
SHOWING 1-10 OF 124 REFERENCES
Self-Supervised MultiModal Versatile Networks
- Computer ScienceNeurIPS
- 2020
This work learns representations using self-supervision by leveraging three modalities naturally present in videos: vision, audio and language by incorporating a novel process of deflation, so that the networks can be effortlessly applied to the visual data in the form of video or a static image.
ViViT: A Video Vision Transformer
- Computer Science2021 IEEE/CVF International Conference on Computer Vision (ICCV)
- 2021
This work shows how to effectively regularise the model during training and leverage pretrained image models to be able to train on comparatively small datasets, and achieves state-of-the-art results on multiple video classification benchmarks.
Self-Supervised Learning by Cross-Modal Audio-Video Clustering
- Computer ScienceNeurIPS
- 2020
Cross-Modal Deep Clustering (XDC), a novel self-supervised method that leverages unsupervised clustering in one modality as a supervisory signal for the other modality, is proposed, which is the first self- supervised learning method that outperforms large-scale fully- Supervised pretraining for action recognition on the same architecture.
Memory-augmented Dense Predictive Coding for Video Representation Learning
- Computer ScienceECCV
- 2020
A new architecture and learning framework Memory-augmented Dense Predictive Coding (MemDPC) is proposed for the self-supervised learning from video, in particular for representations for action recognition, trained with a predictive attention mechanism over the set of compressed memories.
End-to-End Learning of Visual Representations From Uncurated Instructional Videos
- Computer Science2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
- 2020
This work proposes a new learning approach, MIL-NCE, capable of addressing mis- alignments inherent in narrated videos and outperforms all published self-supervised approaches for these tasks as well as several fully supervised baselines.
Training data-efficient image transformers & distillation through attention
- Computer ScienceICML
- 2021
This work produces a competitive convolution-free transformer by training on Imagenet only, and introduces a teacher-student strategy specific to transformers that relies on a distillation token ensuring that the student learns from the teacher through attention.
Is Space-Time Attention All You Need for Video Understanding?
- Computer ScienceICML
- 2021
This paper presents a convolution-free approach to video classification built exclusively on self-attention over space and time, and suggests that “divided attention,” where temporal attention and spatial attention are separately applied within each block, leads to the best video classification accuracy among the design choices considered.
Cooperative Learning of Audio and Video Models from Self-Supervised Synchronization
- Computer ScienceNeurIPS
- 2018
It is demonstrated that a calibrated curriculum learning scheme, a careful choice of negative examples, and the use of a contrastive loss are critical ingredients to obtain powerful multi-sensory representations from models optimized to discern temporal synchronization of audio-video pairs.
Learning Video Representations using Contrastive Bidirectional Transformer
- Computer Science
- 2019
This paper proposes a self-supervised learning approach for video features that results in significantly improved performance on downstream tasks (such as video classification, captioning and…
Evolving Losses for Unsupervised Video Representation Learning
- Computer Science2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
- 2020
An unsupervised representation evaluation metric is proposed using distribution matching to a large unlabeled dataset as a prior constraint, based on Zipf's law, which produces similar results to weakly-supervised, task-specific ones.