• Corpus ID: 233346984

VATT: Transformers for Multimodal Self-Supervised Learning from Raw Video, Audio and Text

@article{Akbari2021VATTTF,
  title={VATT: Transformers for Multimodal Self-Supervised Learning from Raw Video, Audio and Text},
  author={Hassan Akbari and Linagzhe Yuan and Rui Qian and Wei-Hong Chuang and Shih-Fu Chang and Yin Cui and Boqing Gong},
  journal={ArXiv},
  year={2021},
  volume={abs/2104.11178}
}
We present a framework for learning multimodal representations from unlabeled data using convolution-free Transformer architectures. Specifically, our V ideo-A udio- T ext T ransformer ( VATT ) takes raw signals as inputs and extracts multimodal representations that are rich enough to benefit a variety of downstream tasks. We train VATT end-to-end from scratch using multimodal contrastive losses and evaluate its performance by the downstream tasks of video action recognition, audio event… 

Figures and Tables from this paper

Everything at Once - Multi-modal Fusion Transformer for Video Retrieval
TLDR
This work presents a multi-modal, modality agnostic fusion transformer approach that learns to exchange information between multiple modalities, such as video, audio, and text, and integrate them into a joined multi-Modal representation to obtain an embedding that aggregates multi- modal temporal information.
UFO: A UniFied TransfOrmer for Vision-Language Representation Learning
TLDR
A single UniFied transfOrmer (UFO), which is capable of processing either unimodal inputs or multi-modality inputs, for vision-language (VL) representation learning, to achieve new state of the arts on visual question answering, COCO image captioning and nocaps.
VIOLET : End-to-End Video-Language Transformers with Masked Visual-token Modeling
TLDR
This work presents VIOLET, a fully end-to-end VIdeO-LanguagE Transformer, which adopts a video transformer to explicitly model the temporal dynamics of video inputs, and designs a new pretraining task, Masked Visual-token Modeling (MVM), for better video modeling.
All in One: Exploring Unified Video-Language Pre-training
TLDR
This work introduces an end-to-end video-language model, namely all-in-one Transformer, that embeds raw video and textual signals into joint representations using a unified backbone architecture and introduces a novel and effective token rolling operation to encode temporal representations from video clips in a non-parametric manner.
Long-Short Temporal Contrastive Learning of Video Transformers
TLDR
It is empirically demonstrated that self-supervised pretraining of video transformers on video-only datasets can lead to action recognition results that are on par or better than those obtained with supervised pretraining on large-scale image datasets, even massive ones such as ImageNet-21K.
PolyViT: Co-training Vision Transformers on Images, Videos and Audio
TLDR
Co-training PolyViT on multiple modalities and tasks leads to a model that is even more parameter-efficient, and learns representations that generalize across multiple domains, as well as simple and practical to implement.
Enhancing Contrastive Learning with Temporal Cognizance for Audio-Visual Representation Generation
TLDR
The results indicate that the addition of temporal information significantly improved the performance of the contrastive loss based framework and the proposed modeling approach builds upon the recent advances in contrastive losses based audio-visual representation learning.
Towards a Unified Foundation Model: Jointly Pre-Training Transformers on Unpaired Images and Text
TLDR
The experiments show that the resultant unified foundation transformer works surprisingly well on both the vision-only and text-only tasks, and the proposed knowledge distillation and gradient masking strategy can effectively lift the performance to approach the level of separately-trained models.
Learnable Irrelevant Modality Dropout for Multimodal Action Recognition on Modality-Specific Annotated Videos
TLDR
A novel learnable irrelevant modality dropout (IMD) is proposed to completely drop out the irrelevant audio modality and fuse only the relevant modalities and results on several vision-specific annotated datasets validated the framework as it outperforms most relevant action recognition methods.
...
...

References

SHOWING 1-10 OF 124 REFERENCES
Self-Supervised MultiModal Versatile Networks
TLDR
This work learns representations using self-supervision by leveraging three modalities naturally present in videos: vision, audio and language by incorporating a novel process of deflation, so that the networks can be effortlessly applied to the visual data in the form of video or a static image.
ViViT: A Video Vision Transformer
TLDR
This work shows how to effectively regularise the model during training and leverage pretrained image models to be able to train on comparatively small datasets, and achieves state-of-the-art results on multiple video classification benchmarks.
Self-Supervised Learning by Cross-Modal Audio-Video Clustering
TLDR
Cross-Modal Deep Clustering (XDC), a novel self-supervised method that leverages unsupervised clustering in one modality as a supervisory signal for the other modality, is proposed, which is the first self- supervised learning method that outperforms large-scale fully- Supervised pretraining for action recognition on the same architecture.
Memory-augmented Dense Predictive Coding for Video Representation Learning
TLDR
A new architecture and learning framework Memory-augmented Dense Predictive Coding (MemDPC) is proposed for the self-supervised learning from video, in particular for representations for action recognition, trained with a predictive attention mechanism over the set of compressed memories.
End-to-End Learning of Visual Representations From Uncurated Instructional Videos
TLDR
This work proposes a new learning approach, MIL-NCE, capable of addressing mis- alignments inherent in narrated videos and outperforms all published self-supervised approaches for these tasks as well as several fully supervised baselines.
Training data-efficient image transformers & distillation through attention
TLDR
This work produces a competitive convolution-free transformer by training on Imagenet only, and introduces a teacher-student strategy specific to transformers that relies on a distillation token ensuring that the student learns from the teacher through attention.
Is Space-Time Attention All You Need for Video Understanding?
TLDR
This paper presents a convolution-free approach to video classification built exclusively on self-attention over space and time, and suggests that “divided attention,” where temporal attention and spatial attention are separately applied within each block, leads to the best video classification accuracy among the design choices considered.
Cooperative Learning of Audio and Video Models from Self-Supervised Synchronization
TLDR
It is demonstrated that a calibrated curriculum learning scheme, a careful choice of negative examples, and the use of a contrastive loss are critical ingredients to obtain powerful multi-sensory representations from models optimized to discern temporal synchronization of audio-video pairs.
Learning Video Representations using Contrastive Bidirectional Transformer
This paper proposes a self-supervised learning approach for video features that results in significantly improved performance on downstream tasks (such as video classification, captioning and
Evolving Losses for Unsupervised Video Representation Learning
TLDR
An unsupervised representation evaluation metric is proposed using distribution matching to a large unlabeled dataset as a prior constraint, based on Zipf's law, which produces similar results to weakly-supervised, task-specific ones.
...
...