• Corpus ID: 246063865

Omnivore: A Single Model for Many Visual Modalities

  title={Omnivore: A Single Model for Many Visual Modalities},
  author={Rohit Girdhar and Mannat Singh and Nikhil Ravi and Laurens van der Maaten and Armand Joulin and Ishan Misra},
Prior work has studied different visual modalities in iso-lation and developed separate architectures for recognition of images, videos, and 3D data. Instead, in this paper, we propose a single model which excels at classifying images, videos, and single-view 3D data using exactly the same model parameters. Our ‘ O MNIVORE ’ model leverages the flexibility of transformer-based architectures and is trained jointly on classification tasks from different modalities. O MNIVORE is simple to train… 
OmniMAE: Single Model Masked Pretraining on Images and Videos
This work shows that masked autoencoding can be used to train a simple Vision Transformer on images and videos, without requiring any labeled data, and learns visual representations that are comparable to or better than single-modality representations on both image and video benchmarks, while using a much simpler architecture.
One Model, Multiple Modalities: A Sparsely Activated Approach for Text, Sound, Image, Video and Code
It is shown that pretraining improves the performance of SkillNet on modalities, on par with or even better than baselines with modality-specific pretraining, and the system achieves higher accuracy than existing leading systems including Wukong ViT-B and Wenlan 2.0 while using less number of activated parameters.
Bringing Image Scene Structure to Video via Frame-Clip Consistency of Object Tokens
This work proposes a learning framework StructureViT (SViT for short), which demonstrates how utilizing the structure of a small number of images only available during training can improve a video model.
MultiMAE: Multi-modal Multi-task Masked Autoencoders
MultiMAE pre-trained with RGB, depth and semantic segmentation is a more generalist model that does well at transferring to a range of downstream tasks and shows an intriguingly powerful capability by the model in cross-modal/task predictive coding and transfer.
All in One: Exploring Unified Video-Language Pre-training
This work introduces an end-to-end video-language model, namely all-in-one Transformer, that embeds raw video and textual signals into joint representations using a unified backbone architecture and introduces a novel and effective token rolling operation to encode temporal representations from video clips in a non-parametric manner.
CMX: Cross-Modal Fusion for RGB-X Semantic Segmentation with Transformers
Extensive experiments show that CMX generalizes to diverse multi-modal combinations, achieving state-of-the-art performances on RGB-Depth benchmarks, as well as RGB-Thermal and RGB-Polarization datasets, and to investigate the generalizability to dense-sparse data fusion.
Multimodal Learning with Transformers: A Survey
A comprehensive survey of Transformer techniques oriented at multimodal data and a discussion of open problems and potential research directions for the community are presented.
The Modality Focusing Hypothesis: On the Blink of Multimodal Knowledge Distillation
The modality Venn diagram is presented to understand modality relationships and the modality focusing hypothesis revealing the decisive factor in the efficacy of multimodal KD is investigated.
One-stage Action Detection Transformer
In this work, we introduce our solution to the EPICKITCHENS-10
M&M Mix: A Multimodal Multiview Transformer Ensemble
This report describes the approach behind our winning solution to the 2022 Epic-Kitchens Action Recognition Challenge. Our approach builds upon our recent work, Multiview Transformer for Video


Perceiver: General Perception with Iterative Attention
This paper introduces the Perceiver – a model that builds upon Transformers and hence makes few architectural assumptions about the relationship between its inputs, but that also scales to hundreds of thousands of inputs, like ConvNets.
Attention Bottlenecks for Multimodal Fusion
This work introduces a novel transformer based architecture that uses ‘fusion bottlenecks’ for modality fusion at multiple layers, and shows that such a strategy improves fusion performance, at the same time reducing computational cost.
ViViT: A Video Vision Transformer
This work shows how to effectively regularise the model during training and leverage pretrained image models to be able to train on comparatively small datasets, and achieves state-of-the-art results on multiple video classification benchmarks.
Self-Supervised MultiModal Versatile Networks
This work learns representations using self-supervision by leveraging three modalities naturally present in videos: vision, audio and language by incorporating a novel process of deflation, so that the networks can be effortlessly applied to the visual data in the form of video or a static image.
Two-Stream Convolutional Networks for Action Recognition in Videos
This work proposes a two-stream ConvNet architecture which incorporates spatial and temporal networks and demonstrates that a ConvNet trained on multi-frame dense optical flow is able to achieve very good performance in spite of limited training data.
Translate-to-Recognize Networks for RGB-D Scene Recognition
A unified framework to integrate the tasks of cross-modal translation and modality-specific recognition, termed as Translate-to-Recognize Network TRecgNet, which achieves superior performance to the existing state-of-the-art methods, especially for recognition solely based on a single modality.
VATT: Transformers for Multimodal Self-Supervised Learning from Raw Video, Audio and Text
The convolution-free VATT outperforms state-of-the-art ConvNet-based architectures in the downstream tasks and shows the generalizability of the model despite the domain gap between videos and images.
Training data-efficient image transformers & distillation through attention
This work produces a competitive convolution-free transformer by training on Imagenet only, and introduces a teacher-student strategy specific to transformers that relies on a distillation token ensuring that the student learns from the teacher through attention.
Video Swin Transformer
The proposed video architecture is realized by adapting the Swin Transformer designed for the image domain, while continuing to leverage the power of pre-trained image models, and achieves state-of-the-art accuracy on a broad range of video recognition benchmarks.