MultiMAE: Multi-modal Multi-task Masked Autoencoders

  title={MultiMAE: Multi-modal Multi-task Masked Autoencoders},
  author={Roman Bachmann and David Mizrahi and Andrei Atanov and Amir Roshan Zamir},
We propose a pre-training strategy called Multi-modal Multi-task Masked Autoencoders (MultiMAE). It differs from standard Masked Autoencoding in two key aspects: We make use of masking (across image patches and input modalities) to make training MultiMAE tractable as well as to ensure cross-modality predictive coding is indeed learned by the network. We show this pre-training strategy to a flexible, simple, and efficient framework with improved transfer results to downstream tasks. In particu… 
Unleashing Vanilla Vision Transformer with Masked Image Modeling for Object Detection
The proposed detector, named M IM D ET, enables a MIM pre-trained vanilla ViT to outperform hierarchical Swin Transformer by 2.5 AP box and 2.6 AP mask on COCO, and achieves better results compared with the previous best adapted Vanilla ViT detector using a more modest fine-tuning recipe.
CMX: Cross-Modal Fusion for RGB-X Semantic Segmentation with Transformers
Extensive experiments show that CMX generalizes to diverse multi-modal combinations, achieving state-of-the-art performances on RGB-Depth benchmarks, as well as RGB-Thermal and RGB-Polarization datasets, and to investigate the generalizability to dense-sparse data fusion.
Spatial Entropy Regularization for Vision Transformers
This paper explicitly encourages the emergence of this spatial clustering as a form of training regularization, this way including a self-supervised pretext task into the standard supervised learning.
GMML is All you Need
Group masked model learning (GMML), a self-supervised learning (SSL) mechanism for pretraining vision transformers with the ability to extract the contextual information present in all the concepts in an image is proposed.
Masked World Models for Visual Control
A visual model-based RL framework that decoupling visual representation learning and dynamics learning is introduced that achieves state-of-the-art performance on a variety of visual robotic tasks from Meta-world and RLBench.
Saccade Mechanisms for Image Classification, Object Detection and Tracking
The authors' experiments show intelligent data reduction via learning to mimic human saccades when used in conjunction with state-of-the-art DNNs for classification, detection, and tracking tasks.
Modality-invariant Visual Odometry for Indoor Navigation
A novel approach to multi-modal Visual Odometry based on Vision Transformers that successfully replaces GPS+compass and can deal with limited availability of modalities during test time by implicitly learning a representation invari-ant to the availability of input modalities.


Variational Mixture-of-Experts Autoencoders for Multi-Modal Deep Generative Models
This work proposes a mixture-of-experts multi-modal variational autoencoder (MMVAE) for learning of generative models on different sets of modalities, including a challenging image language dataset, and demonstrates its ability to satisfy all four criteria, both qualitatively and quantitatively.
Masked Autoencoders Are Scalable Vision Learners
It is shown that masked autoencoders (MAE) are scalable self-supervised learners for computer vision and transfer per-formance in downstream tasks outperforms supervised pretraining and shows promising scaling behavior.
Self-Supervised MultiModal Versatile Networks
This work learns representations using self-supervision by leveraging three modalities naturally present in videos: vision, audio and language by incorporating a novel process of deflation, so that the networks can be effortlessly applied to the visual data in the form of video or a static image.
VATT: Transformers for Multimodal Self-Supervised Learning from Raw Video, Audio and Text
The convolution-free VATT outperforms state-of-the-art ConvNet-based architectures in the downstream tasks and shows the generalizability of the model despite the domain gap between videos and images.
Multimodal Generative Models for Scalable Weakly-Supervised Learning
A multimodal variational autoencoder that uses a product-of-experts inference network and a sub-sampled training paradigm to solve the multi-modal inference problem and shares parameters to efficiently learn under any combination of missing modalities, thereby enabling weakly-supervised learning.
UniT: Multimodal Multitask Learning with a Unified Transformer
UniT, a Unified Transformer model to simultaneously learn the most prominent tasks across different domains, ranging from object detection to natural language understanding and multimodal reasoning, achieves strong performance on each task with significantly fewer parameters.
UberNet: Training a Universal Convolutional Neural Network for Low-, Mid-, and High-Level Vision Using Diverse Datasets and Limited Memory
  • Iasonas Kokkinos
  • Computer Science
    2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
  • 2017
In this work we train in an end-to-end manner a convolutional neural network (CNN) that jointly handles low-, mid-, and high-level vision tasks in a unified architecture. Such a network can act like
E2E-VLP: End-to-End Vision-Language Pre-training Enhanced by Visual Learning
This paper proposes the first end-to-end vision-language pre-trained model for both V+L understanding and generation, namely E2E-VLP, where a unified Transformer framework is built to jointly learn visual representation, and semantic alignments between image and text.
Are Large-scale Datasets Necessary for Self-Supervised Pre-training?
This study shows that denoising autoencoders, such as BEiT or a variant that is introduced in this paper, are more robust to the type and size of the pre-training data than popular self-supervised methods trained by comparing image embeddings.
Masked Feature Prediction for Self-Supervised Visual Pre-Training
This work presents Masked Feature Prediction (MaskFeat), which first randomly masks out a portion of the input sequence and then predicts the feature of the masked regions, and finds Histograms of Oriented Gradients (HOG), a hand-crafted feature descriptor, works particularly well in terms of both performance and efficiency.