• Corpus ID: 235489838

VIMPAC: Video Pre-Training via Masked Token Prediction and Contrastive Learning

@article{Tan2021VIMPACVP,
  title={VIMPAC: Video Pre-Training via Masked Token Prediction and Contrastive Learning},
  author={Hao Tan and Jie Lei and Thomas Wolf and Mohit Bansal},
  journal={ArXiv},
  year={2021},
  volume={abs/2106.11250}
}
Video understanding relies on perceiving the global content and modeling its internal connections (e.g., causality, movement, and spatio-temporal correspondence). To learn these interactions, we apply a mask-then-predict pre-training task on discretized video tokens generated via VQ-VAE. Unlike language, where the text tokens are more independent, neighboring video tokens typically have strong correlations (e.g., consecutive video frames usually look very similar), and hence uniformly masking… 
VIOLET : End-to-End Video-Language Transformers with Masked Visual-token Modeling
TLDR
This work presents VIOLET, a fully end-to-end VIdeO-LanguagE Transformer, which adopts a video transformer to explicitly model the temporal dynamics of video inputs, and designs a new pretraining task, Masked Visual-token Modeling (MVM), for better video modeling.
MAR: Masked Autoencoders for Efficient Action Recognition
TLDR
The proposed Masked Action Recognition (MAR), which reduces the computational cost of ViT by 53% and extensive experiments show that MAR consistently outperforms existing ViT models with a notable margin.
iBOT: Image BERT Pre-Training with Online Tokenizer
TLDR
A self-supervised framework iBOT that can perform masked prediction with an online tokenizer and underline emerging local semantic patterns, which helps the models to obtain strong robustness against common corruptions and achieve leading results on dense downstream tasks, e.g., object detection, instance segmentation, and semantic segmentation.
Masked Autoencoders As Spatiotemporal Learners
TLDR
It is shown that the MAE method can learn strong representations with almost no inductive bias on spacetime, and spacetime- agnostic random masking performs the best, and the general framework of masked autoencoding can be a unified methodology for representation learning with minimal domain knowledge.
VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training
TLDR
This paper shows that video masked autoencoders (VideoMAE) are data-efficient learners for self-supervised video pre-training (SSVP) and shows that data quality is more important than data quantity for SSVP.
Hierarchical Self-supervised Representation Learning for Movie Understanding
TLDR
This paper proposes a novel hierarchical self-supervised pretraining strategy that separately pretrains each level of the authors' hierarchical movie understanding model (based on [37]), and demonstrates the effectiveness of contextualized event features on LVU tasks.
OmniMAE: Single Model Masked Pretraining on Images and Videos
TLDR
This work shows that masked autoencoding can be used to train a simple Vision Transformer on images and videos, without requiring any labeled data, and learns visual representations that are comparable to or better than single-modality representations on both image and video benchmarks, while using a much simpler architecture.
LAVENDER: Unifying Video-Language Understanding as Masked Language Modeling
TLDR
Surprisingly, experimental results show that this unified VidL framework LAVENDER achieves competitive performance on 14 VidL benchmarks, covering video question answering, textto-video retrieval and video captioning.
Group Contextualization for Video Recognition
TLDR
An efficient feature refinement method that decomposes the feature channels into several groups and separately refines them with different axial contexts in parallel is proposed and is referred to as group contextualization (GC), which is expected to be more resilient to diverse types of activities.
BEVT: BERT Pretraining of Video Transformers
TLDR
BEVT is introduced which decouples video representation learning into spatial representation learning and temporal dynamics learning and achieves state-of-the-art performance on three challenging video benchmarks where BEVT achieves very promising results.
...
...

References

SHOWING 1-10 OF 75 REFERENCES
Spatiotemporal Contrastive Video Representation Learning
TLDR
This work proposes a temporally consistent spatial augmentation method to impose strong spatial augmentations on each frame of the video while maintaining the temporal consistency across frames, and proposes a sampling-based temporal augmentation methods to avoid overly enforcing invariance on clips that are distant in time.
End-to-End Learning of Visual Representations From Uncurated Instructional Videos
TLDR
This work proposes a new learning approach, MIL-NCE, capable of addressing mis- alignments inherent in narrated videos and outperforms all published self-supervised approaches for these tasks as well as several fully supervised baselines.
ViViT: A Video Vision Transformer
TLDR
This work shows how to effectively regularise the model during training and leverage pretrained image models to be able to train on comparatively small datasets, and achieves state-of-the-art results on multiple video classification benchmarks.
HowTo100M: Learning a Text-Video Embedding by Watching Hundred Million Narrated Video Clips
TLDR
It is demonstrated that a text-video embedding trained on this data leads to state-of-the-art results for text-to-video retrieval and action localization on instructional video datasets such as YouCook2 or CrossTask.
VideoGPT: Video Generation using VQ-VAE and Transformers
TLDR
Despite the simplicity in formulation and ease of training, the proposed architecture is able to generate samples competitive with state of theart GAN models for video generation on the BAIR Robot dataset, and generate high fidelity natural images from UCF-101 and Tumbler GIF Dataset (TGIF).
Is Space-Time Attention All You Need for Video Understanding?
TLDR
This paper presents a convolution-free approach to video classification built exclusively on self-attention over space and time, and suggests that “divided attention,” where temporal attention and spatial attention are separately applied within each block, leads to the best video classification accuracy among the design choices considered.
Predicting Video with VQVAE
TLDR
This paper proposes a novel approach to video prediction with Vector Quantized Variational AutoEncoders (VQ-VAE), which compress high-resolution videos into a hierarchical set of multi-scale discrete latent variables, allowing it to apply scalable autoregressive generative models to predict video.
Deep Video Inpainting
TLDR
This work proposes a novel deep network architecture for fast video inpainting built upon an image-based encoder-decoder model that is designed to collect and refine information from neighbor frames and synthesize still-unknown regions.
Unsupervised Learning of Visual Features by Contrasting Cluster Assignments
TLDR
This paper proposes an online algorithm, SwAV, that takes advantage of contrastive methods without requiring to compute pairwise comparisons, and uses a swapped prediction mechanism where it predicts the cluster assignment of a view from the representation of another view.
Large-Scale Weakly-Supervised Pre-Training for Video Action Recognition
TLDR
The primary empirical finding is that pre-training at a very large scale (over 65 million videos), despite on noisy social-media videos and hashtags, substantially improves the state-of-the-art on three challenging public action recognition datasets.
...
...