MoViNets: Mobile Video Networks for Efficient Video Recognition

  title={MoViNets: Mobile Video Networks for Efficient Video Recognition},
  author={D. I. Kondratyuk and Liangzhe Yuan and Yandong Li and Li Zhang and Mingxing Tan and Matthew A. Brown and Boqing Gong},
  journal={2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
We present Mobile Video Networks (MoViNets), a family of computation and memory efficient video networks that can operate on streaming video for online inference. 3D convolutional neural networks (CNNs) are accurate at video recognition but require large computation and memory budgets and do not support online inference, making them difficult to work on mobile devices. We propose a three-step approach to improve computational efficiency while substantially reducing the peak memory usage of 3D… 

VATT: Transformers for Multimodal Self-Supervised Learning from Raw Video, Audio and Text

The convolution-free VATT outperforms state-of-the-art ConvNet-based architectures in the downstream tasks and shows the generalizability of the model despite the domain gap between videos and images.

Temporal Progressive Attention for Early Action Prediction

A bottleneck-based attention model that captures the evo-lution of the action, through progressive sampling over progressive scales, is proposed, composed of multiple attention towers, one for each scale.

A Survey on Video Action Recognition in Sports: Datasets, Methods and Applications

—To understand human behaviors, action recognition based on videos is a common approach. Compared with image-based action recognition, videos provide much more information. Reducing the ambiguity of

MeMViT: Memory-Augmented Multiscale Vision Transformer for Efficient Long-Term Video Recognition

MeMViT, a Memoryaugmented Multiscale Vision Transformer, is built that has a temporal support 30×longer than existing models with only 4.5% more compute; traditional methods need >3,000%More compute to do the same.

Evaluating Transformers for Lightweight Action Recognition

This study is the first to evaluate the efficiency of action recognition models in depth across multiple devices and train a wide range of video transformers under the same conditions and shows that composite transformers that augment convolutional backbones are best at lightweight action recognition, despite lacking accuracy.

Video Mobile-Former: Video Recognition with Efficient Global Spatial-temporal Modeling

Video Mobile-Former improves the video recognition performance of alternative lightweight baselines, and outperforms other efficient CNN-based models at the low FLOP regime from 500M to 6G total FLOPs on various video recognition tasks.

Real-time Streaming Video Denoising with Bidirectional Buffers

A novel Bidirectional Buffer Block is introduced as the core module of the BSVD, which makes it possible to achieve high-fidelity real-time denoising for streaming videos with both past and future temporal receptive fields and outperforms previous methods in terms of restoration fidelity and runtime.

UniFormer: Unifying Convolution and Self-attention for Visual Recognition

This work proposes a novel Unified transFormer (UniFormer), which can seamlessly integrate the merits of convolution and self-attention in a concise transformer format, and adopts it for various vision tasks from image to video domain, from classification to dense prediction.

Multiview Transformers for Video Recognition

This work presents Multiview Transformers for Video Recognition (MTV), a model that consists of separate encoders to represent different views of the input video with lateral connections to fuse information across views and achieves state-of-the-art results on six standard datasets.

A Novel Self-Knowledge Distillation Approach with Siamese Representation Learning for Action Recognition

A novel Self-knowledge distillation approach via Siamese representation learning, which minimizes the difference between two representation vectors of the two different views from a given sample, is intro-duces.



TSM: Temporal Shift Module for Efficient Video Understanding

A generic and effective Temporal Shift Module (TSM) that can achieve the performance of 3D CNN but maintain 2D CNN’s complexity and is extended to online setting, which enables real-time low-latency online video recognition and video object detection.

FASTER Recurrent Networks for Efficient Video Classification

A novel framework named FASTER, i.e., Feature Aggregation for Spatio-TEmporal Redundancy, which aims to leverage the redundancy between neighboring clips and reduce the computational cost by learning to aggregate the predictions from models of different complexities.

MotionSqueeze: Neural Motion Feature Learning for Video Understanding

This work proposes a trainable neural module, dubbed MotionSqueeze, for effective motion feature extraction, and demonstrates that the proposed method provides a significant gain on four standard benchmarks for action recognition with only a small amount of additional cost, outperforming the state of the art on Something-Something-V1&V2 datasets.

Beyond short snippets: Deep networks for video classification

This work proposes and evaluates several deep neural network architectures to combine image information across a video over longer time periods than previously attempted, and proposes two methods capable of handling full length videos.

SlowFast Networks for Video Recognition

This work presents SlowFast networks for video recognition, which achieves strong performance for both action classification and detection in video, and large improvements are pin-pointed as contributions by the SlowFast concept.

Tiny Video Networks: Architecture Search for Efficient Video Models

This work uses architecture search to build highly efficient models for videos Tiny Video Networks which run at unprecedented speeds and, at the same time, are effective at video recognition tasks.

X3D: Expanding Architectures for Efficient Video Recognition

  • Christoph Feichtenhofer
  • Computer Science
    2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
  • 2020
This paper presents X3D, a family of efficient video networks that progressively expand a tiny 2D image classification architecture along multiple network axes, in space, time, width and depth, finding that networks with high spatiotemporal resolution can perform well, while being extremely light in terms of network width and parameters.

RubiksNet: Learnable 3D-Shift for Efficient Video Action Recognition

This work introduces RubiksNet, a new efficient architecture for video action recognition which is based on a proposed learnable 3D spatiotemporal shift operation instead of a channel-wise shift-based primitive, and analyzes the suitability of the new primitive and explores several novel variations of the approach to enable stronger representational flexibility while maintaining an efficient design.

More Is Less: Learning Efficient Video Representations by Big-Little Network and Depthwise Temporal Aggregation

An lightweight and memory-friendly architecture for action recognition that performs on par with or better than current architectures by using only a fraction of resources, and a temporal aggregation module is proposed to model temporal dependencies in a video at very small additional computational costs.

MnasNet: Platform-Aware Neural Architecture Search for Mobile

An automated mobile neural architecture search (MNAS) approach, which explicitly incorporate model latency into the main objective so that the search can identify a model that achieves a good trade-off between accuracy and latency.