In Defense of Image Pre-Training for Spatiotemporal Recognition

  title={In Defense of Image Pre-Training for Spatiotemporal Recognition},
  author={Xianhang Li and Huiyu Wang and Chen Wei and Jieru Mei and Alan Loddon Yuille and Yuyin Zhou and Cihang Xie},
. Image pre-training, the current de-facto paradigm for a wide range of visual tasks, is generally less favored in the field of video recognition. By contrast, a common strategy is to directly train with spatiotemporal convolutional neural networks (CNNs) from scratch. Nonetheless, interestingly, by taking a closer look at these from-scratch learned CNNs, we note there exist certain 3D kernels that exhibit much stronger appearance modeling ability than others, arguably suggesting appearance… 



Rethinking Spatiotemporal Feature Learning: Speed-Accuracy Trade-offs in Video Classification

It is shown that it is possible to replace many of the 3D convolutions by low-cost 2D convolution, suggesting that temporal representation learning on high-level “semantic” features is more useful.

Learning Spatio-Temporal Representation with Pseudo-3D Residual Networks

This paper devise multiple variants of bottleneck building blocks in a residual learning framework by simulating 3 x3 x 3 convolutions with 1 × 3 × 3 convolutional filters on spatial domain (equivalent to 2D CNN) plus 3 × 1 × 1 convolutions to construct temporal connections on adjacent feature maps in time.

Gate-Shift Networks for Video Action Recognition

An extensive evaluation of the proposed Gate-Shift Module is performed to study its effectiveness in video action recognition, achieving state-of-the-art results on Something Something-V1 and Diving48 datasets, and obtaining competitive results on EPIC-Kitchens with far less model complexity.

Video Modeling With Correlation Networks

This paper proposes an alternative approach based on a learnable correlation operator that can be used to establish frame-to-frame matches over convolutional feature maps in the different layers of the network.

Disentangled Non-Local Neural Networks

This paper first studies the non-local block in depth, where it is found that its attention computation can be split into two terms, a whitened pairwise term accounting for the relationship between two pixels and a unary term representing the saliency of every pixel.

Spatiotemporal Residual Networks for Video Action Recognition

The novel spatiotemporal ResNet is introduced and evaluated using two widely used action recognition benchmarks where it exceeds the previous state-of-the-art.

A Closer Look at Spatiotemporal Convolutions for Action Recognition

A new spatiotemporal convolutional block "R(2+1)D" is designed which produces CNNs that achieve results comparable or superior to the state-of-the-art on Sports-1M, Kinetics, UCF101, and HMDB51.

Video Swin Transformer

  • Ze LiuJ. Ning Han Hu
  • Computer Science
    2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
  • 2022
This paper advocates an inductive bias of locality in video Transformers, which leads to a better speed-accuracy trade-off compared to previous approaches which compute self-attention globally even with spatial-temporal factorization.

Two-Stream Convolutional Networks for Action Recognition in Videos

This work proposes a two-stream ConvNet architecture which incorporates spatial and temporal networks and demonstrates that a ConvNet trained on multi-frame dense optical flow is able to achieve very good performance in spite of limited training data.

ViViT: A Video Vision Transformer

This work shows how to effectively regularise the model during training and leverage pretrained image models to be able to train on comparatively small datasets, and achieves state-of-the-art results on multiple video classification benchmarks.