Multi-Fiber Networks for Video Recognition

  title={Multi-Fiber Networks for Video Recognition},
  author={Yunpeng Chen and Yannis Kalantidis and Jianshu Li and Shuicheng Yan and Jiashi Feng},
  booktitle={European Conference on Computer Vision},
In this paper, we aim to reduce the computational cost of spatio-temporal deep neural networks, making them run as fast as their 2D counterparts while preserving state-of-the-art accuracy on video recognition benchmarks. [] Key Method To facilitate information flow between fibers we further incorporate multiplexer modules and end up with an architecture that reduces the computational cost of 3D networks by an order of magnitude, while increasing recognition performance at the same time. Extensive…

X3D: Expanding Architectures for Efficient Video Recognition

  • Christoph Feichtenhofer
  • Computer Science
    2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
  • 2020
This paper presents X3D, a family of efficient video networks that progressively expand a tiny 2D image classification architecture along multiple network axes, in space, time, width and depth, finding that networks with high spatiotemporal resolution can perform well, while being extremely light in terms of network width and parameters.

Cross-Fiber Spatial-Temporal Co-enhanced Networks for Video Action Recognition

A novel Cross-Fiber Spatial-Temporal Co-enhanced (CFST) architecture aiming to reduce the number of parameters tremendously while achieve accurate recognition of actions is proposed, significantly boosts the performance of existing convolution networks and achieves state-of-the-art accuracy on three challenging benchmarks.

EAC-Net: Efficient and Accurate Convolutional Network for Video Recognition

A new architecture EAC-Net is explored, enjoying both high efficiency and high performance, and Motion Guided Temporal Encode blocks for temporal modeling, which exploits motion information and temporal relations among neighbor frames are proposed.

Gate-Shift Networks for Video Action Recognition

An extensive evaluation of the proposed Gate-Shift Module is performed to study its effectiveness in video action recognition, achieving state-of-the-art results on Something Something-V1 and Diving48 datasets, and obtaining competitive results on EPIC-Kitchens with far less model complexity.

A Coarse-to-Fine Framework for Resource Efficient Video Recognition

LiteEval is a coarse-to-fine framework that dynamically allocates computation on a per-video basis, and can be deployed in both online and offline settings, and adaptively determines on-the-fly when to read in more discriminative yet computationally expensive features.

A2-Nets: Double Attention Networks

This work proposes the "double attention block", a novel component that aggregates and propagates informative global features from the entire spatio-temporal space of input images/videos, enabling subsequent convolution layers to access featuresFrom the entire space efficiently.

Human Action Recognition Based on Dual Correlation Network

A Dual Correlation Network is proposed to elaborate the relationship between channels of 3D CNN along with time series to address action recognition task and achieves superior performance to the existing state-of-the-art methods on these three datasets.

Lightweight Action Recognition with Sequence-Specific Global Context

This work aims to make 3D CNNs lightweight without reducing the recognition accuracy, and proposes two innovations — the Xwise Separable Convolution and the SS block, both of which are lightweight.

Is Space-Time Attention All You Need for Video Understanding?

This paper presents a convolution-free approach to video classification built exclusively on self-attention over space and time, and suggests that “divided attention,” where temporal attention and spatial attention are separately applied within each block, leads to the best video classification accuracy among the design choices considered.

FASTER Recurrent Networks for Efficient Video Classification

A novel framework named FASTER, i.e., Feature Aggregation for Spatio-TEmporal Redundancy, which aims to leverage the redundancy between neighboring clips and reduce the computational cost by learning to aggregate the predictions from models of different complexities.



Beyond short snippets: Deep networks for video classification

This work proposes and evaluates several deep neural network architectures to combine image information across a video over longer time periods than previously attempted, and proposes two methods capable of handling full length videos.

Dual Path Networks

This work reveals the equivalence of the state-of-the-art Residual Network (ResNet) and Densely Convolutional Network (DenseNet) within the HORNN framework, and finds that ResNet enables feature re-usage while DenseNet enables new features exploration which are both important for learning good representations.

Large-Scale Video Classification with Convolutional Neural Networks

This work studies multiple approaches for extending the connectivity of a CNN in time domain to take advantage of local spatio-temporal information and suggests a multiresolution, foveated architecture as a promising way of speeding up the training.

Two-Stream Convolutional Networks for Action Recognition in Videos

This work proposes a two-stream ConvNet architecture which incorporates spatial and temporal networks and demonstrates that a ConvNet trained on multi-frame dense optical flow is able to achieve very good performance in spite of limited training data.

Convolutional Two-Stream Network Fusion for Video Action Recognition

A new ConvNet architecture for spatiotemporal fusion of video snippets is proposed, and its performance on standard benchmarks where this architecture achieves state-of-the-art results is evaluated.

Appearance-and-Relation Networks for Video Classification

  • Limin WangWei LiWen LiL. Gool
  • Computer Science
    2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition
  • 2018
This paper presents a new architecture, termed as Appearance-and-Relation Network (ARTNet), to learn video representation in an end-to-end manner, constructed by stacking multiple generic building blocks, called SMART, whose goal is to simultaneously model appearance and relation from RGB input in a separate and explicit manner.

Very Deep Convolutional Networks for Large-Scale Image Recognition

This work investigates the effect of the convolutional network depth on its accuracy in the large-scale image recognition setting using an architecture with very small convolution filters, which shows that a significant improvement on the prior-art configurations can be achieved by pushing the depth to 16-19 weight layers.

Rethinking Spatiotemporal Feature Learning For Video Understanding

Interestingly, it was found that 3D convolutions at the top layers of the network contribute more than 3D Convolutional networks at the bottom layers, while also being computationally more efficient, indicating that I3D is better at capturing high-level temporal patterns than low-level motion signals.

Compressed Video Action Recognition

This work proposes to train a deep network directly on the compressed video, using H.264, HEVC, etc., which has a higher information density, and found the training to be easier.

Aggregated Residual Transformations for Deep Neural Networks

On the ImageNet-1K dataset, it is empirically show that even under the restricted condition of maintaining complexity, increasing cardinality is able to improve classification accuracy and is more effective than going deeper or wider when the authors increase the capacity.