• Corpus ID: 231879989

AdaFuse: Adaptive Temporal Fusion Network for Efficient Action Recognition

@article{Meng2021AdaFuseAT,
  title={AdaFuse: Adaptive Temporal Fusion Network for Efficient Action Recognition},
  author={Yue Meng and Rameswar Panda and Chung-Ching Lin and Prasanna Sattigeri and Leonid Karlinsky and Kate Saenko and Aude Oliva and Rog{\'e}rio Schmidt Feris},
  journal={ArXiv},
  year={2021},
  volume={abs/2102.05775}
}
Temporal modelling is the key for efficient video action recognition. While understanding temporal information can improve recognition accuracy for dynamic actions, removing temporal redundancy and reusing past features can significantly save computation leading to efficient action recognition. In this paper, we introduce an adaptive temporal fusion network, called AdaFuse, that dynamically fuses channels from current and past feature maps for strong temporal modelling. Specifically, the… 

Figures and Tables from this paper

Adaptive Focus for Efficient Video Recognition
TLDR
This paper model the patch localization problem as a sequential decision task, and proposes a reinforcement learning based approach for efficient spatially adaptive video recognition (AdaFocus), whose features are used by a recurrent policy network to localize the most task-relevant regions.
Adaptive Recursive Circle Framework for Fine-grained Action Recognition
TLDR
An Adaptive Recursive Circle (ARC) framework is proposed, a fine-grained decorator for pure feedforward layers that can facilitate fine- grained action recognition by introducing deeply refined features and multi-scale receptive fields at a low cost.
AdaFocus V2: End-to-End Training of Spatial Dynamic Networks for Video Recognition
TLDR
This work reformulates the training of AdaFocus as a simple one-stage algorithm by introducing a differentiable interpolation-based patch selection operation, enabling efficient end-to-end optimization and presenting an improved training scheme to address the issues introduced by the one- stage formulation.
Higher Order Recurrent Space-Time Transformer for Video Action Prediction
TLDR
This paper proposes HORST, a novel higher order recurrent layer design whose core element is a spatial-temporal decomposition of self-attention for video, and achieves state of the art competitive performance on Something-Something early action recognition and EPIC-Kitchens action anticipation.
FrameExit: Conditional Early Exiting for Efficient Video Recognition
TLDR
This paper proposes a conditional early exiting framework for efficient video recognition that automatically learns to process fewer frames for simpler videos and more frames for complex ones, and employs a cascade of gating modules to automatically determine the earliest point in processing where an inference is sufficiently reliable.
AdaMML: Adaptive Multi-Modal Learning for Efficient Video Recognition
TLDR
An adaptive multimodal learning framework, called AdaMML, is proposed that selects on-the-fly the optimal modalities for each segment conditioned on the input for efficient video recognition.
VA-RED: VIDEO ADAPTIVE REDUNDANCY REDUC- TION
Performing inference on deep learning models for videos remains a challenge due to the large amount of computational resources required to achieve robust recognition. An inherent property of
VA-RED2: Video Adaptive Redundancy Reduction
TLDR
This work presents a redundancy reduction framework, termed VA-RED, which uses an input-dependent policy to decide how many features need to be computed for temporal and channel dimensions and learns the adaptive policy jointly with the network weights in a differentiable way with a shared-weight mechanism, making it highly efficient.
Dynamic Network Quantization for Efficient Video Inference
TLDR
A dynamic network quantization framework, that selects optimal precision for each frame conditioned on the input for efficient video recognition is proposed, that provides significant savings in computation and memory usage while outperforming the existing state-of-the-art methods.
IA-RED2: Interpretability-Aware Redundancy Reduction for Vision Transformers
TLDR
It is demonstrated that the interpretability that naturally emerged in the I-RED framework can outperform the raw attention learned by the original visual transformer, as well as those generated by off-the-shelf interpretation methods, with both qualitative and quantitative results.
...
1
2
...

References

SHOWING 1-10 OF 63 REFERENCES
AR-Net: Adaptive Frame Resolution for Efficient Action Recognition
TLDR
A novel approach, called AR-Net (Adaptive Resolution Network), that selects on-the-fly the optimal resolution for each frame conditioned on the input for efficient action recognition in long untrimmed videos.
Dynamic Inference: A New Approach Toward Efficient Video Action Recognition
TLDR
This paper innovatively proposes a general dynamic inference idea to improve inference efficiency by leveraging the variation in the distinguishability of different videos to alleviate the conflict between progressive computation and video temporal relation modeling by introducing the online temporal shift module.
Temporal Segment Networks: Towards Good Practices for Deep Action Recognition
Deep convolutional networks have achieved great success for visual recognition in still images. However, for action recognition in videos, the advantage over traditional methods is not so evident.
TEA: Temporal Excitation and Aggregation for Action Recognition
TLDR
This paper proposes a Temporal Excitation and Aggregation block, including a motion excitation module and a multiple temporal aggregation module, specifically designed to capture both short- and long-range temporal evolution, and achieves impressive results at low FLOPs on several action recognition benchmarks.
Gate-Shift Networks for Video Action Recognition
TLDR
An extensive evaluation of the proposed Gate-Shift Module is performed to study its effectiveness in video action recognition, achieving state-of-the-art results on Something Something-V1 and Diving48 datasets, and obtaining competitive results on EPIC-Kitchens with far less model complexity.
Dynamic Motion Representation for Human Action Recognition
TLDR
The experimental results show that training a convolutional neural network with the dynamic motion representation outperforms state-of-the-art action recognition models and is obtainable on HMDB, JHMDB, UCF-101, and AVA datasets.
ECO: Efficient Convolutional Network for Online Video Understanding
TLDR
A network architecture that takes long-term content into account and enables fast per-video processing at the same time and achieves competitive performance across all datasets while being 10 to 80 times faster than state-of-the-art methods.
AdaFrame: Adaptive Frame Selection for Fast Video Recognition
TLDR
It is qualitatively demonstrate learned frame usage can indicate the difficulty of making classification decisions; easier samples need fewer frames while harder ones require more, both at instance-level within the same class and at class-level among different categories.
STM: SpatioTemporal and Motion Encoding for Action Recognition
TLDR
This work proposes a STM block, which contains a Channel-wise SpatioTemporal Module (CSTM) to present the spatiotemporal features and aChannel-wise Motion Module(CMM) to efficiently encode motion features, and replaces original residual blocks in the ResNet architecture with STM blcoks to form a simple yet effective STM network.
Listen to Look: Action Recognition by Previewing Audio
TLDR
A framework for efficient action recognition in untrimmed video that uses audio as a preview mechanism to eliminate both short-term and long-term visual redundancies is proposed, and an ImgAud2Vid framework is devised that hallucinates clip-level features by distilling from lighter modalities, reducingShort-term temporal redundancy for efficient video-level recognition.
...
1
2
3
4
5
...