• Corpus ID: 231933751

VA-RED2: Video Adaptive Redundancy Reduction

  title={VA-RED2: Video Adaptive Redundancy Reduction},
  author={Bowen Pan and Rameswar Panda and Camilo Luciano Fosco and Chung-Ching Lin and Alex Andonian and Yue Meng and Kate Saenko and Aude Oliva and Rog{\'e}rio Schmidt Feris},
Performing inference on deep learning models for videos remains a challenge due to the large amount of computational resources required to achieve robust recognition. An inherent property of real-world videos is the high correlation of information across frames which can translate into redundancy in either temporal or spatial feature maps of the models, or both. The type of redundant features depends on the dynamics and type of events in the video: static videos have more temporal redundancy… 
Adaptive Focus for Efficient Video Recognition
This paper model the patch localization problem as a sequential decision task, and proposes a reinforcement learning based approach for efficient spatially adaptive video recognition (AdaFocus), whose features are used by a recurrent policy network to localize the most task-relevant regions.
An Image is Worth 16x16 Words, What is a Video Worth?
This work addresses the computational bottleneck by significantly reducing the number of frames required for inference, and relies on a temporal transformer that applies global attention over video frames, and thus better exploits the salient information in each frame.
AdaMML: Adaptive Multi-Modal Learning for Efficient Video Recognition
An adaptive multi-modal learning framework, called AdaMML, is proposed that selects on-the-fly the optimal modalities for each segment conditioned on the input for efficient video recognition.
IA-RED2: Interpretability-Aware Redundancy Reduction for Vision Transformers
It is demonstrated that the interpretability that naturally emerged in the I-RED framework can outperform the raw attention learned by the original visual transformer, as well as those generated by off-the-shelf interpretation methods, with both qualitative and quantitative results.
HMS: Hierarchical Modality Selection for Efficient Video Recognition
HMS is introduced, a simple yet efficient multimodal learning framework for efficient video recognition that dynamically decides on-the-fly whether to use computationally-expensive modalities, including appearance and motion clues, on a per-input basis.
MutualNet: Adaptive ConvNet via Mutual Learning from Different Model Configurations
This work proposes a general method called MutualNet to train a single network that can run at a diverse set of resource constraints, and trains a cohort of model configurations with various network widths and input resolutions, which greatly reduces the training cost.
AdaFuse: Adaptive Temporal Fusion Network for Efficient Action Recognition
An adaptive temporal fusion network that dynamically fuses channels from current and past feature maps for strong temporal modelling, called AdaFuse, that can achieve about 40% computation savings with comparable accuracy to state-of-the-art methods.
Dynamic Neural Networks: A Survey
This survey comprehensively review this rapidly developing area of dynamic networks by dividing dynamic networks into three main categories: sample-wise dynamic models that process each sample with data-dependent architectures or parameters; spatial-wiseynamic networks that conduct adaptive computation with respect to different spatial locations of image data; and temporal-wise Dynamic networks that perform adaptive inference along the temporal dimension for sequential data.
Dynamic Network Quantization for Efficient Video Inference
A dynamic network quantization framework, that selects optimal precision for each frame conditioned on the input for efficient video recognition is proposed, that provides significant savings in computation and memory usage while outperforming the existing state-of-the-art methods.


More Is Less: Learning Efficient Video Representations by Big-Little Network and Depthwise Temporal Aggregation
An lightweight and memory-friendly architecture for action recognition that performs on par with or better than current architectures by using only a fraction of resources, and a temporal aggregation module is proposed to model temporal dependencies in a video at very small additional computational costs.
BlockDrop: Dynamic Inference Paths in Residual Networks
BlockDrop, an approach that learns to dynamically choose which layers of a deep network to execute during inference so as to best reduce total computation without degrading prediction accuracy, is introduced.
DynamoNet: Dynamic Action and Motion Network
A novel unified spatio-temporal 3D-CNN architecture (DynamoNet) that jointly optimizes the video classification and learning motion representation by predicting future frames as a multi-task learning problem is introduced.
Two-Stream Convolutional Networks for Action Recognition in Videos
This work proposes a two-stream ConvNet architecture which incorporates spatial and temporal networks and demonstrates that a ConvNet trained on multi-frame dense optical flow is able to achieve very good performance in spite of limited training data.
Recurrent Residual Module for Fast Inference in Videos
This work proposes a framework called Recurrent Residual Module (RRM) to accelerate the CNN inference for video recognition tasks, which has a novel design of using the similarity of the intermediate feature maps of two consecutive frames to largely reduce the redundant computation.
ECO: Efficient Convolutional Network for Online Video Understanding
A network architecture that takes long-term content into account and enables fast per-video processing at the same time and achieves competitive performance across all datasets while being 10 to 80 times faster than state-of-the-art methods.
Temporal Segment Networks: Towards Good Practices for Deep Action Recognition
Deep convolutional networks have achieved great success for visual recognition in still images. However, for action recognition in videos, the advantage over traditional methods is not so evident.
Video Modeling With Correlation Networks
This paper proposes an alternative approach based on a learnable correlation operator that can be used to establish frame-to-frame matches over convolutional feature maps in the different layers of the network.
RubiksNet: Learnable 3D-Shift for Efficient Video Action Recognition
This work introduces RubiksNet, a new efficient architecture for video action recognition which is based on a proposed learnable 3D spatiotemporal shift operation instead of a channel-wise shift-based primitive, and analyzes the suitability of the new primitive and explores several novel variations of the approach to enable stronger representational flexibility while maintaining an efficient design.
Rethinking Spatiotemporal Feature Learning: Speed-Accuracy Trade-offs in Video Classification
It is shown that it is possible to replace many of the 3D convolutions by low-cost 2D convolution, suggesting that temporal representation learning on high-level “semantic” features is more useful.