Corpus ID: 214802666

TimeGate: Conditional Gating of Segments in Long-range Activities

@article{Hussein2020TimeGateCG,
  title={TimeGate: Conditional Gating of Segments in Long-range Activities},
  author={Noureldien Hussein and Mihir Jain and Babak Ehteshami Bejnordi},
  journal={ArXiv},
  year={2020},
  volume={abs/2004.01808}
}
When recognizing a long-range activity, exploring the entire video is exhaustive and computationally expensive, as it can span up to a few minutes. Thus, it is of great importance to sample only the salient parts of the video. We propose TimeGate, along with a novel conditional gating module, for sampling the most representative segments from the long-range activity. TimeGate has two novelties that address the shortcomings of previous sampling methods, as SCSampler. First, it enables a… Expand
FrameExit: Conditional Early Exiting for Efficient Video Recognition
TLDR
This paper proposes a conditional early exiting framework for efficient video recognition that automatically learns to process fewer frames for simpler videos and more frames for complex ones, and generates on-the-fly supervision signals to provide a dynamic trade-off between accuracy and computational cost. Expand
Long-term Behaviour Recognition in Videos with Actor-focused Region Attention
TLDR
The Multi-Regional fine-tuned 3D-CNN, topped with Actor Focus and Region Attention, largely improves the performance of baseline 3D architectures, achieving state-of-the-art results on Breakfast, a well known long-term activity recognition benchmark. Expand
PIC: Permutation Invariant Convolution for Recognizing Long-range Activities
TLDR
PIC, Permutation Invariant Convolution, a novel neural layer to model the temporal structure of long-range activities has three desirable properties: Unlike standard convolution, PIC is invariant to the temporal permutations of features within its receptive field, qualifying it tomodel the weak temporal structures. Expand
Dynamic Network Quantization for Efficient Video Inference
TLDR
A dynamic network quantization framework, that selects optimal precision for each frame conditioned on the input for efficient video recognition is proposed, that provides significant savings in computation and memory usage while outperforming the existing state-of-the-art methods. Expand
AdaMML: Adaptive Multi-Modal Learning for Efficient Video Recognition
TLDR
An adaptive multi-modal learning framework, called AdaMML, is proposed that selects on-the-fly the optimal modalities for each segment conditioned on the input for efficient video recognition. Expand
CoDiNet: Path Distribution Modeling with Consistency and Diversity for Dynamic Routing.
TLDR
A novel method is proposed, termed CoDiNet, to model the relationship between a sample space and a routing space by regularizing the distribution of routing paths with the properties of consistency and diversity, which achieves higher performance and effectively reduces average computational cost on four widely used datasets. Expand

References

SHOWING 1-10 OF 39 REFERENCES
PIC: Permutation Invariant Convolution for Recognizing Long-range Activities
TLDR
PIC, Permutation Invariant Convolution, a novel neural layer to model the temporal structure of long-range activities has three desirable properties: Unlike standard convolution, PIC is invariant to the temporal permutations of features within its receptive field, qualifying it tomodel the weak temporal structures. Expand
VideoGraph: Recognizing Minutes-Long Human Activities in Videos
TLDR
The graph, its nodes and edges are learned entirely from video datasets, making VideoGraph applicable to problems without node-level annotation, and it is demonstrated that VideoGraph is able to learn the temporal structure of human activities in minutes-long videos. Expand
ActionVLAD: Learning Spatio-Temporal Aggregation for Action Classification
TLDR
A new video representation for action classification that aggregates local convolutional features across the entire spatio-temporal extent of the video and outperforms other baselines with comparable base architectures on HMDB51, UCF101, and Charades video classification benchmarks. Expand
SCSampler: Sampling Salient Clips From Video for Efficient Action Recognition
TLDR
It is demonstrated that the computational cost of action recognition on untrimmed videos can be dramatically reduced by invoking recognition only on these most salient clips, and this yields significant gains in recognition accuracy compared to analysis of all clips or randomly selected clips. Expand
Temporal Segment Networks: Towards Good Practices for Deep Action Recognition
Deep convolutional networks have achieved great success for visual recognition in still images. However, for action recognition in videos, the advantage over traditional methods is not so evident.Expand
Large-Scale Video Classification with Convolutional Neural Networks
TLDR
This work studies multiple approaches for extending the connectivity of a CNN in time domain to take advantage of local spatio-temporal information and suggests a multiresolution, foveated architecture as a promising way of speeding up the training. Expand
Efficient Video Classification Using Fewer Frames
TLDR
This work focuses on building compute-efficient video classification models which process fewer frames and hence have less number of FLOPs and shows that in each of these cases, a see-it-all teacher can be used to train a compute efficient see-very-little student. Expand
You Look Twice: GaterNet for Dynamic Filter Selection in CNNs
TLDR
This paper investigates input-dependent dynamic filter selection in deep convolutional neural networks (CNNs) and proposes a novel yet simple framework called GaterNet, which involves a backbone and a gater network. Expand
Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset
TLDR
I3D models considerably improve upon the state-of-the-art in action classification, reaching 80.2% on HMDB-51 and 97.9% on UCF-101 after pre-training on Kinetics, and a new Two-Stream Inflated 3D Conv net that is based on 2D ConvNet inflation is introduced. Expand
Non-local Neural Networks
TLDR
This paper presents non-local operations as a generic family of building blocks for capturing long-range dependencies in computer vision and improves object detection/segmentation and pose estimation on the COCO suite of tasks. Expand
...
1
2
3
4
...