Video Representation Learning Using Discriminative Pooling

@article{Wang2018VideoRL,
  title={Video Representation Learning Using Discriminative Pooling},
  author={Jue Wang and Anoop Cherian and Fatih Murat Porikli and Stephen Gould},
  journal={2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition},
  year={2018},
  pages={1149-1158}
}
Popular deep models for action recognition in videos generate independent predictions for short clips, which are then pooled heuristically to assign an action label to the full video segment. As not all frames may characterize the underlying action-indeed, many are common across multiple actions-pooling schemes that impose equal importance on all frames might be unfavorable. In an attempt to tackle this problem, we propose discriminative pooling, based on the notion that among the deep features… Expand
Discriminative Video Representation Learning Using Support Vector Classifiers
  • Jue Wang, A. Cherian
  • Computer Science, Medicine
  • IEEE Transactions on Pattern Analysis and Machine Intelligence
  • 2021
TLDR
A discriminative pooling scheme based on the notion that among the deep features generated on all short clips, there is at least one that characterizes the action that is end-to-end trainable within a deep learning framework. Expand
Discriminative Subspace Pooling for Action Recognition
Adversarial perturbations are noise-like patterns that can subtly change the data, while failing an otherwise accurate classifier. In this paper, we propose to use such perturbations for improvingExpand
Second-order Temporal Pooling for Action Recognition
TLDR
This paper proposes a novel end-to-end learnable feature aggregation scheme, dubbed temporal correlation pooling that generates an action descriptor for a video sequence by capturing the similarities between the temporal evolution of clip-level CNN features computed across the video. Expand
Learning Discriminative Video Representations Using Adversarial Perturbations
TLDR
This paper first generates adversarial noise adapted to a well-trained deep model for per-frame video recognition, then develops a binary classification problem that learns a set of discriminative hyperplanes – as a subspace – that will separate the two bags from each other. Expand
Ju l 2 01 8 Learning Discriminative Video Representations Using Adversarial Perturbations
Adversarial perturbations are noise-like patterns that can subtly change the data, while failing an otherwise accurate classifier. In this paper, we propose to use such perturbations for improvingExpand
Non-linear Temporal Subspace Representations for Activity Recognition
TLDR
A novel pooling method is proposed, kernelized rank pooling, that represents a given sequence compactly as the pre-image of the parameters of a hyperplane in a reproducing kernel Hilbert space, projections of data onto which captures their temporal order. Expand
Revisiting Hard Example for Action Recognition
TLDR
A novel light-weight Voting-based Temporal Correlation (VTC) module is proposed to enhance temporal information and a simple and intuitive Similarity Loss (SL) to guide the training procedure of the VTC module and the backbone network. Expand
Action Recognition in Videos Using Multi-stream Convolutional Neural Networks
TLDR
A different pre-training procedure for the latter stream is developed using visual rhythm images extracted from a large and challenging video dataset, the Kinetics, which aims to classify trimmed videos based on the action being performed by one or more agents. Expand
Cooperative Cross-Stream Network for Discriminative Action Representation
TLDR
A novel cooperative cross-stream network that investigates the conjoint information in multiple different modalities and enhances the discriminative power of the deeply learned features and reduces the undesired modality discrepancy by jointly optimizing a modality ranking constraint and a cross-entropy loss for both homogeneous and heterogeneous modalities. Expand
Weakly-supervised temporal attention 3D network for human action recognition
TLDR
A weakly-supervised temporal attention 3D network for human action recognition is proposed, called as TA3DNet, to accelerate 3D convolutional neural networks (3D CNNs) by temporally assigning different importance to each frame. Expand
...
1
2
3
4
5
...

References

SHOWING 1-10 OF 66 REFERENCES
Generalized Rank Pooling for Activity Recognition
TLDR
A novel pooling method, generalized rank pooling (GRP), that takes as input, features from the intermediate layers of a CNN that is trained on tiny sub-sequences, and produces as output the parameters of a subspace which provides a low-rank approximation to the features and preserves their temporal order is proposed. Expand
AdaScan: Adaptive Scan Pooling in Deep Convolutional Neural Networks for Human Action Recognition in Videos
TLDR
The effectiveness of the proposed pooling method consistently improves on baseline pooling methods, with both RGB and optical flow based Convolutional networks, and in combination with complementary video representations is shown. Expand
Two-Stream Convolutional Networks for Action Recognition in Videos
TLDR
This work proposes a two-stream ConvNet architecture which incorporates spatial and temporal networks and demonstrates that a ConvNet trained on multi-frame dense optical flow is able to achieve very good performance in spite of limited training data. Expand
Action Recognition by Learning Deep Multi-Granular Spatio-Temporal Video Representation
TLDR
This paper presents a novel framework to boost action recognition by learning a deep spatio-temporal video representation at hierarchical multi-granularity using 2D or 3D convolutional neural networks to learn both the spatial and temporal representations. Expand
Action recognition with trajectory-pooled deep-convolutional descriptors
TLDR
This paper presents a new video representation, called trajectory-pooled deep-convolutional descriptor (TDD), which shares the merits of both hand-crafted features and deep-learned features, and achieves superior performance to the state of the art on these datasets. Expand
Ordered Pooling of Optical Flow Sequences for Action Recognition
TLDR
This paper introduces a novel ordered representation of consecutive optical flow frames as an alternative and argues that this representation captures the action dynamics more efficiently than RGB frames, and provides intuitions on why such a representation is better for action recognition. Expand
Temporal Segment Networks: Towards Good Practices for Deep Action Recognition
Deep convolutional networks have achieved great success for visual recognition in still images. However, for action recognition in videos, the advantage over traditional methods is not so evident.Expand
Human action recognition with graph-based multiple-instance learning
A new approach to human action recognition from realistic videos is presented in this paper. First, an affine motion model is utilized to compensate background motion for the purpose of extractingExpand
Dynamic Image Networks for Action Recognition
TLDR
The new approximate rank pooling CNN layer allows the use of existing CNN models directly on video data with fine-tuning to generalize dynamic images to dynamic feature maps and the power of the new representations on standard benchmarks in action recognition achieving state-of-the-art performance. Expand
ActionVLAD: Learning Spatio-Temporal Aggregation for Action Classification
TLDR
A new video representation for action classification that aggregates local convolutional features across the entire spatio-temporal extent of the video and outperforms other baselines with comparable base architectures on HMDB51, UCF101, and Charades video classification benchmarks. Expand
...
1
2
3
4
5
...