Learning hierarchical invariant spatio-temporal features for action recognition with independent subspace analysis

@article{Le2011LearningHI,
  title={Learning hierarchical invariant spatio-temporal features for action recognition with independent subspace analysis},
  author={Quoc V. Le and Will Y. Zou and Serena Yeung and A. Ng},
  journal={CVPR 2011},
  year={2011},
  pages={3361-3368}
}
Previous work on action recognition has focused on adapting hand-designed local features, such as SIFT or HOG, from static images to the video domain. In this paper, we propose using unsupervised feature learning as a way to learn features directly from video data. More specifically, we present an extension of the Independent Subspace Analysis algorithm to learn invariant spatio-temporal features from unlabeled video data. We discovered that, despite its simplicity, this method performs… Expand
Accelerated learning of discriminative spatio-temporal features for action recognition
TLDR
This work proposes two methods for speeding up feature learning using ISA by using the scalable programming model, MapReduce to parametrize ISA algorithm and using spatio-temporal interest point detectors to extract “important” blocks from video, which enhances the speed and improves the classification accuracy. Expand
Action Recognition Based on Efficient Deep Feature Learning in the Spatio-Temporal Domain
TLDR
A simple, yet robust, 2-D convolutional neural network extended to a concatenated 3-D network that learns to extract features from the spatio-temporal domain of raw video data and is used for content-based recognition of videos. Expand
DL-SFA: Deeply-Learned Slow Feature Analysis for Action Recognition
TLDR
This paper uses a two-layered SFA learning structure with 3D convolution and max pooling operations to scale up the method to large inputs and capture abstract and structural features from the video. Expand
Two-stream spatiotemporal feature fusion for human action recognition
TLDR
This paper proposes a novel human action recognition method by fusing spatial and temporal features learned from a simple unsupervised convolutional neural network called principal component analysis network (PCANet) in combination with bag-of-features (BoF) and vector of locally aggregated descriptors (VLAD) encoding schemes. Expand
Action recognition via spatio-temporal local features: A comprehensive study
TLDR
A comprehensive study on local methods for human action recognition based on spatio-temporal local features, which implements these techniques and conducts comparison under unified experimental settings on three widely used benchmarks, i.e., the KTH, UCF-YouTube and HMDB51 datasets. Expand
Extracting hierarchical spatial and temporal features for human action recognition
TLDR
A dual-channel model is presented to decouple the spatial and temporal feature extraction to capture the complementary static form information from single frame and dynamic motion information from multi-frame differences in two separate channels. Expand
Multimedia Event Detection using Visual Features
Learning spatial features from static images has traditionally involved approaches such as SIFT, HOG and SURF, to name a few. These approaches typically learn low-level hand-designed features whichExpand
Learning motion and content-dependent features with convolutions for action recognition
TLDR
This paper develops a temporal extension of convolutional neural networks to exploit motion-dependent features for recognizing human action in video and proves that motion and content- dependent features arise simultaneously from the developed architecture, whereas previous works mostly deal with the two separately. Expand
Learning Discriminative Space–Time Action Parts from Weakly Labelled Videos
TLDR
By using local space–time action parts in a weakly supervised setting, the proposed local deformable spatial bag-of-features in which local discriminative regions are split into a fixed grid of parts that are allowed to deform in both space and time at test-time are demonstrated. Expand
Stacked Overcomplete Independent Component Analysis for Action Recognition
TLDR
This paper introduces Overcomplete Independent Component Analysis to directly learn structural spatio-temporal features from the raw video data and proposes to stack OICA to form a two-layer network for abstracting robust high-level features. Expand
...
1
2
3
4
5
...

References

SHOWING 1-10 OF 50 REFERENCES
Evaluation of Local Spatio-temporal Features for Action Recognition
TLDR
It is demonstrated that regular sampling of space-time features consistently outperforms all testedspace-time interest point detectors for human actions in realistic settings and is a consistent ranking for the majority of methods over different datasets. Expand
Recognizing realistic actions from videos “in the wild”
TLDR
This paper presents a systematic framework for recognizing realistic actions from videos “in the wild”, and uses motion statistics to acquire stable motion features and clean static features, and PageRank is used to mine the most informative static features. Expand
Learning realistic human actions from movies
TLDR
A new method for video classification that builds upon and extends several recent ideas including local space-time features,space-time pyramids and multi-channel non-linear SVMs is presented and shown to improve state-of-the-art results on the standard KTH action dataset. Expand
3D Convolutional Neural Networks for Human Action Recognition
TLDR
A novel 3D CNN model for action recognition that extracts features from both the spatial and the temporal dimensions by performing 3D convolutions, thereby capturing the motion information encoded in multiple adjacent frames. Expand
Convolutional Learning of Spatio-temporal Features
TLDR
A model that learns latent representations of image sequences from pairs of successive images is introduced, allowing it to scale to realistic image sizes whilst using a compact parametrization. Expand
Measuring Invariances in Deep Networks
TLDR
A number of empirical tests are proposed that directly measure the degree to which these learned features are invariant to different input transformations and find that stacked autoencoders learn modestly increasingly invariant features with depth when trained on natural images and convolutional deep belief networks learn substantially more invariant Features in each layer. Expand
Unsupervised Learning of Human Action Categories Using Spatial-Temporal Words
TLDR
The approach is not only able to classify different actions, but also to localize different actions simultaneously in a novel and complex video sequence. Expand
Actions in context
TLDR
This paper automatically discover relevant scene classes and their correlation with human actions, and shows how to learn selected scene classes from video without manual supervision and develops a joint framework for action and scene recognition and demonstrates improved recognition of both in natural video. Expand
An Efficient Dense and Scale-Invariant Spatio-Temporal Interest Point Detector
TLDR
This paper presents for the first time spatio-temporal interest points that are at the same time scale-invariant (both spatially and temporally) and densely cover the video content and can be computed efficiently. Expand
Detection of human actions from a single example
  • H. Seo, P. Milanfar
  • Mathematics, Computer Science
  • 2009 IEEE 12th International Conference on Computer Vision
  • 2009
TLDR
The proposed algorithm is unsupervised, does not require learning, segmentation, or motion estimation, and high performance is demonstrated on a challenging set of action data indicating successful detection of multiple complex actions even in the presence of fast motions. Expand
...
1
2
3
4
5
...