Action Recognition with Stacked Fisher Vectors

@inproceedings{Peng2014ActionRW,
  title={Action Recognition with Stacked Fisher Vectors},
  author={Xiaojiang Peng and Changqing Zou and Yu Qiao and Qiang Peng},
  booktitle={ECCV},
  year={2014}
}
Representation of video is a vital problem in action recognition. [...] Key Method In the first layer, we densely sample large subvolumes from input videos, extract local features, and encode them using Fisher vectors (FVs). The second layer compresses the FVs of subvolumes obtained in previous layer, and then encodes them again with Fisher vectors. Compared with standard FV, SFV allows refining the representation and abstracting semantic information in a hierarchical way. Compared with recent mid-level based…Expand
Human action recognition based on multi-layer Fisher vector encoding method
TLDR
Experiments show that more layers produce higher action classification accuracy, which proves the capability of the proposed new multi-layer Fisher vector encoding method.
Spatio-Temporal Vector of Locally Max Pooled Features for Action Recognition in Videos
TLDR
The proposed ST-VLMPF clearly provides a more reliable video representation than some of the most widely used and powerful encoding approaches (Improved Fisher Vectors and Vector of Locally Aggregated Descriptors), while maintaining a low computational complexity.
Efficient encoding of video descriptor distribution for action recognition
TLDR
Comparison to state-of-the-art shows that the Maclaurin coefficients of the density function and moments of the distribution to encode the distribution of video descriptors is faster, and in some cases, achieves comparable accuracy.
Saliency-Informed Spatio-Temporal Vector of Locally Aggregated Descriptors and Fisher Vectors for Visual Action Recognition
TLDR
A Saliency-Informed Spatio-Temporal VLAD (SST-VLAD) approach which selects the extracted features corresponding to small amount of videos in the data set by considering both the spatial and temporal video-wise saliency scores; and the same extension principle has also been applied to the FV approach.
Discriminative Multi-View Subspace Feature Learning for Action Recognition
TLDR
This paper proposes a discriminative subspace learning model (DSLM) to explore the complementary properties between the hand-crafted shallow feature representations and the deep features, and is the first work attempting to mine multi-level feature complementaries by the multi-view sub space learning scheme.
Simple, Efficient and Effective Encodings of Local Deep Features for Video Action Recognition
TLDR
The proposed approaches for deep feature encoding provide a solution to encapsulate the features extracted with a convolutional neural network over the entire video, outperforming by a large margin the current most widely used and powerful encoding approaches, while being extremely efficient for the computational cost.
A component-based video content representation for action recognition
TLDR
Experimental results demonstrate that the proposed Component-based Multi-stream CNN model (CM-CNN), trained on a WSL setting, outperforms the state-of-the-art in action recognition, even the fully-supervised approaches.
Discriminative convolutional Fisher vector network for action recognition
TLDR
It is shown that the proposed architecture can be used as a replacement for the fully connected layers in popular convolutional networks achieving a comparable classification performance, or even significantly surpassing the performance of similar architectures while reducing the total number of trainable parameters by a factor of 5.
Multilayer deep features with multiple kernel learning for action recognition
TLDR
This paper proposes integrating a novel representation named multilayer deep features (MDF) of both the human region and whole image area into an extended region-aware multiple kernel learning (ER-MKL) framework to learn a robust classifier for fusing human-region MDF and whole- Region MDF.
Heterogeneous Semantic Level Features Fusion for Action Recognition
TLDR
This paper aims to apply the success of static image semantic recognition to the video domain, by leveraging both static and motion based descriptors in different stages of the semantic ladder, by employing a scalable method to fuse these features.
...
1
2
3
4
5
...

References

SHOWING 1-10 OF 39 REFERENCES
A Comparative Study of Encoding, Pooling and Normalization Methods for Action Recognition
TLDR
The results show the new encoding methods can significantly improve the recognition accuracy compared with classical VQ and among them, Fisher kernel encoding and sparse encoding have the best performance.
Learning hierarchical invariant spatio-temporal features for action recognition with independent subspace analysis
TLDR
This paper presents an extension of the Independent Subspace Analysis algorithm to learn invariant spatio-temporal features from unlabeled video data and discovered that this method performs surprisingly well when combined with deep learning techniques such as stacking and convolution to learn hierarchical representations.
Bag of visual words and fusion methods for action recognition: Comprehensive study and good practice
TLDR
A comprehensive study of all steps in BoVW and different fusion methods is provided, and a simple yet effective representation is proposed, called hybrid supervector, by exploring the complementarity of different BoVW frameworks with improved dense trajectories.
Action bank: A high-level representation of activity in video
TLDR
Inspired by the recent object bank approach to image representation, Action Bank is presented, a new high-level representation of video comprised of many individual action detectors sampled broadly in semantic space as well as viewpoint space that is capable of highly discriminative performance.
Learning Discriminative Space–Time Action Parts from Weakly Labelled Videos
TLDR
By using local space–time action parts in a weakly supervised setting, the proposed local deformable spatial bag-of-features in which local discriminative regions are split into a fixed grid of parts that are allowed to deform in both space and time at test-time are demonstrated.
Action Recognition with Actons
TLDR
A two-layer structure for action recognition to automatically exploit a mid-level ``acton'' representation via a new max-margin multi-channel multiple instance learning framework, which yields the state-of-the-art classification performance on Youtube and HMDB51 datasets.
Learning realistic human actions from movies
TLDR
A new method for video classification that builds upon and extends several recent ideas including local space-time features,space-time pyramids and multi-channel non-linear SVMs is presented and shown to improve state-of-the-art results on the standard KTH action dataset.
3D Convolutional Neural Networks for Human Action Recognition
TLDR
A novel 3D CNN model for action recognition that extracts features from both the spatial and the temporal dimensions by performing 3D convolutions, thereby capturing the motion information encoded in multiple adjacent frames.
Better Exploiting Motion for Better Action Recognition
TLDR
It is established that adequately decomposing visual motion into dominant and residual motions, both in the extraction of the space-time trajectories and for the computation of descriptors, significantly improves action recognition algorithms.
Evaluation of Local Spatio-temporal Features for Action Recognition
TLDR
It is demonstrated that regular sampling of space-time features consistently outperforms all testedspace-time interest point detectors for human actions in realistic settings and is a consistent ranking for the majority of methods over different datasets.
...
1
2
3
4
...