Temporal Segment Networks: Towards Good Practices for Deep Action Recognition

@article{Wang2016TemporalSN,
  title={Temporal Segment Networks: Towards Good Practices for Deep Action Recognition},
  author={Limin Wang and Yuanjun Xiong and Zhe Wang and Yu Qiao and Dahua Lin and Xiaoou Tang and Luc Van Gool},
  journal={ArXiv},
  year={2016},
  volume={abs/1608.00859}
}
Deep convolutional networks have achieved great success for visual recognition in still images. However, for action recognition in videos, the advantage over traditional methods is not so evident. This paper aims to discover the principles to design effective ConvNet architectures for action recognition in videos and learn these models given limited training samples. Our first contribution is temporal segment network (TSN), a novel framework for video-based action recognition. which is based on… 
Sequential Segment Networks for Action Recognition
TLDR
This work proposes a deep learning framework, sequential segment networks (SSN), to model video-level temporal structures in videos, and achieves state-of-the-art performance on UCF101 and HMDB51 datasets.
Dynamic Representation Learning for Video Action Recognition Using Temporal Residual Networks
  • Yongqiang Kong, Jianhui Huang, Shanshan Huang, Zhengang Wei, Shengke Wang
  • Computer Science
    2018 IEEE SmartWorld, Ubiquitous Intelligence & Computing, Advanced & Trusted Computing, Scalable Computing & Communications, Cloud & Big Data Computing, Internet of People and Smart City Innovation (SmartWorld/SCALCOM/UIC/ATC/CBDCom/IOP/SCI)
  • 2018
TLDR
An Improved Dynamic Image (IDI) to describe videos by applying salient object detection and rank pooling on sequence of still images and a temporal Residual Network (ResNet) architecture which directly operates on multiple IDIs for long-term video representations learning.
Fully convolutional networks for action recognition
TLDR
A novel two-stream fully convolutional networks architecture for action recognition which can significantly reduce parameters while keeping performance is designed and can achieve the state-of-the-art performance on two challenging datasets.
Temporal Segment Networks for Action Recognition in Videos
TLDR
The proposed TSN framework, called temporal segment network (TSN), aims to model long-range temporal structure with a new segment-based sampling and aggregation scheme and won the video classification track at the ActivityNet challenge 2016 among 24 teams.
Action Recognition in Videos with Temporal Segments Fusions
TLDR
The proposed method to combine multiple segments by a fully connected layer in a deep CNN model for the whole action video can obtain the competitive accuracy to the state-of-the-art method of the 3D convolutional operation, but with much fewer parameters.
More efficient and effective tricks for deep action recognition
TLDR
Reducing computational cost of temporal stream while achieving the same accuracy, and providing techniques such as selection of optical flow algorithm, the pre-train dataset/architectures and the hyper-parameters for assembly in action recognition task are proposed.
Frame-skip Convolutional Neural Networks for action recognition
  • Yinan Liu, Q. Wu, L. Tang
  • Computer Science
    2017 IEEE International Conference on Multimedia & Expo Workshops (ICMEW)
  • 2017
TLDR
A novel video dynamics mining strategy which takes advantage of the motion tracking in the video is proposed and a frame skip scheme is introduced to the ConvNets, it stacks different modalities of optical flow to build a novel motion representation.
Exploiting the ConvLSTM: Human Action Recognition using Raw Depth Video-Based Recurrent Neural Networks
TLDR
Two neural networks based on the convolutional long short-term memory unit, namely ConvLSTM, with differences in the architecture and the long-term learning strategy are proposed and compared and it is proved that, in the particular case of videos, the rarely-used stateful mode of recurrent neural networks significantly improves the accuracy obtained with the standard mode.
A novel recurrent hybrid network for feature fusion in action recognition
TLDR
A recurrent hybrid network architecture is designed for action recognition by fusing multi-source features: a two-stream CNNs for learning semantic features, a three-stream single-layer LSTM for learning long-term temporal feature, and an Improved Dense Trajectories stream for learning short-termporal motion feature.
Hierarchical Temporal Pooling for Efficient Online Action Recognition
TLDR
This study focuses on improving the accuracy and efficiency of action recognition following the two-stream ConvNets by investigating the effective video-level representations and HTP-Net (RGB) offers competitive action recognition accuracy but is approximately 1-2 orders of magnitude faster than other state-of-the-art single stream action recognition methods.
...
1
2
3
4
5
...

References

SHOWING 1-10 OF 43 REFERENCES
Two-Stream Convolutional Networks for Action Recognition in Videos
TLDR
This work proposes a two-stream ConvNet architecture which incorporates spatial and temporal networks and demonstrates that a ConvNet trained on multi-frame dense optical flow is able to achieve very good performance in spite of limited training data.
Long-Term Temporal Convolutions for Action Recognition
TLDR
It is demonstrated that LTC-CNN models with increased temporal extents improve the accuracy of action recognition and the impact of different low-level representations, such as raw values of video pixels and optical flow vector fields, and the importance of high-quality optical flow estimation for learning accurate action models.
Beyond short snippets: Deep networks for video classification
TLDR
This work proposes and evaluates several deep neural network architectures to combine image information across a video over longer time periods than previously attempted, and proposes two methods capable of handling full length videos.
Large-Scale Video Classification with Convolutional Neural Networks
TLDR
This work studies multiple approaches for extending the connectivity of a CNN in time domain to take advantage of local spatio-temporal information and suggests a multiresolution, foveated architecture as a promising way of speeding up the training.
A Key Volume Mining Deep Framework for Action Recognition
TLDR
A key volume mining deep framework to identify key volumes and conduct classification simultaneously and an effective yet simple "unsupervised key volume proposal" method for high quality volume sampling are proposed.
Human Action Recognition Using Factorized Spatio-Temporal Convolutional Networks
TLDR
Factorized spatio-temporal convolutional networks (FstCN) are proposed that factorize the original 3D convolution kernel learning as a sequential process of learning 2D spatial kernels in the lower layers, followed by learning 1D temporal kernel in the upper layers.
Real-Time Action Recognition with Enhanced Motion Vector CNNs
TLDR
This paper accelerates the deep two-stream architecture by replacing optical flow with motion vector which can be obtained directly from compressed videos without extra calculation, and introduces three strategies for this, initialization transfer, supervision transfer and their combination.
3D Convolutional Neural Networks for Human Action Recognition
TLDR
A novel 3D CNN model for action recognition that extracts features from both the spatial and the temporal dimensions by performing 3D convolutions, thereby capturing the motion information encoded in multiple adjacent frames.
Bag of visual words and fusion methods for action recognition: Comprehensive study and good practice
TLDR
A comprehensive study of all steps in BoVW and different fusion methods is provided, and a simple yet effective representation is proposed, called hybrid supervector, by exploring the complementarity of different BoVW frameworks with improved dense trajectories.
Action recognition with trajectory-pooled deep-convolutional descriptors
TLDR
This paper presents a new video representation, called trajectory-pooled deep-convolutional descriptor (TDD), which shares the merits of both hand-crafted features and deep-learned features, and achieves superior performance to the state of the art on these datasets.
...
1
2
3
4
5
...