Temporal Segment Networks for Action Recognition in Videos

@article{Wang2019TemporalSN,
  title={Temporal Segment Networks for Action Recognition in Videos},
  author={Limin Wang and Yuanjun Xiong and Zhe Wang and Yu Qiao and Dahua Lin and Xiaoou Tang and Luc Van Gool},
  journal={IEEE Transactions on Pattern Analysis and Machine Intelligence},
  year={2019},
  volume={41},
  pages={2740-2755}
}
We present a general and flexible video-level framework for learning action models in videos. [...] Key Method The learned models could be easily deployed for action recognition in both trimmed and untrimmed videos with simple average pooling and multi-scale temporal window integration, respectively. We also study a series of good practices for the implementation of the TSN framework given limited training samples. Our approach obtains the state-the-of-art performance on five challenging action recognition…Expand
Temporal Action Detection with Structured Segment Networks
TLDR
The structured segment network (SSN) is presented, a novel framework which models the temporal structure of each action instance via a structured temporal pyramid and introduces a decomposed discriminative model comprising two classifiers, respectively for classifying actions and determining completeness.
Action-Stage Emphasized Spatiotemporal VLAD for Video Action Recognition
TLDR
Results show that the proposed ActionS-ST-VLAD method is able to effectively pool useful deep features spatiotemporally, leading to the state-of-the-art performance for video-based action recognition.
Temporal Action Detection with Structured Segment Networks
This paper addresses an important and challenging task, namely detecting the temporal intervals of actions in untrimmed videos. Specifically, we present a framework called structured segment network
Spatio-temporal Multi-level Fusion for Human Action Recognition
TLDR
A spatiotemporal network that integrates the spatial and temporal features at multi-level to model the correlation between spatial andporal information and obtained very promising results on standard dataset UCF-101.
Action segmentation and understanding in RGB videos with convolutional neural networks
TLDR
This work proposes three techniques for accelerating a modern action recognition pipeline and selects two of the reviewed deep learning works, a Convolutional Neural Network framework that makes use of a small number of video frames for obtaining robust predictions and a new Graphics Processing Unit (GPU) video decoding software developed by NVIDIA.
End-to-end Video-level Representation Learning for Action Recognition
TLDR
This paper builds upon two-stream ConvNets and proposes Deep networks with Temporal Pyramid Pooling (DTPP), an end-to-end video-level representation learning approach, to address problems of partial observation training and single temporal scale modeling in action recognition.
Structured Learning for Action Recognition in Videos
TLDR
A novel architecture consisting of a correlation learning and input synthesis network, long short-term memory (LSTM), and a hierarchical classifier is proposed which utilizes the simultaneous occurrence of general actions such as run and jump to refine the prediction on their correlated actions.
Semantic Image Networks for Human Action Recognition
TLDR
The use of a semantic image, an improved representation for video analysis, principally in combination with Inception networks, and the sequential combination of Inception-ResNetv2 and long–short-term memory network (LSTM) to leverage the temporal variances for improved recognition performance.
Early-stopped learning for action prediction in videos
TLDR
This paper proposes encouraging the learner to learn from earlier parts of the video and stop learning from some point on, and shows that the method improves on Temporal Segment Networks and outperforms other baseline methods.
Exploring Frame Segmentation Networks for Temporal Action Localization
TLDR
A Frame Segmentation Network (FSN) is proposed that places a temporal CNN on top of the 2D spatial CNNs and can make dense predictions at frame-level for a video clip using both spatial and temporal context information.
...
1
2
3
4
5
...

References

SHOWING 1-10 OF 83 REFERENCES
Temporal Segment Networks: Towards Good Practices for Deep Action Recognition
Deep convolutional networks have achieved great success for visual recognition in still images. However, for action recognition in videos, the advantage over traditional methods is not so evident.
Long-Term Temporal Convolutions for Action Recognition
TLDR
It is demonstrated that LTC-CNN models with increased temporal extents improve the accuracy of action recognition and the impact of different low-level representations, such as raw values of video pixels and optical flow vector fields, and the importance of high-quality optical flow estimation for learning accurate action models.
Modeling video evolution for action recognition
TLDR
The proposed method to capture video-wide temporal information for action recognition postulate that a function capable of ordering the frames of a video temporally captures well the evolution of the appearance within the video.
Two-Stream Convolutional Networks for Action Recognition in Videos
TLDR
This work proposes a two-stream ConvNet architecture which incorporates spatial and temporal networks and demonstrates that a ConvNet trained on multi-frame dense optical flow is able to achieve very good performance in spite of limited training data.
Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset
TLDR
I3D models considerably improve upon the state-of-the-art in action classification, reaching 80.2% on HMDB-51 and 97.9% on UCF-101 after pre-training on Kinetics, and a new Two-Stream Inflated 3D Conv net that is based on 2D ConvNet inflation is introduced.
Convolutional Two-Stream Network Fusion for Video Action Recognition
TLDR
A new ConvNet architecture for spatiotemporal fusion of video snippets is proposed, and its performance on standard benchmarks where this architecture achieves state-of-the-art results is evaluated.
Action Recognition with Actons
TLDR
A two-layer structure for action recognition to automatically exploit a mid-level ``acton'' representation via a new max-margin multi-channel multiple instance learning framework, which yields the state-of-the-art classification performance on Youtube and HMDB51 datasets.
Modeling Spatial-Temporal Clues in a Hybrid Deep Learning Framework for Video Classification
TLDR
This work proposes a hybrid deep learning framework for video classification, which is able to model static spatial information, short-term motion, as well as long-term temporal clues in the videos, and achieves very competitive performance on two popular and challenging benchmarks.
Spatiotemporal Residual Networks for Video Action Recognition
TLDR
The novel spatiotemporal ResNet is introduced and evaluated using two widely used action recognition benchmarks where it exceeds the previous state-of-the-art.
Appearance-and-Relation Networks for Video Classification
  • Limin Wang, Wei Li, Wen Li, L. Gool
  • Computer Science
    2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition
  • 2018
TLDR
This paper presents a new architecture, termed as Appearance-and-Relation Network (ARTNet), to learn video representation in an end-to-end manner, constructed by stacking multiple generic building blocks, called SMART, whose goal is to simultaneously model appearance and relation from RGB input in a separate and explicit manner.
...
1
2
3
4
5
...