Deep Temporal Linear Encoding Networks

@article{Diba2017DeepTL,
  title={Deep Temporal Linear Encoding Networks},
  author={Ali Diba and Vivek Sharma and Luc Van Gool},
  journal={2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  year={2017},
  pages={1541-1550}
}
  • Ali Diba, Vivek Sharma, L. Gool
  • Published 21 November 2016
  • Computer Science
  • 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
The CNN-encoding of features from entire videos for the representation of human actions has rarely been addressed. [...] Key Method It encodes this aggregated information into a robust video feature representation, via end-to-end learning. Advantages of TLEs are: (a) they encode the entire video into a compact feature representation, learning the semantics and a discriminative feature space, (b) they are applicable to all kinds of networks like 2D and 3D CNNs for video classification, and (c) they model feature…Expand
Spatio-Temporal Fusion Networks for Action Recognition
TLDR
A novel spatio-temporal fusion network (STFN) that integrates temporal dynamics of appearance and motion information from entire videos and is applicable to any network for video classification to boost performance. Expand
Temporal–Spatial Mapping for Action Recognition
TLDR
This work introduces a simple yet effective operation, termed temporal–spatial mapping, for capturing the temporal evolution of the frames by jointly analyzing all the frames of a video, and proposes a temporal attention model within a shallow convolutional neural network to efficiently exploit the temporal-spatial dynamics. Expand
StNet: Local and Global Spatial-Temporal Modeling for Action Recognition
TLDR
A novel spatial temporal network (StNet) architecture for both local and global spatial-temporal modeling in videos that outperforms several state-of-the-art approaches in action recognition and can strike a satisfying trade-off between recognition accuracy and model complexity is explored. Expand
Temporal Bilinear Encoding Network of Audio-Visual Features at Low Sampling Rates
TLDR
The proposed Temporal Bilinear Encoding Networks (TBEN) for encoding both audio and visual long range temporal information using bilinear pooling is demonstrated to be better than average pooling on the temporal dimension for videos with low sampling rate and embed the label hierarchy in TBEN to further improve the robustness of the classifier. Expand
Learning Spatio-Temporal Representation With Local and Global Diffusion
TLDR
A novel framework to boost the spatio-temporal representation learning by Local and Global Diffusion (LGD) is presented, which construct a novel neural network architecture that learns the local and global representations in parallel. Expand
Deep Local Video Feature for Action Recognition
TLDR
Experimental results on the HMDB51 and UCF101 datasets show that a simple maximum pooling on the sparsely sampled local features leads to significant performance improvement. Expand
End-to-end Video-level Representation Learning for Action Recognition
TLDR
This paper builds upon two-stream ConvNets and proposes Deep networks with Temporal Pyramid Pooling (DTPP), an end-to-end video-level representation learning approach, to address problems of partial observation training and single temporal scale modeling in action recognition. Expand
Deep Multimodal Feature Encoding for Video Ordering
TLDR
This work creates a new multimodal dataset for temporal ordering that consists of approximately 30K scenes (2-6 clips per scene) based on the "Large Scale Movie Description Challenge" and analyzes and evaluates the individual and joint modalities on three challenging tasks. Expand
Hidden Two-Stream Convolutional Networks for Action Recognition
TLDR
This paper presents a novel CNN architecture that implicitly captures motion information between adjacent frames and directly predicts action classes without explicitly computing optical flow, and significantly outperforms the previous best real-time approaches. Expand
Semi-CNN Architecture for Effective Spatio-Temporal Learning in Action Recognition
This paper introduces a fusion convolutional architecture for efficient learning of spatiotemporal features in video action recognition. Unlike 2D CNNs, 3D CNNs can be applied directly on consecutiveExpand
...
1
2
3
4
5
...

References

SHOWING 1-10 OF 44 REFERENCES
Convolutional Two-Stream Network Fusion for Video Action Recognition
TLDR
A new ConvNet architecture for spatiotemporal fusion of video snippets is proposed, and its performance on standard benchmarks where this architecture achieves state-of-the-art results is evaluated. Expand
Large-Scale Video Classification with Convolutional Neural Networks
TLDR
This work studies multiple approaches for extending the connectivity of a CNN in time domain to take advantage of local spatio-temporal information and suggests a multiresolution, foveated architecture as a promising way of speeding up the training. Expand
Long-Term Temporal Convolutions for Action Recognition
TLDR
It is demonstrated that LTC-CNN models with increased temporal extents improve the accuracy of action recognition and the impact of different low-level representations, such as raw values of video pixels and optical flow vector fields, and the importance of high-quality optical flow estimation for learning accurate action models. Expand
Unsupervised Learning of Video Representations using LSTMs
TLDR
This work uses Long Short Term Memory networks to learn representations of video sequences and evaluates the representations by finetuning them for a supervised learning problem - human action recognition on the UCF-101 and HMDB-51 datasets. Expand
Two-Stream Convolutional Networks for Action Recognition in Videos
TLDR
This work proposes a two-stream ConvNet architecture which incorporates spatial and temporal networks and demonstrates that a ConvNet trained on multi-frame dense optical flow is able to achieve very good performance in spite of limited training data. Expand
Human Action Recognition Using Factorized Spatio-Temporal Convolutional Networks
TLDR
Factorized spatio-temporal convolutional networks (FstCN) are proposed that factorize the original 3D convolution kernel learning as a sequential process of learning 2D spatial kernels in the lower layers, followed by learning 1D temporal kernel in the upper layers. Expand
Beyond short snippets: Deep networks for video classification
TLDR
This work proposes and evaluates several deep neural network architectures to combine image information across a video over longer time periods than previously attempted, and proposes two methods capable of handling full length videos. Expand
Action recognition with trajectory-pooled deep-convolutional descriptors
TLDR
This paper presents a new video representation, called trajectory-pooled deep-convolutional descriptor (TDD), which shares the merits of both hand-crafted features and deep-learned features, and achieves superior performance to the state of the art on these datasets. Expand
Long-term recurrent convolutional networks for visual recognition and description
TLDR
A novel recurrent convolutional architecture suitable for large-scale visual learning which is end-to-end trainable, and shows such models have distinct advantages over state-of-the-art models for recognition or generation which are separately defined and/or optimized. Expand
Temporal Segment Networks: Towards Good Practices for Deep Action Recognition
Deep convolutional networks have achieved great success for visual recognition in still images. However, for action recognition in videos, the advantage over traditional methods is not so evident.Expand
...
1
2
3
4
5
...