Correlation Net: Spatiotemporal multimodal deep learning for action recognition

  title={Correlation Net: Spatiotemporal multimodal deep learning for action recognition},
  author={Novanto Yudistira and Takio Kurita},
  journal={Signal Process. Image Commun.},
Abstract This paper describes a network that captures multimodal correlations over arbitrary timestamps. The proposed scheme operates as a complementary, extended network over a multimodal convolutional neural network (CNN). Spatial and temporal streams are required for action recognition by a deep CNN, but overfitting reduction and fusing these two streams remain open problems. The existing fusion approach averages the two streams. Here we propose a correlation network with a Shannon fusion… 
A resource conscious human action recognition framework using 26-layered deep convolutional neural network
A new 26-layered Convolutional Neural Network (CNN) architecture for accurate complex action recognition is designed and a feature selection method name Poisson distribution along with Univariate Measures (PDaUM) is proposed.
Improved Soccer Action Spotting using both Audio and Video Streams
This work used the SoccerNet benchmark dataset, which contains annotated events for 500 soccer game videos from the Big Five European leagues, and evaluated several ways to integrate audio stream into video-only-based architectures.
Motion Guided Feature-Augmented Network for Action Recognition
Motion information is a crucial factor to identify human action recognition in videos. Existing state-of-the-art methods use traditional optical flow features representing the short-term motion
Human Action Recognition using Machine Learning in Uncontrolled Environment
An efficient technique to classify human actions by utilizing steps like removing redundant frames from videos, extracting Segments of Interest (SoIs), feature descriptor mining through Geodesic Distance (GD), 3D Cartesian-plane Features (3D-CF), Joints MOCAP (JMOCAP) and n-way Point Trajectory Generation (nPTG).
An Overview of Deep Learning Techniques for Biometric Systems
An overview of some systems and applications that applied deep learning for biometric systems and classifying them according to biometrics modalities is provided and a detailed analysis of several existing approaches that combine biometric system with deep learning methods is drawn.
Deep Learning for Visual Content Analysis
Weakly-Supervised Action Localization and Action Recognition using Global-Local Attention of 3D CNN
Experimental results show that the proposed approach produces informative visual explanations and discriminative attention, and the action recognition via attention gating on each layer produces better classification results than the baseline model.


Gated spatio and temporal convolutional neural network for activity recognition: towards gated multimodal deep learning
This paper investigates how a natural associative cortex such as a network integrates expert networks to form a gatingCNN scheme, and shows that with proper treatment, the gating CNN scheme works well, indicating future approaches to information integration in future activity recognition.
Convolutional Two-Stream Network Fusion for Video Action Recognition
A new ConvNet architecture for spatiotemporal fusion of video snippets is proposed, and its performance on standard benchmarks where this architecture achieves state-of-the-art results is evaluated.
Beyond short snippets: Deep networks for video classification
This work proposes and evaluates several deep neural network architectures to combine image information across a video over longer time periods than previously attempted, and proposes two methods capable of handling full length videos.
Combining multiple sources of knowledge in deep CNNs for action recognition
This paper presents a spatially varying multiplicative fusion method for combining multiple CNNs trained on different sources that results in robust prediction by amplifying or suppressing the feature activations based on their agreement.
Temporal Segment Networks: Towards Good Practices for Deep Action Recognition
Deep convolutional networks have achieved great success for visual recognition in still images. However, for action recognition in videos, the advantage over traditional methods is not so evident.
Two-Stream Convolutional Networks for Action Recognition in Videos
This work proposes a two-stream ConvNet architecture which incorporates spatial and temporal networks and demonstrates that a ConvNet trained on multi-frame dense optical flow is able to achieve very good performance in spite of limited training data.
Action Recognition Based on Efficient Deep Feature Learning in the Spatio-Temporal Domain
A simple, yet robust, 2-D convolutional neural network extended to a concatenated 3-D network that learns to extract features from the spatio-temporal domain of raw video data and is used for content-based recognition of videos.
Long-Term Temporal Convolutions for Action Recognition
It is demonstrated that LTC-CNN models with increased temporal extents improve the accuracy of action recognition and the impact of different low-level representations, such as raw values of video pixels and optical flow vector fields, and the importance of high-quality optical flow estimation for learning accurate action models.
Large-Scale Video Classification with Convolutional Neural Networks
This work studies multiple approaches for extending the connectivity of a CNN in time domain to take advantage of local spatio-temporal information and suggests a multiresolution, foveated architecture as a promising way of speeding up the training.
Action recognition with trajectory-pooled deep-convolutional descriptors
This paper presents a new video representation, called trajectory-pooled deep-convolutional descriptor (TDD), which shares the merits of both hand-crafted features and deep-learned features, and achieves superior performance to the state of the art on these datasets.