Deep Learning for Action and Gesture Recognition in Image Sequences: A Survey

  title={Deep Learning for Action and Gesture Recognition in Image Sequences: A Survey},
  author={Maryam Asadi-Aghbolaghi and Albert Clap{\'e}s and Marco Bellantonio and Hugo Jair Escalante and V{\'i}ctor Ponce-L{\'o}pez and Xavier Bar{\'o} and Isabelle Guyon and Shohreh Kasaei and Sergio Escalera},
  booktitle={Gesture Recognition},
Interest in automatic action and gesture recognition has grown considerably in the last few years. [] Key Method We introduce a taxonomy that summarizes important aspects of deep learning for approaching both tasks. Details of the proposed architectures, fusion strategies, main datasets, and competitions are reviewed. Also, we summarize and discuss the main works proposed so far with particular interest on how they treat the temporal dimension of data, their highlighting features, and opportunities and…
Multimodal 2DCNN action recognition from RGB-D data with video summarization
This work extends 2DCNN is extended to multimodal (MM2DCNN) by introducing scene flow fields as the new input for an additional stream and integrates them with a late fusion for every summarization sequence modality along with uniform random selection.
Small Deep Learning Models for Hand Gesture Recognition
  • A. A. Q. Mohammed, Jiancheng Lv, Md. Sajjatul Islam
  • Computer Science
    2019 IEEE Intl Conf on Parallel & Distributed Processing with Applications, Big Data & Cloud Computing, Sustainable Computing & Communications, Social Computing & Networking (ISPA/BDCloud/SocialCom/SustainCom)
  • 2019
An approach is proposed that leverages the recent progress on Convolutional Neural Networks (CNNs) and uses deep-learning-based strategy to recognize 24 hand gestures from the American Sign Language (ASL) and has great feasibility to apply the models on resource-constrained devices and embedded visual applications.
Similar Finger Gesture Recognition using Triplet-loss Networks
A framework based on a triplet-loss network which learns to decrease the distance of true positive boundaries while increasing that of false positive ones is proposed, and a temporal representation of the segmented gesture is adopted using a stack of feature maps for gesture classification.
Action Recognition from RGB-D Data: Comparison and Fusion of Spatio-Temporal Handcrafted Features and Deep Strategies
Multiodal fusion of RGB-D data are analyzed for action recognition by using scene flow as early fusion and integrating the results of all modalities in a late fusion fashion, achieving state of the art results.
Dynamic hand gesture recognition based on short-term sampling neural networks
A novel deep learning network for hand gesture recognition that integrates several well-proved modules together to learn both short-term and long-term features from video inputs and meanwhile avoid intensive computation.
An Incremental Learning Framework for Skeletal-based Hand Gesture Recognition with Leap Motion
A novel framework which consists of an incremental learning (IL) algorithm without deep structure is proposed and applied to hand gestures classification that explicitly aimed to the LM data and the recognition performance is improved distinctly in robustness and training time than the LSTM network.
Learning dictionaries of kinematic primitives for action classification
The method is proved to be tolerant to view point changes, and can thus support cross-view action recognition, and may be seen as a backbone of a general approach to action understanding, with potential applications in robotics.
Beyond Joints: Learning Representations From Primitive Geometries for Skeleton-Based Action Recognition and Detection
This work aims to leverage the geometric relations among joints for action recognition by introducing three primitive geometries: joints, edges, and surfaces and dramatically outperforms the existing state-of-the-art methods for both tasks of action recognition and action detection.


Temporal Segment Networks: Towards Good Practices for Deep Action Recognition
Deep convolutional networks have achieved great success for visual recognition in still images. However, for action recognition in videos, the advantage over traditional methods is not so evident.
Deep learning based super-resolution for improved action recognition
The experimental results obtained on down-sampled version of a large subset of Hoolywood2 benchmark database show the importance of the proposed system in increasing the recognition rate of a state-of-the-art action recognition system for handling low-resolution videos.
Two-Stream Convolutional Networks for Action Recognition in Videos
This work proposes a two-stream ConvNet architecture which incorporates spatial and temporal networks and demonstrates that a ConvNet trained on multi-frame dense optical flow is able to achieve very good performance in spite of limited training data.
A Key Volume Mining Deep Framework for Action Recognition
A key volume mining deep framework to identify key volumes and conduct classification simultaneously and an effective yet simple "unsupervised key volume proposal" method for high quality volume sampling are proposed.
Two-Stream SR-CNNs for Action Recognition in Videos
This paper proposes a new deep architecture by incorporating human/object detection results into the framework, called two-stream semantic region based CNNs (SR-CNNs), which not only shares great modeling capacity with the original two- stream CNNs, but also exhibits the flexibility of leveraging semantic cues for action understanding.
First Person Action Recognition Using Deep Learned Descriptors
This work proposes convolutional neural networks (CNNs) for end to end learning and classification of wearer's actions and shows that the proposed network can generalize and give state of the art performance on various disparate egocentric action datasets.
Deep Learning-Based Fast Hand Gesture Recognition Using Representative Frames
A vision-based hand gesture recognition system for intelligent vehicles by using novel tiled image patterns and tiled binary pattern within a semantic segmentation- based deep learning framework, the deconvolutional neural network and an improved classification accuracy is observed.
3D-based Deep Convolutional Neural Network for action recognition with depth sequences
Towards Good Practices for Very Deep Two-Stream ConvNets
This report presents very deep two-stream ConvNets for action recognition, by adapting recent very deep architectures into video domain, and extends the Caffe toolbox into Multi-GPU implementation with high computational efficiency and low memory consumption.
Learning Deep Features for Scene Recognition using Places Database
A new scene-centric database called Places with over 7 million labeled pictures of scenes is introduced with new methods to compare the density and diversity of image datasets and it is shown that Places is as dense as other scene datasets and has more diversity.