Leveraging Semantic Scene Characteristics and Multi-Stream Convolutional Architectures in a Contextual Approach for Video-Based Visual Emotion Recognition in the Wild

@article{Pikoulis2021LeveragingSS,
  title={Leveraging Semantic Scene Characteristics and Multi-Stream Convolutional Architectures in a Contextual Approach for Video-Based Visual Emotion Recognition in the Wild},
  author={Ioannis Pikoulis and Panagiotis Paraskevas Filntisis and Petros Maragos},
  journal={2021 16th IEEE International Conference on Automatic Face and Gesture Recognition (FG 2021)},
  year={2021},
  pages={01-08}
}
In this work we tackle the task of video-based visual emotion recognition in the wild. Standard methodologies that rely solely on the extraction of bodily and facial features often fall short of accurate emotion prediction in cases where the aforementioned sources of affective information are inaccessible due to head/body orientation, low resolution and poor illumination. We aspire to alleviate this problem by leveraging visual context in the form of scene characteristics and attributes, as… 

Figures and Tables from this paper

Leveraging Multi-stream Information Fusion for Trajectory Prediction in Low-illumination Scenarios: A Multi-channel Graph Convolutional Approach

A novel approach for trajectory prediction in low-illumination scenarios by leveraging multi-stream information fusion, which effectively integrates image, optical, and object trajectory information in the prediction module.

Contextually-rich human affect perception using multimodal scene information

This work leverage pretrained vision-language (VLN) models to extract descriptions of foreground context from images and proposes a multimodal context fusion (MCF) module to combine foreground cues with the visual scene and person-based contextual information for emotion prediction.

Medical Face Masks and Emotion Recognition from the Body: Insights from a Deep Learning Perspective

Experimental results suggest that spatial structure plays a more important role for an emotional expression, while temporal structure is complementary, and a deep learning model based on the Temporal Segment Network framework is utilized to fully overcome the consequences of the face mask.

Context Based Vision Emotion Recognition in the Wild

A multi-head cross attention network (MHCAN) is proposed to distinguish more subtle changes in expression and improve the accuracy of emotion recognition.

Contextual modulation of affect: Comparing humans and deep neural networks

The study emphasizes the importance of a more holistic, multi-modal training regime with richer human data to build better emotion-understanding systems in the area of affective computing.

An audiovisual and contextual approach for categorical and continuous emotion recognition in-the-wild

This work tackles the task of video-based audio-visual emotion recognition, within the premises of the 2nd Workshop and Competition on Affective Behavior Analysis in-the-wild (ABAW2), and chooses to use a standard CNN-RNN cascade as the backbone of the proposed model for sequence-to-sequence (seq2seq) learning.

References

SHOWING 1-10 OF 44 REFERENCES

Two-Stream Convolutional Networks for Action Recognition in Videos

This work proposes a two-stream ConvNet architecture which incorporates spatial and temporal networks and demonstrates that a ConvNet trained on multi-frame dense optical flow is able to achieve very good performance in spite of limited training data.

An Attention Enhanced Graph Convolutional LSTM Network for Skeleton-Based Action Recognition

A novel Attention Enhanced Graph Convolutional LSTM Network (AGC-LSTM) for human action recognition from skeleton data can not only capture discriminative features in spatial configuration and temporal dynamics but also explore the co-occurrence relationship between spatial and temporal domains.

Temporal Segment Networks: Towards Good Practices for Deep Action Recognition

Deep convolutional networks have achieved great success for visual recognition in still images. However, for action recognition in videos, the advantage over traditional methods is not so evident.

Two-Stream Adaptive Graph Convolutional Networks for Skeleton-Based Action Recognition

A novel two-stream adaptive graph convolutional network (2s-AGCN) for skeleton-based action recognition that increases the flexibility of the model for graph construction and brings more generality to adapt to various data samples.

Context-Aware Emotion Recognition Networks

Deep networks for context-aware emotion recognition, called CAER-Net, are presented that exploit not only human facial expression but also context information in a joint and boosting manner to hide human faces in a visual scene and seek other contexts based on an attention mechanism.

Facial Expression Recognition Based on Deep Evolutional Spatial-Temporal Networks

This paper proposes a part-based hierarchical bidirectional recurrent neural network (PHRNN) to analyze the facial expression information of temporal sequences and reduces the error rates of the previous best ones on three widely used facial expression databases.

Spatial Temporal Graph Convolutional Networks for Skeleton-Based Action Recognition

A novel model of dynamic skeletons called Spatial-Temporal Graph Convolutional Networks (ST-GCN), which moves beyond the limitations of previous methods by automatically learning both the spatial and temporal patterns from data.

EmotiCon: Context-Aware Multimodal Emotion Recognition Using Frege’s Principle

This work presents EmotiCon, a learning-based algorithm for context-aware perceived human emotion recognition from videos and images that combines three interpretations of context for emotion recognition based on Frege's Context Principle.

STEP: Spatial Temporal Graph Convolutional Networks for Emotion Perception from Gaits

A novel classifier network called STEP, to classify perceived human emotion from gaits, based on a Spatial Temporal Graph Convolutional Network (ST-GCN) architecture, which can learn the affective features and exhibits classification accuracy of 88% on E-Gait, which is 14–30% more accurate over prior methods.

Places: An Image Database for Deep Scene Understanding

The Places Database is described, a repository of 10 million scene photographs, labeled with scene semantic categories and attributes, comprising a quasi-exhaustive list of the types of environments encountered in the world.