EmoNets: Multimodal deep learning approaches for emotion recognition in video

@article{Kahou2015EmoNetsMD,
  title={EmoNets: Multimodal deep learning approaches for emotion recognition in video},
  author={Samira Ebrahimi Kahou and Xavier Bouthillier and Pascal Lamblin and Çaglar G{\"u}lçehre and Vincent Michalski and Kishore Reddy Konda and S{\'e}bastien Jean and Pierre Froumenty and Yann Dauphin and Nicolas Boulanger-Lewandowski and Raul Chandias Ferrari and Mehdi Mirza and David Warde-Farley and Aaron C. Courville and Pascal Vincent and Roland Memisevic and Christopher Joseph Pal and Yoshua Bengio},
  journal={Journal on Multimodal User Interfaces},
  year={2015},
  volume={10},
  pages={99-111}
}
The task of the Emotion Recognition in the Wild (EmotiW) Challenge is to assign one of seven emotions to short video clips extracted from Hollywood style movies. The videos depict acted-out emotions under realistic conditions with a large degree of variation in attributes such as pose and illumination, making it worthwhile to explore approaches which consider combinations of features from multiple modalities for label assignment. In this paper we present our approach to learning several… 
Deep learning-based late fusion of multimodal information for emotion classification of music video
TLDR
Human high-level emotions are automatically well classified in the proposed CNN-based multimodal networks, even though a small amount of labeled data samples is available for training.
Deep emotion recognition based on audio-visual correlation
TLDR
Results show that temporal alignment of the data between two modalities improves the recognition performance significantly and canonical correlation analysis and t-distributed stochastic neighbour embedding is used validating the experiments.
AttendAffectNet–Emotion Prediction of Movie Viewers Using Multimodal Fusion with Self-Attention
TLDR
The proposed AttendAffectNet (AAN) is extensively trained and validated on both the MediaEval 2016 dataset for the Emotional Impact of Movies Task and the extended COGNIMUSE dataset, demonstrating that audio features play a more influential role than those extracted from video and movie subtitles when predicting the emotions of movie viewers on these datasets.
End-to-End Multimodal Emotion Recognition Using Deep Neural Networks
TLDR
This work proposes an emotion recognition system using auditory and visual modalities using a convolutional neural network to extract features from the speech, while for the visual modality a deep residual network of 50 layers is used.
A Deep Feature based Multi-kernel Learning Approach for Video Emotion Recognition
TLDR
This paper focuses on the sub-challenge of Audio-Video Based Emotion Recognition using the AFEW dataset and extracts LBP-TOP-based video features, openEAR energy/spectral-based audio features, and CNN (convolutional neural network) based deep image features by fine-tuning a pre-trained model with extra emotion images from the web.
Deep Learning-Based Emotion Recognition from Real-Time Videos
TLDR
A novel framework for emotional state detection from facial expression targeted to learning environments based on a convolutional deep neural network that classifies people’s emotions that are captured through a web-cam and integrated into an affective pedagogical agent system.
Emotion recognition using multimodal deep learning in multiple psychophysiological signals and video
TLDR
An approach to training several specialist networks that employs deep learning techniques to fuse the features of individual modalities to solve the problems of feature redundancy and lack of key features caused by multimodal fusion is proposed.
Audio-Visual Emotion Recognition in Video Clips
TLDR
This paper presents a multimodal emotion recognition system, which is based on the analysis of audio and visual cues, and defines the current state-of-the-art in all three databases.
Multimodal Deep Models for Predicting Affective Responses Evoked by Movies
The goal of this study is to develop and analyze multimodal models for predicting experienced affective responses of viewers watching movie clips. We develop hybrid multimodal prediction models based
Fusion of classifier predictions for audio-visual emotion recognition
TLDR
A novel multimodal emotion recognition system which is based on the analysis of audio and visual cues, and summarise each emotion video into a reduced set of key-frames, which are learnt in order to visually discriminate emotions by means of a Convolutional Neural Network.
...
1
2
3
4
5
...

References

SHOWING 1-10 OF 48 REFERENCES
Combining modality specific deep neural networks for emotion recognition in video
In this paper we present the techniques used for the University of Montréal's team submissions to the 2013 Emotion Recognition in the Wild Challenge. The challenge is to classify the emotions
Multiple kernel learning for emotion recognition in the wild
TLDR
This work proposes a method to automatically detect emotions in unconstrained settings as part of the 2013 Emotion Recognition in the Wild Challenge and achieves competitive results, with an accuracy gain of approximately 10% above the challenge baseline.
Combining Multiple Kernel Methods on Riemannian Manifold for Emotion Recognition in the Wild
TLDR
The method for the Emotion Recognition in the Wild Challenge (EmotiW 2014) is presented, and an optimal fusion of classifiers learned from different kernels and different modalities (video and audio) is conducted at the decision level for further boosting the performance.
Combining Multimodal Features with Hierarchical Classifier Fusion for Emotion Recognition in the Wild
TLDR
This paper investigates a variety of different multimodal features from video and audio to evaluate their discriminative ability to human emotion analysis, and proposes a novel hierarchical classifier fusion method for all the extracted features.
Emotion Recognition In The Wild Challenge 2014: Baseline, Data and Protocol
TLDR
The goal of this Grand Challenge is to carry forward the common platform defined during EmotiW 2013, for evaluation of emotion recognition methods in real-world conditions, using the Acted Facial Expression In Wild 4.0 database.
Partial least squares regression on grassmannian manifold for emotion recognition
TLDR
For each video clip, all frames are represented as an image set, which can be modeled as a linear subspace to be embedded in Grassmannian manifold and an optimal fusion of classifiers learned from both modalities is conducted at decision level.
Learning hierarchical invariant spatio-temporal features for action recognition with independent subspace analysis
TLDR
This paper presents an extension of the Independent Subspace Analysis algorithm to learn invariant spatio-temporal features from unlabeled video data and discovered that this method performs surprisingly well when combined with deep learning techniques such as stacking and convolution to learn hierarchical representations.
Why is facial expression analysis in the wild challenging?
TLDR
It turns out that under close-to-real conditions, especially with co-occurring speech, it is hard even for humans to assign emotion labels to clips when only taking video into account, so the challenges for facial expression analysis in the wild are discussed.
Emotion recognition in the wild challenge 2013
TLDR
The Emotion Recognition In The Wild Challenge and Workshop (EmotiW) 2013 Grand Challenge consists of an audio-video based emotion classification challenges, which mimics real-world conditions.
Emotion Recognition in the Wild with Feature Fusion and Multiple Kernel Learning
TLDR
A new feature descriptor called Histogram of Oriented Gradients from Three Orthogonal Planes (HOG_TOP) to represent facial expressions is proposed and the properties of visual features and audio features are explored to find an optimal feature fusion.
...
1
2
3
4
5
...