• Corpus ID: 19104745

Multimodal Utterance-level Affect Analysis using Visual, Audio and Text Features

  title={Multimodal Utterance-level Affect Analysis using Visual, Audio and Text Features},
  author={Didan Deng and Yuqian Zhou and Jimin Pi and Bertram E. Shi},
The integration of information across multiple modalities and across time is a promising way to enhance the emotion recognition performance of affective systems. [] Key Method We describe here a multi-modal neural architecture that integrates visual information over time using an LSTM, and combines it with utterance level audio and text cues to recognize human sentiment from multimodal clips. Our model outperforms the unimodal baseline, achieving the concordance correlation coefficients (CCC) of 0.400 on the…

Figures and Tables from this paper

Deep Emotion Recognition in Dynamic Data using Facial, Speech and Textual Cues: A Survey

This paper introduces widely accepted emotion models for the purpose of interpreting the definition of emotion and introduces the state-of-the-art for emotion recognition based on unimodality including facial expression recognition, speech emotion recognition and textual emotion recognition.

A Deep Multi-task Contextual Attention Framework for Multi-modal Affect Analysis

This article explores a contextual inter-modal attention framework that aims to leverage the association among the neighboring utterances and their multi- modal information and suggests that, in comparison with the single-task learning framework, this framework yields better performance for the inter-related participating tasks.

A Personalized Affective Memory Neural Model for Improving Emotion Recognition

A neural model based on a conditional adversarial autoencoder to learn how to represent and edit general emotion expressions is presented and Grow-When-Required networks are proposed as personalized affective memories to learn individualized aspects of emotion expressions.

MIMAMO Net: Integrating Micro- and Macro-motion for Video Emotion Recognition

This paper proposes to combine micro- and macro-motion features to improve video emotion recognition with a two-stream recurrent network, named MIMAMO (Micro-Macro-Motion) Net, achieves state of the art performance on two video emotion datasets, the OMG emotion dataset and the Aff-Wild dataset.

Exploiting Multi-CNN Features in CNN-RNN Based Dimensional Emotion Recognition on the OMG in-the-Wild Dataset

This paper presents a novel CNN-RNN based approach, which exploits multiple CNN features for dimensional emotion recognition in-the-wild, utilizing the One-Minute Gradual-Emotion (OMG) dataset, and shows that arousal estimation is greatly improved when low-level features are combined with high-level ones.

Variational Autoencoder with Global- and Medium Timescale Auxiliaries for Emotion Recognition from Speech

This model achieves excellent results exceeding state-of-the-art models on speaker identification and emotion regression from audio, where each hidden representation performed better on some specific tasks than the other hidden representations.

Estimating Multiple Emotion Descriptors by Separating Description and Inference

  • Didan DengBertram E. Shi
  • Computer Science
    2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW)
  • 2022
A novel architecture for multiple emotion descriptor estimation that incorporates this prior knowledge about the differences between descriptive labels and inferential labels (emotion messages like discrete emotion expressions, valence, and arousal) and outperforms all other submitted approaches to multi-task learning.

CTFN: Hierarchical Learning for Multimodal Sentiment Analysis Using Coupled-Translation Fusion Network

This work proposes the coupled-translation fusion network (CTFN) to model bi-direction interplay via couple learning, ensuring the robustness in respect to missing modalities, and presents the cyclic consistency constraint to improve the translation performance.

The FaceChannel: A Fast & Furious Deep Neural Network for Facial Expression Recognition

The FaceChannel is formalized, a light-weight neural network that has much fewer parameters than common deep neural networks, and introduces an inhibitory layer that helps to shape the learning of facial features in the last layer of the network and thus improving performance while reducing the number of trainable parameters.

The FaceChannel: A Fast and Furious Deep Neural Network for Facial Expression Recognition

This paper formalizes the FaceChannel, a light-weight neural network that has much fewer parameters than common deep neural networks, and introduces an inhibitory layer that helps to shape the learning of facial features in the last layer of the network and improves performance while reducing the number of trainable parameters.



End-to-End Multimodal Emotion Recognition Using Deep Neural Networks

This work proposes an emotion recognition system using auditory and visual modalities using a convolutional neural network to extract features from the speech, while for the visual modality a deep residual network of 50 layers is used.

Multimodal emotion recognition using deep learning architectures

A database of multimodal recordings of actors enacting various expressions of emotions, which consists of audio and video sequences of actors displaying three different intensities of expressions of 23 different emotions along with facial feature tracking, skeletal tracking and the corresponding physiological data is presented.

Multimodal emotion recognition in speech-based interaction using facial expression, body gesture and acoustic analysis

The multimodal approach increased the recognition rate by more than 10% when compared to the most successful unimodal system, and the best pairing is ‘gesture-speech’.

The OMG-Emotion Behavior Dataset

This paper proposes a novel multimodal corpus for emotion expression recognition, which uses gradual annotations with a focus on contextual emotion expressions and provides an experimental protocol and a series of unimodal baseline experiments which can be used to evaluate deep and recurrent neural models in a fair and standard manner.

Towards an intelligent framework for multimodal affective data analysis

Deep Convolutional Neural Network Textual Features and Multiple Kernel Learning for Utterance-level Multimodal Sentiment Analysis

A novel way of extracting features from short texts, based on the activation values of an inner layer of a deep convolutional neural network, is presented and a parallelizable decision-level data fusion method is presented, which is much faster, though slightly less accurate.

Multimodal Emotion Recognition in Response to Videos

The results over a population of 24 participants demonstrate that user-independent emotion recognition can outperform individual self-reports for arousal assessments and do not underperform for valence assessments.

Tensor Fusion Network for Multimodal Sentiment Analysis

A novel model, termed Tensor Fusion Networks, is introduced, which learns intra-modality and inter- modality dynamics end-to-end in sentiment analysis and outperforms state-of-the-art approaches for both multimodal and unimodal sentiment analysis.

Pose-Independent Facial Action Unit Intensity Regression Based on Multi-Task Deep Transfer Learning

A multi-task deep network addressing the AU intensity estimation sub-challenge of FERA 2017 is proposed, which outperforms the baseline results, and achieves a balanced performance among nine pose angles for most AUs.