Multimodal Speech Emotion Recognition Using Audio and Text

  title={Multimodal Speech Emotion Recognition Using Audio and Text},
  author={Seunghyun Yoon and Seokhyun Byun and Kyomin Jung},
  journal={2018 IEEE Spoken Language Technology Workshop (SLT)},
Speech emotion recognition is a challenging task, and extensive reliance has been placed on models that use audio features in building well-performing classifiers. [] Key Method This architecture analyzes speech data from the signal level to the language level, and it thus utilizes the information within the data more comprehensively than models that focus on audio features. Extensive experiments are conducted to investigate the efficacy and properties of the proposed model. Our proposed model outperforms…

Figures and Tables from this paper

Learning Alignment for Multimodal Emotion Recognition from Speech

This paper proposes to use an attention mechanism to learn the alignment between speech frames and text words, aiming to produce more accurate multimodal feature representations.

Speech Emotion Recognition Using Multi-hop Attention Mechanism

A framework to exploit acoustic information in tandem with lexical data using two bi-directional long short-term memory (BLSTM) for obtaining hidden representations of the utterance and an attention mechanism, referred to as the multi-hop, which is trained to automatically infer the correlation between the modalities.

Efficient Speech Emotion Recognition Using Multi-Scale CNN and Attention

The proposed model outperforms previous state-of-the-art methods on IEMOCAP dataset with four emotion categories in both weighted accuracy and unweighted accuracy, with an improvement of 5.0% and 5.2% respectively under the ASR setting.

Fusion of Acoustic and Linguistic Information using Supervised Autoencoder for Improved Emotion Recognition

Comparisons of discriminative characteristics of hand-crafted and data-driven acoustic features in a context of emotional classification in arousal and valence dimensions show that joint modeling of acoustic and linguistic cues could improve classification performance compared to individual modalities.

A Segment Level Approach to Speech Emotion Recognition Using Transfer Learning

This paper proposes a speech emotion recognition system that predicts emotions for multiple segments of a single audio clip unlike the conventional emotion recognition models that predict the emotion of an entire audio clip directly.

WISE: Word-Level Interaction-Based Multimodal Fusion for Speech Emotion Recognition

This paper proposes WISE, a novel wordlevel interaction-based multimodal fusion framework for speech emotion recognition, and devise a hierarchical representation of audio information from the frame, phoneme and word levels, which largely improves the expressiveness of resulting audio features.

Robotic Emotion Recognition Using Two-Level Features Fusion in Audio Signals of Speech

  • Chang Li
  • Computer Science
    IEEE Sensors Journal
  • 2022
This paper proposes an emotion recognition system, based on speech signals, using two-level features with position information, Later Feature Fusion with VGGish Overlap (LFFVO), to tackle the present limitations.

End-to-End Speech Emotion Recognition Combined with Acoustic-to-Word ASR Model

This paper proposes speech emotion recognition (SER) combined with an acoustic-to-word automatic speech recognition (ASR) model, which has achieved a 68.63% weighted accuracy and 69.67% unweighted accuracy on the IEMOCAP database, which is state-of-the-art performance.



Automatic speech emotion recognition using recurrent neural networks with local attention

This work studies the use of deep learning to automatically discover emotionally relevant features from speech and proposes a novel strategy for feature pooling over time which uses local attention in order to focus on specific regions of a speech signal that are more emotionally salient.

Efficient Emotion Recognition from Speech Using Deep Learning on Spectrograms

A new implementation of emotion recognition from the para-lingual information in the speech, based on a deep neural network, applied directly to spectrograms, achieves higher recognition accuracy compared to previously published results, while also limiting the latency.

Speech emotion recognition using deep neural network and extreme learning machine

The experimental results demonstrate that the proposed approach effectively learns emotional information from low-level features and leads to 20% relative accuracy improvement compared to the state of the art approaches.

Attentive Convolutional Neural Network Based Speech Emotion Recognition: A Study on the Impact of Input Features, Signal Length, and Acted Speech

This work conducts extensive experiments using an attentive convolutional neural network with multi-view learning objective function for speech emotion recognition and achieves state-of-the-art results on the improvised speech data of IEMOCAP.

Speech emotion recognition combining acoustic features and linguistic information in a hybrid support vector machine-belief network architecture

A novel approach to the combination of acoustic features and language information for a most robust automatic recognition of a speaker's emotion by applying belief network based spotting for emotional key-phrases is introduced.

Towards Speech Emotion Recognition "in the Wild" Using Aggregated Corpora and Deep Multi-Task Learning

This work proposes to use Multi-Task Learning (MTL) and use gender and naturalness as auxiliary tasks in deep neural networks and found that the MTL method proposed improved performance significantly.

High-level feature representation using recurrent neural network for speech emotion recognition

This paper presents a speech emotion recognition system using a recurrent neural network (RNN) model trained by an efficient learning algorithm. The proposed system takes into account the long-range

A first look into a Convolutional Neural Network for speech emotion detection

  • D. BerteroPascale Fung
  • Computer Science
    2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
  • 2017
A real-time Convolutional Neural Network model for speech emotion detection trained from raw audio on a small dataset of TED talks speech data, manually annotated into three emotion classes: “Angry”, “Happy” and “Sad”.

Hidden Markov model-based speech emotion recognition

The paper addresses the design of working recognition engines and results achieved with respect to the alluded alternatives and describes a speech corpus consisting of acted and spontaneous emotion samples in German and English language.

Using regional saliency for speech emotion recognition

  • Zakaria AldenehE. Provost
  • Computer Science, Physics
    2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
  • 2017
The results suggest that convolutional neural networks with Mel Filterbanks (MFBs) can be used as a replacement for classifiers that rely on features obtained from applying utterance-level statistics.