Adieu features? End-to-end speech emotion recognition using a deep convolutional recurrent network

  title={Adieu features? End-to-end speech emotion recognition using a deep convolutional recurrent network},
  author={George Trigeorgis and Fabien Ringeval and Raymond Brueckner and Erik Marchi and Mihalis A. Nicolaou and Bj{\"o}rn Schuller and Stefanos Zafeiriou},
  journal={2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
The automatic recognition of spontaneous emotions from speech is a challenging task. On the one hand, acoustic features need to be robust enough to capture the emotional content for various styles of speaking, and while on the other, machine learning algorithms need to be insensitive to outliers while being able to model the context. Whereas the latter has been tackled by the use of Long Short-Term Memory (LSTM) networks, the former is still under very active investigations, even though more… 

Figures and Tables from this paper

End-to-End Multimodal Emotion Recognition Using Deep Neural Networks
This work proposes an emotion recognition system using auditory and visual modalities using a convolutional neural network to extract features from the speech, while for the visual modality a deep residual network of 50 layers is used.
End-to-End Speech Emotion Recognition Using Deep Neural Networks
This model, which was trained end-to-end, is comprised of a Convolutional Neural Network, which extracts features from the raw signal, and stacked on top of it a 2-layer Long Short-Term Memory so as to consider the contextual information in the data.
Direct Modelling of Speech Emotion from Raw Speech
This paper proposes the use of parallel convolutional layers to harness multiple temporal resolutions in the feature extraction block that is jointly trained with the LSTM based classification network for the emotion recognition task and suggests that the proposed model can reach the performance of CNN trained with hand-engineered features from both IEMOCAP and MSP-IMPROV datasets.
Deep Recurrent Neural Networks for Emotion Recognition in Speech
A deep learning framework for ERS is proposed and different feature representations are compared using a state-of-the-art benchmark database from the domain of affective computing.
End-to-end speech emotion recognition using multi-scale convolution networks
The multi- scale convolution neural network (MCNN) is proposed to identify features at different time scales and frequencies from raw speech signals to improve the emotion recognition performance when compared to existing methods.
Adieu recurrence? End-to-end speech emotion recognition using a context stacking dilated convolutional network
This work proposes a novel end-to-end SER structure that does not contain any recurrent or fully connected layers and levering the power of the dilated causal convolution, the receptive field of the proposed model largely increases with reasonably low computational cost.
Pre-trained Deep Convolution Neural Network Model With Attention for Speech Emotion Recognition
This paper proposes a new method for SER based on Deep Convolution Neural Network (DCNN) and Bidirectional Long Short-Term Memory with Attention (BLSTMwA) model and adopts BLSTM to learn the high-level emotional features for temporal summarization, followed by an attention layer which can focus on emotionally relevant features.
Speech Representation Learning for Emotion Recognition Using End-to-End ASR with Factorized Adaptation
This paper leverages an end-to-end ASR to extract ASR-based representations for speech emotion recognition and devise a factorized domain adaptation approach on the pre-trained ASR model to improve both the speech recognition rate and the emotion recognition accuracy on the target emotion corpus.
Facing Realism in Spontaneous Emotion Recognition from Speech: Feature Enhancement by Autoencoder with LSTM Neural Networks
Results show that the proposed method significantly outperforms a system trained on raw features, for both arousal and valence dimensions, while having almost no degradation when applied to clean speech.


Learning Salient Features for Speech Emotion Recognition Using Convolutional Neural Networks
This paper proposes to learn affect-salient features for SER using convolutional neural networks (CNN), and shows that this approach leads to stable and robust recognition performance in complex scenes and outperforms several well-established SER features.
Estimating phoneme class conditional probabilities from raw speech signal using convolutional neural networks
This paper investigates a novel approach, where the input to the ANN is raw speech signal and the output is phoneme class conditional probability estimates, and indicates that CNNs can learn features relevant for phoneme classification automatically from the rawspeech signal.
Towards End-To-End Speech Recognition with Recurrent Neural Networks
This paper presents a speech recognition system that directly transcribes audio data with text, without requiring an intermediate phonetic representation. The system is based on a combination of the
Prediction of asynchronous dimensional emotion ratings from audiovisual and physiological data
End-to-end learning for music audio
  • S. Dieleman, B. Schrauwen
  • Computer Science
    2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
  • 2014
Although convolutional neural networks do not outperform a spectrogram-based approach, the networks are able to autonomously discover frequency decompositions from raw audio, as well as phase-and translation-invariant feature representations.
Using representation learning and out-of-domain data for a paralinguistic speech task
This work builds upon a deep learning language identification system, which is repurpose for general audio sequence classification, that trains local convolutional neural network classifiers that automatically learn representations on smaller windows of the full sequence’s spectrum.
Convolutional, Long Short-Term Memory, fully connected Deep Neural Networks
This paper takes advantage of the complementarity of CNNs, LSTMs and DNNs by combining them into one unified architecture, and finds that the CLDNN provides a 4-6% relative improvement in WER over an LSTM, the strongest of the three individual models.
AV+EC 2015: The First Affect Recognition Challenge Bridging Across Audio, Video, and Physiological Data
The goal of the Challenge is to provide a common benchmark test set for multimodal information processing and to bring together the audio, video and physiological emotion recognition communities, to compare the relative merits of the three approaches to emotion recognition under well-defined and strictly comparable conditions and establish to what extent fusion of the approaches is possible and beneficial.
Architectures for deep neural network based acoustic models defined over windowed speech waveforms
This paper investigates acoustic models for automatic speech recognition (ASR) using deep neural networks (DNNs) whose input is taken directly from windowed speech waveforms (WSW) and shows that using WSW features results in a 3.0 percent increase in WER relative to that resulting from MFSC features on the WSJ corpus.
Learning the speech front-end with raw waveform CLDNNs
It is shown that raw waveform features match the performance of log-mel filterbank energies when used with a state-of-the-art CLDNN acoustic model trained on over 2,000 hours of speech.