Investigations on audiovisual emotion recognition in noisy conditions

  title={Investigations on audiovisual emotion recognition in noisy conditions},
  author={M. Neumann and Ngoc Thang Vu},
  journal={2021 IEEE Spoken Language Technology Workshop (SLT)},
  • M. NeumannNgoc Thang Vu
  • Published 19 January 2021
  • Computer Science
  • 2021 IEEE Spoken Language Technology Workshop (SLT)
In this paper we explore audiovisual emotion recognition under noisy acoustic conditions with a focus on speech features. We attempt to answer the following research questions: (i) How does speech emotion recognition perform on noisy data? and (ii) To what extend does a multimodal approach improve the accuracy and compensate for potential performance degradation at different noise levels?We present an analytical investigation on two emotion datasets with superimposed noise at different signal… 

Figures and Tables from this paper

Would you respect a norm if it sounds foreign? Foreign-accented speech affects decision-making processes

Does listening to a foreign-accented speaker bias native speakers’ behavior? We investigated whether the accent, i.e., a foreign accent versus a native accent, in which a social norm is presented

Subjective Evaluation of Basic Emotions from Audio–Visual Data

The results indicated that the participants’ perception of emotions was remarkably different between the audio–alone, video-alone, and audio–video data, which emphasizes the importance of emotion-specific features compared to commonly used features in the development of emotions-aware systems.



Towards Robust Speech Emotion Recognition Using Deep Residual Networks for Speech Enhancement

This work shows how a scalable DL architecture can be trained to enhance audio signals in a large number of unseen environments, and shows how that can benefit common SER pipelines in terms of noise robustness.

Attentive Convolutional Neural Network Based Speech Emotion Recognition: A Study on the Impact of Input Features, Signal Length, and Acted Speech

This work conducts extensive experiments using an attentive convolutional neural network with multi-view learning objective function for speech emotion recognition and achieves state-of-the-art results on the improvised speech data of IEMOCAP.

Speech emotion recognition in noisy environment

  • Farah ChenchahZ. Lachiri
  • Computer Science
    2016 2nd International Conference on Advanced Technologies for Signal and Image Processing (ATSIP)
  • 2016
Three speech enhancement algorithms are introduced for improved emotion classification; spectral subtraction, wiener filter and MMSE and they improve the performance of the emotion recognition system under various SNRs.

Facing Realism in Spontaneous Emotion Recognition from Speech: Feature Enhancement by Autoencoder with LSTM Neural Networks

Results show that the proposed method significantly outperforms a system trained on raw features, for both arousal and valence dimensions, while having almost no degradation when applied to clean speech.

A probabilistic fusion strategy for audiovisual emotion recognition of sparse and noisy data

A Semi-Coupled Hidden Markov Model (SC-HMM) based on a state-based bimodal alignment strategy is proposed to align the temporal relation of states of two component HMMs between audio and visual streams.

Emotion Recognition in the Noise Applying Large Acoustic Feature Sets

Generation of functionals is extended by extraction of a large 4k hi-level feature set out of more than 60 partially novel base contours that comprise among others intonation, intensity, formants, HNR, MFCC, and VOC19, and Fast Information-Gain-Ratio filter-selection picks attributes according to noise conditions.


An enhanced Lipschitz embedding was developed to embed the 64-dimensional acoustic features into a six-dimensional space in order to avoid the problems brought by noise reduction, emotion recognition from noisy speech was performed directly.

Investigating Speech Enhancement and Perceptual Quality for Speech Emotion Recognition

Results show that applying enhancement prior to the SER task can improve SER performance in more degraded scenarios, and that quality measures can be an important asset as indicator of enhancement algorithms performance towards SER.

Towards More Reality in the Recognition of Emotional Speech

The major aspects of emotion recognition are addressed in view of potential applications in the field, to benchmark today's emotion recognition systems and bridge the gap between commercial interest and current performances: acted vs. spontaneous speech, realistic emotions, noise and microphone conditions, and speaker independence.