LSTM Based Cross-corpus and Cross-task Acoustic Emotion Recognition

  title={LSTM Based Cross-corpus and Cross-task Acoustic Emotion Recognition},
  author={Heysem Kaya and Dmitrii Fedotov and Ali Yesilkanat and Oxana Verkholyak and Yang Zhang and Alexey Karpov},
Acoustic emotion recognition is a popular and central research direction in paralinguistic analysis, due its relation to a wide range of affective states/traits and manifold applications. Developing highly generalizable models still remains as a challenge for researchers and engineers, because of multitude of nuisance factors. To assert generalization, deployed models need to handle spontaneous speech recorded under different acoustic conditions compared to the training set. This requires that… 

Figures and Tables from this paper

Context Modeling for Cross-Corpus Dimensional Acoustic Emotion Recognition: Challenges and Mixup
This paper analyzes difficulties of automatic emotion recognition in time-continuous, dimensional scenario using data from RECOLA, SEMAINE and CreativeIT databases, and proposes a simple but effective strategy called “mixup” to overcome the gap in feature-target and target-target covariance structures across corpora.
Hierarchical Two-level Modelling of Emotional States in Spoken Dialog Systems
A hierarchical context modelling approach based on RNN-LSTM architecture is proposed, which models acoustical context on the frame level and partner’s emotionalcontext on the dialog level and the state-of-the-art on this corpus is advanced for both dimensions using only acoustic modality.
Cross-Corpus Speech Emotion Recognition Using Semi-Supervised Transfer Non-Negative Matrix Factorization with Adaptation Regularization
The core idea of SATNMF is to incorporate the label information of training corpus into NMF, and seek a latent low-rank feature space, in which the marginal and conditional distribution differences between the two corpora can be minimized simultaneously.
Predicting Depression and Emotions in the Cross-roads of Cultures, Para-linguistics, and Non-linguistics
The results show that non-verbal parts of the signal are important for detection of depression, and combining this with linguistic information produces the best results.
Nonnegative Matrix Factorization Based Transfer Subspace Learning for Cross-Corpus Speech Emotion Recognition
  • Hui Luo, Jiqing Han
  • Computer Science
    IEEE/ACM Transactions on Audio, Speech, and Language Processing
  • 2020
A non-negative matrix factorization based transfer subspace learning method (NMFTSL) to find a shared feature subspace for the source and target corpora, in which the discrepancy between the two distributions is eliminated as much as possible and their individual components are excluded, thus the knowledge of the source corpus can be transferred to the target corpus.
Hierarchical classifier design for speech emotion recognition in the mixed-cultural environment
A two-level hierarchical engine has been designed to identify emotion from the speech of different cultural backgrounds, using a discriminative, multiclass SVM classifier trained with the emotional utterances of that particular corpus.
A Successive Difference Feature for Detecting Emotional Valence from Speech
A recent, time-domain feature extraction technique for detecting emotional-valence, which provides an accuracy of 75% on the Indian-demography corpus and 100% accuracy in classifying happy, sad and neutral emotions from EmoDB, with the RandomForest Classifier.
Speech Emotion Recognition Using Spectrogram Patterns as Features
The experimental evaluations indicate that the spectrogram-based patterns outperform the standard set of acoustic features and it is shown that the results can further be improved with the increasing number of spectrogram partitions.


Autoencoder-based Unsupervised Domain Adaptation for Speech Emotion Recognition
An Adaptive Denoising Autoencoder based on an unsupervised domain adaptation method, where prior knowledge learned from a target set is used to regularize the training on a source set to achieve matched feature space representation for the target and source sets while ensuring target domain knowledge transfer.
Efficient and effective strategies for cross-corpus acoustic emotion recognition
On Acoustic Emotion Recognition: Compensating for Covariate Shift
This paper employs three algorithms from the domain of transfer learning that apply importance weights (IWs) within a support vector machine classifier to reduce the effects of covariate shift and shows that the IW methods outperform combined CMN and VTLN and significantly improve on the baseline performance of the Challenge.
Context-sensitive multimodal emotion recognition from speech and facial expression using bidirectional LSTM modeling
A context-sensitive technique for multimodal emotion recognition based on feature-level fusion of acoustic and visual cues is applied, which enables us to classify both prototypical and nonprototypical emotional expressions contained in a large audiovisual database.
Continuous Estimation of Emotions in Speech by Dynamic Cooperative Speaker Models
A novel emotion recognition system, based on ensembles of single-speaker-regression-models selecting those that are most concordant among them, which allows the addition or removal of speakers from the ensemble without the necessity to re-build the entire recognition system.
Contextual Dependencies in Time-Continuous Multidimensional Affect Recognition
A series of experiments conducted with different modalities and emotional labels on the RECOLA corpora has shown a strong pattern between the amount of context used in model and performance, and the pattern remains the same for different pairs of modalities, but the intensity differs.
Speech Emotion Analysis: Exploring the Role of Context
A novel set of features based on cepstrum analysis of pitch and intensity contours is introduced and the effects of different contexts on two different databases are systematically analyzed.
Cross lingual speech emotion recognition using canonical correlation analysis on principal component subspace
This paper proposes an analytical approach based on Kernel Canonical Correlation Analysis (KCCA) for domain adaptation that yields higher classification performance and compared it with the Shared-Hidden-Layer Auto-Encoder (SHLA) and kernel based principal components.
The INTERSPEECH 2018 Computational Paralinguistics Challenge: Atypical & Self-Assessed Affect, Crying & Heart Beats
The Sub-Challenges are described, their conditions, and baseline feature extraction and classifiers, which include data-learnt (supervised) feature representations by end-to-end learning, the ‘usual’ ComParE and BoAW features, and deep unsupervised representation learning using the AUDEEP toolkit for the first time in the challenge series.
Semisupervised Autoencoders for Speech Emotion Recognition
Experimental results demonstrate that the proposed semisupervised autoencoders to improve speech emotion recognition achieves state-of-the-art performance with a very small number of labelled data on the challenge task and other tasks, and significantly outperforms other alternative methods.