ASR-based Features for Emotion Recognition: A Transfer Learning Approach

@article{Tits2018ASRbasedFF,
  title={ASR-based Features for Emotion Recognition: A Transfer Learning Approach},
  author={No{\'e} Tits and Kevin El Haddad and Thierry Dutoit},
  journal={ArXiv},
  year={2018},
  volume={abs/1805.09197}
}
During the last decade, the applications of signal processing have drastically improved with deep learning. However areas of affecting computing such as emotional speech synthesis or emotion recognition from spoken language remains challenging. In this paper, we investigate the use of a neural Automatic Speech Recognition (ASR) as a feature extractor for emotion recognition. We show that these features outperform the eGeMAPS feature set to predict the valence and arousal emotional dimensions… 

Figures and Tables from this paper

Transfer Learning for Speech Emotion Recognition
  • Zhijie Han, Huijuan Zhao, Ru-chuan Wang
  • Computer Science
    2019 IEEE 5th Intl Conference on Big Data Security on Cloud (BigDataSecurity), IEEE Intl Conference on High Performance and Smart Computing, (HPSC) and IEEE Intl Conference on Intelligent Data and Security (IDS)
  • 2019
TLDR
This paper investigates and summarizes the concepts and categories, methods and applications of transferLearning briefly, and studies the combination of transfer learning and deep learning, and the application of speech emotion recognition, and then points out the key issues that need to be further solved in the research.
A Methodology for Controlling the Emotional Expressiveness in Synthetic Speech - a Deep Learning approach
  • Noé Tits
  • Computer Science
    2019 8th International Conference on Affective Computing and Intelligent Interaction Workshops and Demos (ACIIW)
  • 2019
In this project, we aim to build a Text-to-Speech system able to produce speech with a controllable emotional expressiveness. We propose a methodology for solving this problem in three main steps.
Dimensional speech emotion recognition from speech features and word embeddings by using multitask learning
TLDR
The findings suggest that the use of MTL with two parameters is better than other evaluated methods in representing the interrelation of emotional attributes in representation of categorical and dimensional emotion results from psychological and engineering perspectives.
Speech-Based Emotion Recognition using Neural Networks and Information Visualization
TLDR
A tool which enables users to take speech samples and identify a range of emotions from audio elements through a machine learning model is proposed, designed based on local therapists' needs for intuitive representations of speech data in order to gain insights and informative analyses of their sessions with their patients.
Multimodal Emotion Recognition With Transformer-Based Self Supervised Feature Fusion
TLDR
This work introduces a novel Transformers and Attention-based fusion mechanism that can combine multimodal SSL features and achieve state-of-the-art results for the task of multi-modal emotion recognition.
Improving Valence Prediction in Dimensional Speech Emotion Recognition Using Linguistic Information
  • Bagus Tris Atmaja, M. Akagi
  • Computer Science
    2020 23rd Conference of the Oriental COCOSDA International Committee for the Co-ordination and Standardisation of Speech Databases and Assessment Techniques (O-COCOSDA)
  • 2020
TLDR
This paper presents an approach to tackle the low score of valence prediction by utilizing linguistic information, which fuses acoustic features with linguistic features, which is a conversion from words to vectors.
Exploring Transfer Learning for Low Resource Emotional TTS
TLDR
This paper investigates how to leverage fine-tuning on a pre-trained Deep Learning-based TTS model to synthesize speech with a small dataset of another speaker, and adapts this model to have emotional TTS by fine- Tuning the neutral T TS model with asmall emotional dataset.
Mandarin Electrolaryngeal Speech Recognition Based on WaveNet-CTC.
TLDR
This study indicates that EL speech could be recognized effectively by the ASR based on WaveNet-CTC, and has a higher generalization performance and better stability than the traditional methods.
Visualization and Interpretation of Latent Spaces for Controlling Expressive Speech Synthesis through Audio Analysis
TLDR
This paper analyzes and compares different latent spaces and obtains an interpretation of their influence on expressive speech to enable the possibility to build controllable speech synthesis systems with an understandable behaviour.
Improving speech emotion recognition via Transformer-based Predictive Coding through transfer learning
TLDR
I have submitted a new version to arXiv:1910.13806.1, forgetting to choose to replace the old version, but submitting a new one because of a mistake.
...
...

References

SHOWING 1-10 OF 23 REFERENCES
Adieu features? End-to-end speech emotion recognition using a deep convolutional recurrent network
TLDR
This paper proposes a solution to the problem of `context-aware' emotional relevant feature extraction, by combining Convolutional Neural Networks (CNNs) with LSTM networks, in order to automatically learn the best representation of the speech signal directly from the raw time representation.
Using regional saliency for speech emotion recognition
  • Zakaria Aldeneh, E. Provost
  • Computer Science, Physics
    2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
  • 2017
TLDR
The results suggest that convolutional neural networks with Mel Filterbanks (MFBs) can be used as a replacement for classifiers that rely on features obtained from applying utterance-level statistics.
Towards Speech Emotion Recognition "in the Wild" Using Aggregated Corpora and Deep Multi-Task Learning
TLDR
This work proposes to use Multi-Task Learning (MTL) and use gender and naturalness as auxiliary tasks in deep neural networks and found that the MTL method proposed improved performance significantly.
Deep Temporal Models using Identity Skip-Connections for Speech Emotion Recognition
TLDR
A novel architecture which regulates unimpeded feature flows and captures long-term dependencies via gate-based skip-connections and a memory mechanism is proposed which outperforms the state-of-the-art methods by 9 - 15% and achieves an Unweighted Accuracy of 80.5% in an imbalanced class distribution.
IEMOCAP: interactive emotional dyadic motion capture database
TLDR
A new corpus named the “interactive emotional dyadic motion capture database” (IEMOCAP), collected by the Speech Analysis and Interpretation Laboratory at the University of Southern California (USC), which provides detailed information about their facial expressions and hand movements during scripted and spontaneous spoken communication scenarios.
The Geneva Minimalistic Acoustic Parameter Set (GeMAPS) for Voice Research and Affective Computing
TLDR
A basic standard acoustic parameter set for various areas of automatic voice analysis, such as paralinguistic or clinical speech analysis, is proposed and intended to provide a common baseline for evaluation of future research and eliminate differences caused by varying parameter sets or even different implementations of the same parameters.
WaveNet: A Generative Model for Raw Audio
TLDR
WaveNet, a deep neural network for generating raw audio waveforms, is introduced; it is shown that it can be efficiently trained on data with tens of thousands of samples per second of audio, and can be employed as a discriminative model, returning promising results for phoneme recognition.
Tensor Fusion Network for Multimodal Sentiment Analysis
TLDR
A novel model, termed Tensor Fusion Networks, is introduced, which learns intra-modality and inter- modality dynamics end-to-end in sentiment analysis and outperforms state-of-the-art approaches for both multimodal and unimodal sentiment analysis.
Context-Dependent Sentiment Analysis in User-Generated Videos
TLDR
A LSTM-based model is proposed that enables utterances to capture contextual information from their surroundings in the same video, thus aiding the classification process and showing 5-10% performance improvement over the state of the art and high robustness to generalizability.
Learning to Generate Reviews and Discovering Sentiment
TLDR
The properties of byte-level recurrent language models are explored and a single unit which performs sentiment analysis is found which achieves state of the art on the binary subset of the Stanford Sentiment Treebank.
...
...