Speech Emotion Recognition Using Spectrogram & Phoneme Embedding

@inproceedings{Yenigalla2018SpeechER,
  title={Speech Emotion Recognition Using Spectrogram \& Phoneme Embedding},
  author={Promod Yenigalla and Abhay Kumar and Suraj Tripathi and Chirag Singh and Sibsambhu Kar and Jithendra Vepa},
  booktitle={INTERSPEECH},
  year={2018}
}
This paper proposes a speech emotion recognition method based on phoneme sequence and spectrogram. Both phoneme sequence and spectrogram retain emotion contents of speech which is missed if the speech is converted into text. We performed various experiments with different kinds of deep neural networks with phoneme and spectrogram as inputs. Three of those network architectures are presented here that helped to achieve better accuracy when compared to the stateof-the-art methods on benchmark… 

Figures and Tables from this paper

A Hybrid Technique using CNN+LSTM for Speech Emotion Recognition
  • Hafsa Qazi, B. Kaushik
  • Computer Science
    International Journal of Engineering and Advanced Technology
  • 2020
TLDR
This paper is motivated by using spectrograms as inputs to the hybrid deep convolutional LSTM for speech emotion recognition, and the proposed model is highly capable as it obtained an accuracy of 94.26%.
3D CNN-Based Speech Emotion Recognition Using K-Means Clustering and Spectrograms
TLDR
An emotion recognition system based on analysis of speech signals that is superior to the state-of-the-art methods reported in the literature is proposed.
Speech Emotion Recognition using Convolution Neural Networks and Deep Stride Convolutional Neural Networks
TLDR
A recently developed different network architecture of convolutional neural networks, i.e., Deep Stride Convolutional Neural Networks (DSCNN), is modified by taking a smaller number of convotional layers to increase the computational speed while still maintaining accuracy.
Focal Loss based Residual Convolutional Neural Network for Speech Emotion Recognition
TLDR
A Residual Convolutional Neural Network based on speech features and trained under Focal Loss to recognize emotion in speech is proposed, preventing the model from being overwhelmed by easily classifiable examples.
Fine-grained Dynamical Speech Emotion Analysis Utilizing Networks Customized for Acoustic Data
  • Yaxiong Ma, Jincai Chen, Ping Lu
  • Computer Science
    2020 IEEE International Conference on Advances in Electrical Engineering and Computer Applications( AEECA)
  • 2020
TLDR
A new method to do fine-grained dynamical speech emotion analysis utilizing neural networks customized for acoustic data is proposed and Emotion time unit (ETU) is introduced to model the dynamic change of speech emotion and improve recognition accuracy in utterance level.
Emotion recognition from speech using spectrograms and shallow neural networks
TLDR
A SER (Speech Emotion Recognition) system is proposed in which the power of DL models in self pattern recognition together with the ability of working on small databases is combined.
DNN-based Emotion Recognition Based on Bottleneck Acoustic Features and Lexical Features
  • Eesung Kim, Jong Won Shin
  • Computer Science
    ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
  • 2019
TLDR
Experimental results on the Interactive Emotional Dyadic Motion Capture (IEMOCAP) multi-modal dataset showed 75.5% in unweighted accuracy recall, which outperformed the best results reported previously in the multimodal emotion recognition using acoustic and lexical features.
Speech Emotion Recognition Using Scalogram Based Deep Structure
TLDR
This work has proposed an SER method based on a concatenated Convolutional Neural Network and a Recurrent Neural Network that combines the strengths of both networks to learn long-term temporal relationships of the learned features.
Improved Speech Emotion Recognition using Transfer Learning and Spectrogram Augmentation
TLDR
Experimental results indicate that the transfer learning and spectrogram augmentation approaches improve the SER performance, and when combined achieve state-of-the-art results.
...
...

References

SHOWING 1-10 OF 23 REFERENCES
Towards real-time Speech Emotion Recognition using deep neural networks
TLDR
A Deep Neural Network (DNN) that recognizes emotions from a one second frame of raw speech spectrograms is presented and investigated and is achievable due to a deep hierarchical architecture, data augmentation, and sensible regularization.
Efficient Emotion Recognition from Speech Using Deep Learning on Spectrograms
TLDR
A new implementation of emotion recognition from the para-lingual information in the speech, based on a deep neural network, applied directly to spectrograms, achieves higher recognition accuracy compared to previously published results, while also limiting the latency.
An experimental study of speech emotion recognition based on deep convolutional neural networks
  • W. Q. Zheng, J. S. Yu, Y. Zou
  • Computer Science
    2015 International Conference on Affective Computing and Intelligent Interaction (ACII)
  • 2015
TLDR
Preliminary experiments show the proposed emotion recognition system based on DCNNs achieves about 40% classification accuracy and outperforms the SVM based classification using the hand-crafted acoustic features.
Speech emotion recognition using deep neural network and extreme learning machine
TLDR
The experimental results demonstrate that the proposed approach effectively learns emotional information from low-level features and leads to 20% relative accuracy improvement compared to the state of the art approaches.
Emotion recognition in spontaneous speech using GMMs
TLDR
The results indicate that using Gaussian mixture models on the frame level is a feasible technique for emotion classification, and combining the three classifiers significantly improves performance.
Emotion Recognition From Speech With Recurrent Neural Networks
TLDR
The effectiveness of the proposed deep recurrent neural network trained on a sequence of acoustic features calculated over small speech intervals and special probabilistic-nature CTC loss function allows to consider long utterances containing both emotional and neutral parts is shown.
High-level feature representation using recurrent neural network for speech emotion recognition
This paper presents a speech emotion recognition system using a recurrent neural network (RNN) model trained by an efficient learning algorithm. The proposed system takes into account the long-range
Multi-level Speech Emotion Recognition Based on HMM and ANN
TLDR
Comparison between isolated HMMs and hybrid of HMMs/ANN proves that the approach introduced is more effective, and the average recognition rate of five emotion states has reached 81.7%.
Emotion Recognition of Affective Speech Based on Multiple Classifiers Using Acoustic-Prosodic Information and Semantic Labels
TLDR
This work presents an approach to emotion recognition of affective speech based on multiple classifiers using acoustic-prosodic information (AP) and semantic labels (SLs) and reveals that the recognition accuracy of the proposed approach can be further improved to 85.79 percent.
GMM Supervector Based SVM with Spectral Features for Speech Emotion Recognition
  • Hao Hu, Ming-Xing Xu, Wei Wu
  • Computer Science
    2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07
  • 2007
TLDR
Experimental results on an emotional speech database demonstrate that the GMM supervector based SVM outperforms standard GMM on speech emotion recognition.
...
...