• Corpus ID: 195316945

Learning Discriminative features using Center Loss and Reconstruction as Regularizer for Speech Emotion Recognition

  title={Learning Discriminative features using Center Loss and Reconstruction as Regularizer for Speech Emotion Recognition},
  author={Suraj Tripathi and Abhiram Ramesh and Abhay Kumar and Chirag Singh and Promod Yenigalla},
This paper proposes a Convolutional Neural Network (CNN) inspired by Multitask Learning (MTL) and based on speech features trained under the joint supervision of softmax loss and center loss, a powerful metric learning strategy, for the recognition of emotion in speech. Speech features such as Spectrograms and Mel-frequency Cepstral Coefficient s (MFCCs) help retain emotion-related low-level characteristics in speech. We experimented with several Deep Neural Network (DNN) architectures that… 
2 Citations
A Deep Time-delay Embedded Algorithm for Unsupervised Stress Speech Clustering
Based on the authors' experiments, DTEC outperforms the popular deep clustering algorithm and able to increase the clustering performance in terms of accuracy(ACC) and normalized mutual information (NMI).
C3VQG: category consistent cyclic visual question generation
The proposed model, C3VQG outperforms state-of-the-art VQG methods with weak supervision and introduces a novel category consistent cyclic loss to enable the model to generate consistent predictions with respect to the answer category, reducing redundancies and irregularities.


Efficient Emotion Recognition from Speech Using Deep Learning on Spectrograms
A new implementation of emotion recognition from the para-lingual information in the speech, based on a deep neural network, applied directly to spectrograms, achieves higher recognition accuracy compared to previously published results, while also limiting the latency.
Speech emotion recognition using deep neural network and extreme learning machine
The experimental results demonstrate that the proposed approach effectively learns emotional information from low-level features and leads to 20% relative accuracy improvement compared to the state of the art approaches.
Learning Discriminative Features for Speaker Identification and Verification
A Convolutional Neural Network Architecture based on the popular Very Deep VGG CNNs, with key modifications to accommodate variable length spectrogram inputs, reduce the model disk space requirements and reduce the number of parameters, resulting in significant reduction in training times is proposed.
Speech Emotion Recognition Using Spectrogram & Phoneme Embedding
A phoneme and spectrogram combined CNN model proved to be most accurate in recognizing emotions on IEMOCAP data and achieved more than 4% increase in overall accuracy and average class accuracy as compared to the existing state-of-the-art methods.
High-level feature representation using recurrent neural network for speech emotion recognition
This paper presents a speech emotion recognition system using a recurrent neural network (RNN) model trained by an efficient learning algorithm. The proposed system takes into account the long-range
Speech emotion recognition with acoustic and lexical features
This paper proposes a new feature representation named emotion vector (eVector) and uses the traditional Bag-of-Words (BoW) feature to apply these feature representations for emotion recognition and compares their performance on the USC-IEMOCAP database.
Modeling the Temporal Evolution of Acoustic Parameters for Speech Emotion Recognition
A series of exhaustive experiments are described which demonstrate the feasibility of recognizing human emotional states via integrating low level descriptors via integrating three different methodologies for integrating subsequent feature values.
Deep learning for robust feature generation in audiovisual emotion recognition
A suite of Deep Belief Network models are proposed and evaluated, and it is demonstrated that these models show improvement in emotion classification performance over baselines that do not employ deep learning, suggesting that the learned high-order non-linear relationships are effective for emotion recognition.
Emotion Recognition From Speech With Recurrent Neural Networks
The effectiveness of the proposed deep recurrent neural network trained on a sequence of acoustic features calculated over small speech intervals and special probabilistic-nature CTC loss function allows to consider long utterances containing both emotional and neutral parts is shown.
Extracting Speaker-Specific Information with a Regularized Siamese Deep Network
A multi-objective loss function is proposed for learning speaker-specific characteristics and regularization via normalizing interference of non-speaker related information and avoiding information loss.