Corpus ID: 237572249

Hybrid Data Augmentation and Deep Attention-based Dilated Convolutional-Recurrent Neural Networks for Speech Emotion Recognition

  title={Hybrid Data Augmentation and Deep Attention-based Dilated Convolutional-Recurrent Neural Networks for Speech Emotion Recognition},
  author={Nhat Truong Pham and Duc Ngoc Minh Dang and Sy Dzung Nguyen},
Speech emotion recognition (SER) has been one of the significant tasks in Human-Computer Interaction (HCI) applications. However, it is hard to choose the optimal features and deal with imbalance labeled data. In this article, we investigate hybrid data augmentation (HDA) methods to generate and balance data based on traditional and generative adversarial networks (GAN) methods. To evaluate the effectiveness of HDA methods, a deep learning framework namely (ADCRNN) is designed by integrating… Expand


A Method upon Deep Learning for Speech Emotion Recognition
Experimental results show that the performance of the proposed method is strongly comparable with the existing state-ofthe-art methods on the Emo-DB and ERC2019 datasets with 88% and 67%, respectively. Expand
Speech Emotion Recognition From 3D Log-Mel Spectrograms With Deep Learning Network
A novel architecture ADRNN (dilated CNN with residual block and BiLSTM based on the attention mechanism) to apply for the speech emotion recognition which can take advantage of the strengths of diverse networks and overcome the shortcomings of utilizing alone, and are evaluated in the popular IEMOCAP database and Berlin EMODB corpus. Expand
3-D Convolutional Recurrent Neural Networks With Attention Model for Speech Emotion Recognition
A three-dimensional attention-based convolutional recurrent neural networks to learn discriminative features for SER is proposed, where the Mel-spectrogram with deltas and delta-deltas are used as input. Expand
Speech emotion recognition with deep convolutional neural networks
A new architecture is introduced, which extracts mel-frequency cepstral coefficients, chromagram, mel-scale spectrogram, Tonnetz representation, and spectral contrast features from sound files and uses them as inputs for the one-dimensional Convolutional Neural Network for the identification of emotions using samples from the Ryerson Audio-Visual Database of Emotional Speech and Song, Berlin, and EMO-DB datasets. Expand
Exploring Deep Spectrum Representations via Attention-Based Recurrent and Convolutional Neural Networks for Speech Emotion Recognition
Deep spectrum representations extracted from the proposed model are well-suited to the task of SER, achieving a WA of 68.1% and a UA of 67.0% on IEMOCAP, and 45.4% on FAU-AEC dataset. Expand
Improving Speech Emotion Recognition With Adversarial Data Augmentation Network.
  • Lu Yi, M. Mak
  • Medicine, Computer Science
  • IEEE transactions on neural networks and learning systems
  • 2020
By forcing the synthetic latent vectors and the real latent vectors to share a common representation, the gradient vanishing problem can be largely alleviated and the resulting emotion classifiers are competitive with state-of-the-art speech emotion recognition systems. Expand
Multimodal Continuous Emotion Recognition with Data Augmentation Using Recurrent Neural Networks
This paper presents our effects for Cross-cultural Emotion Sub-challenge in the Audio/Visual Emotion Challenge (AVEC) 2018, whose goal is to predict the level of three emotional dimensionsExpand
End-to-End Multimodal Emotion Recognition Using Deep Neural Networks
This work proposes an emotion recognition system using auditory and visual modalities using a convolutional neural network to extract features from the speech, while for the visual modality a deep residual network of 50 layers is used. Expand
Clustering-Based Speech Emotion Recognition by Incorporating Learned Features and Deep BiLSTM
A novel framework for SER is introduced using a key sequence segment selection based on redial based function network (RBFN) similarity measurement in clusters to reduce the computational complexity of the overall model and normalize the CNN features before their actual processing, so that it can easily recognize the Spatio-temporal information. Expand
On Enhancing Speech Emotion Recognition using Generative Adversarial Networks
This work investigates the application of GANs to generate synthetic feature vectors used for speech emotion recognition and investigates two set ups: a vanilla GAN and a conditional GAN that learns the distribution of the higher dimensional feature vectors conditioned on the labels or the emotional class to which it belongs. Expand