Data Augmentation Using GANs for Speech Emotion Recognition

@inproceedings{Chatziagapi2019DataAU,
  title={Data Augmentation Using GANs for Speech Emotion Recognition},
  author={Aggelina Chatziagapi and Georgios Paraskevopoulos and Dimitris Sgouropoulos and Georgios Pantazopoulos and Malvina Nikandrou and Theodoros Giannakopoulos and Athanasios Katsamanis and Alexandros Potamianos and Shrikanth S. Narayanan},
  booktitle={INTERSPEECH},
  year={2019}
}
In this work, we address the problem of data imbalance for the task of Speech Emotion Recognition (SER). We investigate conditioned data augmentation using Generative Adversarial Networks (GANs), in order to generate samples for underrepresented emotions. We adapt and improve a conditional GAN architecture to generate synthetic spectrograms for the minority class. For comparison purposes, we implement a series of signal-based data augmentation methods. The proposed GANbased approach is… Expand

Figures, Tables, and Topics from this paper

GAN-Based Data Generation for Speech Emotion Recognition
TLDR
This work proposes a CNN-based GAN with spectral normalization on both the generator and discriminator, both of which are pre-trained on large unlabeled speech corpora and shows that this method provides better speech emotion recognition performance than a strong baseline. Expand
Stargan for Emotional Speech Conversion: Validated by Data Augmentation of End-To-End Emotion Recognition
TLDR
An adversarial network implementation for speech emotion conversion as a data augmentation method, validated by a multi-class speech affect recognition task, and is concluded that its samples are indicative of their target emotion, albeit showing a tendency for confusion in cases where the emotional attribute of valence and arousal are inconsistent. Expand
AAEC: An Adversarial Autoencoder-based Classifier for Audio Emotion Recognition
TLDR
A model is proposed, Adversarial Autoencoder-based Classifier (AAEC), that can not only augment the data within real data distribution but also reasonably extend the boundary of the current data distribution to a possible space. Expand
Multi-Conditioning and Data Augmentation Using Generative Noise Model for Speech Emotion Recognition in Noisy Conditions
TLDR
This paper proposes multi-conditioning and data augmentation using an utterance level parametric Generative noise model, designed to generate noise types which can span the entire noise space in the mel-filterbank energy domain. Expand
Improved Speech Emotion Recognition using Transfer Learning and Spectrogram Augmentation
TLDR
Experimental results indicate that the transfer learning and spectrogram augmentation approaches improve the SER performance, and when combined achieve state-of-the-art results. Expand
Multi-Window Data Augmentation Approach for Speech Emotion Recognition
TLDR
The proposed augmentation method with minimally extracted features combined with a deep learning model improves the performance of speech emotion recognition and achieves state-of-the-art results. Expand
Speech emotion recognition with deep convolutional neural networks
TLDR
A new architecture is introduced, which extracts mel-frequency cepstral coefficients, chromagram, mel-scale spectrogram, Tonnetz representation, and spectral contrast features from sound files and uses them as inputs for the one-dimensional Convolutional Neural Network for the identification of emotions using samples from the Ryerson Audio-Visual Database of Emotional Speech and Song, Berlin, and EMO-DB datasets. Expand
An Improved StarGAN for Emotional Voice Conversion: Enhancing Voice Quality and Data Augmentation
TLDR
A novel StarGAN framework along with a two-stage training process that separates emotional features from those independent of emotion by using an autoencoder with two encoders as the generator of the Generative Adversarial Network (GAN) reveals that the proposed model can effectively reduce distortion. Expand
Deep Learning Techniques for Speech Emotion Recognition, from Databases to Models
TLDR
This study reviews deep learning approaches for SER with available datasets, followed by conventional machine learning techniques for speech emotion recognition, and presents a multi-aspect comparison between practical neural network approaches in Speech Emotion Recognition. Expand
Feature Augmenting Networks for Improving Depression Severity Estimation From Speech Signals
TLDR
This approach is the first attempt to apply the Generative Adversarial Network for features augmentation to improve depression severity estimation from speech and results show that the combination of the three proposed evaluation criteria can effectively and comprehensively evaluate the quality of the augmented features. Expand
...
1
2
3
4
...

References

SHOWING 1-10 OF 31 REFERENCES
CNN+LSTM Architecture for Speech Emotion Recognition with Data Augmentation
TLDR
This work designs a neural network for recognizing emotions in speech, using the IEMOCAP dataset, and examines the techniques of data augmentation with vocal track length perturbation, layer-wise optimizer adjustment, batch normalization of recurrent layers and obtain highly competitive results. Expand
On Enhancing Speech Emotion Recognition using Generative Adversarial Networks
TLDR
This work investigates the application of GANs to generate synthetic feature vectors used for speech emotion recognition and investigates two set ups: a vanilla GAN and a conditional GAN that learns the distribution of the higher dimensional feature vectors conditioned on the labels or the emotional class to which it belongs. Expand
Context-Aware Attention Mechanism for Speech Emotion Recognition
TLDR
A new Long Short-Term Memory (LSTM)-based neural network attention model which is able to take into account the temporal information in speech during the computation of the attention vector is introduced. Expand
Using regional saliency for speech emotion recognition
  • Zakaria Aldeneh, E. Provost
  • Computer Science
  • 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
  • 2017
TLDR
The results suggest that convolutional neural networks with Mel Filterbanks (MFBs) can be used as a replacement for classifiers that rely on features obtained from applying utterance-level statistics. Expand
Evaluating deep learning architectures for Speech Emotion Recognition
TLDR
A frame-based formulation to SER is described that relies on minimal speech processing and end-to-end deep learning to model intra-utterance dynamics and is used to empirically explore feed-forward and recurrent neural network architectures and their variants. Expand
Deep convolutional recurrent neural network with attention mechanism for robust speech emotion recognition
TLDR
A deep convolutional recurrent neural network for speech emotion recognition based on the log-Mel filterbank energies is presented, where the Convolutional layers are responsible for the discriminative feature learning and aconvolutional attention mechanism is proposed to learn the utterance structure relevant to the task. Expand
Attentive Convolutional Neural Network Based Speech Emotion Recognition: A Study on the Impact of Input Features, Signal Length, and Acted Speech
TLDR
This work conducts extensive experiments using an attentive convolutional neural network with multi-view learning objective function for speech emotion recognition and achieves state-of-the-art results on the improvised speech data of IEMOCAP. Expand
BAGAN: Data Augmentation with Balancing GAN
TLDR
This work proposes balancing GAN (BAGAN) as an augmentation tool to restore balance in imbalanced datasets and compares the proposed methodology with state-of-the-art GANs and demonstrates that BAGAN generates images of superior quality when trained with an imbalanced dataset. Expand
Understanding Data Augmentation for Classification: When to Warp?
TLDR
It is found that while it is possible to perform generic augmentation in feature-space, if plausible transforms for the data are known then augmentationIn data-space provides a greater benefit for improving performance and reducing overfitting. Expand
3-D Convolutional Recurrent Neural Networks With Attention Model for Speech Emotion Recognition
TLDR
A three-dimensional attention-based convolutional recurrent neural networks to learn discriminative features for SER is proposed, where the Mel-spectrogram with deltas and delta-deltas are used as input. Expand
...
1
2
3
4
...