Overcoming Data Scarcity in Speaker Identification: Dataset Augmentation with Synthetic MFCCs via Character-level RNN

  title={Overcoming Data Scarcity in Speaker Identification: Dataset Augmentation with Synthetic MFCCs via Character-level RNN},
  author={Jordan J. Bird and Diego Resende Faria and Cristiano Premebida and Anik{\'o} Ek{\'a}rt and Pedro P. S. Ayrosa},
  journal={2020 IEEE International Conference on Autonomous Robot Systems and Competitions (ICARSC)},
Autonomous speaker identification suffers issues of data scarcity due to it being unrealistic to gather hours of speaker audio to form a dataset, which inevitably leads to class imbalance in comparison to the large data availability from non-speakers since large-scale speech datasets are available online. In this study, we explore the possibility of improving speaker recognition by augmenting the dataset with synthetic data produced by training a Character-level Recurrent Neural Network on a… Expand
LSTM and GPT-2 Synthetic Speech Transfer Learning for Speaker Recognition to Overcome Data Scarcity
It is argued that speaker classification can be improved by utilising a small amount of user data but with exposure to synthetically-generated MFCCs which then allow the networks to achieve near maximum classification scores. Expand
Fruit Quality and Defect Image Classification with Conditional GAN Data Augmentation
This work suggests a machine learning pipeline that combines the ideas of fine-tuning, transfer learning, and generative model-based training data augmentation towards improving fruit quality image classification, arguing that Conditional Generative Adversarial Networks have the ability to produce new data to alleviate issues of data scarcity. Expand
Transformer-based Map Matching Model with Limited Ground-Truth Data using Transfer-Learning Approach
  • Zhixiong Jin, Seongjin Choi, H. Yeo
  • Computer Science
  • ArXiv
  • 2021
This paper builds a Transformer-based map-matching model with a transfer learning approach that outperforms existing models and uses the attention weights of the Transformer to plot the map- matching process and find how the model matches the road segments correctly. Expand


VoxCeleb: A Large-Scale Speaker Identification Dataset
This paper proposes a fully automated pipeline based on computer vision techniques to create a large scale text-independent speaker identification dataset collected 'in the wild', and shows that a CNN based architecture obtains the best performance for both identification and verification. Expand
Adversarial Learning and Augmentation for Speaker Recognition
This paper develops a new generative adversarial network (GAN) to artificially generate i-vectors to deal with the issue of unbalanced or insufficient data in speaker recognition based on theExpand
Learning Discriminative Features for Speaker Identification and Verification
A Convolutional Neural Network Architecture based on the popular Very Deep VGG CNNs, with key modifications to accommodate variable length spectrogram inputs, reduce the model disk space requirements and reduce the number of parameters, resulting in significant reduction in training times is proposed. Expand
Phoneme aware speech recognition through evolutionary optimisation
A preliminary study on Artificial Neural Network (ANN) and Hidden Markov Model (HMM) methods of classification for Human Speech Recognition through Diphthong Vowel sounds in the English Phonetic Alphabet with a specific focus on evolutionary optimisation of bio-inspired classification methods. Expand
Investigating RNN-based speech enhancement methods for noise-robust Text-to-Speech
Experiments show that the a-layer can effectively learn to interpolate the acoustic features between speakers, and tackle the problem of speaker interpolation by adding a new output layer (a-layer) on top of the multi-output branches. Expand
Investigating RNN-based speech enhancement methods for noise-robust Text-to-Speech
Two different approaches for speech enhancement to train TTS systems are investigated, following conventional speech enhancement methods, and show that the second approach results in larger MCEP distortion but smaller F0 errors. Expand
Speaker verification with short utterances: a review of challenges, trends and opportunities
The authors present an extensive survey of SV with short utterances considering the studies from recent past and include latest research offering various solutions and analyses to address the limited data issue within the scope of SV. Expand
This report describes our contribution to the development of audio scene classification methods for the DCASE 2018 Challenge Task 1A. The proposed systems for this task are based on data augmentationExpand
Long short-term memory recurrent neural network architectures for large scale acoustic modeling
The first distributed training of LSTM RNNs using asynchronous stochastic gradient descent optimization on a large cluster of machines is introduced and it is shown that a two-layer deep LSTm RNN where each L STM layer has a linear recurrent projection layer can exceed state-of-the-art speech recognition performance. Expand
An RNN-Based Quantized F0 Model with Multi-Tier Feedback Links for Text-to-Speech Synthesis
A recurrent-neural-network-based F0 model for text-to-speech (TTS) synthesis that generates F0 contours given textual features using a sequence of discrete symbols, which avoids the influence of artificially interpolated F0 curves. Expand