Non-Parallel Voice Conversion for ASR Augmentation
@article{Wang2022NonParallelVC, title={Non-Parallel Voice Conversion for ASR Augmentation}, author={Gary Wang and Andrew Rosenberg and Bhuvana Ramabhadran and Fadi Biadsy and Yinghui Huang and Jesse Emond and Pedro Moreno Mengibar}, journal={ArXiv}, year={2022}, volume={abs/2209.06987} }
Automatic speech recognition (ASR) needs to be robust to speaker differences. Voice Conversion (VC) modifies speaker characteristics of input speech. This is an attractive feature for ASR data augmentation. In this paper, we demonstrate that voice conversion can be used as a data augmentation technique to improve ASR performance, even on LibriSpeech, which contains 2,456 speakers. For ASR augmentation, it is necessary that the VC model be robust to a wide range of input speech. This motivates…
References
SHOWING 1-10 OF 30 REFERENCES
Voice Conversion Can Improve ASR in Very Low-Resource Settings
- Computer ScienceINTERSPEECH
- 2022
This work combines several recent techniques to design and train a practical VC system in English, and then uses this system to augment data for training speech recognition models in several low-resource languages.
Phonetic posteriorgrams for many-to-one voice conversion without parallel data training
- Computer Science2016 IEEE International Conference on Multimedia and Expo (ICME)
- 2016
This paper proposes a novel approach to voice conversion with non-parallel training data. The idea is to bridge between speakers by means of Phonetic PosteriorGrams (PPGs) obtained from a…
Non-Parallel Voice Conversion with Cyclic Variational Autoencoder
- Computer ScienceINTERSPEECH
- 2019
The experimental results demonstrate the effectiveness of the proposed CycleVAE-based VC, which yields higher accuracy of converted spectra, generates latent features with higher correlation degree, and significantly improves the quality and conversion accuracy of the converted speech.
The metamorphic algorithm: a speaker mapping approach to data augmentation
- Computer ScienceIEEE Trans. Speech Audio Process.
- 1994
Results show that the metamorphic algorithm can substantially reduce the word error rate when only a limited amount of enrolment data is available, and can also be used for tracking spectral evolution over time, thus providing a possible means for robust speaker self-adaptation.
Voice Conversion Based Data Augmentation to Improve Children's Speech Recognition in Limited Data Scenario
- Computer ScienceINTERSPEECH
- 2020
Significantly improved recognition rate for children’s speech is noted due to VC-based data augmentation, and the need to deal with speaking-rate differences is reported to demonstrate the need for time-scale modification of childrens speech test data.
Improving Speech Recognition Using GAN-Based Speech Synthesis and Contrastive Unspoken Text Selection
- Computer ScienceINTERSPEECH
- 2020
This work proposes to combine generative adversarial network (GAN) and multi-style training (MTR) to increase acoustic diversity in the synthesized data and presents a contrastive language model-based data selection technique to improve the efficiency of learning from unspoken text.
ACVAE-VC: Non-Parallel Voice Conversion With Auxiliary Classifier Variational Autoencoder
- Computer ScienceIEEE/ACM Transactions on Audio, Speech, and Language Processing
- 2019
This paper confirmed experimentally that the proposed method outperformed baseline non-parallel VC systems and performed comparably to an open-source parallel VC system trained using a parallel corpus in a speaker identity conversion task.
The NeteaseGames System for Voice Conversion Challenge 2020 with Vector-quantization Variational Autoencoder and WaveNet
- Computer ScienceArXiv
- 2020
VQ-VAE-WaveNet is a nonparallel VAE-based voice conversion that reconstructs the acoustic features along with separating the linguistic information with speaker identity and achieves an average score of 3.95 in naturalness in automatic naturalness prediction and ranked the 6th and 8th, respectively in ASV-based speaker similarity and spoofing countermeasures.
Enriching Source Style Transfer in Recognition-Synthesis based Non-Parallel Voice Conversion
- Computer ScienceInterspeech
- 2021
This paper model the prosody in a hybrid manner, which effectively combines explicit and implicit methods in a proposed prosody module, and uses a modified self-attention based encoder to extract sentential context from bottleneck features, which also implicitly aggregates the prosodic aspects of source speech from the layered representations.
Accent and Speaker Disentanglement in Many-to-many Voice Conversion
- Computer Science2021 12th International Symposium on Chinese Spoken Language Processing (ISCSLP)
- 2021
An interesting voice and accent joint conversion approach, which can convert an arbitrary source speaker’s voice to a target speaker with non-native accent, and proposes to use adversarial training to better disentangle the speaker and accent information in the encoder-decoder based conversion model.