Non-Parallel Voice Conversion for ASR Augmentation

@article{Wang2022NonParallelVC,
  title={Non-Parallel Voice Conversion for ASR Augmentation},
  author={Gary Wang and Andrew Rosenberg and Bhuvana Ramabhadran and Fadi Biadsy and Yinghui Huang and Jesse Emond and Pedro Moreno Mengibar},
  journal={ArXiv},
  year={2022},
  volume={abs/2209.06987}
}
Automatic speech recognition (ASR) needs to be robust to speaker differences. Voice Conversion (VC) modifies speaker characteristics of input speech. This is an attractive feature for ASR data augmentation. In this paper, we demonstrate that voice conversion can be used as a data augmentation technique to improve ASR performance, even on LibriSpeech, which contains 2,456 speakers. For ASR augmentation, it is necessary that the VC model be robust to a wide range of input speech. This motivates… 

Figures and Tables from this paper

References

SHOWING 1-10 OF 30 REFERENCES

Voice Conversion Can Improve ASR in Very Low-Resource Settings

This work combines several recent techniques to design and train a practical VC system in English, and then uses this system to augment data for training speech recognition models in several low-resource languages.

Phonetic posteriorgrams for many-to-one voice conversion without parallel data training

This paper proposes a novel approach to voice conversion with non-parallel training data. The idea is to bridge between speakers by means of Phonetic PosteriorGrams (PPGs) obtained from a

Non-Parallel Voice Conversion with Cyclic Variational Autoencoder

The experimental results demonstrate the effectiveness of the proposed CycleVAE-based VC, which yields higher accuracy of converted spectra, generates latent features with higher correlation degree, and significantly improves the quality and conversion accuracy of the converted speech.

The metamorphic algorithm: a speaker mapping approach to data augmentation

Results show that the metamorphic algorithm can substantially reduce the word error rate when only a limited amount of enrolment data is available, and can also be used for tracking spectral evolution over time, thus providing a possible means for robust speaker self-adaptation.

Voice Conversion Based Data Augmentation to Improve Children's Speech Recognition in Limited Data Scenario

Significantly improved recognition rate for children’s speech is noted due to VC-based data augmentation, and the need to deal with speaking-rate differences is reported to demonstrate the need for time-scale modification of childrens speech test data.

Improving Speech Recognition Using GAN-Based Speech Synthesis and Contrastive Unspoken Text Selection

This work proposes to combine generative adversarial network (GAN) and multi-style training (MTR) to increase acoustic diversity in the synthesized data and presents a contrastive language model-based data selection technique to improve the efficiency of learning from unspoken text.

ACVAE-VC: Non-Parallel Voice Conversion With Auxiliary Classifier Variational Autoencoder

This paper confirmed experimentally that the proposed method outperformed baseline non-parallel VC systems and performed comparably to an open-source parallel VC system trained using a parallel corpus in a speaker identity conversion task.

The NeteaseGames System for Voice Conversion Challenge 2020 with Vector-quantization Variational Autoencoder and WaveNet

VQ-VAE-WaveNet is a nonparallel VAE-based voice conversion that reconstructs the acoustic features along with separating the linguistic information with speaker identity and achieves an average score of 3.95 in naturalness in automatic naturalness prediction and ranked the 6th and 8th, respectively in ASV-based speaker similarity and spoofing countermeasures.

Enriching Source Style Transfer in Recognition-Synthesis based Non-Parallel Voice Conversion

This paper model the prosody in a hybrid manner, which effectively combines explicit and implicit methods in a proposed prosody module, and uses a modified self-attention based encoder to extract sentential context from bottleneck features, which also implicitly aggregates the prosodic aspects of source speech from the layered representations.

Accent and Speaker Disentanglement in Many-to-many Voice Conversion

An interesting voice and accent joint conversion approach, which can convert an arbitrary source speaker’s voice to a target speaker with non-native accent, and proposes to use adversarial training to better disentangle the speaker and accent information in the encoder-decoder based conversion model.