Voice Conversion Challenge 2020: Intra-lingual semi-parallel and cross-lingual voice conversion

@article{Zhao2020VoiceCC,
  title={Voice Conversion Challenge 2020: Intra-lingual semi-parallel and cross-lingual voice conversion},
  author={Yi Zhao and Wen-Chin Huang and Xiaohai Tian and Junichi Yamagishi and Rohan Kumar Das and Tomi H. Kinnunen and Zhenhua Ling and Tomoki Toda},
  journal={ArXiv},
  year={2020},
  volume={abs/2008.12527}
}
The voice conversion challenge is a bi-annual scientific event held to compare and understand different voice conversion (VC) systems built on a common dataset. In 2020, we organized the third edition of the challenge and constructed and distributed a new database for two tasks, intra-lingual semi-parallel and cross-lingual VC. After a two-month challenge period, we received 33 submissions, including 3 baselines built on the database. From the results of crowd-sourced listening tests, we… 
Transfer Learning from Monolingual ASR to Transcription-free Cross-lingual Voice Conversion
TLDR
This paper focuses on knowledge transfer from monolin-gual ASR to cross-lingual VC, in order to address the con-tent mismatch problem, and proposes a speaker-dependent conversion model that significantly reduces the MOS drop be-tween intra- and cross-lingsual conversion.
Submission from SRCB for Voice Conversion Challenge 2020
TLDR
This work focuses on building a voice conversion system achieving consistent improvements in accent and intelligibility evaluations, and extracts general phonation from the source speakers' speeches of different languages, and improves the sound quality by optimizing the speech synthesis module and adding a noise suppression post-process module to the vocoder.
Cross-Lingual Voice Conversion with Disentangled Universal Linguistic Representations
TLDR
An any-to-many voice conversion system based on disentangled universal linguistic representations (ULRs), which are extracted from a mix-lingual phoneme recognition system, and two methods are proposed to remove speaker information from ULRs.
StarGANv2-VC: A Diverse, Unsupervised, Non-parallel Framework for Natural-Sounding Voice Conversion
TLDR
Subjective and objective evaluation experiments on a non-parallel many-to-many voice conversion task revealed that the StarGAN v2 model produces natural sounding voices, close to the sound quality of state-of-the-art text-tospeech based voice conversion methods without the need for text labels.
Building Bilingual and Code-Switched Voice Conversion with Limited Training Data Using Embedding Consistency Loss
TLDR
A parallel nonautoregressive network to achieve bilingual and code-switched voice conversion for multiple speakers when there are only mono-lingual corpora for each language is described.
Predictions of Subjective Ratings and Spoofing Assessments of Voice Conversion Challenge 2020 Submissions
TLDR
Five types of objective assessments using automatic speaker verification (ASV), neural speaker embeddings, spoofing countermeasures, predicted mean opinion scores (MOS), and automatic speech recognition (ASR) are examined to provide complementary performance analysis that may be more beneficial than the time-consuming listening tests.
StarGAN-ZSVC: Towards Zero-Shot Voice Conversion in Low-Resource Contexts
TLDR
StarGAN-ZSVC is extended, showing that real-time zero-shot voice conversion is possible even for a model trained on very little data, and comparing it against other voice conversion techniques in a low-resource setting using a small 9-minute training set.
The NeteaseGames System for Voice Conversion Challenge 2020 with Vector-quantization Variational Autoencoder and WaveNet
TLDR
VQ-VAE-WaveNet is a nonparallel VAE-based voice conversion that reconstructs the acoustic features along with separating the linguistic information with speaker identity and achieves an average score of 3.95 in naturalness in automatic naturalness prediction and ranked the 6th and 8th, respectively in ASV-based speaker similarity and spoofing countermeasures.
GlowVC: Mel-spectrogram space disentangling model for language-independent text-free voice conversion
TLDR
GlowVC models greatly outperform AutoVC baseline in terms of intelligibility, while achieving just as high speaker similarity in intra-lingual VC, and slightly worse in the cross-lingUAL setting.
Cross-Speaker Emotion Transfer for Low-Resource Text-to-Speech Using Non-Parallel Voice Conversion with Pitch-Shift Data Augmentation
TLDR
Subjective test results showed that a FastSpeech 2-based emotional TTS system with a novel data augmentation method that combines pitch-shifting and VC techniques improved naturalness and emotional similarity compared with conventional methods.
...
...

References

SHOWING 1-10 OF 51 REFERENCES
The Voice Conversion Challenge 2018: Promoting Development of Parallel and Nonparallel Methods
TLDR
A brief summary of the state-of-the-art techniques for VC is presented, followed by a detailed explanation of the challenge tasks and the results that were obtained.
Predictions of Subjective Ratings and Spoofing Assessments of Voice Conversion Challenge 2020 Submissions
TLDR
Five types of objective assessments using automatic speaker verification (ASV), neural speaker embeddings, spoofing countermeasures, predicted mean opinion scores (MOS), and automatic speech recognition (ASR) are examined to provide complementary performance analysis that may be more beneficial than the time-consuming listening tests.
Cross-lingual Voice Conversion with Bilingual Phonetic Posteriorgram and Average Modeling
This paper presents a cross-lingual voice conversion approach using bilingual Phonetic PosteriorGram (PPG) and average modeling. The proposed approach makes use of bilingual PPGs to represent
One-shot Voice Conversion by Separating Speaker and Content Representations with Instance Normalization
TLDR
This paper proposed a novel one-shot VC approach which is able to perform VC by only an example utterance from source and target speaker respectively, and the source andtarget speaker do not even need to be seen during training.
Cotatron: Transcription-Guided Speech Encoder for Any-to-Many Voice Conversion without Parallel Data
TLDR
Cotatron is a transcription-guided speech encoder for speaker-independent linguistic representation based on the multispeaker TTS architecture that outperform the previous method in terms of both naturalness and speaker similarity.
Phonetic posteriorgrams for many-to-one voice conversion without parallel data training
This paper proposes a novel approach to voice conversion with non-parallel training data. The idea is to bridge between speakers by means of Phonetic PosteriorGrams (PPGs) obtained from a
StarGAN-VC: non-parallel many-to-many Voice Conversion Using Star Generative Adversarial Networks
TLDR
Subjective evaluation experiments on a non-parallel many-to-many speaker identity conversion task revealed that the proposed method obtained higher sound quality and speaker similarity than a state-of-the-art method based on variational autoencoding GANs.
Voice Conversion from Unaligned Corpora Using Variational Autoencoding Wasserstein Generative Adversarial Networks
TLDR
This paper proposes a non-parallel VC framework with a variational autoencoding Wasserstein generative adversarial network (VAW-GAN) that explicitly considers a VC objective when building the speech model.
Voice Transformer Network: Sequence-to-Sequence Voice Conversion Using Transformer with Text-to-Speech Pretraining
TLDR
Experimental results show that a simple yet effective pretraining technique to transfer knowledge from learned TTS models, which benefit from large-scale, easily accessible TTS corpora, can facilitate data-efficient training and outperform an RNN-basedseq VC model in terms of intelligibility, naturalness, and similarity.
Average Modeling Approach to Voice Conversion with Non-Parallel Data
TLDR
The proposed approach makes use of a multi-speaker average model that maps speaker-independent linguistic features to speaker dependent acoustic features that doesn’t require parallel data in either average model training or adaptation.
...
...