Predictions of Subjective Ratings and Spoofing Assessments of Voice Conversion Challenge 2020 Submissions

@article{Das2020PredictionsOS,
  title={Predictions of Subjective Ratings and Spoofing Assessments of Voice Conversion Challenge 2020 Submissions},
  author={Rohan Kumar Das and Tomi H. Kinnunen and Wen-Chin Huang and Zhenhua Ling and Junichi Yamagishi and Yi Zhao and Xiaohai Tian and Tomoki Toda},
  journal={ArXiv},
  year={2020},
  volume={abs/2009.03554}
}
The Voice Conversion Challenge 2020 is the third edition under its flagship that promotes intra-lingual semiparallel and cross-lingual voice conversion (VC). While the primary evaluation of the challenge submissions was done through crowd-sourced listening tests, we also performed an objective assessment of the submitted systems. The aim of the objective assessment is to provide complementary performance analysis that may be more beneficial than the time-consuming listening tests. In this study… 
Voice Conversion Challenge 2020: Intra-lingual semi-parallel and cross-lingual voice conversion
  • Yi Zhao, Wen-Chin Huang, T. Toda
  • Computer Science
    Joint Workshop for the Blizzard Challenge and Voice Conversion Challenge 2020
  • 2020
TLDR
From the results of crowd-sourced listening tests, it is observed that VC methods have progressed rapidly thanks to advanced deep learning methods, and the overall naturalness and similarity scores were lower than those for the intra-lingual conversion task.
On Prosody Modeling for ASR+TTS Based Voice Conversion
TLDR
This work proposes to directly predict prosody from the linguistic representation in a target-speaker-dependent manner, referred to as target text prediction (TTP), and evaluates both methods on the VCC2020 benchmark and consider different linguistic representations.
Generalization Ability of MOS Prediction Networks
TLDR
It is found that wav2vec2 models fine-tuned for MOS prediction have good generalization capability to out-ofdomain data even for the most challenging case of utterance-level predictions in the zero-shot setting, and that fine- Tuning to in-domain data can improve predictions.
The NeteaseGames System for Voice Conversion Challenge 2020 with Vector-quantization Variational Autoencoder and WaveNet
TLDR
VQ-VAE-WaveNet is a nonparallel VAE-based voice conversion that reconstructs the acoustic features along with separating the linguistic information with speaker identity and achieves an average score of 3.95 in naturalness in automatic naturalness prediction and ranked the 6th and 8th, respectively in ASV-based speaker similarity and spoofing countermeasures.
The IQIYI System for Voice Conversion Challenge 2020
TLDR
The evaluation results show that this end-to-end voice conversion system based on PPG can achieve better voice conversion effects and among them, the best results are in the similarity evaluation of the Task 2, the second in the ASV-based objective evaluation and the 5th in the subjective evaluation.
CASIA Voice Conversion System for the Voice Conversion Challenge 2020
TLDR
This paper presents CASIA (Chinese Academy of Sciences, Institute of Automation) voice conversion system for the Voice Conversation Challenge 2020, and builds the system by combining the initialization using a multi-speaker data and the adaptation using limited data of the target speaker.
S3PRL-VC: Open-source Voice Conversion Framework with Self-supervised Speech Representations
TLDR
S3R is comparable with VCC2020 top systems in the A2O setting in terms of similarity, and achieves state-of-the-art in S3R-based A2A VC, according to a series of in-depth analyses.
StarGANv2-VC: A Diverse, Unsupervised, Non-parallel Framework for Natural-Sounding Voice Conversion
TLDR
Subjective and objective evaluation experiments on a non-parallel many-to-many voice conversion task revealed that the StarGAN v2 model produces natural sounding voices, close to the sound quality of state-of-the-art text-tospeech based voice conversion methods without the need for text labels.
IQDUBBING: Prosody modeling based on discrete self-supervised speech representation for expressive voice conversion
TLDR
IQDubbing is presented to solve the problem of expressive voice conversion by leveraging the recent advances in discrete self-supervised speech representation (DSSR) to model prosody, and two kinds of prosody filters are proposed to sample prosody from the prosody vector.
Building Bilingual and Code-Switched Voice Conversion with Limited Training Data Using Embedding Consistency Loss
TLDR
A parallel nonautoregressive network to achieve bilingual and code-switched voice conversion for multiple speakers when there are only mono-lingual corpora for each language is described.
...
1
2
3
...

References

SHOWING 1-10 OF 34 REFERENCES
Voice Conversion Challenge 2020: Intra-lingual semi-parallel and cross-lingual voice conversion
  • Yi Zhao, Wen-Chin Huang, T. Toda
  • Computer Science
    Joint Workshop for the Blizzard Challenge and Voice Conversion Challenge 2020
  • 2020
TLDR
From the results of crowd-sourced listening tests, it is observed that VC methods have progressed rapidly thanks to advanced deep learning methods, and the overall naturalness and similarity scores were lower than those for the intra-lingual conversion task.
MOSNet: Deep Learning based Objective Assessment for Voice Conversion
TLDR
Results confirm that the proposed deep learning-based assessment models could be used as a computational evaluator to measure the MOS of VC systems to reduce the need for expensive human rating.
The Voice Conversion Challenge 2018: Promoting Development of Parallel and Nonparallel Methods
TLDR
A brief summary of the state-of-the-art techniques for VC is presented, followed by a detailed explanation of the challenge tasks and the results that were obtained.
Comparison of Speech Representations for Automatic Quality Estimation in Multi-Speaker Text-to-Speech Synthesis
TLDR
A key finding is the quality achieved for certain speakers seems consistent, regardless of the TTS or VC system, and the method provides an automatic way to identify such speakers.
The Voice Conversion Challenge 2016
TLDR
The design of the challenge, its result, and a future plan to share views about unsolved problems and challenges faced by the current VC techniques are summarized.
AutoMOS: Learning a non-intrusive assessor of naturalness-of-speech
TLDR
It is demonstrated that the AutoMOS model can model human raters' mean opinion scores (MOS) of synthesized speech using a deep recurrent neural network whose inputs consist solely of a raw waveform.
Quality-Net: An End-to-End Non-intrusive Speech Quality Assessment Model based on BLSTM
TLDR
This study proposes a novel end-to-end, non-intrusive speech quality evaluation model, termed Quality-Net, based on bidirectional long short-term memory, which has potential to be used in a wide variety of applications of speech signal processing.
Non-intrusive Speech Quality Assessment Using Neural Networks
TLDR
This work presents an investigation of the applicability of neural networks for non-intrusive audio quality assessment, and proposes three neural network-based approaches for mean opinion score (MOS) estimation.
Perceptual Objective Listening Quality Assessment (POLQA), The Third Generation ITU-T Standard for End-to-End Speech Quality Measurement Part I-Temporal Alignment
TLDR
The authors present the Perceptual Objective Listening Quality Assessment (POLQA), the third-generation speech quality measurement algorithm, which provides a new measurement standard for predicting Mean Opinion Scores that outperforms the older PESQ standard.
Deep Neural Network Embeddings for Text-Independent Speaker Verification
TLDR
It is found that the embeddings outperform i-vectors for short speech segments and are competitive on long duration test conditions, which are the best results reported for speaker-discriminative neural networks when trained and tested on publicly available corpora.
...
1
2
3
4
...