The Sequence-to-Sequence Baseline for the Voice Conversion Challenge 2020: Cascading ASR and TTS

@article{Huang2020TheSB,
  title={The Sequence-to-Sequence Baseline for the Voice Conversion Challenge 2020: Cascading ASR and TTS},
  author={Wen-Chin Huang and Tomoki Hayashi and Shinji Watanabe and Tomoki Toda},
  journal={ArXiv},
  year={2020},
  volume={abs/2010.02434}
}
This paper presents the sequence-to-sequence (seq2seq) baseline system for the voice conversion challenge (VCC) 2020. We consider a naive approach for voice conversion (VC), which is to first transcribe the input speech with an automatic speech recognition (ASR) model, followed using the transcriptions to generate the voice of the target with a text-to-speech (TTS) model. We revisit this method under a sequence-to-sequence (seq2seq) framework by utilizing ESPnet, an open-source end-to-end… 

Figures and Tables from this paper

The NU Voice Conversion System for the Voice Conversion Challenge 2020: On the Effectiveness of Sequence-to-sequence Models and Autoregressive Neural Vocoders
TLDR
Comparing with the baseline systems, it is confirmed that the seq2seq modeling can improve the conversion similarity and that the use of AR vocoders can improved the naturalness of the converted speech.
On Prosody Modeling for ASR+TTS Based Voice Conversion
TLDR
This work proposes to directly predict prosody from the linguistic representation in a target-speaker-dependent manner, referred to as target text prediction (TTP), and evaluates both methods on the VCC2020 benchmark and consider different linguistic representations.
Transfer Learning from Monolingual ASR to Transcription-free Cross-lingual Voice Conversion
TLDR
This paper focuses on knowledge transfer from monolin-gual ASR to cross-lingual VC, in order to address the con-tent mismatch problem, and proposes a speaker-dependent conversion model that significantly reduces the MOS drop be-tween intra- and cross-lingsual conversion.
The NeteaseGames System for Voice Conversion Challenge 2020 with Vector-quantization Variational Autoencoder and WaveNet
TLDR
VQ-VAE-WaveNet is a nonparallel VAE-based voice conversion that reconstructs the acoustic features along with separating the linguistic information with speaker identity and achieves an average score of 3.95 in naturalness in automatic naturalness prediction and ranked the 6th and 8th, respectively in ASV-based speaker similarity and spoofing countermeasures.
ESPnet-SLU: Advancing Spoken Language Understanding through ESPnet
TLDR
This work enhances the toolkit to provide implementations for various SLU benchmarks that enable researchers to seamlessly mix-and-match different ASR and NLU models, and provides pretrained models with intensively tuned hyper-parameters that can match or even outperform the current state-of-the-art performances.
S3PRL-VC: Open-source Voice Conversion Framework with Self-supervised Speech Representations
TLDR
S3R is comparable with VCC2020 top systems in the A2O setting in terms of similarity, and achieves state-of-the-art in S3R-based A2A VC, according to a series of in-depth analyses.
Latent linguistic embedding for cross-lingual text-to-speech and voice conversion
TLDR
This work shows that a well-trained English latent linguistic embedding can be seamlessly used for cross-lingual TTS without having to perform any extra steps, and the subjective evaluations of perceived naturalness seemed to vary between target speakers, which is one aspect for future improvement.
Baseline System of Voice Conversion Challenge 2020 with Cyclic Variational Autoencoder and Parallel WaveGAN
TLDR
The baseline system of Voice Conversion Challenge (VCC) 2020 with a cyclic variational autoencoder and Parallel WaveGAN, i.e., CycleVAEPWG, is presented, showing an approximately or nearly average score for naturalness and an aboveaverage score for speaker similarity.
The UFRJ Entry for the Voice Conversion Challenge 2020
TLDR
The system submitted to the Task 1 of the 2020 edition of the voice conversion challenge (VCC), based on CycleGAN to convert mel-spectograms and MelGAN to synthesize converted speech, suggests that the use of neural vocoders to represent converted speech is a problem that demand specific training strategies and theUse of adaptation techniques.
The 2020 ESPnet Update: New Features, Broadened Applications, Performance Improvements, and Future Plans
TLDR
The recent development of ESPnet is described, an end-to-end speech processing toolkit that includes text to speech (TTS), voice conversation (VC), speech translation (ST), and speech enhancement (SE) with support for beamforming, speech separation, denoising, and dereverberation.
...
1
2
3
...

References

SHOWING 1-10 OF 37 REFERENCES
ATTS2S-VC: Sequence-to-sequence Voice Conversion with Attention and Context Preservation Mechanisms
TLDR
The proposed VC framework can be trained in only one day, using only one GPU of an NVIDIA Tesla K80, while the quality of the synthesized speech is higher than that ofspeech converted by Gaussian mixture model-based VC and is comparable to that of speech generated by recurrent neural network-based text-to-speech synthesis.
Voice Transformer Network: Sequence-to-Sequence Voice Conversion Using Transformer with Text-to-Speech Pretraining
TLDR
Experimental results show that a simple yet effective pretraining technique to transfer knowledge from learned TTS models, which benefit from large-scale, easily accessible TTS corpora, can facilitate data-efficient training and outperform an RNN-basedseq VC model in terms of intelligibility, naturalness, and similarity.
Espnet-TTS: Unified, Reproducible, and Integratable Open Source End-to-End Text-to-Speech Toolkit
TLDR
The experimental results show that the ESPnet-TTS models can achieve state-of-the-art performance comparable to the other latest toolkits, resulting in a mean opinion score (MOS) of 4.25 on the LJSpeech dataset.
Non-Parallel Sequence-to-Sequence Voice Conversion With Disentangled Linguistic and Speaker Representations
TLDR
In this method, disentangled linguistic and speaker representations are extracted from acoustic features, and voice conversion is achieved by preserving the linguistic representations of source utterances while replacing the speaker representations with the target ones in sequence-to-sequence (seq2seq) voice conversion.
Semi-Supervised Speaker Adaptation for End-to-End Speech Synthesis with Pretrained Models
TLDR
The proposed method can greatly simplify a speaker adaptation pipeline by consistently employing end-to-end ASR/TTS ecosystems and achieved comparable performance to a paired data adaptation method in terms of subjective speaker similarity and objective cepstral distance measures.
Voice Conversion Challenge 2020: Intra-lingual semi-parallel and cross-lingual voice conversion
  • Yi Zhao, Wen-Chin Huang, T. Toda
  • Computer Science
    Joint Workshop for the Blizzard Challenge and Voice Conversion Challenge 2020
  • 2020
TLDR
From the results of crowd-sourced listening tests, it is observed that VC methods have progressed rapidly thanks to advanced deep learning methods, and the overall naturalness and similarity scores were lower than those for the intra-lingual conversion task.
Sequence-to-Sequence Acoustic Modeling for Voice Conversion
TLDR
Experimental results show that the proposed neural network named sequence-to-sequence ConvErsion NeTwork (SCENT) obtained better objective and subjective performance than the baseline methods using Gaussian mixture models and deep neural networks as acoustic models.
Speech-Transformer: A No-Recurrence Sequence-to-Sequence Model for Speech Recognition
  • Linhao Dong, Shuang Xu, Bo Xu
  • Computer Science
    2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
  • 2018
TLDR
The Speech-Transformer is presented, a no-recurrence sequence-to-sequence model entirely relies on attention mechanisms to learn the positional dependencies, which can be trained faster with more efficiency and a 2D-Attention mechanism which can jointly attend to the time and frequency axes of the 2-dimensional speech inputs, thus providing more expressive representations for the Speech- Transformer.
Phonetic posteriorgrams for many-to-one voice conversion without parallel data training
This paper proposes a novel approach to voice conversion with non-parallel training data. The idea is to bridge between speakers by means of Phonetic PosteriorGrams (PPGs) obtained from a
Building a mixed-lingual neural TTS system with only monolingual data
TLDR
The problem in the encoder-decoder framework when only monolingual data from a target speaker is available is looked at from two aspects: speaker consistency within an utterance and naturalness.
...
1
2
3
4
...