S3PRL-VC: Open-source Voice Conversion Framework with Self-supervised Speech Representations

@inproceedings{Huang2022S3PRLVCOV,
  title={S3PRL-VC: Open-source Voice Conversion Framework with Self-supervised Speech Representations},
  author={Wen-Chin Huang and Shu-wen Yang and Tomoki Hayashi and Hung-yi Lee and Shinji Watanabe and Tomoki Toda},
  booktitle={ICASSP},
  year={2022}
}
This paper introduces S3PRL-VC, an open-source voice conversion (VC) framework based on the S3PRL toolkit. In the context of recognition-synthesis VC, self-supervised speech representation (S3R) is valuable in its potential to replace the expensive supervised representation adopted by state-of-the-art VC systems. Moreover, we claim that VC is a good probing task for S3R analysis. In this work, we provide a series of in-depth analyses by benchmarking on the two tasks in VCC2020, namely intra… 

Figures and Tables from this paper

A Comparative Study of Self-supervised Speech Representation Based Voice Conversion
TLDR
A large-scale comparative study of self- supervised speech representation (S3R)-based voice conversion (VC), which demonstrates the competitiveness of S3R-based VC and also sheds light on the possible improving directions.
SUPERB-SG: Enhanced Speech processing Universal PERformance Benchmark for Semantic and Generative Capabilities
TLDR
This paper introduces SUPERB-SG, a new benchmark focusing on evaluating the semantic and generative capabilities of pre- trained models by increasing task diversity and difficulty over SUPERB, and uses a lightweight methodology to test the robustness of representations learned by pre-trained models under shifts in data domain and quality across different types of tasks.
PARP: Prune, Adjust and Re-Prune for Self-Supervised Speech Recognition
TLDR
This work proposes Prune-AdjustRe-Prune (PARP), which discovers and finetunes subnetworks for much better ASR performance, while only requiring a single downstream finetuning run, and demonstrates the computational advantage and performance gain of PARP over baseline pruning methods.
Self-Supervised Speech Representations Preserve Speech Characteristics while Anonymizing Voices
TLDR
Several voice conversion models using self-supervised speech representations including Wav2Vec2.0, Hubert and UniSpeech are trained to be used as a method for anonymizing voices for discriminating between healthy and pathological speech.

References

SHOWING 1-10 OF 30 REFERENCES
S2VC: A Framework for Any-to-Any Voice Conversion with Self-Supervised Pretrained Representations
TLDR
S2VC is proposed that utilizes Self-Supervised features as both source and target features for the VC model and Supervised phoneme posteriorgram (PPG), which is believed to be speaker-independent and widely used in VC to extract content information, is chosen as a strong baseline for SSL features.
On Prosody Modeling for ASR+TTS Based Voice Conversion
TLDR
This work proposes to directly predict prosody from the linguistic representation in a target-speaker-dependent manner, referred to as target text prediction (TTP), and evaluates both methods on the VCC2020 benchmark and consider different linguistic representations.
The Sequence-to-Sequence Baseline for the Voice Conversion Challenge 2020: Cascading ASR and TTS
TLDR
This paper revisits a naive approach for voice conversion by utilizing ESPnet, an open-source end-to-end speech processing toolkit, and the many well-configured pretrained models provided by the community, demonstrating the promising ability of seq2seq models to convert speaker identity.
Voice Conversion Challenge 2020: Intra-lingual semi-parallel and cross-lingual voice conversion
  • Yi Zhao, Wen-Chin Huang, T. Toda
  • Computer Science
    Joint Workshop for the Blizzard Challenge and Voice Conversion Challenge 2020
  • 2020
TLDR
From the results of crowd-sourced listening tests, it is observed that VC methods have progressed rapidly thanks to advanced deep learning methods, and the overall naturalness and similarity scores were lower than those for the intra-lingual conversion task.
Submission from SRCB for Voice Conversion Challenge 2020
TLDR
This work focuses on building a voice conversion system achieving consistent improvements in accent and intelligibility evaluations, and extracts general phonation from the source speakers' speeches of different languages, and improves the sound quality by optimizing the speech synthesis module and adding a noise suppression post-process module to the vocoder.
Exploring wav2vec 2.0 on speaker verification and language identification
TLDR
This work uses some preliminary experiments to indicate that wav2vec 2.0 can capture the information about the speaker and language and utilizes one model to achieve the unified modeling by the multi-task learning for the two tasks.
Speech Resynthesis from Discrete Disentangled Self-Supervised Representations
TLDR
To generate disentangled representation, low-bitrate representations are extracted for speech content, prosodic information, and speaker identity to synthesize speech in a controllable manner using self-supervised discrete representations.
Fragmentvc: Any-To-Any Voice Conversion by End-To-End Extracting and Fusing Fine-Grained Voice Fragments with Attention
TLDR
Objective evaluation based on speaker verification and subjective evaluation with MOS both showed that this approach outperformed SOTA approaches, such as AdaIN-VC and AutoVC, and is accomplished end-to-end.
CASIA Voice Conversion System for the Voice Conversion Challenge 2020
TLDR
This paper presents CASIA (Chinese Academy of Sciences, Institute of Automation) voice conversion system for the Voice Conversation Challenge 2020, and builds the system by combining the initialization using a multi-speaker data and the adaptation using limited data of the target speaker.
Voice Conversion by Cascading Automatic Speech Recognition and Text-to-Speech Synthesis with Prosody Transfer
TLDR
A ASR-TTS method for voice conversion, which used iFLYTEK ASR engine to transcribe the source speech into text and a Transformer TTS model with WaveNet vocoder to synthesize the converted speech from the decoded text.
...
...