• Corpus ID: 246431153

The HCCL-DKU system for fake audio generation task of the 2022 ICASSP ADD Challenge

  title={The HCCL-DKU system for fake audio generation task of the 2022 ICASSP ADD Challenge},
  author={Ziyi Chen and Hua Hua and Yuxiang Zhang and Ming Li and Pengyuan Zhang},
The voice conversion task is to modify the speaker identity of continuous speech while preserving the linguistic content. Generally, the naturalness and similarity are two main metrics for evaluating the conversion quality, which has been improved significantly in recent years. This paper presents the HCCL-DKU entry for the fake audio generation task of the 2022 ICASSP ADD challenge. We propose a novel ppg-based voice conversion model that adopts a fully end-to-end structure. Experimental… 



AISHELL-3: A Multi-speaker Mandarin TTS Corpus and the Baselines

A large-scale and high-fidelity multi-speaker Mandarin speech corpus which could be used to train multi- Speakers Text-to-Speech systems and a robust synthesis model that is able to achieve zero-shot voice cloning is presented.

VQVC+: One-Shot Voice Conversion by Vector Quantization and U-Net architecture

To further improve audio quality, the U-Net architecture is used within an auto-encoder-based VC system and the VQ-based method, which quantizes the latent vectors, can serve the purpose.

The Effect of Silence and Dual-Band Fusion in Anti-Spoofing System

The dualband fusion anti-spoofing algorithm is proposed, which requires only two sub-systems but outperforms all but one primary system submitted to the logical access condition of the ASVspoof 2019 challenge.

An Overview of Voice Conversion and Its Challenges: From Statistical Modeling to Deep Learning

This article provides a comprehensive overview of the state-of-the-art of voice conversion techniques and their performance evaluation methods from the statistical approaches to deep learning, and discusses their promise and limitations.

WenetSpeech: A 10000+ Hours Multi-domain Mandarin Corpus for Speech Recognition

WenetSpeech is the current largest open-source Mandarin speech corpus with transcriptions, which benefits research on production-level speech recognition, and a novel end-to-end label error detection approach is proposed.

FastSpeech: Fast, Robust and Controllable Text to Speech

A novel feed-forward network based on Transformer to generate mel-spectrogram in parallel for TTS is proposed, which speeds up mel-Spectrogram generation by 270x and the end-to-end speech synthesis by 38x and is called FastSpeech.

STC Antispoofing Systems for the ASVspoof2021 Challenge

Applied augmentation techniques allowed us to significantly increase the quality and robustness of the proposed spoofing detection systems in all tracks, and this paper mainly focuses on several approaches that were used to raise generalizing ability of the systems.

WeNet: Production Oriented Streaming and Non-Streaming End-to-End Speech Recognition Toolkit

An open source speech recognition toolkit called WeNet is proposed, in which a new two-pass approach named U2 is implemented to unify streaming and non-streaming end-to-end (E2E) speech recognition in a single model.

Zero-Shot Voice Style Transfer with Only Autoencoder Loss

A new style transfer scheme that involves only an autoencoder with a carefully designed bottleneck is proposed, which achieves state-of-the-art results in many-to-many voice conversion with non-parallel data and is the first to perform zero-shot voice conversion.

Natural TTS Synthesis by Conditioning Wavenet on MEL Spectrogram Predictions

This paper describes Tacotron 2, a neural network architecture for speech synthesis directly from text. The system is composed of a recurrent sequence-to-sequence feature prediction network that maps