• Corpus ID: 237592994

Noisy-to-Noisy Voice Conversion Framework with Denoising Model

  title={Noisy-to-Noisy Voice Conversion Framework with Denoising Model},
  author={Chao Xie and Yi-Chiao Wu and Patrick Lumban Tobing and Wen-Chin Huang and Tomoki Toda},
  journal={2021 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC)},
  • Chao XieYi-Chiao Wu T. Toda
  • Published 22 September 2021
  • Computer Science
  • 2021 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC)
In a conventional voice conversion (VC) framework, a V C model is often trained with a clean dataset consisting of speech data carefully recorded and selected by minimizing background interference. However, collecting such a high-quality dataset is expensive and time-consuming. Leveraging crowd-sourced speech data in training is more economical. Moreover, for some real-world VC scenarios such as VC in video and VC-based data augmentation for speech recognition systems, the background sounds… 

Figures and Tables from this paper

Direct Noisy Speech Modeling for Noisy-To-Noisy Voice Conversion

The improved VC module is proposed to directly model the noisy speech waveform while controlling the background sounds and achieves an acceptable score in terms of naturalness, while reaching comparable similarity performance to the upper bound of the framework.

An Evaluation of Three-Stage Voice Conversion Framework for Noisy and Reverberant Conditions

The experimental results show that the proposed method alleviates the adverse effects caused by both noise and reverberation, and significantly outperforms the baseline directly trained on the noisy-reverberant speech data, and the potential degradation introduced by the denoising and dereverberation still causes noticeable adverse effects on VC performance.

Speech Enhancement-assisted Stargan Voice Conversion in Noisy Environments

The results of VC experiments conducted on a Mandarin dataset show that when combined with SE, the proposed EStarGAN VC model is robust to unseen noises and can improve the sound quality of speech signals converted from noise-corrupted source utterances.

Preserving background sound in noise-robust voice conversion via multi-task learning

Experimental results demonstrate that the proposed end-to-end framework via multi-task learning outperforms the baseline systems while achieving comparable quality and speaker similarity to the VC models trained with clean data.



Noise-Robust Voice Conversion Using High-Quefrency Boosting via Sub-Band Cepstrum Conversion and Fusion

The experimental results showed that the proposed method significantly improved the naturalness and similarity of the converted voice compared to the baselines, even with the noisy inputs of source speakers.

Exemplar-based voice conversion in noisy environment

A voice conversion technique for noisy environments, where parallel exemplars are introduced to encode the source speech signal and synthesize the target speech signal, which is confirmed by comparing its effectiveness with that of a conventional Gaussian Mixture Model (GMM)-based method.

Noise-robust voice conversion based on spectral mapping on sparse space

A framework to train the basis matrices of source and target exemplars so that they have a common weight matrix is proposed, which allows the VC to be performed with lower computation times than with the exemplar-based method.

Non-Parallel Voice Conversion with Autoregressive Conversion Model and Duration Adjustment

The released official subjective results show that the proposed N10 system obtains the best performance in conversion speech naturalness and comparable performance to the best system in speaker similarity, which indicate that this method can achieve state-of-the-art cross-lingual voice conversion performance as well.

LibriMix: An Open-Source Dataset for Generalizable Speech Separation

The experiments show that the generalization error is smaller for models trained with LibriMix than with WHAM!, in both clean and noisy conditions, and a third test set based on VCTK for speech and WHAM! for noise is introduced.

Conv-TasNet: Surpassing Ideal Time–Frequency Magnitude Masking for Speech Separation

  • Yi LuoN. Mesgarani
  • Computer Science
    IEEE/ACM Transactions on Audio, Speech, and Language Processing
  • 2019
A fully convolutional time-domain audio separation network (Conv-TasNet), a deep learning framework for end-to-end time- domain speech separation, which significantly outperforms previous time–frequency masking methods in separating two- and three-speaker mixtures.

ICASSP 2021 Deep Noise Suppression Challenge

A DNS challenge special session at INTERSPEECH 2020 was organized where the open-sourced training and test datasets were opened and a subjective evaluation framework was opened and used to evaluate and select the final winners.

Voice Conversion Based Data Augmentation to Improve Children's Speech Recognition in Limited Data Scenario

Significantly improved recognition rate for children’s speech is noted due to VC-based data augmentation, and the need to deal with speaking-rate differences is reported to demonstrate the need for time-scale modification of childrens speech test data.

The Voice Conversion Challenge 2018: Promoting Development of Parallel and Nonparallel Methods

A brief summary of the state-of-the-art techniques for VC is presented, followed by a detailed explanation of the challenge tasks and the results that were obtained.

A short-time objective intelligibility measure for time-frequency weighted noisy speech

An objective intelligibility measure is presented, which shows high correlation (rho=0.95) with the intelligibility of both noisy, and TF-weighted noisy speech, and shows significantly better performance than three other, more sophisticated, objective measures.