Speech Enhancement-assisted Voice Conversion in Noisy Environments

  title={Speech Enhancement-assisted Voice Conversion in Noisy Environments},
  author={Yun-Ju Chan and Chiang-Jen Peng and Syu-Siang Wang and Hsin-Min Wang and Yu Tsao and Taishih Chi},
  journal={2022 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC)},
  • Yun-Ju ChanChiang-Jen Peng T. Chi
  • Published 19 October 2021
  • Computer Science
  • 2022 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC)
Numerous voice conversion (VC) techniques have been proposed for the conversion of voices among different speakers. Although good quality of the converted speech can be observed when VC is applied in a clean environment, the quality degrades drastically when the system is run in noisy conditions. In order to address this issue, we propose a novel speech enhancement (SE)-assisted VC system that utilizes the SE techniques for signal pre-processing, where the VC and SE components are optimized in… 

Figures and Tables from this paper



Noisy-to-Noisy Voice Conversion Framework with Denoising Model

This paper proposes a noisy-to-noisy (N2N) VC framework composed of a denoising module and a VC module that can convert the speaker's identity while preserving the background sounds.

Noise-Robust Voice Conversion Using High-Quefrency Boosting via Sub-Band Cepstrum Conversion and Fusion

The experimental results showed that the proposed method significantly improved the naturalness and similarity of the converted voice compared to the baselines, even with the noisy inputs of source speakers.

A Deep Denoising Autoencoder Approach to Improving the Intelligibility of Vocoded Speech in Cochlear Implant Simulation

DDAE-based NR could potentially be integrated into a CI processor to provide more benefits to CI users under noisy conditions, and was confirmed to yield higher intelligibility scores than conventional NR approaches.

Automatic Speaker Recognition System in Adverse Conditions — Implication of Noise and Reverberation on System Performance

Speaker recognition has been developed and evolved over the past few decades into a supposedly mature technique. Existing methods typically utilize robust features extracted from clean speech. In

A Regression Approach to Speech Enhancement Based on Deep Neural Networks

The proposed DNN approach can well suppress highly nonstationary noise, which is tough to handle in general, and is effective in dealing with noisy speech data recorded in real-world scenarios without the generation of the annoying musical artifact commonly observed in conventional enhancement methods.

Speech enhancement based on deep denoising autoencoder

Experimental results show that adding depth of the DAE consistently increase the performance when a large training data set is given, and compared with a minimum mean square error based speech enhancement algorithm, the proposed denoising DAE provided superior performance on the three objective evaluations.

Voice Conversion Based on Maximum-Likelihood Estimation of Spectral Parameter Trajectory

Experimental results indicate that the performance of VC can be dramatically improved by the proposed method in view of both speech quality and conversion accuracy for speaker individuality.

Many-to-one voice conversion using exemplar-based sparse representation

This paper proposes a many-to-one VC method in an exemplar-based framework which does not need training data of the source speaker and its effectiveness has been confirmed by comparing its effectiveness with that of a conventional one-To-one NMF-based method and one- to-one GMM- based method.

Speech enhancement using Long Short-Term Memory based recurrent Neural Networks for noise robust Speaker Verification

It is shown in simulation experiments that a male-speaker and text-independent DRNN based SE front-end, without specific a priori knowledge about the noise type outperforms a text, noise type and speaker dependent NMF basedFront-end as well as a STSA-MMSE based front- end in terms of Equal Error Rates for a large range of noise types and signal to noise ratios on the RSR2015 speech corpus.

Spectral Mapping Using Artificial Neural Networks for Voice Conversion

A voice conversion approach using an ANN model to capture speaker-specific characteristics of a target speaker is proposed and it is demonstrated that such a voice Conversion approach can perform monolingual as well as cross-lingual voice conversion of an arbitrary source speaker.