• Corpus ID: 239024889

Speech Enhancement-assisted Stargan Voice Conversion in Noisy Environments

  title={Speech Enhancement-assisted Stargan Voice Conversion in Noisy Environments},
  author={Yun-Ju Chan and Chiang-Jen Peng and Syu-Siang Wang and Hsin-Min Wang and Yu Tsao and Taishih Chi},
  • Yun-Ju Chan, Chiang-Jen Peng, +3 authors T. Chi
  • Published 19 October 2021
  • Computer Science, Engineering
  • ArXiv
Numerous voice conversion (VC) techniques have been proposed for the conversion of voices among different speakers. Although the decent quality of converted speech can be observed when VC is applied in a clean environment, the quality will drop sharply when the system is running under noisy conditions. In order to address this issue, we propose a novel enhancement-based StarGAN (E-StarGAN) VC system, which leverages a speech enhancement (SE) technique for signal pre-processing. SE systems are… 

Figures and Tables from this paper


Noisy-to-Noisy Voice Conversion Framework with Denoising Model
To explore VC with the flexibility of handling background sounds, a noisy-to-noisy VC framework composed of a denoising module and a VC module is proposed that can convert the speaker’s identity while preserving the background sounds.
Noise-Robust Voice Conversion Using High-Quefrency Boosting via Sub-Band Cepstrum Conversion and Fusion
The experimental results showed that the proposed method significantly improved the naturalness and similarity of the converted voice compared to the baselines, even with the noisy inputs of source speakers.
Exemplar-based voice conversion in noisy environment
A voice conversion technique for noisy environments, where parallel exemplars are introduced to encode the source speech signal and synthesize the target speech signal, which is confirmed by comparing its effectiveness with that of a conventional Gaussian Mixture Model (GMM)-based method.
VoiceID Loss: Speech Enhancement for Speaker Verification
The proposed VoiceID loss is a novel loss function for training a speech enhancement model to improve the robustness of speaker verification and consistently improves the speaker verification system on both clean and noisy conditions.
A Regression Approach to Speech Enhancement Based on Deep Neural Networks
The proposed DNN approach can well suppress highly nonstationary noise, which is tough to handle in general, and is effective in dealing with noisy speech data recorded in real-world scenarios without the generation of the annoying musical artifact commonly observed in conventional enhancement methods.
A Deep Denoising Autoencoder Approach to Improving the Intelligibility of Vocoded Speech in Cochlear Implant Simulation
DDAE-based NR could potentially be integrated into a CI processor to provide more benefits to CI users under noisy conditions, and was confirmed to yield higher intelligibility scores than conventional NR approaches.
Speech enhancement based on deep denoising autoencoder
Experimental results show that adding depth of the DAE consistently increase the performance when a large training data set is given, and compared with a minimum mean square error based speech enhancement algorithm, the proposed denoising DAE provided superior performance on the three objective evaluations.
SEGAN: Speech Enhancement Generative Adversarial Network
This work proposes the use of generative adversarial networks for speech enhancement, and operates at the waveform level, training the model end-to-end, and incorporate 28 speakers and 40 different noise conditions into the same model, such that model parameters are shared across them.
Continuous probabilistic transform for voice conversion
The design of a new methodology for representing the relationship between two sets of spectral envelopes and the proposed transform greatly improves the quality and naturalness of the converted speech signals compared with previous proposed conversion methods.
Automatic Speaker Recognition System in Adverse Conditions — Implication of Noise and Reverberation on System Performance
Speaker recognition has been developed and evolved over the past few decades into a supposedly mature technique. Existing methods typically utilize robust features extracted from clean speech. In