Synthetic-to-Natural Speech Waveform Conversion Using Cycle-Consistent Adversarial Networks

@article{Tanaka2018SynthetictoNaturalSW,
  title={Synthetic-to-Natural Speech Waveform Conversion Using Cycle-Consistent Adversarial Networks},
  author={Kou Tanaka and Takuhiro Kaneko and Nobukatsu Hojo and H. Kameoka},
  journal={2018 IEEE Spoken Language Technology Workshop (SLT)},
  year={2018},
  pages={632-639}
}
We propose a learning-based filter that allows us to directly modify a synthetic speech waveform into a natural speech waveform. Speech-processing systems using a vocoder framework such as statistical parametric speech synthesis and voice conversion are convenient especially for a limited number of data because it is possible to represent and process interpretable acoustic features over a compact space, such as the fundamental frequency (F0) and mel-cepstrum. However, a well-known problem that… 

Figures from this paper

Waveform Generation for Text-to-speech Synthesis Using Pitch-synchronous Multi-scale Generative Adversarial Networks

Listening test results show that while direct waveform generation with GAN is still far behind WaveNet, a GAN-based glottal excitation model can achieve quality and voice similarity on par with a WaveNet vocoder.

A Parallel-Data-Free Speech Enhancement Method Using Multi-Objective Learning Cycle-Consistent Generative Adversarial Network

  • Yang XiangC. Bao
  • Computer Science
    IEEE/ACM Transactions on Audio, Speech, and Language Processing
  • 2020
A novel parallel-data-free speech enhancement method, in which the cycle-consistent generative adversarial network (CycleGAN) and multi-objective learning are employed, which is effective to improve speech quality and intelligibility when the networks are trained under the parallel data.

High Fidelity Speech Synthesis with Adversarial Networks

GAN-TTS is capable of generating high-fidelity speech with naturalness comparable to the state-of-the-art models, and unlike autoregressive models, it is highly parallelisable thanks to an efficient feed-forward generator.

HIGH FIDELITY SPEECH SYNTHESIS

GAN-TTS is capable of generating high-fidelity speech with naturalness comparable to the state-of-the-art models, and unlike autoregressive models, it is highly parallelisable thanks to an efficient feed-forward generator.

A Cyclical Post-filtering Approach to Mismatch Refinement of Neural Vocoder for Text-to-speech Systems

A cyclic voice conversion (VC) model is adopted to generate temporally matched pseudo-VC data for training and acoustically matched enhanced data for testing the neural vocoders for basic TTS systems, and both objective and subjective experimental results confirm the effectiveness of the proposed framework.

Nonparallel Voice Conversion With Augmented Classifier Star Generative Adversarial Networks

Three formulations of StarGAN are described, including a newly introduced novel StarGAN variant called “Augmented classifier StarGAN (A-StarGAN)”, and they are compared in a nonparallel VC task and compared with several baseline methods.

WaveCycleGAN2: Time-domain Neural Post-filter for Speech Waveform Generation

The results show that the proposed method alleviates the aliasing well, is useful for both speech waveforms generated by analysis-and-synthesis and statistical parametric speech synthesis, and achieves a mean opinion score comparable to those of natural speech and speech synthesized by WaveNet and WaveGlow while processing speech samples at a rate of more than 150 kHz on an NVIDIA Tesla P100.

Voice spoofing detection with raw waveform based on Dual Path Res2net

The proposed DP-Res2Net significantly improves the model’s generalizability to unseen spoofing attacks, and the results demonstrate that it outperforms state-of-the-art audio spoof detection models.

A pr 2 01 9 WaveCycleGAN 2 : Time-domain Neural Post-filter for Speech Waveform Generation

The results show that the proposed method alleviates the aliasing well, is useful for both speech waveforms generated by analysis-and-synthesis and statistical parametric speech synthesis, and achieves a mean opinion score comparable to those of natural speech and speech synthesized by WaveNet and WaveGlow while processing speech samples at a rate of more than 150 kHz on an NVIDIA Tesla P100.

A Cyclical Approach to Synthetic and Natural Speech Mismatch Refinement of Neural Post-filter for Low-cost Text-to-speech System

This paper explores a general framework to develop a neural post-filter (NPF) for low-cost TTS systems using neural vocoders and proposes a cyclical approach to tackle the acoustic and temporal mismatches of developing an NPF.

References

SHOWING 1-10 OF 30 REFERENCES

Statistical Parametric Speech Synthesis Incorporating Generative Adversarial Networks

The proposed method can generate more natural spectral parameters and $F_0$ than conventional minimum generation error training algorithm regardless of its hyperparameter settings, and it is found that a Wasserstein GAN minimizing the Earth-Mover's distance works the best in terms of improving the synthetic speech quality.

Sequence-to-Sequence Voice Conversion with Similarity Metric Learned Using Generative Adversarial Networks

The proposed SVC framework uses a similarity metric implicitly derived from a generative adversarial network, enabling the measurement of the distance in the high-level abstract space to mitigate the oversmoothing problem caused in the low-level data space.

Generative adversarial network-based postfilter for statistical parametric speech synthesis

Objective evaluation of experimental results shows that the GAN-based postfilter can compensate for detailed spectral structures including modulation spectrum, and subjective evaluation shows that its generated speech is comparable to natural speech.

SEGAN: Speech Enhancement Generative Adversarial Network

This work proposes the use of generative adversarial networks for speech enhancement, and operates at the waveform level, training the model end-to-end, and incorporate 28 speakers and 40 different noise conditions into the same model, such that model parameters are shared across them.

Parallel-Data-Free Voice Conversion Using Cycle-Consistent Adversarial Networks

This work uses a cycle-consistent adversarial network (CycleGAN) with gated convolutional neural networks (CNNs) and an identity-mapping loss to learn a mapping from source to target speech without relying on parallel data.

A postfilter to modify the modulation spectrum in HMM-based speech synthesis

The Modulation Spectrum (MS) of speech parameter trajectory is introduced as a new feature to effectively capture the over-smoothing effect, and a postfilter is proposed based on the MS.

Vae-Space: Deep Generative Model of Voice Fundamental Frequency Contours

The generative model proposed is able to accurately decompose an $F_{0}$ contour into the sum of phrase and accent components of the Fujisaki model, a mathematical model describing the control mechanism of vocal fold vibration, without an iterative algorithm.

Statistical parametric speech synthesis using deep neural networks

This paper examines an alternative scheme that is based on a deep neural network (DNN), the relationship between input texts and their acoustic realizations is modeled by a DNN, and experimental results show that the DNN- based systems outperformed the HMM-based systems with similar numbers of parameters.

Voice Conversion Based on Maximum-Likelihood Estimation of Spectral Parameter Trajectory

Experimental results indicate that the performance of VC can be dramatically improved by the proposed method in view of both speech quality and conversion accuracy for speaker individuality.

Context Encoders: Feature Learning by Inpainting

It is found that a context encoder learns a representation that captures not just appearance but also the semantics of visual structures, and can be used for semantic inpainting tasks, either stand-alone or as initialization for non-parametric methods.