FastPitchFormant: Source-filter based Decomposed Modeling for Speech Synthesis

@inproceedings{Bak2021FastPitchFormantSB,
  title={FastPitchFormant: Source-filter based Decomposed Modeling for Speech Synthesis},
  author={Taejun Bak and Jaesung Bae and Hanbin Bae and Young-Ik Kim and Hoon-Young Cho},
  booktitle={Interspeech},
  year={2021}
}
Methods for modeling and controlling prosody with acoustic features have been proposed for neural text-to-speech (TTS) models. Prosodic speech can be generated by conditioning acoustic features. However, synthesized speech with a large pitch-shift scale suffers from audio quality degradation, and speaker characteristics deformation. To address this problem, we propose a feed-forward Transformer based TTS model that is designed based on the source-filter theory. This model, called… 

Figures and Tables from this paper

Enhancement of Pitch Controllability using Timbre-Preserving Pitch Augmentation in FastPitch

Two algorithms to improve the robustness and pitch controllability of FastPitch are proposed, including a novel timbre-preserving pitch-shifting algorithm for natural pitch augmentation and a training algorithm that uses pitch-augmented speech datasets with different pitch ranges for the same sentence.

Adversarial Multi-Task Learning for Disentangling Timbre and Pitch in Singing Voice Synthesis

This study proposes a singing voice synthesis model with multi-task learning to use both approaches – acoustic features for a parametric vocoder and mel-spectrograms for a neural vocoder to improve the quality of singing voices in a multi-singer model.

Hierarchical and Multi-Scale Variational Autoencoder for Diverse and Natural Non-Autoregressive Text-to-Speech

Experimental results verify that the proposed HiMuV-TTS model can generate more diverse and natural speech as compared to TTS models with single-scale variational autoencoders, and can represent different prosody information in each scale.

Controllable Accented Text-to-Speech Synthesis

A neural TTS architecture is proposed, that allows us to control the accent and its intensity during inference and attains superior performance to the baseline models in terms of accent rendering and intensity control.

PromptTTS: Controllable Text-to-Speech with Text Descriptions

A text-to-speech (TTS) system that takes a prompt with both style and content descriptions as input to synthesize the corresponding speech, and experiments show that PromptTTS can generate speech with precise style control and high speech quality.

A Linguistic-based Transfer Learning Approach for Low-resource Bahnar Text-to-Speech

This work proposes the transfer learning approach to integrate the Vietnamese pronunciation into the Bahnar TTS synthesizer, and shows significant improvement in the performance of the TTS model for a low-resource language.

SANE-TTS: Stable And Natural End-to-End Multilingual Text-to-Speech

This paper introduces speaker regularization loss that improves speech naturalness during cross-lingual synthesis as well as domain adversarial training, which is applied in other multilingual TTS models.

References

SHOWING 1-10 OF 28 REFERENCES

FastSpeech: Fast, Robust and Controllable Text to Speech

A novel feed-forward network based on Transformer to generate mel-spectrogram in parallel for TTS is proposed, which speeds up mel-Spectrogram generation by 270x and the end-to-end speech synthesis by 38x and is called FastSpeech.

Neural Source-filter-based Waveform Model for Statistical Parametric Speech Synthesis

This study proposes a non-AR neural source-filter waveform model that can be directly trained using spectrum-based training criteria and the stochastic gradient descent method and the quality of its synthetic speech is close to that of speech generated by the AR WaveNet.

Tacotron: Towards End-to-End Speech Synthesis

Tacotron is presented, an end-to-end generative text- to-speech model that synthesizes speech directly from characters that achieves a 3.82 subjective 5-scale mean opinion score on US English, outperforming a production parametric system in terms of naturalness.

Robust and Fine-grained Prosody Control of End-to-end Speech Synthesis

  • Younggun LeeTaesu Kim
  • Computer Science
    ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
  • 2019
The proposed methods introduce temporal structures in the embedding networks, thus enabling fine-grained control of the speaking style of the synthesized speech and introducing the temporal normalization of prosody embeddings, which shows better robustness against speaker perturbations during prosody transfer tasks.

FastPitch: Parallel Text-to-speech with Pitch Prediction

It is found that uniformly increasing or decreasing the pitch with FastPitch generates speech that resembles the voluntary modulation of voice, making it comparable to state-of-the-art speech.

FastSpeech 2: Fast and High-Quality End-to-End Text to Speech

FastSpeech 2 is proposed, which addresses the issues in FastSpeech and better solves the one-to-many mapping problem in TTS by directly training the model with ground-truth target instead of the simplified output from teacher, and introducing more variation information of speech as conditional inputs.

Natural TTS Synthesis by Conditioning Wavenet on MEL Spectrogram Predictions

This paper describes Tacotron 2, a neural network architecture for speech synthesis directly from text. The system is composed of a recurrent sequence-to-sequence feature prediction network that maps

Towards End-to-End Prosody Transfer for Expressive Speech Synthesis with Tacotron

An extension to the Tacotron speech synthesis architecture that learns a latent embedding space of prosody, derived from a reference acoustic representation containing the desired prosody results in synthesized audio that matches the prosody of the reference signal with fine time detail.

Parallel Tacotron: Non-Autoregressive and Controllable TTS

  • Isaac EliasH. Zen Yonghui Wu
  • Computer Science
    ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
  • 2021
A non-autoregressive neural text-to-speech model augmented with a variational autoencoder-based residual encoder, called Parallel Tacotron, which is highly parallelizable during both training and inference, allowing efficient synthesis on modern parallel hardware.

Generating Diverse and Natural Text-to-Speech Samples Using a Quantized Fine-Grained VAE and Autoregressive Prosody Prior

  • Guangzhi SunYu Zhang Yonghui Wu
  • Computer Science
    ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
  • 2020
Experimental results show that the proposed sequential prior in a discrete latent space which can generate more naturally sounding samples significantly improves the naturalness in random sample generation and randomly sampling can be used as data augmentation to improve the ASR performance.