FastPitchFormant: Source-filter based Decomposed Modeling for Speech Synthesis
@inproceedings{Bak2021FastPitchFormantSB, title={FastPitchFormant: Source-filter based Decomposed Modeling for Speech Synthesis}, author={Taejun Bak and Jaesung Bae and Hanbin Bae and Young-Ik Kim and Hoon-Young Cho}, booktitle={Interspeech}, year={2021} }
Methods for modeling and controlling prosody with acoustic features have been proposed for neural text-to-speech (TTS) models. Prosodic speech can be generated by conditioning acoustic features. However, synthesized speech with a large pitch-shift scale suffers from audio quality degradation, and speaker characteristics deformation. To address this problem, we propose a feed-forward Transformer based TTS model that is designed based on the source-filter theory. This model, called…
8 Citations
Controllable speech synthesis by learning discrete phoneme-level prosodic representations
- Computer ScienceSpeech Commun.
- 2023
Enhancement of Pitch Controllability using Timbre-Preserving Pitch Augmentation in FastPitch
- Computer ScienceINTERSPEECH
- 2022
Two algorithms to improve the robustness and pitch controllability of FastPitch are proposed, including a novel timbre-preserving pitch-shifting algorithm for natural pitch augmentation and a training algorithm that uses pitch-augmented speech datasets with different pitch ranges for the same sentence.
Adversarial Multi-Task Learning for Disentangling Timbre and Pitch in Singing Voice Synthesis
- Computer ScienceINTERSPEECH
- 2022
This study proposes a singing voice synthesis model with multi-task learning to use both approaches – acoustic features for a parametric vocoder and mel-spectrograms for a neural vocoder to improve the quality of singing voices in a multi-singer model.
Hierarchical and Multi-Scale Variational Autoencoder for Diverse and Natural Non-Autoregressive Text-to-Speech
- Computer ScienceINTERSPEECH
- 2022
Experimental results verify that the proposed HiMuV-TTS model can generate more diverse and natural speech as compared to TTS models with single-scale variational autoencoders, and can represent different prosody information in each scale.
Controllable Accented Text-to-Speech Synthesis
- Computer ScienceArXiv
- 2022
A neural TTS architecture is proposed, that allows us to control the accent and its intensity during inference and attains superior performance to the baseline models in terms of accent rendering and intensity control.
PromptTTS: Controllable Text-to-Speech with Text Descriptions
- Computer ScienceArXiv
- 2022
A text-to-speech (TTS) system that takes a prompt with both style and content descriptions as input to synthesize the corresponding speech, and experiments show that PromptTTS can generate speech with precise style control and high speech quality.
A Linguistic-based Transfer Learning Approach for Low-resource Bahnar Text-to-Speech
- Computer Science2022 9th NAFOSTED Conference on Information and Computer Science (NICS)
- 2022
This work proposes the transfer learning approach to integrate the Vietnamese pronunciation into the Bahnar TTS synthesizer, and shows significant improvement in the performance of the TTS model for a low-resource language.
SANE-TTS: Stable And Natural End-to-End Multilingual Text-to-Speech
- Linguistics, Computer ScienceINTERSPEECH
- 2022
This paper introduces speaker regularization loss that improves speech naturalness during cross-lingual synthesis as well as domain adversarial training, which is applied in other multilingual TTS models.
References
SHOWING 1-10 OF 28 REFERENCES
FastSpeech: Fast, Robust and Controllable Text to Speech
- Computer ScienceNeurIPS
- 2019
A novel feed-forward network based on Transformer to generate mel-spectrogram in parallel for TTS is proposed, which speeds up mel-Spectrogram generation by 270x and the end-to-end speech synthesis by 38x and is called FastSpeech.
Neural Source-filter-based Waveform Model for Statistical Parametric Speech Synthesis
- Computer ScienceICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
- 2019
This study proposes a non-AR neural source-filter waveform model that can be directly trained using spectrum-based training criteria and the stochastic gradient descent method and the quality of its synthetic speech is close to that of speech generated by the AR WaveNet.
Tacotron: Towards End-to-End Speech Synthesis
- Computer ScienceINTERSPEECH
- 2017
Tacotron is presented, an end-to-end generative text- to-speech model that synthesizes speech directly from characters that achieves a 3.82 subjective 5-scale mean opinion score on US English, outperforming a production parametric system in terms of naturalness.
Robust and Fine-grained Prosody Control of End-to-end Speech Synthesis
- Computer ScienceICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
- 2019
The proposed methods introduce temporal structures in the embedding networks, thus enabling fine-grained control of the speaking style of the synthesized speech and introducing the temporal normalization of prosody embeddings, which shows better robustness against speaker perturbations during prosody transfer tasks.
FastPitch: Parallel Text-to-speech with Pitch Prediction
- Computer ScienceICASSP
- 2021
It is found that uniformly increasing or decreasing the pitch with FastPitch generates speech that resembles the voluntary modulation of voice, making it comparable to state-of-the-art speech.
FastSpeech 2: Fast and High-Quality End-to-End Text to Speech
- Computer ScienceICLR
- 2021
FastSpeech 2 is proposed, which addresses the issues in FastSpeech and better solves the one-to-many mapping problem in TTS by directly training the model with ground-truth target instead of the simplified output from teacher, and introducing more variation information of speech as conditional inputs.
Natural TTS Synthesis by Conditioning Wavenet on MEL Spectrogram Predictions
- Computer Science2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
- 2018
This paper describes Tacotron 2, a neural network architecture for speech synthesis directly from text. The system is composed of a recurrent sequence-to-sequence feature prediction network that maps…
Towards End-to-End Prosody Transfer for Expressive Speech Synthesis with Tacotron
- PhysicsICML
- 2018
An extension to the Tacotron speech synthesis architecture that learns a latent embedding space of prosody, derived from a reference acoustic representation containing the desired prosody results in synthesized audio that matches the prosody of the reference signal with fine time detail.
Parallel Tacotron: Non-Autoregressive and Controllable TTS
- Computer ScienceICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
- 2021
A non-autoregressive neural text-to-speech model augmented with a variational autoencoder-based residual encoder, called Parallel Tacotron, which is highly parallelizable during both training and inference, allowing efficient synthesis on modern parallel hardware.
Generating Diverse and Natural Text-to-Speech Samples Using a Quantized Fine-Grained VAE and Autoregressive Prosody Prior
- Computer ScienceICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
- 2020
Experimental results show that the proposed sequential prior in a discrete latent space which can generate more naturally sounding samples significantly improves the naturalness in random sample generation and randomly sampling can be used as data augmentation to improve the ASR performance.