• Corpus ID: 233204694

Flavored Tacotron: Conditional Learning for Prosodic-linguistic Features

  title={Flavored Tacotron: Conditional Learning for Prosodic-linguistic Features},
  author={Mahsa Elyasi and Gaurav Bharaj},
Neural sequence-to-sequence text-to-speech synthesis (TTS), such as Tacotron-2, transforms text into high-quality speech. However, generating speech with natural prosody still remains a challenge. Yasuda et. al. [1] show that unlike natural speech, Tacotron-2’s encoder doesn’t fully represent prosodic features (e.g. syllable stress in English) from characters, and result in flat fundamental frequency variations. In this work, we propose a novel carefully designed strategy for conditioning… 

Figures and Tables from this paper

Can Prosody Transfer Embeddings be Used for Prosody Assessment?

This work uses an intonation data set and a voice conversion corpus to explore how neural prosody embeddings group for utterances of different intonations, content, and speaker identity, and finds that neural prosodies can achieve a geometrical separability index as high as 0.956 for highly contrastive intonATIONS, and 0.706 for different sentence types.

Liaison and Pronunciation Learning in End-to-End Text-to-Speech in French

Sequence-to-sequence (S2S) TTS models like Tacotron have grapheme-only inputs when trained fully end-to-end. Grapheme inputs map to phone sounds depending on context, which traditionally is handled



Investigation of Enhanced Tacotron Text-to-speech Synthesis Systems with Self-attention for Pitch Accent Language

The results reveal that although the proposed systems still do not match the quality of a top-line pipeline system for Japanese, they show important stepping stones towards end-to-end Japanese speech synthesis.

Prosodic Prominence and Boundaries in Sequence-to-Sequence Speech Synthesis

The results of objective evaluation of synthesized speech show that using the prosodic labels significantly improves the output in terms of faithfulness of f0 and energy contours, in comparison with state-of-the-art implementations.

Tacotron: Towards End-to-End Speech Synthesis

Tacotron is presented, an end-to-end generative text- to-speech model that synthesizes speech directly from characters that achieves a 3.82 subjective 5-scale mean opinion score on US English, outperforming a production parametric system in terms of naturalness.

Investigation of learning abilities on linguistic features in sequence-to-sequence text-to-speech synthesis

Impacts of input linguistic feature representation on Japanese end-to-end speech synthesis

Experimental results indicate improvement in the naturalness of the synthesized speech using high or low accents, and results indicate that the accent-phrase information can help to predict pause insertion, and an end-to-end text- to-speech model may be able to change the pronunciation for devoiced vowels and particles.

Natural TTS Synthesis by Conditioning Wavenet on MEL Spectrogram Predictions

This paper describes Tacotron 2, a neural network architecture for speech synthesis directly from text. The system is composed of a recurrent sequence-to-sequence feature prediction network that maps

Investigating accuracy of pitch-accent annotations in neural network-based speech synthesis and denoising effects

While an utterance-level Turing test showed that listeners had a difficult time differentiating synthetic speech from natural speech, it further indicated that adding noise to the linguistic features in the training set can partially reduce the effect of the mismatch, regularize the model, and help the system perform better when linguistic features of the test set are noisy.

Tone Learning in Low-Resource Bilingual TTS

This work trains with monolingual English and Mandarin speakers and synthesizes every speaker in both languages and applies techniques to a recent strong multi-lingual baseline and achieves higher ratings in intelligibility and target accent, but slightly lower ratings in cross-lingUAL speaker similarity.

Merlin: An Open Source Neural Network Speech Synthesis System

The Merlin speech synthesis toolkit for neural network-based speech synthesis takes linguistic features as input, and employs neural networks to predict acoustic features, which are then passed to a vocoder to produce the speech waveform.

Parallel Wavegan: A Fast Waveform Generation Model Based on Generative Adversarial Networks with Multi-Resolution Spectrogram

The proposed Parallel WaveGAN has only 1.44 M parameters and can generate 24 kHz speech waveform 28.68 times faster than real-time on a single GPU environment, which is comparative to the best distillation-based Parallel WaveNet system.