• Corpus ID: 233204694

Flavored Tacotron: Conditional Learning for Prosodic-linguistic Features

  title={Flavored Tacotron: Conditional Learning for Prosodic-linguistic Features},
  author={Mahsa Elyasi and Gaurav Bharaj},
Neural sequence-to-sequence text-to-speech synthesis (TTS), such as Tacotron-2, transforms text into high-quality speech. However, generating speech with natural prosody still remains a challenge. Yasuda et. al. [1] show that unlike natural speech, Tacotron-2’s encoder doesn’t fully represent prosodic features (e.g. syllable stress in English) from characters, and result in flat fundamental frequency variations. In this work, we propose a novel carefully designed strategy for conditioning… 

Figures and Tables from this paper

Can Prosody Transfer Embeddings be Used for Prosody Assessment?

This work uses an intonation data set and a voice conversion corpus to explore how neural prosody embeddings group for utterances of different intonations, content, and speaker identity, and finds that neural prosodies can achieve a geometrical separability index as high as 0.956 for highly contrastive intonATIONS, and 0.706 for different sentence types.

Liaison and Pronunciation Learning in End-to-End Text-to-Speech in French

Sequence-to-sequence (S2S) TTS models like Tacotron have grapheme-only inputs when trained fully end-to-end. Grapheme inputs map to phone sounds depending on context, which traditionally is handled



Investigation of Enhanced Tacotron Text-to-speech Synthesis Systems with Self-attention for Pitch Accent Language

The results reveal that although the proposed systems still do not match the quality of a top-line pipeline system for Japanese, they show important stepping stones towards end-to-end Japanese speech synthesis.

Prosodic Prominence and Boundaries in Sequence-to-Sequence Speech Synthesis

The results of objective evaluation of synthesized speech show that using the prosodic labels significantly improves the output in terms of faithfulness of f0 and energy contours, in comparison with state-of-the-art implementations.

Tacotron: Towards End-to-End Speech Synthesis

Tacotron is presented, an end-to-end generative text- to-speech model that synthesizes speech directly from characters that achieves a 3.82 subjective 5-scale mean opinion score on US English, outperforming a production parametric system in terms of naturalness.

Investigation of learning abilities on linguistic features in sequence-to-sequence text-to-speech synthesis

Analysis of Pronunciation Learning in End-to-End Speech Synthesis

It is found that LTS errors for words with ambiguous or unpredictable pronunciations are mirrored as mispronunciations by an E2E model, which suggests that limited and unbalanced lexical coverage in E 2E training data may pose significant confounding factors that complicate learning accurate pronunciation in a purely E1E system.

Impacts of input linguistic feature representation on Japanese end-to-end speech synthesis

Experimental results indicate improvement in the naturalness of the synthesized speech using high or low accents, and results indicate that the accent-phrase information can help to predict pause insertion, and an end-to-end text- to-speech model may be able to change the pronunciation for devoiced vowels and particles.

Natural TTS Synthesis by Conditioning Wavenet on MEL Spectrogram Predictions

This paper describes Tacotron 2, a neural network architecture for speech synthesis directly from text. The system is composed of a recurrent sequence-to-sequence feature prediction network that maps

Cross-lingual Multi-speaker Text-to-speech Synthesis for Voice Cloning without Using Parallel Corpus for Unseen Speakers

It is found that normalizing speaker embedding x-vectors by L2-norm normalization or whitening improves output quality a lot in many cases, and the WaveNet performance seems to be language-independent: the authors' WaveNet is trained with Cantonese speech and can be used to generate Mandarin and English speech very well.

Multi-Lingual Multi-Speaker Text-to-Speech Synthesis for Voice Cloning with Online Speaker Enrollment

This work investigates a novel approach for generating high-quality speeches in multiple languages of speakers enrolled in their native language by introducing tone/stress embeddings which extend the language embedding to represent tone and stress information.

Investigating accuracy of pitch-accent annotations in neural network-based speech synthesis and denoising effects

While an utterance-level Turing test showed that listeners had a difficult time differentiating synthetic speech from natural speech, it further indicated that adding noise to the linguistic features in the training set can partially reduce the effect of the mismatch, regularize the model, and help the system perform better when linguistic features of the test set are noisy.