Corpus ID: 102354807

Multi-reference Tacotron by Intercross Training for Style Disentangling, Transfer and Control in Speech Synthesis

@article{Bian2019MultireferenceTB,
  title={Multi-reference Tacotron by Intercross Training for Style Disentangling, Transfer and Control in Speech Synthesis},
  author={Yanyao Bian and Changbin Chen and Yongguo Kang and Zhenglin Pan},
  journal={ArXiv},
  year={2019},
  volume={abs/1904.02373}
}
Speech style control and transfer techniques aim to enrich the diversity and expressiveness of synthesized speech. [...] Key Method To address this issue, we introduce a novel multi-reference structure to Tacotron and propose intercross training approach, which together ensure that each sub-encoder of the multi-reference encoder independently disentangles and controls a specific style. Experimental results show that our model is able to control and transfer desired speech styles individually.Expand
Improving Performance of Seen and Unseen Speech Style Transfer in End-to-end Neural TTS
TLDR
A new approach to style transfer for both seen and unseen styles, with disjoint, multi-style datasets, with an inverse autoregressive flow (IAF) structure to improve the variational inference. Expand
Multi-Reference Neural TTS Stylization with Adversarial Cycle Consistency
TLDR
This paper proposes an adversarial cycle consistency training scheme with paired and unpaired triplets to ensure the use of information from all style dimensions and uses this method to transfer emotion from a dataset containing four emotions to a dataset with only a single emotion. Expand
Cycle consistent network for end-to-end style transfer TTS training
TLDR
The final evaluation demonstrates the proposed approach significantly outperforms the Global Style Token (GST) and VAE based systems for all the six style transfer categories, in metrics of naturalness, speech quality, similarity of speaker identity, and similarity of speaking style. Expand
Introducing Prosodic Speaker Identity for a Better Expressive Speech Synthesis Control
TLDR
Results show that the prosodic identity of the speaker is captured by the model and therefore allows the user to control more precisely synthesis. Expand
Controllable Context-aware Conversational Speech Synthesis
TLDR
This work uses explicit labels to represent two typical spontaneous behaviors filled-pause and prolongation in the acoustic model and develops a neural network based predictor to predict the occurrences of the two behaviors from text and develops an algorithm based on the predictor to control the occurrence frequency. Expand
Controllable Emotion Transfer For End-to-End Speech Synthesis
TLDR
The synthetic speech of the proposed method is more accurate and expressive with less emotion category confusions and the control of emotion strength is more salient to listeners. Expand
An Effective Style Token Weight Control Technique for End-to-End Emotional Speech Synthesis
TLDR
This letter proposes an effective way of generating emotion embedding vectors by utilizing the trained GSTs, and confirms that the proposed controlled weight-based method is superior to the conventional emotion label-based methods in terms of perceptual quality and emotion classification accuracy. Expand
MASS: Multi-task Anthropomorphic Speech Synthesis Framework
TLDR
A multi-task anthropomorphic speech synthesis framework (MASS), which can synthesize speeches from text with specified emotion and speaker identity, is proposed, which solves the problem of feature loss during voice conversion. Expand
Emotional Speech Synthesis with Rich and Granularized Control
TLDR
An inter-to-intra emotional distance ratio algorithm is introduced to the embedding vectors that can minimize the distance to the target emotion category while maximizing itsdistance to the other emotion categories. Expand
Style Transfer for Co-Speech Gesture Animation: A Multi-Speaker Conditional-Mixture Approach
TLDR
A new model is proposed, named Mix-StAGE, which trains a single model for multiple speakers while learning unique style embeddings for each speaker's gestures in an end-to-end manner and allows for style preservation when learning simultaneously from multiple speakers. Expand
...
1
2
...

References

SHOWING 1-10 OF 15 REFERENCES
Learning Latent Representations for Style Control and Transfer in End-to-end Speech Synthesis
TLDR
The Variational Autoencoder (VAE) is introduced to an end-to-end speech synthesis model, to learn the latent representation of speaking styles in an unsupervised manner and shows good properties such as disentangling, scaling, and combination. Expand
Towards End-to-End Prosody Transfer for Expressive Speech Synthesis with Tacotron
TLDR
An extension to the Tacotron speech synthesis architecture that learns a latent embedding space of prosody, derived from a reference acoustic representation containing the desired prosody results in synthesized audio that matches the prosody of the reference signal with fine time detail. Expand
Style Tokens: Unsupervised Style Modeling, Control and Transfer in End-to-End Speech Synthesis
TLDR
"global style tokens" (GSTs), a bank of embeddings that are jointly trained within Tacotron, a state-of-the-art end-to-end speech synthesis system, learn to factorize noise and speaker identity, providing a path towards highly scalable but robust speech synthesis. Expand
Expressive Speech Synthesis via Modeling Expressions with Variational Autoencoder
TLDR
Experiments show the VAE helps VoiceLoop to generate higher quality speech and to control the expressions in its synthesized speech by incorporating global characteristics into the speech generating process. Expand
Close to Human Quality TTS with Transformer
TLDR
This paper introduces and adapt the multi-head attention mechanism to replace the RNN structures and also the original attention mechanism in Tacotron2, and achieves state-of-the-art performance and close to human quality. Expand
Hierarchical Generative Modeling for Controllable Speech Synthesis
TLDR
A high-quality controllable TTS model which can control latent attributes in the generated speech that are rarely annotated in the training data, such as speaking style, accent, background noise, and recording conditions is proposed. Expand
Natural TTS Synthesis by Conditioning Wavenet on MEL Spectrogram Predictions
This paper describes Tacotron 2, a neural network architecture for speech synthesis directly from text. The system is composed of a recurrent sequence-to-sequence feature prediction network that mapsExpand
Neural Speech Synthesis with Transformer Network
TLDR
This paper introduces and adapt the multi-head attention mechanism to replace the RNN structures and also the original attention mechanism in Tacotron2, and achieves state-of-the-art performance and close to human quality. Expand
Deep Voice 3: Scaling Text-to-Speech with Convolutional Sequence Learning
TLDR
Deep Voice 3 is presented, a fully-convolutional attention-based neural text-to-speech (TTS) system that matches state-of-the-art neural speech synthesis systems in naturalness while training ten times faster. Expand
Deep Voice 2: Multi-Speaker Neural Text-to-Speech
TLDR
It is shown that a single neural TTS system can learn hundreds of unique voices from less than half an hour of data per speaker, while achieving high audio quality synthesis and preserving the speaker identities almost perfectly. Expand
...
1
2
...