Natural TTS Synthesis by Conditioning Wavenet on MEL Spectrogram Predictions
- Jonathan Shen, Ruoming Pang, Yonghui Wu
- Computer ScienceIEEE International Conference on Acoustics…
- 16 December 2017
This paper describes Tacotron 2, a neural network architecture for speech synthesis directly from text. The system is composed of a recurrent sequence-to-sequence feature prediction network that maps…
Tacotron: Towards End-to-End Speech Synthesis
- Yuxuan Wang, R. Skerry-Ryan, R. Saurous
- Computer ScienceInterspeech
- 29 March 2017
Tacotron is presented, an end-to-end generative text- to-speech model that synthesizes speech directly from characters that achieves a 3.82 subjective 5-scale mean opinion score on US English, outperforming a production parametric system in terms of naturalness.
On Training Targets for Supervised Speech Separation
- Yuxuan Wang, A. Narayanan, Deliang Wang
- Computer Science, PhysicsIEEE/ACM Transactions on Audio Speech and…
- 1 December 2014
Results in various test conditions reveal that the two ratio mask targets, the IRM and the FFT-MASK, outperform the other targets in terms of objective intelligibility and quality metrics, and that masking based targets, in general, are significantly better than spectral envelope based targets.
PANNs: Large-Scale Pretrained Audio Neural Networks for Audio Pattern Recognition
- Qiuqiang Kong, Yin Cao, Turab Iqbal, Yuxuan Wang, Wenwu Wang, Mark D. Plumbley
- Computer ScienceIEEE/ACM Transactions on Audio Speech and…
- 21 December 2019
This paper proposes pretrained audio neural networks (PANNs) trained on the large-scale AudioSet dataset, and investigates the performance and computational complexity of PANNs modeled by a variety of convolutional neural networks.
Style Tokens: Unsupervised Style Modeling, Control and Transfer in End-to-End Speech Synthesis
- Yuxuan Wang, Daisy Stanton, R. Saurous
- Computer ScienceInternational Conference on Machine Learning
- 23 March 2018
"global style tokens" (GSTs), a bank of embeddings that are jointly trained within Tacotron, a state-of-the-art end-to-end speech synthesis system, learn to factorize noise and speaker identity, providing a path towards highly scalable but robust speech synthesis.
Towards End-to-End Prosody Transfer for Expressive Speech Synthesis with Tacotron
- R. Skerry-Ryan, Eric Battenberg, R. Saurous
- PhysicsInternational Conference on Machine Learning
- 24 March 2018
An extension to the Tacotron speech synthesis architecture that learns a latent embedding space of prosody, derived from a reference acoustic representation containing the desired prosody results in synthesized audio that matches the prosody of the reference signal with fine time detail.
Complex Ratio Masking for Monaural Speech Separation
- D. Williamson, Yuxuan Wang, Deliang Wang
- PhysicsIEEE/ACM Transactions on Audio Speech and…
- 1 March 2016
The proposed approach improves over other methods when evaluated with several objective metrics, including the perceptual evaluation of speech quality (PESQ), and a listening test where subjects prefer the proposed approach with at least a 69% rate.
Tacotron: A Fully End-to-End Text-To-Speech Synthesis Model
- Yuxuan Wang, R. Skerry-Ryan, R. Saurous
- Computer ScienceArXiv
- 29 March 2017
This paper presents Tacotron, an end- to-end generative text-to-speech model that synthesizes speech directly from characters, and presents several key techniques to make the sequence-tosequence framework perform well for this challenging task.
Trainable frontend for robust and far-field keyword spotting
- Yuxuan Wang, Pascal Getreuer, Thad Hughes, R. Lyon, R. Saurous
- Computer ScienceIEEE International Conference on Acoustics…
- 19 July 2016
This work introduces a novel frontend called per-channel energy normalization (PCEN), which uses an automatic gain control based dynamic compression to replace the widely used static compression in speech recognition.
Hierarchical Generative Modeling for Controllable Speech Synthesis
- Wei-Ning Hsu, Yu Zhang, Ruoming Pang
- Computer ScienceInternational Conference on Learning…
- 16 October 2018
A high-quality controllable TTS model which can control latent attributes in the generated speech that are rarely annotated in the training data, such as speaking style, accent, background noise, and recording conditions is proposed.
...
...