A Comparison between Letters and Phones as Input to Sequence-to-Sequence Models for Speech Synthesis

@inproceedings{Fong2019ACB,
  title={A Comparison between Letters and Phones as Input to Sequence-to-Sequence Models for Speech Synthesis},
  author={Jason Fong and Jason Taylor and Korin Richmond and Simon King},
  year={2019}
}
  • Jason Fong, Jason Taylor, +1 author Simon King
  • Published 2019
Neural sequence-to-sequence (S2S) models for text-tospeech synthesis (TTS) may take letter or phone input sequences. Since for many languages phones have a more direct relationship to the acoustic signal, they lead to improved quality. But generating phone transcriptions from text requires an expensive dictionary and an error-prone grapheme-to-phoneme (G2P) model, and the relative improvement over using letters has yet to be quantified. In approaching this question, we presume that letter-input… CONTINUE READING

Figures and Tables from this paper.

References

Publications referenced by this paper.
SHOWING 1-10 OF 23 REFERENCES

LibriTTS: A Corpus Derived from LibriSpeech for Text-to-Speech

  • INTERSPEECH 2019
  • 2019
VIEW 3 EXCERPTS
HIGHLY INFLUENTIAL

Analysis of Pronunciation Learning in End-to-End Speech Synthesis

Jason Taylor, Korin Richmond
  • INTERSPEECH 2019
  • 2019
VIEW 3 EXCERPTS

Cloud Text-to-Speech

Google
  • 2019. [Online]. Available: https://cloud.google.com/text-to-speech/
  • 2019
VIEW 1 EXCERPT

Investigating the Robustness of Sequence-to-Sequence Text-to-Speech Models to Imperfectly-Transcribed Training Data

Jason Fong, Pilar Oplustil Gallegos, Zack Hodari, Simon King
  • INTERSPEECH 2019
  • 2019
VIEW 1 EXCERPT

The Carnegie Mellon pronouncing dictionary

CMU
  • 2019. [Online]. Available: https://github.com/cmusphinx/cmudict
  • 2019
VIEW 1 EXCERPT

Voices in Amazon Polly

Amazon
  • 2019. [Online]. Available: https://docs.aws.amazon.com/polly/latest/dg/voicelist.html
  • 2019
VIEW 1 EXCERPT

Combilex speech technology lexicon

K. Richmond
  • 2018. [Online]. Available: http://homepages.inf.ed.ac.uk/korin/sitenew/ Research/Combilex
  • 2018
VIEW 1 EXCERPT

Representation Mixing for TTS Synthesis

  • ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
  • 2018
VIEW 2 EXCERPTS

Efficiently Trainable Text-to-Speech System Based on Deep Convolutional Networks with Guided Attention

  • 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
  • 2017
VIEW 1 EXCERPT