Acoustic-dependent Phonemic Transcription for Text-to-speech Synthesis

@inproceedings{Vythelingum2018AcousticdependentPT,
  title={Acoustic-dependent Phonemic Transcription for Text-to-speech Synthesis},
  author={Kevin Vythelingum and Y. Est{\`e}ve and Olivier Rosec},
  booktitle={INTERSPEECH},
  year={2018}
}
Text-to-speech synthesis (TTS) purpose is to produce a speech signal from an input text. This implies the annotation of speech recordings with word and phonemic transcriptions. The overall quality of TTS highly depends on the accuracy of phonemic transcriptions. However, they are generally automatically produced by grapheme-to-phoneme conversion systems, which do not deal with speaker variability. In this work, we explore ways to obtain signal-dependent phonemic transcriptions. We investigate… 
2 Citations

Figures and Tables from this paper

Bridging Mixture Density Networks with Meta-Learning for Automatic Speaker Identification
TLDR
This work considers its work to be a steppingstone for more sophisticated meta-learning frameworks for accelerating voice recognition, and discusses a strategy for enhancing the accuracy by incorporating the notion of household-based acoustic profiles with MDNML.

References

SHOWING 1-10 OF 26 REFERENCES
Error detection of grapheme-to-phoneme conversion in text-to-speech synthesis using speech signal and lexical context
TLDR
A method to automatically detect grapheme-to-phoneme conversion errors by comparing contrastives phonemisation hypothesis is proposed, and the time spent for phoneme manual checking can be drastically reduced without decreasing significantly the phonemic transcription quality.
Deep Voice: Real-time Neural Text-to-Speech
TLDR
Deep Voice lays the groundwork for truly end-to-end neural speech synthesis and shows that inference with the system can be performed faster than real time and describes optimized WaveNet inference kernels on both CPU and GPU that achieve up to 400x speedups over existing implementations.
Testing the consistency assumption: Pronunciation variant forced alignment in read and spontaneous speech synthesis
TLDR
Evidence is presented that in the alignment of both standard read prompts and spontaneous speech this phoneme sequence is often wrong, and that this is likely to have a negative impact on acoustic models.
Grapheme-to-phoneme conversion using Long Short-Term Memory recurrent neural networks
TLDR
This work proposes a G2P model based on a Long Short-Term Memory (LSTM) recurrent neural network (RNN) that has the flexibility of taking into consideration the full context of graphemes and transform the problem from a series of grapheme-to-phoneme conversions to a word- to-pronunciation conversion.
Joint-sequence models for grapheme-to-phoneme conversion
Sequence-to-sequence neural net models for grapheme-to-phoneme conversion
TLDR
The simple side-conditioned generation approach is able to rival the state-of-the-art with bi-directional long short-term memory (LSTM) neural networks that use the same alignment information that is used in conventional approaches.
Unit selection in a concatenative speech synthesis system using a large speech database
  • Andrew J. Hunt, A. Black
  • Computer Science
    1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings
  • 1996
TLDR
It is proposed that the units in a synthesis database can be considered as a state transition network in which the state occupancy cost is the distance between a database unit and a target, and the transition cost is an estimate of the quality of concatenation of two consecutive units.
Pronunciation of proper names with a joint n-gram model for bi-directional grapheme-to-phoneme conversion
TLDR
The joint n-gram model for bi-directional grapheme-to-phoneme conversion is applied to the more specific task of converting between name pronunciations and spellings and valuable information is derived about the potential of sub-lexical recognition of novel proper names is derived.
Deep Voice 2: Multi-Speaker Neural Text-to-Speech
TLDR
It is shown that a single neural TTS system can learn hundreds of unique voices from less than half an hour of data per speaker, while achieving high audio quality synthesis and preserving the speaker identities almost perfectly.
Grapheme to phoneme conversion using an SMT system
TLDR
Grapheme to phoneme conversion based on Moses is compared to two other methods: G2P, and a dictionary look-up method supplemented by a rule-based tool for phonetic transcriptions of words unavailable in the dictionary.
...
...