The Speech Labeling and Modeling Toolkit (SLMTK) Version 1.0

@article{Chiang2022TheSL,
  title={The Speech Labeling and Modeling Toolkit (SLMTK) Version 1.0},
  author={Chen-Yu Chiang and Wu-Hao Li and Yen-Ting Lin and Jia-Jyu Su and Wei-Cheng Chen and Cheng-Che Kao and Shu-Lei Lin and Pin-Han Lin and Shao-Wei Hong and Guan-Ting Liou and Wen-Yang Chang and Jen-Chieh Chiang and Yen-Ting Lin and Yih-Ru Wang and Sin-Horng Chen},
  journal={2022 25th Conference of the Oriental COCOSDA International Committee for the Co-ordination and Standardisation of Speech Databases and Assessment Techniques (O-COCOSDA)},
  year={2022},
  pages={1-5},
  url={https://api.semanticscholar.org/CorpusID:255188027}
}
The Speech Labeling and Modeling Toolkit version 1.0, which facilitates automatic labeling of text and speech for constructing text-to-speech (TTS) systems and speech analysis, has been applied to constructing personalized TTS systems for augmentative and alternative communication.

Figures from this paper

VoiceBank-2023: A Multi-Speaker Mandarin Speech Corpus for Constructing Personalized TTS Systems for the Speech Impaired

The corpus design, corpus recording, data purging and correction for the corpus, and evaluations of the developed personalized TTS systems, for the VoiceBanking project are reported.

A Preliminary Study on Analysing Mandarin Tone Values of Romance L2 Mandarin Learners

The 5-level tone value labeling system helps characterize pitch contours of syllables for L2 Mandarin learners and teachers to facilitate tone acquisition and tone error analysis, respectively. In

Tone Value Representation for Computer-Assisted Pronunciation Training

Experimental results show that subjects can identify the class of tone by looking at the representation proposed in this study and evaluating the quality of the tones of the syllables pronounced by the speakers visually, and the approach offers more comprehensive feedback to learners.

Hierarchical prosody modeling of English speech and its application to TTS

A hierarchical prosody modeling approach for English speech is proposed, an extended version of the HPM approach proposed previously for Mandarin speech that designs a syllable-based, statistical prosodic model and employs a prosody labeling and modeling algorithm to estimate the model parameters and label the prosodic tags of all training utterances simultaneously from a prosodic-unlabeled speech corpus.

Hierarchical prosody modeling for Mandarin spontaneous speech.

An application of the HPM to assist in Mandarin spontaneous-speech recognition is discussed, with significant relative error rate reductions for base-syllable, character, tone, and word recognition, respectively.

A Prosodic Mandarin Text-to-Speech System Based on Tacotron

Under subjective evaluation in terms of the prosody, results show that the synthesis system performs better by adding the prosodic system as the front-end system for Tacotron.

Implementing Prosodic Phrasing in Chinese End-to-end Speech Synthesis

This paper will propose a solution for an end-to-end Chinese TTS system on the basis of Tacotron 2 and Wavenet vocoder, and add extra contextual information to improve the performance of prosodic phrasing.

FastSpeech 2: Fast and High-Quality End-to-End Text to Speech

FastSpeech 2 is proposed, which addresses the issues in FastSpeech and better solves the one-to-many mapping problem in TTS by directly training the model with ground-truth target instead of the simplified output from teacher, and introducing more variation information of speech as conditional inputs.

Montreal Forced Aligner: Trainable Text-Speech Alignment Using Kaldi

The Montreal Forced Aligner is an update to the Prosodylab-Aligner, and maintains its key functionality of trainability on new data, as well as incorporating improved architecture (triphone acoustic models and speaker adaptation), and other features.

A novel method for Mandarin speech synthesis by inserting prosodic structure prediction into Tacotron2

This paper proposes in this paper a novel synthesis method by adding a Mandarin-to-PinYin module and a prosodic structure prediction model into Tacotron2 to help Tacotrons synthesize more natural and human-like Mandarin speech.

An Exploration of Local Speaking Rate Variations in Mandarin Read Speech

The generated prosody with local speaking rate variations is proved to be more vivid than the one with a constant speaking rate and use in the prosody generation of Mandarin TTS.

Speaker Adaptation of SR-HPM for Speaking Rate-Controlled Mandarin TTS

Both objective and subjective evaluations show that the proposed method not only performs better than the maximum likelihood-based method in the observed SR range of the target speaker's data, but also is much better in the unseen SR ranges.

Latent Prosody Model of Continuous Mandarin Speech

A latent prosody model (LPM) aiming to jointly model the affections of tone and prosody state on FO is proposed, with the main purposes of improving tone recognition accuracy and automatic prosodyState labeling.