Markpong Jongtaveesataporn

Learn More
Large speech and text corpora are crucial to the development of a state-of-the-art speech recognition system. This paper reports on the construction and evaluation of the first Thai broadcast news speech and text corpora. Specifications and conventions used in the transcription process are described in the paper. The speech corpus contains about 17 hours of(More)
Traditional language models rely on lexical units that are de ned as entities separated from each other by word boundary markers. Since there are no such boundaries in Thai, alternative de nitions of lexical units have to be pursued. The problem is to nd the optimal set of lexical units that constitutes the vocabulary of the language model and yields the(More)
One of the difficulties of Thai language modeling is the process of text corpus preparation. Because there is no explicit word boundary marker in written Thai text, word segmentation must be performed prior to training a language model. This paper presents two approaches to language model construction for Thai LVCSR based on pseudo-morpheme merging. The(More)
The amount of available Thai broadcast news transcribed text for training a language model is still very limited, comparing to other major languages. Since the construction of a broadcast news corpus is very costly and time-consuming, newspaper text is often used to increase the size of training text data. This paper proposes a language model topic and(More)
  • 1