Lexical units for Thai LVCSR

Abstract

Traditional language models rely on lexical units that are de ned as entities separated from each other by word boundary markers. Since there are no such boundaries in Thai, alternative de nitions of lexical units have to be pursued. The problem is to nd the optimal set of lexical units that constitutes the vocabulary of the language model and yields the best nal result. The word is a traditional lexical unit recognized by Thai people and is used by most of the natural language processing systems, including an automatic speech recognition system. This paper discusses problems with using words as a lexical unit and investigates other lexical units for the Thai large vocabulary continuous speech recognition (LVCSR) system. The pseudo-morpheme is introduced in the paper and shown to be unsuitable for use as a lexical unit directly. A technique using pseudo-morphemes to improve the system based on the traditional word model is introduced and some improvements can be gained by this technique. Then, a new lexical unit for Thai, the compound pseudomorpheme, and an algorithm to build compound pseudo-morphemes are presented. The experimental results show that the system using compound pseudo-morphemes outperforms other systems. Thus, the compound pseudo-morpheme is the most suitable lexical unit for Thai LVCSR system.

DOI: 10.1016/j.specom.2008.11.006

Extracted Key Phrases

17 Figures and Tables

Cite this paper

@article{Jongtaveesataporn2009LexicalUF, title={Lexical units for Thai LVCSR}, author={Markpong Jongtaveesataporn and Issara Thienlikit and Chai Wutiwiwatchai and Sadaoki Furui}, journal={Speech Communication}, year={2009}, volume={51}, pages={379-389} }