Batuer Aisha

Learn More
Tokenization is very important for Uyghur language processing. Tokenization of Uyghur, an agglutinative language, is quite different from other languages such as Chinese and English. In this paper we propose a two-steps statistical tokenization method for Uyghur. Two related factors, the feature template scheme and the manually tokenized corpora, are also(More)
  • Batuer Aisha
  • 2010
In this paper, we present a letter tagging approach(LTA) to Uyghur tokenization. Experiments show that the problem with label bias (rich and complex suffixes) problem to be resolved using LTA combined with CRFs, so it is more effective than previous work, the accuracy of word tokenization reaches 93.3%. In future our tokenization research will be very(More)
  • 1