Chinese Text Classification without Automatic Word Segmentation

  author={Wei Liu and Ben Allison and David Guthrie and Louise Guthrie},
  • Published 1 August 2007
  • Computer Science
Due to the lack of word boundaries in Asian systems of writing, machine processing of these languages often involves segmenting text into word units. This paper tests the assumption that this segmentation is a necessary step for authorship attribution and topic classification tasks in Chinese, and demonstrates that it is not. We show extensive results for both tasks, considering both single words and short phrases as features, and examining the effect of document length on classification… 

