Segmenting Chinese in Unicode

  title={Segmenting Chinese in Unicode},
  author={T. Emerson},
The automatic segmentation of Chinese text is an ongoing problem in information retrieval (IR) and computational linguistics: “words” in written Chinese are not delimited by spaces so tokenizing (the first phase of many IR tasks) is considerably more difficult than for Western languages. This paper presents an overview of the segmentation problem, detailing previous research into its solution and introduces Basis Technology’s Chinese Morphological Analyzer (CMA), a new, general purpose hybrid… CONTINUE READING
Highly Cited
This paper has 18 citations. REVIEW CITATIONS