Chinese Text Classification without Automatic Word Segmentation

  title={Chinese Text Classification without Automatic Word Segmentation},
  author={Wei Liu and Ben Allison and David Guthrie and Louise Guthrie},
  journal={Sixth International Conference on Advanced Language Processing and Web Information Technology (ALPIT 2007)},
  • Wei Liu, B. Allison, Louise Guthrie
  • Published 1 August 2007
  • Computer Science
  • Sixth International Conference on Advanced Language Processing and Web Information Technology (ALPIT 2007)
Due to the lack of word boundaries in Asian systems of writing, machine processing of these languages often involves segmenting text into word units. This paper tests the assumption that this segmentation is a necessary step for authorship attribution and topic classification tasks in Chinese, and demonstrates that it is not. We show extensive results for both tasks, considering both single words and short phrases as features, and examining the effect of document length on classification… 

Figures and Tables from this paper

Is Word Segmentation Necessary for Deep Learning of Chinese Representations?
It is shown that word-based models are more vulnerable to data sparsity and the presence of out-of-vocabulary (OOV) words, and thus more prone to overfitting, and this could encourage researchers in the community to rethink the necessity of word segmentation in deep learning-based Chinese Natural Language Processing.
A Radical-Aware Attention-Based Model for Chinese Text Classification
A novel Radicalaware Attention-based Four-Granularity (RAFG) model is proposed to take full advantages of Chinese characters, words, characterlevel radicals, word-level radicals simultaneously and design an attention mechanism to enhance the effects of radicals thus model the radical sharing property when integrating granularities.
Quantitative evidence for a hypothesis regarding the attribution of early Buddhist translations
Using a Variable Length n-Gram Feature Extraction Algorithm, principal component analysis and average linkage clustering, it is shown that 24 sutras, attributed by the tradition to different translators, were in fact translated by the same translator or group of translators.
Stylometric Analysis of Chinese Buddhist texts: Do different Chinese translations of the 'Gandhavyūha' reflect stylistic features that are typical for their age?
A method to determine whether the use of grammatical particles in Chinese Buddhist scriptures is characteristic for the period of their translation is developed, which allows for historical Chinese linguistics and Buddhist studies and an important basis for further research into Buddhist Hybrid Chinese translation idioms and the better attribution and dating of Chinese Buddhist texts.
Surveying Stylometry Techniques and Applications
An extensive performance analysis is performed on a corpus of 1,000 authors to investigate authorship attribution, verification, and clustering using 14 algorithms from the literature.


Chinese Word Segmentation without Using Lexicon and Hand-crafted Training Data
A new algorithm for segmenting Chinese texts without making use of any lexicon and hand-crafted linguistic resource is presented, that is, mutual information and the difference of t-score between characters is derived automatically from raw Chinese corpora.
Do We Need Chinese Word Segmentation for Statistical Machine Translation?
To avoid the dependence of the translation system on an external dictionary, a system that learns a domainspecific dictionary from the parallel training corpus is developed that produces results that are comparable with the predefined dictionary.
The Second International Chinese Word Segmentation Bakeoff
The second international Chinese word segmentation bakeoff was held in the summer of 2005 and it was found that the technology has improved over the intervening two years, though the out-of-vocabulary problem is still or paramount importance.
A Maximum Entropy Approach to Chinese Word Segmentation
This work evaluated the Chinese word segmenter in the open track, on all four corpora, namely Academia Sinica, City University of Hong Kong, Microsoft Research, and Peking University, and achieved the highest F measure for AS, CITYU, and PKU.
Chinese Word Segmentation based on Maximum Matching and Word Binding Force
A Chinese word segmentation algorithm based on forward maximum matching and word binding force is proposed in this paper. This algorithm plays a key role in post-processing the output of a character
Language independent authorship attribution using character level language models
We present a method for computer-assisted authorship attribution based on character-level n-gram language models. Our approach is based on simple information theoretic principles, and achieves
Chinese Word Segmentation and Information Retrieval
Results of experiments with Chinese word segmentation and information retrieval indicate that accurate segmentation measurably improves retrieval performance.
A comparison of event models for naive bayes text classification
It is found that the multi-variate Bernoulli performs well with small vocabulary sizes, but that the multinomial performs usually performs even better at larger vocabulary sizes--providing on average a 27% reduction in error over the multi -variateBernoulli model at any vocabulary size.
HHMM-based Chinese Lexical Analyzer ICTCLAS
This document presents the results from Inst. of Computing Tech., CAS in the ACL SIGHAN-sponsored First International Chinese Word Segmentation Bake-off. The authors introduce the unified HHMM-based
An Evaluation of Statistical Approaches to Text Categorization
Analysis and empirical evidence suggest that the evaluation results on some versions of Reuters were significantly affected by the inclusion of a large portion of unlabelled documents, mading those results difficult to interpret and leading to considerable confusions in the literature.