Which is More Suitable for Chinese Word Segmentation , the Generative Model or the Discriminative One ? F ∗

Abstract

Since the traditional word-based n-gram model, a generative approach, cannot handle those out-of-vocabulary (OOV) words in the testing-set, the character-based discriminative approach has been widely adopted recently. However, this discriminative model, though is more robust to OOV words, fails to deliver satisfactory performance for those in-vocabulary (IV) words that have been observed before. Having analyzed the wordbased approach, its capability to handle the dependency between adjacent characters within a word, which is believed that the human adopts for doing segmentation, is found to account for its excellent performance for those IV words. To incorporate the intra-word characters dependency, a character-based approach with a generative model is thus proposed in this paper. The experiments conducted on the second SIGHAN Bakeoffs have shown that the proposed model not only achieves a good balance between those IV words and OOV words, but also outperforms the above-mentioned well-known approaches under the similar conditions.

Extracted Key Phrases

5 Figures and Tables

Cite this paper

@inproceedings{Wang2012WhichIM, title={Which is More Suitable for Chinese Word Segmentation , the Generative Model or the Discriminative One ? F ∗}, author={Kun Wang and Chengqing Zong and Keh-Yih Su}, year={2012} }