The CIPS-SIGHAN CLP 2012 Chinese Word Segmentation on MicroBlog Corpora Bakeoff was held in the autumn of 2012. This bake-off task of Chinese word segmentation is focused on the performance of Chinese word segmentation algorithms on MicroBlog corpora. 17 groups submitted 20 results, among which the best system has all the P, R and F values near 95%, and the… (More)
This paper presents a maximum entropy (ME)-based model for Chinese noun phrase metaphor recognition. The metaphor recognizing process will be viewed as a classification task between metaphor and literal meaning. Our experiments show that the metaphor recognizer based on the ME method is significantly better than the Example-based methods within the same… (More)
This paper summarizes the SIGHAN 2014 Chinese Word Segmentation bake-off in several aspects such as dataset, evaluation results. In addition, we analyze errors of segmentation by instance and make a suggestion for improving segmentation systems. 1 Goal of the Chinese word segmenta-tion bake-off Chinese Word Segmentation is the preliminary step for Chinese… (More)
NP identification is a challenging subtask of NLP. The reported literatures mainly focus on base noun phrase and maximal-length noun phrase, and deal with them as a sequence labeling problem. In this paper, unlike existing perspective, we concentrate on a special subcategory of Chinese NP, classifier noun phrase (CNP), and present a new approach which uses… (More)
In contemporary Chinese, there is a subclass of verbs called Dummy Verbs. After briefly introducing the lexical meanings of two typical dummy verb, 'Jiayi' and 'Jinxing', this paper discusses the grammatical attributes of 'Jiayi' and 'Jinxing' in detail and further explores their functions as markers of syntactic constituents and semantic roles.
Increase in three-character words attracts more and more attention from researchers. In the present paper, the ratio of three-character words unrecorded in the Grammatical Knowledge-base of Contemporary Chinese is obtained by an analysis of the tagged corpus of People's Daily of 1998. (henceforth, three-character unknown words). The results show that the… (More)