Data Set Used
This paper presents the construction of a Chinese word sense-tagged corpus. The resulting lexical resource includes mainly three components: 1) a corpus annotated with word senses; 2) a lexicon containing sense distinction and description in the feature-based formalism; 3) the linking between the sense entries in the lexicon and CCD synsets. A dynamic model… (More)
This paper proposes a new method for Chinese language corpus processing. Unlike the past researches, our approach has following charactericstics : it blends segmentation with tagging and integrates rule-based approach with statistics-based one in grammatical dis-ambiguation. The principal ideas presented in the paper are incorporated in the development of a… (More)
This paper is a comparative study on representing units in Chinese text categorization. Several kinds of representing units, including byte 3-gram, Chinese character, Chinese word, and Chinese word with part of speech tag, were investigated. Empirical evidence shows that when the size of training data is large enough, representations of higher-level or with… (More)
This paper presents a Unicode based Chinese word segmentor. It can handle Chinese text in Simplified, Traditional, or mixed mode. The system uses the strategy of divide-and-conquer to handle the recognition of personal names, numbers, time and numerical values, etc in the pre-processing stage. The segmentor further uses tagging information to work on… (More)
k is the most important parameter in a text categorization system based on k-Nearest Neighbor algorithm (kNN).In the classification process, k nearest documents to the test one in the training set are determined firstly. Then, the predication can be made according to the category distribution among these k nearest neighbors. Generally speaking, the class… (More)
Automatic evaluation of output quality for machine translation systems is a difficult task. The Institute of Computational Linguistics of Peking University has developed an automatic evaluation system called MTE. This paper introduces the basic principles of MTE, its implementation techniques and the practice experiences.
This paper introduces the attributes of emotional evaluation in the Grammatical Knowledge-base of Contemporary Chinese. Lexical emotion tagging is studied by means of both qualitative and quantitative approaches. Based on the statistical results from the People's Daily tagging corpus, lexical emotional trends are described and formulated in our… (More)