Learn More
In this paper we begin to investigate how to <i>automatically</i> determine the subjectivity orientation of questions posted by real users in community question answering (CQA) portals. Subjective questions seek answers containing private states, such as personal opinion and experience. In contrast, objective questions request objective, verifiable(More)
An increasingly popular method for finding information online is via the Community Question Answering (CQA) portals such as Yahoo! Answers , Naver, and Baidu Knows. Searching the CQA archives, and ranking , filtering, and evaluating the submitted answers requires intelligent processing of the questions and answers posed by the users. One important task is(More)
Error-Correcting Output Coding (ECOC) is a general framework for multiclass text classification with a set of binary classifiers. It can not only help a binary classifier solve multi-class classification problems, but also boost the performance of a multi-class classifier. When building each individual binary classifier in ECOC, multiple classes are(More)
Temporal information is useful in many NLP applications, such as information extraction, question answering and summarization. In this paper, we present a temporal parser for extracting and normalizing temporal expressions from Chinese texts. An integrated temporal framework is proposed, which includes basic temporal concepts and the classification of(More)
NLM's Unified Medical Language System (UMLS) is a very large ontology of biomedical and health data. In order to be used effectively for knowledge processing, it needs to be customized to a specific domain. In this paper, we present techniques to automatically discover domain-specific concepts, discover relationships between these concepts, build a context(More)
This paper is a comparative study on representing units in Chinese text categorization. Several kinds of representing units, including byte 3-gram, Chinese character, Chinese word, and Chinese word with part of speech tag, were investigated. Empirical evidence shows that when the size of training data is large enough, representations of higher-level or with(More)
This paper presents a Unicode based Chinese word segmentor. It can handle Chinese text in Simplified, Traditional, or mixed mode. The system uses the strategy of divide-and-conquer to handle the recognition of personal names, numbers, time and numerical values, etc in the pre-processing stage. The segmentor further uses tagging information to work on(More)
k is the most important parameter in a text categorization system based on k-Nearest Neighbor algorithm (kNN).In the classification process, k nearest documents to the test one in the training set are determined firstly. Then, the predication can be made according to the category distribution among these k nearest neighbors. Generally speaking, the class(More)