Chinese base phrases chunking based on latent semi-CRF model
Abstract—There are many researcher homepages on Web, if one wants to process researcher information for search engine, building a semantic profile for the academic researcher to identify and annotate information is an effective method. In this paper, we label Chinese researcher information with Conditional Random Fields (CRF) model, which has achieved good performance on Named Entity Identification. We proposed a hybrid annotation method which combines Conditional Random Fields and semantic rules, considering some features such as suffix, prefix, and semantic features of named entity at the same time. The comparison experiments show that this method can correctly extract the real content of the Chinese researcher homepages and assign a suitable category label to each part of the contents simultaneously.