Learn More
We present two measures for comparing corpora based on infbrmation theory statistics such as gain ratio as well as simple term-class ~equency counts. We tested the predictions made by these measures about corpus difficulty in two domains-news and molecular biology using the result of two well-used paradigms for NE, decision trees and HMMs and found that(More)
\~e report the results of a study into the use of a linear interpolating hidden Marker model (HMM) for the task of extra.('ting lxw]mi(:al |;er-minology fl:om MEDLINE al)stra('ts and texl;s in the molecular-bioh)gy domain. Tiffs is the first stage isl a. system that will exl;ra('l; evenl; information for automatically ut)da.ting 1)ioh)gy databases. We(More)
The tagging of Named Entities, the names of particular things or classes, is regarded as an important component technology for many NLP applications. The first Named Entity set had 7 types, organization, location, person, date, time, money and percent expressions. Later, in the IREX project artifact was added and ACE added two, GPE and facility, to pursue(More)
We present an outline of the genome information acquisition (GENIA) project for automatically extracting biochemical information from journal papers and abstracts. GENIA will be available over the Internet and is designed to aid in information extraction, retrieval and vi-sualisation and to help reduce information overload on researchers. The vast(More)
UK PubMed Central (UKPMC) is a full-text article database that extends the functionality of the original PubMed Central (PMC) repository. The UKPMC project was launched as the first 'mirror' site to PMC, which in analogy to the International Nucleotide Sequence Database Collaboration, aims to provide international preservation of the open and free-access(More)
We have developed a sentence extraction system that estimates the significance of sentences by integrating four scoring functions that use as evidence sentence location, sentence length, TF/IDF values of words, and similarity to the title. Similarity to a given query is also added to the system in the summarization task for information retrieval. Parameters(More)
Corpus annotation is now a key topic for all areas of natural language processing (NLP) and information extraction (IE) which employ supervised learning. With the explosion of results in molecular-biology there is an increased need for IE to extract knowledge to support database building and to search intelligently for information in online journal(More)
We have introduced information extraction technique such as named entity tagging and pattern discovery to a summarization system based on sentence extraction technique, and evaluated the performance in the Document Understanding Conference 2001 (DUC-2001). We participated in the Single Document Summarization task in DUC-2001 and achieved one of the best(More)