Learn More
\~e report the results of a study into the use of a linear interpolating hidden Marker model (HMM) for the task of extra.('ting lxw]mi(:al |;er-minology fl:om MEDLINE al)stra('ts and texl;s in the molecular-bioh)gy domain. Tiffs is the first stage isl a. system that will exl;ra('l; evenl; information for automatically ut)da.ting 1)ioh)gy databases. We(More)
Detection of abusive language in user generated online content has become an issue of increasing importance in recent years. Most current commercial methods make use of blacklists and regular expressions, however these measures fall short when contending with more subtle, less ham-fisted examples of hate speech. In this work, we develop a machine learning(More)
We present two measures for comparing corpora based on infbrmation theory statistics such as gain ratio as well as simple term-class ~equency counts. We tested the predictions made by these measures about corpus difficulty in two domains-news and molecular biology using the result of two well-used paradigms for NE, decision trees and HMMs and found that(More)
The rapid growth of collections in online academic databases has meant that there is increasing diculty for experts who want to access information in a timely and ecient way. We seek here to explore the application of information extraction methods to the identication and classication of terms in biological abstracts from MEDLINE. We explore the use of a(More)
We present an outline of the genome information acquisition (GENIA) project for automatically extracting biochemical information from journal papers and abstracts. GENIA will be available over the Internet and is designed to aid in information extraction, retrieval and vi-sualisation and to help reduce information overload on researchers. The vast(More)
Huge quantities of on-line medical texts such as Medline are available, and we w ould hope to extract useful information from these resources, as much as possible, hopefully in an automatic way, with the aid of computer technologies. Especially, recent advances in Natural Language Processing (NLP) techniques raise new challenges and opportunities for(More)
The tagging of Named Entities, the names of particular things or classes, is regarded as an important component technology for many NLP applications. The first Named Entity set had 7 types, organization, location, person, date, time, money and percent expressions. Later, in the IREX project artifact was added and ACE added two, GPE and facility, to pursue(More)
We have developed a sentence extraction system that estimates the significance of sentences by integrating four scoring functions that use as evidence sentence location, sentence length, TF/IDF values of words, and similarity to the title. Similarity to a given query is also added to the system in the summarization task for information retrieval. Parameters(More)