• Publications
  • Influence
Rapid similarity searches of nucleic acid and protein data banks.
  • W. Wilbur, D. Lipman
  • Biology, Computer Science
    Proceedings of the National Academy of Sciences…
  • 1 February 1983
An algorithm for the global comparison of sequences based on matching k-tuples of sequence elements for a fixed k results in substantial reduction in the time required to search a data bank when compared with prior techniques of similarity analysis, with minimal loss in sensitivity.
Overview of BioCreative II gene mention recognition
It is demonstrated that, by combining the results from all submissions, an F score of 0.9066 is feasible, and furthermore that the best result makes use of the lowest scoring submissions.
SplicePort—An interactive splice-site analysis tool
Interactive feature browsing and visualization tool that allows the user to make splice-site predictions for submitted sequences and browse the rich catalog of features that underlies these predictions, and which has been found capable of providing high classification accuracy on human splice sites.
BioC: a minimalist approach to interoperability for biomedical text processing
A simple extensible mark-up language format to share text documents and annotations, which allows a large number of different annotations to be represented including sentences, tokens, parts of speech, named entities such as genes or diseases and relationships between named entities is proposed.
GENETAG: a tagged corpus for gene/protein named entity recognition
The annotation of GENETAG required intricate manual judgments by annotators which hindered tagging consistency, and the data were pre-segmented into words, to provide indices supporting comparison of system responses to the "gold standard", however, character- based indices would have been more robust than word-based indices.
PubMed related articles: a probabilistic topic-based model for content similarity
A probabilistic topic-based model for content similarity called pmra that underlies the related article search feature in PubMed, and a novel technique for estimating parameters that does not require human relevance judgments is described.
MedPost: a part-of-speech tagger for bioMedical text
A part-of-speech tagger that achieves over 97% accuracy on MEDLINE citations and a corpus of 5700 manually tagged sentences are presented.
New directions in biomedical text annotation: definitions, guidelines and corpus construction
The results of the inquiry into properties of scientific text that have sufficient generality to transcend the confines of a narrow subject area are reported, while supporting practical mining of text for factual information.
Tagging gene and protein names in biomedical text
This work proposes to approach the detection of gene and protein names in scientific abstracts as part-of-speech tagging, the most basic form of linguistic corpus annotation, and demonstrates that this method can be applied to large sets of MEDLINE abstracts, without the need for special conditions or human experts to predetermine relevant subsets.