LINNAEUS: A species name identification system for biomedical literature

Abstract

The task of recognizing and identifying species names in biomedical literature has recently been regarded as critical for a number of applications in text and data mining, including gene name recognition, species-specific document retrieval, and semantic enrichment of biomedical articles. In this paper we describe an open-source species name recognition and normalization software system, LINNAEUS, and evaluate its performance relative to several automatically generated biomedical corpora, as well as a novel corpus of full-text documents manually annotated for species mentions. LINNAEUS uses a dictionary-based approach (implemented as an efficient deterministic finite-state automaton) to identify species names and a set of heuristics to resolve ambiguous mentions. When compared against our manually annotated corpus, LINNAEUS performs with 94% recall and 97% precision at the mention level, and 98% recall and 90% precision at the document level. Our system successfully solves the problem of disambiguating uncertain species mentions, with 97% of all mentions in PubMed Central full-text documents resolved to unambiguous NCBI taxonomy identifiers. LINNAEUS is an open source, stand-alone software system capable of recognizing and normalizing species name mentions with speed and accuracy, and can therefore be integrated into a range of bioinformatics and text-mining applications. The software and manually annotated corpus can be downloaded freely at http://linnaeus.sourceforge.net/ .

DOI: 10.1186/1471-2105-11-85

Extracted Key Phrases

7 Figures and Tables

020406020102011201220132014201520162017
Citations per Year

311 Citations

Semantic Scholar estimates that this publication has 311 citations based on the available data.

See our FAQ for additional information.

Cite this paper

@inproceedings{Gerner2009LINNAEUSAS, title={LINNAEUS: A species name identification system for biomedical literature}, author={Martin Gerner and Goran Nenadic and Casey M. Bergman}, booktitle={BMC Bioinformatics}, year={2009} }