Wnt pathway curation using automated natural language processing: combining statistical methods with partial and full parse for knowledge extraction

  title={Wnt pathway curation using automated natural language processing: combining statistical methods with partial and full parse for knowledge extraction},
  author={Carlos Santos and Daniela Eggle and David J. States},
  volume={21 8},
MOTIVATION Wnt signaling is a very active area of research with highly relevant publications appearing at a rate of more than one per day. Building and maintaining databases describing signal transduction networks is a time-consuming and demanding task that requires careful literature analysis and extensive domain-specific knowledge. For instance, more than 50 factors involved in Wnt signal transduction have been identified as of late 2003. In this work we describe a natural language processing… 

Tables from this paper

Automatic pathway building in biological association networks

The automatically curated MedScan data is adequate for automatic generation of good quality signaling networks and the algorithm for the reconstruction of signaling pathways is described and validated by comparison with manually curated pathways and tissue-specific gene expression profiles.

A text-mining system for extracting metabolic reactions from full-text articles

It is concluded that automated metabolic pathway construction is more tractable than has often been assumed, and that relatively simple text-mining approaches can prove surprisingly effective.

Knowledge derivation and data mining strategies for probabilistic functional integrated networks

One of the most powerful integration techniques, probabilistic functional integrated networks (PFINs), is extended to incorporate a concept of biological relevance, which significantly reduces the hypothesis space for experimental validation of genes hypothesised to be involved in the oxidative stress response.

Machine Learning Techniques for Establishing the Provenance of Biological Interactions in MEDLINE papers

  • Computer Science
  • 2005
It is shown that a number of machine learning algorithms can be used to directly establish sentence-level support for given entity-entity interactions in biological databases, and particular interaction entries in database assertions about protein-protein interactions are found.

BSQA: integrated text mining using entity relation semantics extracted from biological literature of insects

BeeSpace question/answering (BSQA) system that performs integrated text mining for insect biology, covering diverse aspects from molecular interactions of genes to insect behavior is presented.

The fully automated construction of metabolic pathways using text mining and knowledge-based constraints

The development of the Literature Metabolic Pathway Extraction Tool (LiMPET), a text-mining tool designed for the automated extraction of metabolic pathways from article abstracts and full-text open-access articles is described.

Reconstruction of Protein-Protein Interaction Pathways by Mining Subject-Verb-Objects Intermediates

This study has constructed Muscorian, using MontyLingua, a generic text processor that uses a two-layered generalization-specialization paradigm previously proposed where text was generically processed to a suitable intermediate format before domain-specific data extraction techniques are applied at the specialization layer.

Machine Learning Techniques for Establishing the Provenance of Biological Interactions

A number of machine learning algorithms can be used to directly establish sentence-level support for given entity-entity interactions in biological databases, and specifically focus on findi ng support for specific interaction entries in database assert ions about protein-protein interactions.

Automated extraction of precise protein expression patterns in lymphoma by text mining abstracts of immunohistochemical studies

This work proposes establishing a database linking quantitative protein expression levels with specific tumor classifications through NLP, and takes advantage of typical forms of representing experimental findings in terms of percentages of protein expression manifest by the tumor population under study.

New challenges for text mining: mapping between text and manually curated pathways

New resources are constructed to link the text with a model pathway and their detailed analysis are addressed, addressing the untapped resource, ‘bio-inference,’ as well as the differences between text and pathway representation.



Extracting human protein interactions from MEDLINE using a full-sentence parser

MedScan is presented, a completely automated natural language processing-based information extraction system that is used to extract 2976 interactions between human proteins from MEDLINE abstracts dated after 1988, and suggests that MEDLINE is a unique source of diverse protein function information, which can be extracted in acompletely automated way with a reasonably high precision.

Extraction of protein interaction information from unstructured text using a context-free grammar

This work describes a system for extracting PGSM interactions from unstructured text using a lexical analyzer and context free grammar, and demonstrates that efficient parsers can be constructed for extracting these relationships from natural language with high rates of recall and precision.

Mining literature for protein-protein interactions

It is shown that the frequencies of words in Medline abstracts can be used to determine whether or not a given paper discusses protein-protein interactions, and the relevant information can be captured for the Database of Interacting Proteins.

Detecting Gene Relations from MEDLINE Abstracts

The relative computational simplicity of the proposed method makes it possible to process and analyze large volumes of data in a short time and significantly contributes to and enhances a user's ability to discover such embedded information.

Automatic Extraction of Biological Information from Scientific Text: Protein-Protein Interactions

The basic design of a system for automatic detection of protein-protein interactions extracted from scientific abstracts is described and the feasibility of developing a fully automated system able to describe networks of protein interactions with sufficient accuracy is demonstrated.

Kinase pathway database: an integrated protein-kinase and NLP-based protein-interaction resource.

The Kinase Pathway Database, an integrated database involving major completely sequenced eukaryotes, is developed, which contains the classification of protein kinases and their functional conservation, ortholog tables among species, protein-protein,protein-gene, and protein-compound interaction data, domain information, and structural information.

Using text analysis to identify functionally coherent gene groups.

A method, neighbor divergence, for assessing whether the genes within a group share a common biological function based on their associated scientific literature is presented and achieves 79% sensitivity at 100% specificity, comparing favorably to other tested methods.

Biobibliometrics: information retrieval and visualization from co-occurrences of gene names in Medline abstracts.

  • B. StapleyG. Benoît
  • Computer Science
    Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing
  • 2000
A prototype system for retrieving and visualizing information from literature and genomic databases using gene names, which is a tool for efficiently exploring the biomedical information landscape and may act as a inference network.

Automatic Annotation for Biological Sequences by Etraction of Keywords from MEDLINE Abstracts: Development of a Prototype System

A prototype for the automatic annotation of functional characteristics in protein families able to extract biological information directly from scientific literature in the form of MEDLINE abstracts is developed.

TEXTQUEST: Document Clustering of MEDLINE Abstracts For Concept Discovery In Molecular Biology

An algorithm for large-scale document clustering of biological text, obtained from Medline abstracts, based on statistical treatment of terms, stemming, the idea of a 'go-list', unsupervised machine learning and graph layout optimization is presented.