A Decision Tree System for Finding Genes in DNA

  title={A Decision Tree System for Finding Genes in DNA},
  author={Steven L. Salzberg and Arthur L. Delcher and Kenneth H. Fasman and John Henderson},
  journal={Journal of computational biology : a journal of computational molecular cell biology},
  volume={5 4},
MORGAN is an integrated system for finding genes in vertebrate DNA sequences. MORGAN uses a variety of techniques to accomplish this task, the most distinctive of which is a decision tree classifier. The decision tree system is combined with new methods for identifying start codons, donor sites, and acceptor sites, and these are brought together in a frame-sensitive dynamic programming algorithm that finds the optimal segmentation of a DNA sequence into coding and noncoding regions (exons and… 

Figures and Tables from this paper

Homology-based gene prediction using neural nets.

GIN is able to recognize multiple genes within genomic DNA as demonstrated by the identification of a globin gene (gamma-globin-1(G)) that has not been annotated as a coding region in the widely used the test set of Burset and Guigo.

A neural network based multi-classifier system for gene identification in DNA sequences

It is proved that the same data set, when presented to neural networks in different forms, can provide slightly varying results and also proves that when different opinions of more classifiers on the same input data are integrated within a multi-classifier system, it can obtain results that are better than the individual performances of the neural networks.

DNA splice site detection: a comparison of specific and general methods

This work compares large margin classifiers (SVM and CMLS) and boosted decision trees with the three most common models used for splice site detection and finds that the newer methods compare favorably in all cases and can yield significant improvement in some cases.

Recognition of Translation Initiation Sites of Eukaryotic Genes Based on an EM Algorithm

The important characteristics of shorter flanking fragments around TISs are extracted and an expectation-maximization (EM) algorithm based on incomplete data is used to recognize TISS of eukaryotic genes and it is shown that the identification variables are effectively extracted and the EM algorithm is a powerful tool to predict the TISed genes.

Recognizing shorter coding regions of human genes based on the statistics of stop codons.

A new algorithm for the recognition of shorter coding sequences of human genes is developed and it is found that the average accuracy achieved is as high as 92.1% for sequences with length of 192 base pairs, which is confirmed by sixfold cross-validation tests.

MultiNNProm: A Multi-Classifier System for Finding Genes

A novel neural network based multi-classifier system, MultiNNProm, is presented for the identification of promoter regions in E.Coli DNA sequences and it is shown that the combination of more neural classifiers provides the system with better accuracy than the individual networks.

Classifier Assessment and Feature Selection for Recognizing Short Coding Sequences of Human Genes

Assessment of various linear and kernel-based classification algorithms and selecting the best combination of Z-curve features for further improvement of the issue of recognizing short exons in eukaryotes found that, by making good use of the interpretability of the PLS and the Z- Curve methods, 93 Z-Curve features were proved to be the best selective combination.

An optimized approach for annotation of large eukaryotic genomic sequences using genetic algorithm

The proposed genetic-based method, named Gene Prediction with Genetic Algorithm (GPGA), reduces the computational complexity of the gene-finding problem by searching only one exon at a time instead of all exons along with its introns.

Interpolated Markov models for eukaryotic gene finding.

A new system, GlimmerM, that was developed to find genes in the malaria parasite Plasmodium falciparum, and laboratory tests on a small selection of predicted genes confirmed all the predictions.

Single Species Gene Finding

This chapter covers a five of the most commonly used mathematical models used as main algorithms in single species gene finding, which are hidden Markov models, generalized hidden MarkOV models, interpolated Markov model, neural networks, and decision trees.



Identification of protein coding regions in genomic DNA.

A computer program, GeneParser, which identifies and determines the fine structure of protein genes in genomic DNA sequences and can rapidly generate ranked suboptimal solutions, each of which is the optimum solution containing a given intron-exon junction is developed.

Locating Protein Coding Regions in Human DNA Using a Decision Tree Algorithm

  • S. Salzberg
  • Biology, Computer Science
    J. Comput. Biol.
  • 1995
The conclusion is that decision trees are a highly effective tool for identifying protein coding regions, on DNA sequences ranging from 54 to 162 base pairs in length.

Prediction of gene structure.

Identification of coding regions in genomic DNA sequences: an application of dynamic programming and neural networks.

Dynamic programming (DP) is applied to the problem of precisely identifying internal exons and introns in genomic DNA sequences and the program GeneParser employs the DP algorithm to enforce the constraints that introns and exons must be adjacent and non-overlapping and finds the highest scoring combination of intron and exon subject to these constraints.

Finding Genes in DNA with a Hidden Markov Model

A new Hidden Markov Model (HMM) system for segmenting uncharacterized genomic DNA sequences into exons, introns, and intergenic regions, called VEIL (Viterbi Exon-Intron Locator), obtains an overall accuracy on test data of 92% of total bases correctly labelled.

A Generalized Hidden Markov Model for the Recognition of Human Genes in DNA

A Generalized Hidden Markov Model (GHMM) provides the framework for describing the grammar of a legal parse of a DNA sequence and provides simple solutions for integrating cardinality constraints, reading frame constraints, "indels", and homology searching.

Assessment of protein coding measures.

This paper reviews and synthesizes the underlying coding measures from published algorithms and concludes that a very simple and obvious measure--counting oligomers--is more effective than any of the more sophisticated measures.

Gene recognition via spliced sequence alignment.

A spliced alignment algorithm and software tool that explores all possible exon assemblies in polynomial time and finds the multiexon structure with the best fit to a related protein.

Automated Gene Identification in Large-Scale Genomic Sequences

A computer program which can automatically parse the recognized exons into gene models that are most consistent with the available Expressed Sequence Tags (ESTs) and a set of biological heuristics, derived empirically.

Recognition of Genes in Human DNA Sequences

A new approach to computer-assisted gene recognition in higher eukaryote DNA is suggested. It allows one to use not only linear functions for scoring structures, but all functions satisfying natural