Distinctive sequence features in protein coding genic non-coding, and intergenic human DNA.

  title={Distinctive sequence features in protein coding genic non-coding, and intergenic human DNA.},
  author={Roderic Guig{\'o} and James W. Fickett},
  journal={Journal of molecular biology},
  volume={253 1},
We have studied the behavior of a number of sequence statistics, mostly indicative of protein coding function, in a large set of human clone sequences randomly selected in the course of genome mapping (randomly selected clone sequences), and compared this with the behavior in known sequences containing genes (which we term genic sequences). As expected, given the higher coding density of the genic sequences, the sequence statistics studied behave in a substantially different manner in the… 

Figures and Tables from this paper

Detection of Protein Coding Sequences Using a Mixture Model for Local Protein Amino Acid Sequence

The mixture model derived scores for a collection of human exons were not correlated with the GENSCAN scores, suggesting that the addition of the protein pattern recognition module to current gene recognition programs may improve their performance.

Recognizing shorter coding regions of human genes based on the statistics of stop codons.

A new algorithm for the recognition of shorter coding sequences of human genes is developed and it is found that the average accuracy achieved is as high as 92.1% for sequences with length of 192 base pairs, which is confirmed by sixfold cross-validation tests.

A relationship between GC content and coding-sequence length

The analysis of DNA sequences from several genome databases stratified according to GC content reveals that the longest coding sequences—exons in vertebrates and genes in prokaryotes—are GC-rich, while the shortest ones are GC-poor, a function of GC content.

Topological Pressure and Coding Sequence Density Estimation in the Human Genome

Topological pressure is a flexible tool and is expected to be useful for the investigation of many other features of DNA sequences such as interspecies comparison of codon usage bias and a first result in this direction is given, investigating CDS density in the mouse genome and comparing the results with those for the human genome.


The quality of the prediction of coding sequences at the nucleotide level comparable to that of specialized gene finding programs is obtained.

An assessment of gene prediction accuracy in large DNA sequences.

Though gene prediction will improve with every new protein that is discovered and through improvements in the current set of tools, there is a long way to go before the authors can decipher the precise exonic structure of every gene in the human genome using purely computational methodology.

Predicting Untranslated Regions and Code Sections in DNA using Hidden Markov Models

The goal is to find and develop a way to determine a likelihood value, based on which the joining sections of these three regions can by identified in any DNA sequence, using hidden Markov model.



Estimation of protein coding density in a corpus of DNA sequence data.

This work presents a method to estimate the protein coding density in a corpus of DNA sequence data, in which a 'coding statistic' is calculated for a large number of windows of the sequence under study, and the distribution of the statistic is decomposed into two normal distributions.

Identification of protein coding regions in genomic DNA.

A computer program, GeneParser, which identifies and determines the fine structure of protein genes in genomic DNA sequences and can rapidly generate ranked suboptimal solutions, each of which is the optimum solution containing a given intron-exon junction is developed.

Assessment of protein coding measures.

This paper reviews and synthesizes the underlying coding measures from published algorithms and concludes that a very simple and obvious measure--counting oligomers--is more effective than any of the more sophisticated measures.

A sequence-tagged site map of human chromosome 11.

We report the construction of 370 sequence-tagged sites (STSs) that are detectable by PCR amplification under sets of standardized conditions and that have been regionally mapped to human chromosome

Identification of coding regions in genomic DNA sequences: an application of dynamic programming and neural networks.

Dynamic programming (DP) is applied to the problem of precisely identifying internal exons and introns in genomic DNA sequences and the program GeneParser employs the DP algorithm to enforce the constraints that introns and exons must be adjacent and non-overlapping and finds the highest scoring combination of intron and exon subject to these constraints.

Prediction of gene structure.

Locating protein-coding regions in human DNA sequences by a multiple sensor-neural network approach.

  • E. UberbacherR. Mural
  • Biology, Computer Science
    Proceedings of the National Academy of Sciences of the United States of America
  • 1991
This work describes a reliable computational approach for locating protein-coding portions of genes in anonymous DNA sequence using a set of sensor algorithms and a neural network to localize the coding regions.

A survey of expressed genes in Caenorhabditis elegans

The result is the identification of about 1,200 of the estimated 15,000 genes of C. elegans, providing a more accurate estimate of the total number of genes in the organism than has hitherto been available.