Comparative n-gram analysis of whole-genome protein sequences

@article{Ganapathiraju2002ComparativeNA,
  title={Comparative n-gram analysis of whole-genome protein sequences},
  author={Madhavi K. Ganapathiraju and Deborah K. Weisser and Roni Rosenfeld and Jaime G. Carbonell and Ramana G. Reddy and Judith Klein-Seetharaman},
  journal={IEEE Personal Communications},
  year={2002}
}
A current barrier for successful rational drug design is the lack of understanding of the structure space provided by the proteins in a cell that is determined by their sequence space. The protein sequences capable of folding to functional three-dimensional shapes of the proteins are clearly different for different organisms, since sequences obtained from human proteins often fail to form correct three-dimensional structures in bacterial organisms. In analogy to the question "What kind of… 

Figures and Tables from this paper

N-gram analysis of 970 microbial organisms reveals presence of biological language models
TLDR
Whole proteome sequences of microbial organisms have been shown to contain particular n-gram sequences in abundance in one organism but occurring very rarely in other organisms, thereby serving as proteome signatures and perplexity, a statistical measure of similarity of n- gram composition, can be used to predict evolutionary distance within a genus in the phylogenetic tree.
Evolutionary insights from suffix array-based genome sequence analysis
TLDR
An improvement to the design of construction of suffix arrays is reported, indicating enhancement in versatility and scalability, enabled by this approach, and the usefulness of identifying repeats in whole proteomes efficiently.
n-Gram characterization of genomic islands in bacterial genomes
Identifying the missing proteins in human proteome by biological language model
TLDR
The analysis shows that 102 proteins may be native gene-coding proteins and some of the missing proteins are membrane or natively disordered proteins which are hard to be detected by experiment methods.
ProtVec: A Continuous Distributed Representation of Biological Sequences
TLDR
By only providing sequence data for various proteins into this model, information about protein structure can be determined with high accuracy, and this so-called embedding model needs to be trained only once and can be used to ascertain a diverse set of information regarding the proteins of interest.
Recruitment of rare 3-grams at functional sites: Is this a mechanism for increasing enzyme specificity?
TLDR
The results suggest that recruitment of rare 3-grams may be an efficient mechanism for increasing specificity at functional sites, and rareness/scarcity emerges as a feature that may assist in identifying key sites for proteins function.
Comparison of phosphorylation patterns across eukaryotes by discriminative N-gram analysis
TLDR
The discriminative n-grams were able to classify organisms in their corresponding kingdom/phylum, they show different patterns among species of different kingdom/ phylum and these regions can contribute to evolutionary divergence as they are in disordered regions that can evolve rapidly.
Yule Value Tables from Protein Datasets
TLDR
In transmembrane helices, associations were more negative than in any other dataset studied, suggesting that evolution of these helices requires suppression of occurrence of specific amino acid combinations within local range.
Could n-gram analysis contribute to genomic island determination?
...
1
2
3
4
5
...

References

SHOWING 1-10 OF 29 REFERENCES
betawrap: Successful prediction of parallel β-helices from primary sequence reveals an association with many microbial pathogens
TLDR
A computational approach is presented that predicts the right-handed parallel β-helix supersecondary structural motif in primary amino acid sequences by using β-strand interactions learned from non-β-helIX structures to generate interstrand pairwise correlations from a processive sequence wrap.
Statistical Properties of Open Reading Frames in Complete Genome Sequences
Lack of biological significance in the 'linguistic features' of noncoding DNA--a quantitative analysis.
TLDR
The presented results show quantitatively that the 'linguistic' tests failed to reveal any new biological information in (noncoding or coding) DNA.
Over- and under-representation of short oligonucleotides in DNA sequences.
Strand-symmetric relative abundance functionals for di-, tri-, and tetranucleotides are introduced and applied to sequences encompassing a broad phylogenetic range to discern tendencies and anomalies
Quantile distributions of amino acid usage in protein classes.
TLDR
A comparative study of the compositional properties of various protein sets from both cellular and viral organisms is presented and a quantitative criterion to assess amino acid compositional extremes relative to a reference protein set is proposed and applied.
Linguistic features of noncoding DNA sequences.
We extend the Zipf approach to analyzing linguistic texts to the statistical study of DNA base pair sequences and find that the noncoding regions are more similar to natural languages than the coding
Initial sequencing and analysis of the human genome
TLDR
The results of an international collaboration to produce and make freely available a draft sequence of the human genome are reported and an initial analysis is presented, describing some of the insights that can be gleaned from the sequence.
Is DNA a language?
TLDR
Analysis of many DNA sequences suggests that no linguistics connections to DNA exist and that even though it has structure DNA is not a language.
Noncoding DNA, Zipf's law, and language.
TLDR
It is noted that this classical measure of significance does not take into account the red spectrum of the observed nonhydrostatic geoid, whose harmonic coefficients cannot be properly regarded as a random distribution, therefore the statistical significance of the measured correlation coefficient is possibly less than 99%.
...
1
2
3
...