Statistical Properties of Short Subsequences in Microbial Genomes and Their Link to Pathogen Identification and Evolution


Numerous sequencing projects have unveiled partial and full microbial genomes. The data produced far exceeds one person’s analytical capabilities and thus requires the power of computing. A significant amount of work has focused on the diversity of statistical characteristics along microbial genomic sequences, e.g. codon bias, G+C content, the frequencies of short subsequences (n-mers), etc. Based upon the results of these studies, two observations were made: (1) there exists a correlation between regions of unusual statistical properties, e.g. difference in codon bias, etc., from the rest of the genomic sequence, and evolutionary significant regions, e.g. regions of horizontal gene transfer; and (2) because no two microbial genomes look statistically identical, statistical properties can be used to distinguish between genomic sequences. Recently, we conducted extensive analysis on the presence/absence of n-mers for many microbial genomes as well as several viral and eukaryotic genomes. This analysis revealed that the presence of n-mers in all genomes considered (in the range of n, when the condition M<<4 holds, where M is the genome length) can be treated as a nearly random and independent process. Thus we hypothesize that one may use relatively small sets of randomly picked n-mers for differentiating between different microorganisms. Recently, we analyzed the frequency of appearance of all 8to 12-mers present in each of the 200+ publicly available microbial genomes. For nearly all of the genomes under consideration, we observed that some n-mers are present much more frequently than expected: from 50 to over a thousand copies. Upon closer inspection of these sequences, we found several cases in which an overrepresented n-mer exhibits a bias towards being located in the coding or being located in the non-coding region. Although the evolutionary reason for the conservation of such sequences remains unclear, in some cases it is plausible to believe that sequences having a clear bias for noncoding regions may be because of their role in the DNA uptake/recombination process, being parts in insertion sequences, or serving as transcription factors recognition sites. Our analysis of the frequency of appearance of 6-mers for each microbial genome revealed regions that display unusual statistical properties with respect to their own genome. After inspection of the genes contained within these regions, we believe that such regions are likely to have been acquired into the genomic sequence through horizontal gene transfer.

Cite this paper

@inproceedings{Zhang2006StatisticalPO, title={Statistical Properties of Short Subsequences in Microbial Genomes and Their Link to Pathogen Identification and Evolution}, author={Meizhuo Zhang and Catherine Putonti and Sergei Chumakov and Adhish Gupta and George E. Fox and Dan Graur and Yuriy Fofanov}, year={2006} }