Comprehensive autoregressive modeling for classification of genomic sequences

Abstract

In this paper, we propose the novel use of an autoregressive (AR) model to produce a multi-dimensional feature for distinguishing between genomic protein coding and non-coding regions, at their nucleotide level. In contrast to previous research, in which AR models were used to estimate a single frequency, here AR model parameters characterizing the entire short-term sequence spectrum are employed as a feature in conjunction with Gaussian mixture model-based classification. The optimized AR-based features are then combined with other signal processing based time-domain and frequency-domain features to advance detection accuracy for the coding/non-coding region classification problem. The system described herein is shown to produce identification accuracies of more than 78.9%, and 81.6% respectively for protein coding and non-coding nucleotides, when evaluated on the GENSCAN test set.

Extracted Key Phrases

7 Figures and Tables

Cite this paper

@article{Akhtar2007ComprehensiveAM, title={Comprehensive autoregressive modeling for classification of genomic sequences}, author={Munaim Akhtar and Eliathamby Ambikairajah and Joanne Epps}, journal={2007 6th International Conference on Information, Communications & Signal Processing}, year={2007}, pages={1-5} }