Signal Processing in Sequence Analysis: Advances in Eukaryotic Gene Prediction
In this paper, we propose the novel use of an autoregressive (AR) model to produce a multi-dimensional feature for distinguishing between genomic protein coding and non-coding regions, at their nucleotide level. In contrast to previous research, in which AR models were used to estimate a single frequency, here AR model parameters characterizing the entire short-term sequence spectrum are employed as a feature in conjunction with Gaussian mixture model-based classification. The optimized AR-based features are then combined with other signal processing based time-domain and frequency-domain features to advance detection accuracy for the coding/non-coding region classification problem. The system described herein is shown to produce identification accuracies of more than 78.9%, and 81.6% respectively for protein coding and non-coding nucleotides, when evaluated on the GENSCAN test set.