Using a VOM model for reconstructing potential coding regions in EST sequences

  title={Using a VOM model for reconstructing potential coding regions in EST sequences},
  author={Armin Shmilovici and Irad Ben-Gal},
  journal={Computational Statistics},
This paper presents a method for annotating coding and noncoding DNA regions by using variable order Markov (VOM) models. A main advantage in using VOM models is that their order may vary for different sequences, depending on the sequences’ statistics. As a result, VOM models are more flexible with respect to model parameterization and can be trained on relatively short sequences and on low-quality datasets, such as expressed sequence tags (ESTs). The paper presents a modified VOM model for… 

Gene-finding with the VOM model

Experiments with the proposed gene-finder (GF) on three prokaryotic genomes indicate its potential advantage on the detection of short genes.

Single Species Gene Finding

This chapter covers a five of the most commonly used mathematical models used as main algorithms in single species gene finding, which are hidden Markov models, generalized hidden MarkOV models, interpolated Markov model, neural networks, and decision trees.

MicroRNA Prediction Using a Fixed-Order Markov Model Based on the Secondary Structure Pattern

A new generation of miRNA prediction algorithm is provided, which successfully realizes a full-function recognition of the mature miRNAs directly from the hairpin sequences and presents a new understanding of the biological recognition based on the strongest signal’s location detected by FOMmiR.

Classical and Quantum Algorithms for Constructing Text from Dictionary Problem

The classical algorithm is optimal up to a log factor, and the quantum algorithm shows speed-up comparing to any classical algorithm in a case of non-constant length of strings in the dictionary.

Representing higher-order dependencies in networks

The higher-order network (HON) representation is proposed, including accuracy, scalability, and direct compatibility with the existing suite of network analysis methods, and it is illustrated how HON can be applied to a broad variety of tasks, such as random walking, clustering, and ranking.

Distributions of pattern statistics in sparse Markov models

  • D. Martin
  • Computer Science, Mathematics
    Annals of the Institute of Statistical Mathematics
  • 2019
Method for efficient computation of pattern distributions through Markov chains with minimal state spaces is extended to the sparse Markov framework, which gives a better handling of the trade-off between bias associated with having too few model parameters and variance from having too many.

High-Order Entropy-Based Population Diversity Measures in the Traveling Salesman Problem

  • Y. Nagata
  • Computer Science
    Evolutionary Computation
  • 2020
Three types of population diversity measures that address high-order dependencies between the variables to investigate the effectiveness of considering high- order dependencies are proposed.

A boosting method with asymmetric mislabeling probabilities which depend on covariates

A new boosting method for a kind of noisy data is developed, where the probability of mislabeling depends on the label of a case. The mechanism of the model is based on a simple idea and gives

Representing Big Data as Networks: New Methods and Insights

This dissertation proposes theHigher-order network, which is a critical piece for representing higher-order interaction data; it introduces a scalable algorithm for building the network, and visualization tools for interactive exploration, and presents broad applications of the higher- order network in the real-world.

Measuring the Efficiency of the Intraday Forex Market with a Universal Data Compression Algorithm

A universal Variable Order Markov (VOM) model is presented and used to test the weak form of the Efficient Market Hypothesis and Forex market turns out to be efficient, at least most of the time.



ESTScan: A Program for Detecting, Evaluating, and Reconstructing Potential Coding Regions in EST Sequences

It is shown that ESTScan can detect and extract coding regions from low-quality sequences with high selectivity and sensitivity, and is able to accurately correct frameshift errors.

Modeling sequencing errors by combining Hidden Markov models

This research improves the detection of translation start and stop sites by integrating a more complex mRNA model with codon usage bias based error correction into one hidden Markov model (HMM), thus generalizing this error correction approach to more complex HMMs.

A VOM based gene-finder that specializes in short genes

The proposed VOM gene-finder outperforms traditional gene-finders that are based on fifth-order Markov models for short newly sequenced bacterial genomes.

Interpolated markov chains for eukaryotic promoter recognition

A new content-based approach for the detection of promoter regions of eukaryotic protein encoding genes based on three interpolated Markov chains of different order which are trained on coding, non-coding and promoter sequences is described.

ExonHunter: a comprehensive approach to gene finding

ExonHunter is a new and comprehensive gene finding system that outperforms existing systems and features several new ideas and approaches and gives a new method for modeling the length distribution of intergenic regions in hidden Markov models.

DIANA-EST: a statistical analysis

The goal of this work is the development of a new program called DNA Intelligent Analysis for ESTs (DIANA-EST) based on a combination of Artificial Neural Networks and statistics for the characterization of the coding regions within ESTs and the reconstruction of the encoded protein.

Finding borders between coding and noncoding DNA regions by an entropic segmentation method.

It is found that this method is highly accurate in finding borders between coding and noncoding regions and requires no "prior training" on known data sets.

Variations on probabilistic suffix trees: statistical modeling and prediction of protein families

Exhaustive evaluations show that the PST model detects much more related sequences than pairwise methods such as Gapped-BLAST, and is almost as sensitive as a hidden Markov model that is trained from a multiple alignment of the input sequences, while being much faster.

Assessment of protein coding measures.

This paper reviews and synthesizes the underlying coding measures from published algorithms and concludes that a very simple and obvious measure--counting oligomers--is more effective than any of the more sophisticated measures.

EasyGene – a prokaryotic gene finder that ranks ORFs by statistical significance

A new automated gene-finding method, EasyGene, which estimates the statistical significance of a predicted gene based on a hidden Markov model (HMM) that is automatically estimated for a new genome.