Using a VOM model for reconstructing potential coding regions in EST sequences

  title={Using a VOM model for reconstructing potential coding regions in EST sequences},
  author={Armin Shmilovici and Irad Ben-Gal},
  journal={Computational Statistics},
This paper presents a method for annotating coding and noncoding DNA regions by using variable order Markov (VOM) models. A main advantage in using VOM models is that their order may vary for different sequences, depending on the sequences’ statistics. As a result, VOM models are more flexible with respect to model parameterization and can be trained on relatively short sequences and on low-quality datasets, such as expressed sequence tags (ESTs). The paper presents a modified VOM model for… Expand
Gene-finding with the VOM model
Experiments with the proposed gene-finder (GF) on three prokaryotic genomes indicate its potential advantage on the detection of short genes. Expand
Single Species Gene Finding
This chapter covers a five of the most commonly used mathematical models used as main algorithms in single species gene finding, which are hidden Markov models, generalized hidden MarkOV models, interpolated Markov model, neural networks, and decision trees. Expand
MicroRNA Prediction Using a Fixed-Order Markov Model Based on the Secondary Structure Pattern
A new generation of miRNA prediction algorithm is provided, which successfully realizes a full-function recognition of the mature miRNAs directly from the hairpin sequences and presents a new understanding of the biological recognition based on the strongest signal’s location detected by FOMmiR. Expand
Classical and Quantum Algorithms for Constructing Text from Dictionary Problem
The classical algorithm is optimal up to a log factor, and the quantum algorithm shows speed-up comparing to any classical algorithm in a case of non-constant length of strings in the dictionary. Expand
Representing higher-order dependencies in networks
The higher-order network (HON) representation is proposed, including accuracy, scalability, and direct compatibility with the existing suite of network analysis methods, and it is illustrated how HON can be applied to a broad variety of tasks, such as random walking, clustering, and ranking. Expand
Distributions of pattern statistics in sparse Markov models
Markov models provide a good approximation to probabilities associated with many categorical time series, and thus they are applied extensively. However, a major drawback associated with them is thatExpand
High-Order Entropy-Based Population Diversity Measures in the Traveling Salesman Problem
  • Y. Nagata
  • Computer Science, Medicine
  • Evolutionary Computation
  • 2020
Three types of population diversity measures that address high-order dependencies between the variables to investigate the effectiveness of considering high- order dependencies are proposed. Expand
A boosting method with asymmetric mislabeling probabilities which depend on covariates
  • K. Hayashi
  • Mathematics, Computer Science
  • Comput. Stat.
  • 2012
A new boosting method for a kind of noisy data is developed, where the probability of mislabeling depends on the label of a case. The mechanism of the model is based on a simple idea and givesExpand
Representing Big Data as Networks: New Methods and Insights
  • Jian Xu
  • Computer Science, Physics
  • ArXiv
  • 2017
This dissertation proposes theHigher-order network, which is a critical piece for representing higher-order interaction data; it introduces a scalable algorithm for building the network, and visualization tools for interactive exploration, and presents broad applications of the higher- order network in the real-world. Expand
Measuring the Efficiency of the Intraday Forex Market with a Universal Data Compression Algorithm
Universal compression algorithms can detect recurring patterns in any type of temporal data—including financial data—for the purpose of compression. The universal algorithms actually find a model ofExpand


ESTScan: A Program for Detecting, Evaluating, and Reconstructing Potential Coding Regions in EST Sequences
It is shown that ESTScan can detect and extract coding regions from low-quality sequences with high selectivity and sensitivity, and is able to accurately correct frameshift errors. Expand
Modeling sequencing errors by combining Hidden Markov models
This research improves the detection of translation start and stop sites by integrating a more complex mRNA model with codon usage bias based error correction into one hidden Markov model (HMM), thus generalizing this error correction approach to more complex HMMs. Expand
A VOM based gene-finder that specializes in short genes
The proposed VOM gene-finder outperforms traditional gene-finders that are based on fifth-order Markov models for short newly sequenced bacterial genomes. Expand
Interpolated markov chains for eukaryotic promoter recognition
A new content-based approach for the detection of promoter regions of eukaryotic protein encoding genes based on three interpolated Markov chains of different order which are trained on coding, non-coding and promoter sequences is described. Expand
ExonHunter: a comprehensive approach to gene finding
ExonHunter is a new and comprehensive gene finding system that outperforms existing systems and features several new ideas and approaches and gives a new method for modeling the length distribution of intergenic regions in hidden Markov models. Expand
DIANA-EST: a statistical analysis
The goal of this work is the development of a new program called DNA Intelligent Analysis for ESTs (DIANA-EST) based on a combination of Artificial Neural Networks and statistics for the characterization of the coding regions within ESTs and the reconstruction of the encoded protein. Expand
Finding borders between coding and noncoding DNA regions by an entropic segmentation method.
It is found that this method is highly accurate in finding borders between coding and noncoding regions and requires no "prior training" on known data sets. Expand
Variations on probabilistic suffix trees: statistical modeling and prediction of protein families
Exhaustive evaluations show that the PST model detects much more related sequences than pairwise methods such as Gapped-BLAST, and is almost as sensitive as a hidden Markov model that is trained from a multiple alignment of the input sequences, while being much faster. Expand
Assessment of protein coding measures.
This paper reviews and synthesizes the underlying coding measures from published algorithms and concludes that a very simple and obvious measure--counting oligomers--is more effective than any of the more sophisticated measures. Expand
EasyGene – a prokaryotic gene finder that ranks ORFs by statistical significance
A new automated gene-finding method, EasyGene, which estimates the statistical significance of a predicted gene based on a hidden Markov model (HMM) that is automatically estimated for a new genome. Expand