Faster exact Markovian probability functions for motif occurrences: a DFA-only approach

@article{Ribeca2008FasterEM,
  title={Faster exact Markovian probability functions for motif occurrences: a DFA-only approach},
  author={Paolo Ribeca and Emanuele Raineri},
  journal={Bioinformatics},
  year={2008},
  volume={24 24},
  pages={
          2839-48
        }
}
BACKGROUND The computation of the statistical properties of motif occurrences has an obviously relevant application: patterns that are significantly over- or under-represented in genomes or proteins are interesting candidates for biological roles. However, the problem is computationally hard; as a result, virtually all the existing motif finders use fast but approximate scoring functions, in spite of the fact that they have been shown to produce systematically incorrect results. A few… 

Figures and Tables from this paper

Exact distribution of a pattern in a set of random sequences generated by a Markov source: applications to biological data
TLDR
Three innovative algorithms based on optimal Markov chain embedding based on deterministic finite automata are introduced and prove to be effective and able to handle real data sets with multiple sequences, as well as biological patterns of interest, even when the latter display a high complexity.
Analysis of pattern overlaps and exact computation of P-values of pattern occurrences numbers: case of Hidden Markov Models
TLDR
A novel algorithm SufPref is presented that computes an exact P-value for Hidden Markov models (HMM) based on recursive equations on text sets related to pattern occurrences and inductively traverses a specific data structure, an overlap graph.
Faster exact distributions of pattern statistics through sequential elimination of states
TLDR
This work develops a method to obtain a small set of states during the state generation process without forming a DFA, and shows that a huge reduction in the size of the AMC can be attained.
Computation of exact probabilities associated with overlapping pattern occurrences
  • D. Martin
  • Computer Science
    WIREs Computational Statistics
  • 2019
TLDR
An overview of the main methods used to compute distributions of statistics of overlapping pattern occurrences, specifically, generating functions, correlation functions, the Goulden‐Jackson cluster method, recursive equations, and Markov chain embedding are given.
Sparse approaches for the exact distribution of patterns in long state sequences generated by a Markov source
The Power of Detecting Enriched Patterns: An HMM Approach
TLDR
The issue of how the known presence of random instances of a known motif affects the power of detecting patterns, such as patterns within the motif, is addressed.
Regmex, Motif analysis in ranked lists of sequences
TLDR
A motif analysis tool, Regmex (REGular expression Motif EXplorer), which offers several methods to identify overrepresented motifs in a ranked list of sequences and is well suited for a range of biological sequence analysis problems related to motif discovery.
Moments of the Count of a Regular Expression in a Heterogeneous Random Sequence
  • G. Nuel
  • Mathematics, Computer Science
    Methodology and Computing in Applied Probability
  • 2019
TLDR
This work focuses on the distribution of the random count N of a regular expression in a multi-state random sequence generated by a heterogenous Markov source and provides explicit recursions allowing to compute the mgf/pgf of N under the evidence constraint.
Regmex: a statistical tool for exploring motifs in ranked sequence lists from genomics experiments
TLDR
Regmex (REGular expression Motif EXplorer), which offers several methods to identify overrepresented motifs in ranked lists of sequences, is presented and compared with an existing motif discovery tool and shows increased sensitivity.
Significance Score of Motifs in Biological Sequences
TLDR
In statistical terms, this is equivalent to compute the p-value of observation n in respect with a relevant reference model if X1:l = X1 .
...
1
2
3
...

References

SHOWING 1-10 OF 20 REFERENCES
Statistical tests to compare motif count exceptionalities
TLDR
Two statistical tests are developed and analyzed, an exact binomial one and an asymptotic likelihood ratio test, to decide whether the exceptionality of a given motif is equivalent or significantly different in two sequences of interest.
Computing exact P-values for DNA motifs
TLDR
The problem to be NP-hard is shown, and MotifRank, software based on dynamic programming, is presented, to calculate exact P-values of motifs, which are defined on a general and more precise model.
Proteome analysis based on motif statistics
TLDR
It is demonstrated that statistical over- or under-representation of motifs in complete proteomes may be an indicator of whether, in that organism, the authors are looking at chance occurrences of the motif or whether the occurrences are sufficiently numerous to suggest a systematic, and thus functionally important occurrence.
Effective p-value computations using Finite Markov Chain Imbedding (FMCI): application to local score and to pattern statistics
  • G. Nuel
  • Computer Science
    Algorithms for Molecular Biology
  • 2006
TLDR
This work proposes here to give here a general recursive algorithms allowing to compute in a numerically stable manner exact Cumulative Distribution Function (CDF) or complementary CDF (CCDF).
Pattern Markov chains: optimal Markov chain embedding through deterministic finite automata
  • G. Nuel
  • Mathematics, Computer Science
  • 2008
TLDR
The theory of language and automata is used to provide space-optimal Markov chain embedding using the new notion of pattern Markov chains (PMCs), and explicit constructive algorithms are given to build the PMC associated to any given pattern problem.
Fitting a Mixture Model By Expectation Maximization To Discover Motifs In Biopolymer
The algorithm described in this paper discovers one or more motifs in a collection of DNA or protein sequences by using the technique of expectation maximization to fit a two-component finite mixture
DNA, Words and Models, Statistics of Exceptional Words
TLDR
The book covers some aspects of financial time series, associated analysis challenges, and S+ software and has a surprisingly wide and reasonably deep coverage of a lot of practical and theoretical time series concepts.
Hidden Markov models in computational biology. Applications to protein modeling.
TLDR
The results suggest the presence of an EF-hand calcium binding motif in a highly conserved and evolutionary preserved putative intracellular region of 155 residues in the alpha-1 subunit of L-type calcium channels which play an important role in excitation-contraction coupling.
Numerical Solutions for Patterns Statistics on Markov Chains
  • G. Nuel
  • Computer Science, Mathematics
    Statistical applications in genetics and molecular biology
  • 2006
TLDR
A review of the methods available to compute pattern statistics on text generated by a Markov source in terms of computational time and reliability in the most complete pattern statistics benchmark available at the present time is proposed.
Identification of DNA Motifs Implicated in Maintenance of Bacterial Core Genomes by Predictive Modeling
TLDR
This work examined the distribution of the “crossover hotspot instigator,” or Chi, in Escherichia coli, and found that its exceptional distribution is restricted to the core genome common to three strains, and formulated a set of criteria that were incorporated in a statistical model to search core genomes for motifs potentially involved in genome stability in other species.
...
1
2
...