# Faster exact Markovian probability functions for motif occurrences: a DFA-only approach

@article{Ribeca2008FasterEM, title={Faster exact Markovian probability functions for motif occurrences: a DFA-only approach}, author={Paolo Ribeca and Emanuele Raineri}, journal={Bioinformatics}, year={2008}, volume={24 24}, pages={ 2839-48 } }

BACKGROUND
The computation of the statistical properties of motif occurrences has an obviously relevant application: patterns that are significantly over- or under-represented in genomes or proteins are interesting candidates for biological roles. However, the problem is computationally hard; as a result, virtually all the existing motif finders use fast but approximate scoring functions, in spite of the fact that they have been shown to produce systematically incorrect results. A few…

## Figures and Tables from this paper

## 29 Citations

Exact distribution of a pattern in a set of random sequences generated by a Markov source: applications to biological data

- Computer ScienceAlgorithms for Molecular Biology
- 2009

Three innovative algorithms based on optimal Markov chain embedding based on deterministic finite automata are introduced and prove to be effective and able to handle real data sets with multiple sequences, as well as biological patterns of interest, even when the latter display a high complexity.

Analysis of pattern overlaps and exact computation of P-values of pattern occurrences numbers: case of Hidden Markov Models

- Computer ScienceAlgorithms for Molecular Biology
- 2014

A novel algorithm SufPref is presented that computes an exact P-value for Hidden Markov models (HMM) based on recursive equations on text sets related to pattern occurrences and inductively traverses a specific data structure, an overlap graph.

Faster exact distributions of pattern statistics through sequential elimination of states

- Computer Science
- 2017

This work develops a method to obtain a small set of states during the state generation process without forming a DFA, and shows that a huge reduction in the size of the AMC can be attained.

Computation of exact probabilities associated with overlapping pattern occurrences

- Computer ScienceWIREs Computational Statistics
- 2019

An overview of the main methods used to compute distributions of statistics of overlapping pattern occurrences, specifically, generating functions, correlation functions, the Goulden‐Jackson cluster method, recursive equations, and Markov chain embedding are given.

Sparse approaches for the exact distribution of patterns in long state sequences generated by a Markov source

- Mathematics, Computer ScienceTheor. Comput. Sci.
- 2013

The Power of Detecting Enriched Patterns: An HMM Approach

- BiologyJ. Comput. Biol.
- 2010

The issue of how the known presence of random instances of a known motif affects the power of detecting patterns, such as patterns within the motif, is addressed.

Regmex, Motif analysis in ranked lists of sequences

- BiologybioRxiv
- 2016

A motif analysis tool, Regmex (REGular expression Motif EXplorer), which offers several methods to identify overrepresented motifs in a ranked list of sequences and is well suited for a range of biological sequence analysis problems related to motif discovery.

Moments of the Count of a Regular Expression in a Heterogeneous Random Sequence

- Mathematics, Computer ScienceMethodology and Computing in Applied Probability
- 2019

This work focuses on the distribution of the random count N of a regular expression in a multi-state random sequence generated by a heterogenous Markov source and provides explicit recursions allowing to compute the mgf/pgf of N under the evidence constraint.

Regmex: a statistical tool for exploring motifs in ranked sequence lists from genomics experiments

- BiologyAlgorithms for Molecular Biology
- 2018

Regmex (REGular expression Motif EXplorer), which offers several methods to identify overrepresented motifs in ranked lists of sequences, is presented and compared with an existing motif discovery tool and shows increased sensitivity.

Significance Score of Motifs in Biological Sequences

- Biology
- 2011

In statistical terms, this is equivalent to compute the p-value of observation n in respect with a relevant reference model if X1:l = X1 .

## References

SHOWING 1-10 OF 20 REFERENCES

Statistical tests to compare motif count exceptionalities

- BiologyBMC Bioinformatics
- 2006

Two statistical tests are developed and analyzed, an exact binomial one and an asymptotic likelihood ratio test, to decide whether the exceptionality of a given motif is equivalent or significantly different in two sequences of interest.

Computing exact P-values for DNA motifs

- Computer ScienceBioinform.
- 2007

The problem to be NP-hard is shown, and MotifRank, software based on dynamic programming, is presented, to calculate exact P-values of motifs, which are defined on a general and more precise model.

Proteome analysis based on motif statistics

- BiologyECCB
- 2002

It is demonstrated that statistical over- or under-representation of motifs in complete proteomes may be an indicator of whether, in that organism, the authors are looking at chance occurrences of the motif or whether the occurrences are sufficiently numerous to suggest a systematic, and thus functionally important occurrence.

Effective p-value computations using Finite Markov Chain Imbedding (FMCI): application to local score and to pattern statistics

- Computer ScienceAlgorithms for Molecular Biology
- 2006

This work proposes here to give here a general recursive algorithms allowing to compute in a numerically stable manner exact Cumulative Distribution Function (CDF) or complementary CDF (CCDF).

Pattern Markov chains: optimal Markov chain embedding through deterministic finite automata

- Mathematics, Computer Science
- 2008

The theory of language and automata is used to provide space-optimal Markov chain embedding using the new notion of pattern Markov chains (PMCs), and explicit constructive algorithms are given to build the PMC associated to any given pattern problem.

Fitting a Mixture Model By Expectation Maximization To Discover Motifs In Biopolymer

- Computer ScienceISMB
- 1994

The algorithm described in this paper discovers one or more motifs in a collection of DNA or protein sequences by using the technique of expectation maximization to fit a two-component finite mixture…

DNA, Words and Models, Statistics of Exceptional Words

- EconomicsTechnometrics
- 2007

The book covers some aspects of financial time series, associated analysis challenges, and S+ software and has a surprisingly wide and reasonably deep coverage of a lot of practical and theoretical time series concepts.

Hidden Markov models in computational biology. Applications to protein modeling.

- Biology, Computer ScienceJournal of molecular biology
- 1994

The results suggest the presence of an EF-hand calcium binding motif in a highly conserved and evolutionary preserved putative intracellular region of 155 residues in the alpha-1 subunit of L-type calcium channels which play an important role in excitation-contraction coupling.

Numerical Solutions for Patterns Statistics on Markov Chains

- Computer Science, MathematicsStatistical applications in genetics and molecular biology
- 2006

A review of the methods available to compute pattern statistics on text generated by a Markov source in terms of computational time and reliability in the most complete pattern statistics benchmark available at the present time is proposed.

Identification of DNA Motifs Implicated in Maintenance of Bacterial Core Genomes by Predictive Modeling

- BiologyPLoS genetics
- 2007

This work examined the distribution of the “crossover hotspot instigator,” or Chi, in Escherichia coli, and found that its exceptional distribution is restricted to the core genome common to three strains, and formulated a set of criteria that were incorporated in a statistical model to search core genomes for motifs potentially involved in genome stability in other species.