Sampling rare events: statistics of local sequence alignments.

A method to calculate probability distributions in regions where the events are very unlikely (e.g., p approximately 10(-40)) is presented. The basic idea is to map the underlying model on a physical system. The system is simulated at a low temperature, such that preferably configurations with originally low probabilities are generated. Since the distribution of such a physical system is known, the original unbiased distribution can be obtained. As an application, local alignment of protein… 

Local sequence alignments statistics: deviations from Gumbel statistics in the rare-event tail
The results show that the statistics of gapped and ungapped local alignments deviates significantly from Gumbel in the rare-event tail, which is usually used when evaluating p-values in databases.
Sequence Alignment Statistics
This chapter gives some simple, useful techniques for approximating the p-values of various types of optimal alignment scores. It starts with general techniques: if, e.g., a dynamic programming
Accurate statistics for local sequence alignment with position-dependent scoring by rare-event sampling
An efficient and general method to compute the score distribution to any desired accuracy, combining Markov chain Monte Carlo simulations with importance sampling and generalized ensembles, and extended to a model of transmembrane proteins.
Large-Deviation Properties of Sequence Alignment of Correlated Sequences
The large deviation method that was used in previous studies is applied to local and global alignment of iid drawn sequences and it is shown that again a correction to the Gumbel distribution is necessary to study the dependence of the parameters on the correlation strength.
Significance of Gapped Sequence Alignments
  • L. Newberg
  • Biology, Computer Science
    J. Comput. Biol.
  • 2008
This work draws random samples directly from a well chosen, importance-sampling probability distribution to approximate alignment score significance, and shows that the extreme value significance statistic for the local alignment model that is examined does not follow a Gumbel distribution.
New finite-size correction for local alignment score distributions
An improved finite-size correction is presented that considers the distribution of sequence lengths rather than simply the corresponding means and improves sensitivity and avoids substituting an ad hoc length for short sequences that can underestimate the significance of a match.
Mathematical models, algorithms, and statistics of sequence alignment
This work presents the basic theory of sequence alignment from computational, biological, and statistical perspectives, and analyzes results of computer simulations that effectively illustrate one possible application of this theory.
Score statistics of global sequence alignment from the energy distribution of a modified directed polymer and directed percolation problem.
This work investigates the score statistics of global sequence alignment taking into account, in particular, the compositional bias of the sequences compared, and the possibility of characterizing score statistics for modest system size (sequence lengths), via proper reparametrization of alignment scores, is illustrated.
Estimating statistical significance of local protein profile-profile alignments
It is shown that improvements in statistical accuracy and sensitivity and high-quality alignment rate result from statistically characterizing alignments by establishing the dependence of statistical parameters on various measures associated with both individual and pairwise profile characteristics.
Minimum-free-energy distribution of RNA secondary structures: Entropic and thermodynamic properties of rare events.
Generalized ensemble Markov-chain Monte Carlo methods are used to explore the rare-event tail of the MFE distribution down to probabilities such as 10^{-70} and to study the relationship between the sequence entropy and structural properties for sequence ensembles with fixed MFEs.


