Dirichlet mixtures: a method for improved detection of weak but significant protein sequence homology

@article{Sjlander1996DirichletMA,
  title={Dirichlet mixtures: a method for improved detection of weak but significant protein sequence homology},
  author={Kimmen Sj{\"o}lander and Kevin Karplus and Michael Brown and Richard Hughey and Anders Krogh and I. Saira Mian and David Haussler},
  journal={Computer applications in the biosciences : CABIOS},
  year={1996},
  volume={12 4},
  pages={
          327-45
        }
}
We present a method for condensing the information in multiple alignments of proteins into a mixture of Dirichlet densities over amino acid distributions. Dirichlet mixture densities are designed to be combined with observed amino acid frequencies to form estimates of expected amino acid probabilities at each position in a profile, hidden Markov model or other statistical model. These estimates give a statistical model greater generalization capacity, so that remotely related family members can… 

Tables from this paper

Dirichlet Mixtures, the Dirichlet Process, and the Structure of Protein Space

TLDR
The Dirichlet process is used to model probability distributions that are mixtures of an unknown number of components, and these mixtures consist of over 500 components and provide a novel perspective on the structure of proteins.

On the Inference of Dirichlet Mixture Priors for Protein Sequence Comparison

TLDR
This article addresses two questions relevant to such inference of how many components should a Dirichlet mixture consist, and how may a maximum-likelihood mixture be derived from a given data set, and applies the Minimum Description Length principle to the first question.

Using Substitution Matrices to Estimate Probability Distributions for Biological Sequences

TLDR
This paper presents a biologically-motivated method for estimating probability distributions over discrete alphabets from observations using a mixture model of common ancestors using a simple Bayesian interpretation and has the advantage over Dirichlet mixtures that it is both effective and simple to compute for large alphABets.

Using mixtures of common ancestors for estimating the probabilities of discrete events in biological sequences

TLDR
This paper presents a biologically-motivated method for estimating probability distributions over discrete alphabets from observations using a mixture model of common ancestors using a simple Bayesian interpretation and has the advantage over Dirichlet mixtures that it is both effective and simple to compute for large alphABets.

Efficient functional clustering of protein sequences using the Dirichlet process

TLDR
A novel probabilistic framework that models subfamilies within a known protein family, which uses Dirichlet mixture densities to estimate amino acid preferences within subfamily clusters, and places aDirichlet process prior on the overall set of clusters.

Compositional Adjustment of Dirichlet Mixture Priors

TLDR
This work implements the implementation of the Lagrange-Newton method, and can compositionally adjust to good precision a 20-component Dirichlet mixture prior for proteins in under half a second on a standard workstation.

Protein homology detection using sparse profile hidden Markov models

TLDR
This work hypothesizes that the knowledge of a small set of key residues and the distances between each neighboring pair allows one to classify a given protein into an appropriate group, and proposes a class of models termed sparse profile hidden Markov models and a training algorithm for obtaining these models from data.

The Complexity of the Dirichlet Model for Multiple Alignment Data

TLDR
This work derives, in the limit of large n and c, a closed-form expression for the complexity of the Dirichlet model applied to multiple-alignment data, which has been applied fruitfully to the study of protein multiple sequence alignments.

Context-Specific Independence Mixture Modelling for Protein Families

TLDR
A clustering procedure using the context-specific independencemixture framework using a Dirichlet mixture prior for simultaneous inference of subgroups and prediction of specificity determining residues based on multiple sequence alignments of protein families is presented.
...

References

SHOWING 1-10 OF 63 REFERENCES

Using substitution probabilities to improve position-specific scoring matrices

TLDR
This work introduces a simple method for computing pseudo-counts that combines the diversity observed in each alignment position with amino acid substitution probabilities and was a substantial improvement over the traditional average score method used for constructing profiles.

REGULARIZERS FOR ESTIMATING DISTRIBUTIONS OF AMINO ACIDS FROM SMALL SAMPLES

TLDR
A new method is presented for setting the parameters of the regularizers to minimize the encoding cost (also called the entropy) of the training data, for all possible samples from theTraining data.

Protein modeling using hidden Markov models: analysis of globins

TLDR
A variant of the expectation maximization algorithm known as the Viterbi algorithm is used to obtain the statistical model from the unaligned sequences, and a multiple alignment of the 400 sequences and 225 other globin sequences was obtained that agrees almost perfectly with a structural alignment by D Bashford et al. (1987).

Detection of conserved segments in proteins: iterative scanning of sequence databases with alignment blocks.

TLDR
An approach to analyzing protein sequence databases that, starting from a single uncharacterized sequence or group of related sequences, generates blocks of conserved segments that are used to detect novel conserved motifs of potential biological importance is described.

Stochastic models for heterogeneous DNA sequences.

Hidden Markov models for sequence analysis: extension and analysis of the basic method

TLDR
The mathematical extensions and heuristics that move the method from the theoretical to the practical are reviewed and the effectiveness of model regularization, dynamic model modification and optimization strategies are experimentally analyzed.

The Value of Prior Knowledge in Discovering Motifs with MEME

TLDR
This paper describes several extensions to MEME which increase its ability to find motifs in a totally unsupervised fashion, but which also allow it to benefit when prior knowledge is available, when no background knowledge is asserted.

Hidden Markov models in computational biology. Applications to protein modeling.

TLDR
The results suggest the presence of an EF-hand calcium binding motif in a highly conserved and evolutionary preserved putative intracellular region of 155 residues in the alpha-1 subunit of L-type calcium channels which play an important role in excitation-contraction coupling.

Multiple Alignment Using Hidden Markov Models

TLDR
Examination of the specific cases in which ClustalW outperforms simulated annealing, and vice versa, provides insight into the strengths and weaknesses of current hidden Markov model approaches.
...