# Dirichlet mixtures: a method for improved detection of weak but significant protein sequence homology

@article{Sjlander1996DirichletMA, title={Dirichlet mixtures: a method for improved detection of weak but significant protein sequence homology}, author={Kimmen Sj{\"o}lander and Kevin Karplus and Michael Brown and Richard Hughey and Anders Krogh and I. Saira Mian and David Haussler}, journal={Computer applications in the biosciences : CABIOS}, year={1996}, volume={12 4}, pages={ 327-45 } }

We present a method for condensing the information in multiple alignments of proteins into a mixture of Dirichlet densities over amino acid distributions. Dirichlet mixture densities are designed to be combined with observed amino acid frequencies to form estimates of expected amino acid probabilities at each position in a profile, hidden Markov model or other statistical model. These estimates give a statistical model greater generalization capacity, so that remotely related family members can…

## Tables from this paper

## 391 Citations

### Dirichlet Mixtures, the Dirichlet Process, and the Structure of Protein Space

- MathematicsJ. Comput. Biol.
- 2013

The Dirichlet process is used to model probability distributions that are mixtures of an unknown number of components, and these mixtures consist of over 500 components and provide a novel perspective on the structure of proteins.

### On the Inference of Dirichlet Mixture Priors for Protein Sequence Comparison

- Mathematics, Computer ScienceJ. Comput. Biol.
- 2011

This article addresses two questions relevant to such inference of how many components should a Dirichlet mixture consist, and how may a maximum-likelihood mixture be derived from a given data set, and applies the Minimum Description Length principle to the first question.

### Using Substitution Matrices to Estimate Probability Distributions for Biological Sequences

- Computer ScienceJ. Comput. Biol.
- 2002

This paper presents a biologically-motivated method for estimating probability distributions over discrete alphabets from observations using a mixture model of common ancestors using a simple Bayesian interpretation and has the advantage over Dirichlet mixtures that it is both effective and simple to compute for large alphABets.

### Using mixtures of common ancestors for estimating the probabilities of discrete events in biological sequences

- Computer ScienceISMB
- 2001

This paper presents a biologically-motivated method for estimating probability distributions over discrete alphabets from observations using a mixture model of common ancestors using a simple Bayesian interpretation and has the advantage over Dirichlet mixtures that it is both effective and simple to compute for large alphABets.

### Efficient functional clustering of protein sequences using the Dirichlet process

- BiologyBioinform.
- 2008

A novel probabilistic framework that models subfamilies within a known protein family, which uses Dirichlet mixture densities to estimate amino acid preferences within subfamily clusters, and places aDirichlet process prior on the overall set of clusters.

### Compositional Adjustment of Dirichlet Mixture Priors

- Computer ScienceJ. Comput. Biol.
- 2010

This work implements the implementation of the Lagrange-Newton method, and can compositionally adjust to good precision a 20-component Dirichlet mixture prior for proteins in under half a second on a standard workstation.

### Protein homology detection using sparse profile hidden Markov models

- Computer Science, Biology
- 2005

This work hypothesizes that the knowledge of a small set of key residues and the distances between each neighboring pair allows one to classify a given protein into an appropriate group, and proposes a class of models termed sparse profile hidden Markov models and a training algorithm for obtaining these models from data.

### The Complexity of the Dirichlet Model for Multiple Alignment Data

- MathematicsJ. Comput. Biol.
- 2011

This work derives, in the limit of large n and c, a closed-form expression for the complexity of the Dirichlet model applied to multiple-alignment data, which has been applied fruitfully to the study of protein multiple sequence alignments.

### Context-Specific Independence Mixture Modelling for Protein Families

- BiologyPKDD
- 2007

A clustering procedure using the context-specific independencemixture framework using a Dirichlet mixture prior for simultaneous inference of subgroups and prediction of specificity determining residues based on multiple sequence alignments of protein families is presented.

### A maximum likelihood approximation method for Dirichlet's parameter estimation

- MathematicsComput. Stat. Data Anal.
- 2008

## References

SHOWING 1-10 OF 63 REFERENCES

### Using substitution probabilities to improve position-specific scoring matrices

- Computer ScienceComput. Appl. Biosci.
- 1996

This work introduces a simple method for computing pseudo-counts that combines the diversity observed in each alignment position with amino acid substitution probabilities and was a substantial improvement over the traditional average score method used for constructing profiles.

### REGULARIZERS FOR ESTIMATING DISTRIBUTIONS OF AMINO ACIDS FROM SMALL SAMPLES

- Computer Science
- 1995

A new method is presented for setting the parameters of the regularizers to minimize the encoding cost (also called the entropy) of the training data, for all possible samples from theTraining data.

### Amino acid substitution matrices from an information theoretic perspective

- BiologyJournal of Molecular Biology
- 1991

### Protein modeling using hidden Markov models: analysis of globins

- Biology[1993] Proceedings of the Twenty-sixth Hawaii International Conference on System Sciences
- 1993

A variant of the expectation maximization algorithm known as the Viterbi algorithm is used to obtain the statistical model from the unaligned sequences, and a multiple alignment of the 400 sequences and 225 other globin sequences was obtained that agrees almost perfectly with a structural alignment by D Bashford et al. (1987).

### Detection of conserved segments in proteins: iterative scanning of sequence databases with alignment blocks.

- BiologyProceedings of the National Academy of Sciences of the United States of America
- 1994

An approach to analyzing protein sequence databases that, starting from a single uncharacterized sequence or group of related sequences, generates blocks of conserved segments that are used to detect novel conserved motifs of potential biological importance is described.

### Stochastic models for heterogeneous DNA sequences.

- MathematicsBulletin of mathematical biology
- 1989

### Hidden Markov models for sequence analysis: extension and analysis of the basic method

- Computer ScienceComput. Appl. Biosci.
- 1996

The mathematical extensions and heuristics that move the method from the theoretical to the practical are reviewed and the effectiveness of model regularization, dynamic model modification and optimization strategies are experimentally analyzed.

### The Value of Prior Knowledge in Discovering Motifs with MEME

- BiologyISMB
- 1995

This paper describes several extensions to MEME which increase its ability to find motifs in a totally unsupervised fashion, but which also allow it to benefit when prior knowledge is available, when no background knowledge is asserted.

### Hidden Markov models in computational biology. Applications to protein modeling.

- Biology, Computer ScienceJournal of molecular biology
- 1994

The results suggest the presence of an EF-hand calcium binding motif in a highly conserved and evolutionary preserved putative intracellular region of 155 residues in the alpha-1 subunit of L-type calcium channels which play an important role in excitation-contraction coupling.

### Multiple Alignment Using Hidden Markov Models

- Computer ScienceISMB
- 1995

Examination of the specific cases in which ClustalW outperforms simulated annealing, and vice versa, provides insight into the strengths and weaknesses of current hidden Markov model approaches.