# The Complexity of the Dirichlet Model for Multiple Alignment Data

@article{Yu2011TheCO, title={The Complexity of the Dirichlet Model for Multiple Alignment Data}, author={Yi-Kuo Yu and Stephen F. Altschul}, journal={Journal of computational biology : a journal of computational molecular cell biology}, year={2011}, volume={18 8}, pages={ 925-39 } }

A model is a set of possible theories for describing a set of data. When the data are used to select a maximum-likelihood theory, an important question is how many effectively independent theories the model contains; the log of this number is called the model's complexity. The Dirichlet model is the set of all Dirichlet distributions, which are probability densities over the space of multinomials. A Dirichlet distribution may be used to describe multiple-alignment data, consisting of n columnsâ€¦Â

## Tables from this paper

## 4 Citations

### On the Inference of Dirichlet Mixture Priors for Protein Sequence Comparison

- Mathematics, Computer ScienceJ. Comput. Biol.
- 2011

This article addresses two questions relevant to such inference of how many components should a Dirichlet mixture consist, and how may a maximum-likelihood mixture be derived from a given data set, and applies the Minimum Description Length principle to the first question.

### Dirichlet Mixtures, the Dirichlet Process, and the Structure of Protein Space

- MathematicsJ. Comput. Biol.
- 2013

The Dirichlet process is used to model probability distributions that are mixtures of an unknown number of components, and these mixtures consist of over 500 components and provide a novel perspective on the structure of proteins.

### Computational Methods for Inferring Transcription Factor Binding Sites

- Biology
- 2012

A novel method to PWM training based on the known motifs to sample additional putative binding sites from a proximal promoter area was introduced and implemented and tested in this thesis with a large scale application.

### Domain Analysis and Visualization of NLRP10

- Computer Science
- 2013

In this study, computational tools such as algorithm, web server and database were used to investigate the domain of NLRP10 protein, and the findings may provide computational insights into the structure and functions ofNLRP10.

## References

SHOWING 1-8 OF 8 REFERENCES

### On the Inference of Dirichlet Mixture Priors for Protein Sequence Comparison

- Mathematics, Computer ScienceJ. Comput. Biol.
- 2011

This article addresses two questions relevant to such inference of how many components should a Dirichlet mixture consist, and how may a maximum-likelihood mixture be derived from a given data set, and applies the Minimum Description Length principle to the first question.

### Dirichlet mixtures: a method for improved detection of weak but significant protein sequence homology

- MathematicsComput. Appl. Biosci.
- 1996

This paper corrects the previously published formula for estimating expected amino acid probabilities, and contains complete derivations of the Dirichlet mixture formulas, methods for optimizing the mixtures to match particular databases, and suggestions for efficient implementation.

### The Construction and Use of Log-Odds Substitution Scores for Multiple Sequence Alignment

- BiologyPLoS Comput. Biol.
- 2010

This work uses Bayesian methods to construct â€śBILDâ€ť (â€śBayesian Integral Log-oddsâ€ť) substitution scores from prior distributions describing columns of related letters, and describes how to calculate BILD scores efficiently, and illustrates their uses in Gibbs sampling optimization procedures, gapped alignment, and the construction of hidden Markov model profiles.

### Minimum Description Length Principle

- Computer ScienceEncyclopedia of Machine Learning
- 2010

### PSI-BLAST pseudocounts and the minimum description length principle

- BiologyNucleic acids research
- 2009

This article argues that the minimum description length principle can motivate the choice of this parameter and implies that more highly conserved columns should use fewer pseudocounts, increasing the inter-column contrast of the implied PSSMs.

### Using Dirichlet Mixture Priors to Derive Hidden Markov Models for Protein Families

- Computer ScienceISMB
- 1993

A Bayesian method for estimating the amino acid distributions in the states of a hidden Markov model (HMM) for a protein family or the columns of a multiple alignment of that family is introduced, which can improve the quality of HMMs produced from small training sets.

### E-mail: altschul@ncbi.nlm.nih.gov COMPLEXITY OF THE DIRICHLET MODEL

- E-mail: altschul@ncbi.nlm.nih.gov COMPLEXITY OF THE DIRICHLET MODEL