The Complexity of the Dirichlet Model for Multiple Alignment Data

@article{Yu2011TheCO,
  title={The Complexity of the Dirichlet Model for Multiple Alignment Data},
  author={Yi-Kuo Yu and Stephen F. Altschul},
  journal={Journal of computational biology : a journal of computational molecular cell biology},
  year={2011},
  volume={18 8},
  pages={
          925-39
        }
}
  • Yi-Kuo Yu, S. Altschul
  • Published 29 July 2011
  • Mathematics
  • Journal of computational biology : a journal of computational molecular cell biology
A model is a set of possible theories for describing a set of data. When the data are used to select a maximum-likelihood theory, an important question is how many effectively independent theories the model contains; the log of this number is called the model's complexity. The Dirichlet model is the set of all Dirichlet distributions, which are probability densities over the space of multinomials. A Dirichlet distribution may be used to describe multiple-alignment data, consisting of n columns… 

Tables from this paper

On the Inference of Dirichlet Mixture Priors for Protein Sequence Comparison

TLDR
This article addresses two questions relevant to such inference of how many components should a Dirichlet mixture consist, and how may a maximum-likelihood mixture be derived from a given data set, and applies the Minimum Description Length principle to the first question.

Dirichlet Mixtures, the Dirichlet Process, and the Structure of Protein Space

TLDR
The Dirichlet process is used to model probability distributions that are mixtures of an unknown number of components, and these mixtures consist of over 500 components and provide a novel perspective on the structure of proteins.

Computational Methods for Inferring Transcription Factor Binding Sites

TLDR
A novel method to PWM training based on the known motifs to sample additional putative binding sites from a proximal promoter area was introduced and implemented and tested in this thesis with a large scale application.

Domain Analysis and Visualization of NLRP10

TLDR
In this study, computational tools such as algorithm, web server and database were used to investigate the domain of NLRP10 protein, and the findings may provide computational insights into the structure and functions ofNLRP10.

References

SHOWING 1-8 OF 8 REFERENCES

On the Inference of Dirichlet Mixture Priors for Protein Sequence Comparison

TLDR
This article addresses two questions relevant to such inference of how many components should a Dirichlet mixture consist, and how may a maximum-likelihood mixture be derived from a given data set, and applies the Minimum Description Length principle to the first question.

Dirichlet mixtures: a method for improved detection of weak but significant protein sequence homology

TLDR
This paper corrects the previously published formula for estimating expected amino acid probabilities, and contains complete derivations of the Dirichlet mixture formulas, methods for optimizing the mixtures to match particular databases, and suggestions for efficient implementation.

The Construction and Use of Log-Odds Substitution Scores for Multiple Sequence Alignment

TLDR
This work uses Bayesian methods to construct “BILD” (“Bayesian Integral Log-odds”) substitution scores from prior distributions describing columns of related letters, and describes how to calculate BILD scores efficiently, and illustrates their uses in Gibbs sampling optimization procedures, gapped alignment, and the construction of hidden Markov model profiles.

Minimum Description Length Principle

  • J. Rissanen
  • Computer Science
    Encyclopedia of Machine Learning
  • 2010

Choosing a Point from the Surface of a Sphere

PSI-BLAST pseudocounts and the minimum description length principle

TLDR
This article argues that the minimum description length principle can motivate the choice of this parameter and implies that more highly conserved columns should use fewer pseudocounts, increasing the inter-column contrast of the implied PSSMs.

Using Dirichlet Mixture Priors to Derive Hidden Markov Models for Protein Families

TLDR
A Bayesian method for estimating the amino acid distributions in the states of a hidden Markov model (HMM) for a protein family or the columns of a multiple alignment of that family is introduced, which can improve the quality of HMMs produced from small training sets.

E-mail: altschul@ncbi.nlm.nih.gov COMPLEXITY OF THE DIRICHLET MODEL

  • E-mail: altschul@ncbi.nlm.nih.gov COMPLEXITY OF THE DIRICHLET MODEL