Bayesian Computation and Model Selection in Population Genetics


Until recently, the use of Bayesian inference in population genetics was limited to a few cases because for many realistic population genetic models the likelihood function cannot be calculated analytically . The situation changed with the advent of likelihood-free inference algorithms, often subsumed under the term Approximate Bayesian Computation (ABC). A key innovation was the use of a post-sampling regression adjustment, allowing larger tolerance values and as such shifting computation time to realistic orders of magnitude [1]. Here we propose a reformulation of the regression adjustment in terms of a General Linear Model (GLM). This allows the integration into the sound theoretical framework of Bayesian statistics and the use of its methods, including model selection via Bayes factors. We then apply the proposed methodology to the question of population subdivision among western chimpanzees Pan troglodytes verus. Introduction With the advent of ever more powerful computers and the refinement of algorithms like MCMC or Gibbs sampling, Bayesian statistics has become an important tool for scientific inference during the past two decades. Until recently many scientists shunned Bayesian methods – mainly because of the philosophical problems related to the choice of prior distributions – but the development of hierarchical and em∗These two authors contributed equally to this work †Ecole d’ingénieurs de Fribourg, Bd. de Pérolles 80, 1705 Fribourg, Switzerland, ‡University of Berne, Computational and Molecular Population Genetics Laboratory, 3012 Berne, Switzerland, 1 ar X iv :0 90 1. 22 31 v1 [ st at .M E ] 1 5 Ja n 20 09 pirical Bayes turned them into an alternative even for hard-core frequentists (see e.g. [22] for a discussion of these issues). Consider a modelM creating data D (DNA sequence data, for example) determined by parameters θ from some (bounded) parameter space Π ⊂ R whose joint prior density we denote by π(θ). The quantity of interest is the posterior distribution of the parameters which can be calculated by Reverend Bayes’ golden rule π(θ|D) = c · fM(D|θ)π(θ), where f(D|θ) is the likelihood of the data and c is a normalizing constant. Direct use of this formula, however, is often thwarted by the fact that the likelihood function cannot be calculated analytically for many realistic population genetic models. In these cases one is obliged to have recourse to stochastic simulation. Tavaré et al. [24] propose a rejection sampling method for simulating a posterior random sample where the full data D is replaced by a summary statistics s (like the number of segregating sites in their setting). Even if the statistics are not sufficient for D – that is, the statistics do not capture the full information contained in the data –, rejection sampling allows for the simulation of approximate posterior distributions of the parameters in question (the scaled mutation rate in their model). This approach was extended to multiple-parameter models with multivariate summary statistics s = (s1, . . . , sn) by Weiss and von Haeseler [27]. In their setting a candidate vector θ of parameters is simulated from a prior distribution and is accepted if its corresponding vector of summary statistics is sufficiently close to the observed summary statistics sobs with respect to some metric in the space of s, i.e. if dist(s, sobs) < for a fixed tolerance . If we suppose that the likelihood fM(s|θ) of the full model is continuous and non-zero around sobs then the likelihood of this truncated model M (sobs) obtained by this accept-reject process is given by f (s|θ) = Ind(s ∈ B (sobs)) · fM(s|θ) · ( ∫

Extracted Key Phrases

3 Figures and Tables

Cite this paper

@inproceedings{Excoffier2009BayesianCA, title={Bayesian Computation and Model Selection in Population Genetics}, author={Christoph Leuenberger Daniel Wegmann Laurent Excoffier}, year={2009} }