Posterior predictive checks to quantify lack-of-fit in admixture models of latent population structure

@article{Mimno2015PosteriorPC,
  title={Posterior predictive checks to quantify lack-of-fit in admixture models of latent population structure},
  author={David Mimno and David M. Blei and Barbara E. Engelhardt},
  journal={Proceedings of the National Academy of Sciences},
  year={2015},
  volume={112},
  pages={E3441 - E3450}
}
Significance Bayesian models, including admixture models, are a powerful framework for articulating complex assumptions about large-scale genetic data; such models are widely used to explore data or to study population-level statistics of interest. However, we assume that a Bayesian model does not oversimplify the complexities in the data, to the point of invalidating our analyses. Here, we develop and study procedures for quantitatively evaluating admixture models of genetic data. Using four… 

Figures and Tables from this paper

Population Predictive Checks
TLDR
A new method for Bayesian model checking, the population predictive check (Pop-PC), which is built on posterior predictive checks (PPC), a seminal method that checks a model by assessing the posterior predictive distribution on the observed data.
Efficient analysis of large datasets and sex bias with ADMIXTURE
TLDR
Improvements to the ADMIXTURE software are described, allowing users to extract more information from large genomic datasets, and increased power to detect sex-biased admixture in African-American individuals from the 1000 Genomes project is demonstrated.
Bayesian statistics and modelling
  • Depaoli, King, Yau
  • Computer Science
    Nature Reviews Methods Primers
  • 2021
TLDR
The importance of prior and posterior predictive checking, selecting a proper technique for sampling from a posterior distribution, variational inference and variable selection are discussed, and the impact of Bayesian analysis on artificial intelligence is outlined, a major goal in the next decade.
Evaluating Bayesian Models with Posterior Dispersion Indices
TLDR
This work proposes to evaluate a model through posterior dispersion, and shows how a PDI identifies patterns of model mismatch in three real data examples: voting preferences, supermarket shopping, and population genetics.
Bayesian statistics and modelling
TLDR
This Primer on Bayesian statistics summarizes the most important aspects of determining prior distributions, likelihood functions and posterior distributions, in addition to discussing different applications of the method across disciplines.
Bayesian diagnostic analysis for quantitative trait loci mapping
TLDR
An overview of a few methods for residual and diagnostic analysis in the context of Bayesian regression modeling and adapt them to work with QTL mapping to check the fitted model adequacy is presented.
Combining Multiple Hypothesis Testing with Machine Learning Increases the Statistical Power of Genome-wide Association Studies
TLDR
Applying COMBI to data from a WTCCC study and measuring performance as replication by independent GWAS published within the 2008–2015 period, it is shown that the method outperforms ordinary raw p-value thresholding as well as other state-of-the-art methods.
Population genetic history and polygenic risk biases in 1000 Genomes populations
TLDR
It is shown that the transferability of results from GWAS are dependent on the ancestral diversity of the study cohort as well as the phenotype polygenicity, causal allele frequency divergence, and heritability, and the need for inclusion of more diverse samples in medical genomics studies to enable broadly applicable disease risk information.
Model Criticism for Bayesian Causal Inference
TLDR
This work develops model criticism for Bayesian causal inference, building on the idea of posterior predictive checks to assess model fit, and shows how to check any additional modeling assumption on the assumption of unconfoundedness.
...
...

References

SHOWING 1-10 OF 87 REFERENCES
Analysis of Population Structure: A Unifying Framework and Novel Methods Based on Sparse Factor Analysis
TLDR
It is found that SFA produces similar results to admixture-based models when the samples are descended from a few well-differentiated ancestral populations and can recapitulate the results of PCA when the population structure is more “continuous,” as in isolation-by-distance models.
Fast model-based estimation of ancestry in unrelated individuals.
TLDR
The results show that ADMIXTURE's computational speed opens up the possibility of using a much larger set of markers in model-based ancestry estimation and that its estimates are suitable for use in correcting for population stratification in association studies.
On the inference of ancestries in admixed populations.
TLDR
This work presents an augmented form of Markov models that can be used to predict historical recombination events and can model background linkage disequilibrium (LD) more accurately and study some of the computational issues that arise in using such Markovian models on realistic data sets.
Control of confounding of genetic associations in stratified populations.
TLDR
These methods can deal with both confounding and selection bias in genetic-association studies, making family-based designs unnecessary, and are demonstrated by using data from three admixed populations in which there is extreme confounding of trait-genotype associations.
Robust Demographic Inference from Genomic and SNP Data
TLDR
A flexible and robust simulation-based framework to infer demographic parameters from the site frequency spectrum (SFS) computed on large genomic datasets and shows that it allows one to study evolutionary models of arbitrary complexity, which cannot be tackled by other current likelihood-based methods.
Variance component model to account for sample structure in genome-wide association studies
TLDR
A variance component approach implemented in publicly available software, EMMA eXpedited (EMMAX), that reduces the computational time for analyzing large GWAS data sets from years to hours is reported.
POSTERIOR PREDICTIVE ASSESSMENT OF MODEL FITNESS VIA REALIZED DISCREPANCIES
This paper considers Bayesian counterparts of the classical tests for good- ness of fit and their use in judging the fit of a single Bayesian model to the observed data. We focus on posterior
Inference of population structure using multilocus genotype data: linked loci and correlated allele frequencies.
TLDR
Extensions to the method of Pritchard et al. for inferring population structure from multilocus genotype data are described and methods that allow for linkage between loci are developed, which allows identification of subtle population subdivisions that were not detectable using the existing method.
Inference of Population Structure using Dense Haplotype Data
TLDR
A novel inference framework that aims to efficiently capture information on population structure provided by patterns of haplotype similarity and an efficient model-based approach to identify discrete populations using this matrix, which offers advantages over PCA and over existing clustering algorithms in terms of speed, number of separable populations, and sensitivity to subtle population structure.
Inference of Population Splits and Mixtures from Genome-Wide Allele Frequency Data
TLDR
A statistical model for inferring the patterns of population splits and mixtures in multiple populations and it is shown that a simple bifurcating tree does not fully describe the data; in contrast, many migration events are inferred.
...
...