Novel probabilistic models of spatial genetic ancestry with applications to stratification correction in genome‐wide association studies

  title={Novel probabilistic models of spatial genetic ancestry with applications to stratification correction in genome‐wide association studies},
  author={Anand Bhaskar and Adel Javanmard and Thomas A. Courtade and David Tse},
Motivation: Genetic variation in human populations is influenced by geographic ancestry due to spatial locality in historical mating and migration patterns. Spatial population structure in genetic datasets has been traditionally analyzed using either model‐free algorithms, such as principal components analysis (PCA) and multidimensional scaling, or using explicit spatial probabilistic models of allele frequency evolution. We develop a general probabilistic model and an associated inference… 

Figures and Tables from this paper

Fast Inference of Individual Admixture Coefficients Using Geographic Data
This study introduces new algorithms that use geographic information to estimate ancestry proportions and ancestral genotype frequencies from population genetic data and combines matrix factorization methods and spatial statistics to provide estimates of ancestry matrices based on least-squares approximation.
Human ancestry indentification under resource constraints -- what can one chromosome tell us about human biogeographical ancestry?
The results demonstrate that one single chromosome, Chromosome 1, if carefully analyzed, could hold enough information for accurate prediction of human biogeographical ancestry, and in the applications of such analyses, such as in studies of genetic diseases, forensics, and soft biometrics.
Predicting Geographic Location from Genetic Variation with Deep Neural Networks
A deep learning method is described, which is called Locator, to accomplish this task faster and more accurately than existing approaches to predict the location of origin of a genetic sample by comparing it to a set of samples of known geographic origin.
Population Stratification in Genetic Association Studies
Methods for detecting the presence of population stratification and approaches for accounting for PS when calculating association statistics, such that measures of association are not confounded are described.
Pearson Chi-squared Conditional Randomization Test
Conditional independence (CI) testing arises naturally in many scientific problems and applications domains. The goal of this problem is to investigate the conditional independence between a response
Association of NOD2 and IFNG single nucleotide polymorphisms with leprosy in the Amazon ethnic admixed population
It is confirmed that NOD2 and IFNG are major players in immunity against M.leprae in the Amazon ethnic admixed population.
Inference of Biogeographical Ancestry Under Resource Constraints
Inference of biogEPGRAPHICAL ANCESTRY UNDER RESOURCE CONSTRAINTS finds resources in the region are under threat from various sources, including coal, gas, and uranium.


A model-based approach for analysis of spatial structure in genetic data
This work applies the spatial ancestry analysis method to a European and a worldwide population genetic variation data set and identifies SNPs showing large gradients in allele frequency, and suggests these as candidate regions under selection.
Fast spatial ancestry via flexible allele frequency surfaces
The proposed model divides the region of interest into pixels and operates SNP by SNP and gives better localization for both unmixed and admixed individuals than existing methods despite using just a small fraction of the available SNPs.
Probabilistic models of genetic variation in structured populations applied to global human studies
A new ‘logistic factor analysis’ framework is introduced that seeks to directly model the logit transformation of probabilities underlying observed genotypes in terms of latent variables that capture population structure.
A Note on the Relations Between Spatio-Genetic Models
This note explains the implicit spatio-genetic model that underlies PCA, and provides insights into some of the recently proposed spatial models, and shows that two of these models can be formulated as modifications ofPCA, each removing one of PCA's limitations in the context of genetic analysis.
Variance component model to account for sample structure in genome-wide association studies
A variance component approach implemented in publicly available software, EMMA eXpedited (EMMAX), that reduces the computational time for analyzing large GWAS data sets from years to hours is reported.
A Spatial Framework for Understanding Population Structure and Admixture
This work uses genome-wide polymorphism data to build “geogenetic maps,” which, when applied to stationary populations, produces a map of the geographic positions of the populations, but with distances distorted to reflect historical rates of gene flow.
Genes mirror geography within Europe
Despite low average levels of genetic differentiation among Europeans, there is a close correspondence between genetic and geographic distances; indeed, a geographical map of Europe arises naturally as an efficient two-dimensional summary of genetic variation in Europeans.
Sensitive Detection of Chromosomal Segments of Distinct Ancestry in Admixed Populations
HAPMIX will be of particular utility for mapping disease genes in recently admixed populations, as its accurate estimates of local ancestry permit admixture and case-control association signals to be combined, enabling more powerful tests of association than with either signal alone.
PCA-Correlated SNPs for Structure Identification in Worldwide Human Populations
A novel algorithm is presented that is effectively used for the analysis of admixed populations without having to trace the origin of individuals, and can be easily applied on large genome-wide datasets, facilitating the identification of population substructure, stratification assessment in multi-stage whole-genome association studies, and the study of demographic history in human populations.
A Genealogical Interpretation of Principal Components Analysis
For SNP data the projection of samples onto the principal components can be obtained directly from considering the average coalescent times between pairs of haploid genomes, which provides a framework for interpreting PCA projections in terms of underlying processes, including migration, geographical isolation, and admixture.