Principal Component Analyses (PCA)-based findings in population genetic studies are highly biased and must be reevaluated

  title={Principal Component Analyses (PCA)-based findings in population genetic studies are highly biased and must be reevaluated},
  author={Eran Elhaik},
  journal={Scientific Reports},
  • E. Elhaik
  • Published 29 August 2022
  • Biology
  • Scientific Reports
Principal Component Analysis (PCA) is a multivariate analysis that reduces the complexity of datasets while preserving data covariance. The outcome can be visualized on colorful scatterplots, ideally with only a minimal loss of information. PCA applications, implemented in well-cited packages like EIGENSOFT and PLINK, are extensively used as the foremost analyses in population genetics and related fields (e.g., animal and plant or medical genetics). PCA outcomes are used to shape study design… 

Decision Tree Ensembles Utilizing Multivariate Splits Are Effective at Investigating Beta-Diversity in Medically Relevant 16S Amplicon Sequencing Data

It is shown that learned representations can be used to create informative ordinations and that the taxa identified by these approaches have been associated with immune-mediated inflammatory diseases and colorectal cancer.

coVariance Neural Networks

This work theoretically establishes the stability of VNNs to perturbations in the covariance matrix, thus, implying an advantage over standard PCA-based data analysis approaches that are prone to instability due to principal components associated with close eigenvalues.

Asymmetrical lineage introgression and recombination in populations of Aspergillus flavus: Implications for biological control

It is reported that the two distinct A. flavus evolutionary lineages IB and IC differ significantly in their frequency distributions across states, and there is evidence of increased unidirectional gene flow from lineage IB into IC, inferred to be due to the applied Afla-Guard biocontrol strain.

Reply to the Letter to the Editor

  • M. RaiE. TycksenJ. Keener
  • Medicine
    Journal of orthopaedic research : official publication of the Orthopaedic Research Society
  • 2022
To the Editor, The authors have read the insightful comments on their article “RNA‐Seq analysis reveals sex‐dependent transcriptomic profiles of human subacromial bursa stratified by tear etiology” and decided to carefully respond to their concerns.

Synonymous Codon Variant Analysis for Autophagic Genes Dysregulated in Neurodegeneration

The study of synonymous variant usage in various transcripts of autophagic genes reported to cause neurodegeneration (if dysregulated) is studied to help understand various evolutionary forces acting on these genes and the possible augmentation of a gene if showing unusual behavior.

Efficient representations of binarized health deficit data: the frailty index and beyond.

This work investigates efficient representations of binarized health deficit data using the 2001-2002 National Health and Nutrition Examination Survey and demonstrates how PCA extends the FI, providing additional health information, and allows us to explore system dimensionality and complexity.

Ancestry: How researchers use it and what they mean by it

Ancestry is in practice a highly ambiguous concept, and far from an objective counterpart to race or ethnicity, and it does not represent a “safe haven” for researchers seeking to avoid evokingrace or ethnicity in their work.



Interpreting principal component analyses of spatial population genetic variation

It is found that gradients and waves observed in Cavalli-Sforza et al.'s maps resemble sinusoidal mathematical artifacts that arise generally when PCA is applied to spatial data, implying that the patterns do not necessarily reflect specific migration events.

Quantification of Population Structure Using Correlated SNPs by Shrinkage Principal Components

This work demonstrated that LD patterns in genome-wide association datasets can distort the techniques for stratification control, showing ‘subpopulations’ reflecting localized LD phenomena rather than plausible population structure.

A Genealogical Interpretation of Principal Components Analysis

For SNP data the projection of samples onto the principal components can be obtained directly from considering the average coalescent times between pairs of haploid genomes, which provides a framework for interpreting PCA projections in terms of underlying processes, including migration, geographical isolation, and admixture.

Clustering by genetic ancestry using genome-wide SNP data

A novel algorithm to cluster individuals into groups with similar ancestral backgrounds based on the principal components computed by PCA is developed and it is shown that matching cases and controls using the clusters assigned by the algorithm substantially reduces population stratification bias.

Be careful with your principal components

  • M. Björklund
  • Computer Science
    Evolution; international journal of organic evolution
  • 2019
A number of simple test statistics appropriate for testing PC's are reviewed and a real‐world example is used to illustrate how this can be done using randomization tests.

Population Structure and Eigenanalysis

An approach to studying population structure (principal components analysis) is discussed that was first applied to genetic data by Cavalli-Sforza and colleagues, and results from modern statistics are used to develop formal significance tests for population differentiation.

A tutorial on how not to over-interpret STRUCTURE and ADMIXTURE bar plots

An approach is implemented to assess the goodness of fit of the model using the ancestry “palettes” estimated by CHROMOPAINTER and apply it to both simulated data and real case studies, allowing a richer and more robust analysis of recent demographic history.

Across-cohort QC analyses of GWAS summary statistics from complex traits

This study proposes four metrics and visualization tools for GWAMA, using summary statistics from cohort-level GWASs, and proposes methods to examine the concordance between demographic information, and summary statistics and methods to investigate sample overlap.

Factor analysis of ancient population genomic samples

A factor analysis (FA) method in which individual scores are corrected for the effect of allele frequency drift over time is presented, to improve descriptive analyses of ancient DNA samples without requiring inclusion of outgroup or present-day samples.