A Phylogeny-Regularized Sparse Regression Model for Predictive Modeling of Microbial Community Data

  title={A Phylogeny-Regularized Sparse Regression Model for Predictive Modeling of Microbial Community Data},
  author={Jian Xiao and Li Chen and Yue Yu and Xianyang Zhang and Jun Chen},
  journal={Frontiers in Microbiology},
Fueled by technological advancement, there has been a surge of human microbiome studies surveying the microbial communities associated with the human body and their links with health and disease. As a complement to the human genome, the human microbiome holds great potential for precision medicine. Efficient predictive models based on microbiome data could be potentially used in various clinical applications such as disease diagnosis, patient stratification and drug response prediction. One… 

Figures from this paper

A novel deep learning method for predictive modeling of microbiome data

A novel deep learning prediction method MDeep (microbiome-based deep learning method) to predict both continuous and binary outcomes and demonstrates that MDeep outperforms competing methods in both regression and binary classifications.

RFtest: A Robust and Flexible Community-Level Test for Microbiome Data Powerfully Detects Phylogenetically Clustered Signals

This study proposes “Random Forest Test” (RFtest), a global (community-level) test based on random forest for high-dimensional and phylogenetically structured microbiome data, a permutation test using the generalization error of random forest as the test statistic.

PopPhy-CNN: A Phylogenetic Tree Embedded Architecture for Convolutional Neural Networks to Predict Host Phenotype From Metagenomic Data

PopPhy-CNN is a practical deep learning framework for the prediction of host phenotype with the ability of facilitating the retrieval of predictive microbial taxa and the competitiveness of the model compared to other available methods using nine metagenomic datasets of moderate size for binary classification.

Microbiome compositional analysis with logistic-tree normal models

This work introduces a generative model, called the “logistic-tree normal” (LTN) model, that marries two popular classes of models—namely, log-ratio normal (LN) and Dirichlet-tree (DT)—and inherits the key benefits of each.

A Review and Tutorial of Machine Learning Methods for Microbiome Host Trait Prediction

The most commonly used machine learning methods are explored, and their prediction accuracy as applied to microbiome host trait prediction is evaluated.

Sparse least trimmed squares regression with compositional covariates for high-dimensional data

The numerical performance of the proposed method is evaluated via simulation studies, and its usefulness is illustrated by an application to a microbiome study with the aim to predict caffeine intake based on the human gut microbiome composition.

Feature selection and causal analysis for microbiome studies in the presence of confounding using standardization

Standardization enables more accurate identification of individual microbiome features with an effect on the outcome of interest compared to other variable selection and estimation procedures when there is confounding by a categorical variable.

Principal Amalgamation Analysis for Microbiome Data

This work proposes Principal Amalgamation Analysis (PAA), a novel amalgamation-based and taxonomy-guided dimension reduction paradigm for microbiome data that aims to aggregate the compositions into a smaller number of principal compositions, guided by the available taxonomic structure.

Image and graph convolution networks improve microbiome-based machine learning accuracy

Two novel methods to combine information from different bacteria and improve data representation for machine learning using bacterial taxonomy are suggested and it is shown that both algorithms improve performance of static 16S rRNA gene sequence-based machine learning compared to the best state-of-the-art methods.



Predictive Modeling of Microbiome Data Using a Phylogeny-Regularized Generalized Linear Mixed Model

“glmmTree” is developed, a prediction method based on a generalized linear mixed model framework, for capturing clustered and dense microbiome signals that outperformed existing methods in the dense and clustered signal scenarios.

Phylogeny-Based Kernels with Application to Microbiome Association Studies

A three-parameter phylogeny-based kernel, which allows modeling a wide range of nonlinear relationships, is provided, which has a nice biological interpretation and, by tuning the parameter, can gain insights about how the microbiome interacts with the environment.

False discovery rate control incorporating phylogenetic tree increases detection power in microbiome‐wide multiple testing

A new FDR control procedure is proposed that incorporates the prior structure information and applies it to microbiome data and achieves a similar power as traditional procedures that do not take into account the tree structure.

Phylogeny-based classification of microbial communities

A novel supervised classification method for microbial community samples, where each sample is represented as a set of OTU frequencies, which takes advantage of the natural structure in microbial community data encoded by a phylogenetic tree, to take advantage of environment-specific compositional patterns that may contain features at multiple granularity levels.

Structure-constrained sparse canonical correlation analysis with an application to microbiome data analysis.

This work developed a method for structure-constrained sparse canonical correlation analysis (ssCCA) in a high-dimensional setting that takes into account the phylogenetic relationships among bacteria, which provides important prior knowledge on evolutionary relationships among bacterial taxa.

Supervised classification of human microbiota.

This review demonstrates that several existing supervised classifiers can be applied effectively to microbiota classification, both for selecting subsets of taxa that are highly discriminative of the type of community, and for building models that can accurately classify unlabeled data.

Phylogenetic approaches to microbial community classification

The classification of oral microbiota remains a challenging problem; the best accuracy on the plaque dataset was approximately 81 %.


This paper uses kernel-based methods to show how to incorporate a variety of extrinsic information, such as phylogeny, into penalized regression models that estimate taxonspecific associations with a phenotype or clinical outcome, and shows how this regression framework can be used to address the compositional nature of multivariate predictors comprised of relative abundances.

Machine Learning Meta-analysis of Large Metagenomic Datasets: Tools and Biological Insights

A computational framework for prediction tasks using quantitative microbiome profiles, including species-level relative abundances and presence of strain-specific markers, is developed, which can be considered a first step toward defining general microbial dysbiosis.

Microbiomes in light of traits: A phylogenetic perspective

Key aspects of microbial traits are reviewed and a synthesis of these studies reveals that, despite the promiscuity of HGT, microbial traits appear to be phylogenetically conserved, or not distributed randomly across the tree of life.