• Corpus ID: 88524311

Multi-sample Estimation of Bacterial Composition Matrix in Metagenomics Data

  title={Multi-sample Estimation of Bacterial Composition Matrix in Metagenomics Data},
  author={Yuanpei Cao and Anru R. Zhang and Hongzhe Li},
  journal={arXiv: Methodology},
Metagenomics sequencing is routinely applied to quantify bacterial abundances in microbiome studies, where the bacterial composition is estimated based on the sequencing read counts. Due to limited sequencing depth and DNA dropouts, many rare bacterial taxa might not be captured in the final sequencing reads, which results in many zero counts. Naive composition estimation using count normalization leads to many zero proportions, which tend to result in inaccurate estimates of bacterial… 

Figures and Tables from this paper

Clustering microbiome data using mixtures of logistic normal multinomial models
This paper develops a novel mixture of logistic normal multinomial models for clustering microbiome data and utilizes an efficient framework for parameter estimation using variational Gaussian approximations (VGA).
High-dimensional Log-Error-in-Variable Regression with Applications to Microbial Compositional Data Analysis
In microbiome and genomic study, the regression of compositional data has been a crucial tool for identifying microbial taxa or genes that are associated with clinical phenotypes. To account for the
Estimating diversity in networked ecological communities.
This article leverage models from the compositional data literature that explicitly account for co-occurrence networks and use them to estimate diversity, and finds that the greatest gains of the method are in strongly networked communities with many taxa.
An Optimal Statistical and Computational Framework for Generalized Tensor Estimation
This paper describes a flexible framework for generalized low-rank tensor estimation problems that includes many important instances arising from applications in computational imaging, genomics, and network analysis, and proves the superiority of the proposed framework via extensive experiments.
Doubly-Stochastic Normalization of the Gaussian Kernel is Robust to Heteroskedastic Noise
It is proved that in a suitable high-dimensional setting where heteroskedastic noise does not concentrate too much in any particular direction in space, the resulting (doubly-stochastic) noisy affinity matrix converges to its clean counterpart with rate m -1/2, where m is the ambient dimension.
Heteroskedastic PCA: Algorithm, optimality, and applications
This paper proposes an algorithm called HeteroPCA, which involves iteratively imputing the diagonal entries to remove the bias due to heteroskedasticity and is computationally efficient and provably optimal under the generalized spiked covariance model.
Freeness over the diagonal and outliers detection in deformed random matrices with a variance profile
We study the eigenvalue distribution of a GUE matrix with a variance profile that is perturbed by an additive random matrix that may possess spikes. Our approach is guided by Voiculescu's notion of
Composition Estimation via Shrinkage
In this note, we explore a simple approach to composition estimation, using penalized likelihood density estimation on a nominal discrete domain. Practical issues such as smoothing parameter


Robust estimation of microbial diversity in theory and in practice
It is argued that one cannot reliably estimate the absolute and relative number of microbial species present in a community without making unsupported assumptions about species abundance distributions, and recommended is the use of Shannon and Simpson diversity rather than species richness in efforts to quantify and compare microbial diversity.
Regression Analysis for Microbiome Compositional Data
One important problem in microbiome analysis is to identify the bacterial taxa that are associated with a response, where the microbiome data are summarized as the composition of the bacterial taxa
A global network of coexisting microbes from environmental and whole-genome sequence data.
A global meta-analysis of previously sampled microbial lineages in the environment is presented, hypothesizing that groupings of lineages are often ancient, and that they may have significantly impacted on genome evolution.
Variable selection in regression with compositional covariates
An l1 regularization method for the linear log-contrast model that respects the unique features of compositional data is proposed and its usefulness is illustrated by an application to a microbiome study relating human body mass index to gut microbiome composition.
Large Covariance Estimation for Compositional Data Via Composition-Adjusted Thresholding
The problem of covariance estimation for high-dimensional compositional data is addressed and a composition-adjusted thresholding (COAT) method under the assumption that the basis covariance matrix is sparse is introduced, which is scalable for large covariance matrices.
A framework for human microbiome research
Resources from a population of 242 healthy adults sampled at 15 or 18 body sites up to three times are presented, which have generated 5,177 microbial taxonomic profiles from 16S ribosomal RNA genes and over 3.5 terabases of metagenomic sequence so far.
Microbial Co-occurrence Relationships in the Human Microbiome
An ensemble method based on multiple similarity measures in combination with generalized boosted linear models (GBLMs) to taxonomic marker (16S rRNA gene) profiles of this cohort resulted in a global network of 3,005 significant co-occurrence and co-exclusion relationships between 197 clades occurring throughout the human microbiome.
A core gut microbiome in obese and lean twins
The faecal microbial communities of adult female monozygotic and dizygotic twin pairs concordant for leanness or obesity, and their mothers are characterized to address how host genotype, environmental exposure and host adiposity influence the gut microbiome.
A comparison of taxon co-occurrence patterns for macro- and microorganisms.
It is shown that assemblages of microorganisms demonstrate nonrandom patterns of co-occurrence that are broadly similar to those found in assemblage of macroorganisms, suggesting that some taxon co- Occurrence patterns may be general characteristics of communities of organisms from all domains of life.