Multiple comparative metagenomics using multiset k-mer counting
@article{Benoit2016MultipleCM, title={Multiple comparative metagenomics using multiset k-mer counting}, author={Ga{\"e}tan Benoit and Pierre Peterlongo and Mahendra Mariadassou and Erwan Drezen and Sophie Schbath and Dominique Lavenier and Claire Lemaitre}, journal={PeerJ Comput. Sci.}, year={2016}, volume={2}, pages={e94} }
Background
Large scale metagenomic projects aim to extract biodiversity knowledge between different environmental conditions. Current methods for comparing microbial communities face important limitations. Those based on taxonomical or functional assignation rely on a small subset of the sequences that can be associated to known organisms. On the other hand, de novo methods, that compare the whole sets of sequences, either do not scale up on ambitious metagenomic projects or do not provide…
Figures and Tables from this paper
75 Citations
Libra: scalable k-mer–based tool for massive all-vs-all metagenome comparisons
- Computer ScienceGigaScience
- 2019
A tool called Libra is developed that performs an all-vs-all comparison of metagenomes for precise clustering based on their k-mer content in a Hadoop architecture that can scale to any size dataset to enable global-scale analyses and link microbial signatures to biological processes.
SimkaMin: fast and resource frugal de novo comparative metagenomics
- BiologyBioinform.
- 2020
SimkaMin is presented, a quick comparative metagenomics tool with low disk and memory footprints, thanks to an efficient data subsampling scheme used to estimate Bray-Curtis and Jaccard dissimilarities.
Identifying Group-Specific Sequences for Microbial Communities Using Long k-mer Sequence Signatures
- BiologyFront. Microbiol.
- 2018
A long k-mer (k ≥ 30 bps)-based computational pipeline to detect group-specific sequences at strain resolution free from reference sequences, sequence alignments, and metagenome-wide de novo assembly is developed with parallel computing.
A New Alignment-Free Whole Metagenome Comparison Tool and Its Application on Gut Microbiomes of Wild Giant Pandas
- EngineeringFrontiers in Microbiology
- 2020
A new alignment-free tool, KmerFreqCalc, is developed for the comparison of the whole metagenomic data, which first calculated the frequencies of both forward and reverse complementary sequences of k-mers like Mash and then computed the cosine distance between the samples based on k-mer frequency vectors like Libra.
Recentrifuge: Robust comparative analysis and contamination removal for metagenomics
- BiologyPLoS Comput. Biol.
- 2019
Recentrifuge implements a robust method for the removal of negative-control and crossover taxa from the rest of samples and provides shared and exclusive taxa per sample, thus enabling robust contamination removal and comparative analysis in environmental and clinical metagenomics.
Comparison of microbiome samples: methods and computational challenges
- BiologyBriefings Bioinform.
- 2021
Current solutions for three key challenges in the comparison of metagenomic next-generation sequencing data sets are presented, considering both reference-based methods relying on a database of reference genomes and reference-free methods working directly on all sequencing reads from the samples.
MAGNETO: An Automated Workflow for Genome-Resolved Metagenomics
- BiologybioRxiv
- 2022
MAGNETO is presented, an automated workflow dedicated to MAG reconstruction, which includes a fully-automated coassembly step informed by optimal clustering of metagenomic distances, and implements complementary genome binning strategies, for improving MAG recovery.
Software for Systematics and Evolution
- Biology
- 2020
APPLES, a distance-based method for phylogenetic placement, is introduced and it is shown that using dense references improves accuracy substantially so that APPLES on dense trees is more accurate than ML on sparser trees, where it can run.
Recentrifuge: robust comparative analysis and contamination removal for metagenomic data
- BiologybioRxiv
- 2018
Recentrifuge, researchers can analyze results from Centrifug and LMAT classifiers using interactive hierarchical pie charts with special emphasis on the confidence level of the classifications, thus enabling robust comparative analysis of multiple samples in any metagenomic study.
Fast Approximation of Frequent k-Mers and Applications to Metagenomics.
- Computer ScienceJournal of computational biology : a journal of computational molecular cell biology
- 2019
This study develops, analyzes, and test a sampling-based approach, called Sampling Algorithm for K-mErs approxIMAtion (SAKEIMA), to approximate the frequent k-mers and their frequencies in a high-throughput sequencing data set while providing rigorous guarantees on the quality of the approximation.
References
SHOWING 1-10 OF 56 REFERENCES
MetaFast: fast reference-free graph-based comparison of shotgun metagenomic data
- BiologyBioinform.
- 2016
MetaFast is developed, an approach that allows to represent a shotgun metagenome from an arbitrary environment as a modified de Bruijn graph consisting of simplified components and is computationally efficient and especially promising for an analysis of metagenomes from novel environmental niches.
Assessment of k-mer spectrum applicability for metagenomic dissimilarity analysis
- BiologyBMC Bioinformatics
- 2015
The k-mer spectrum-based measure was found to behave similarly to one based on mapping to a reference gene catalog, but different from one using a genome catalog, which turned out to be associated with a significant presence of viral reads in a number of metagenomes.
Compareads: comparing huge metagenomic experiments
- Computer Science, BiologyBMC Bioinformatics
- 2012
Using a new data structure based on Bloom filters, Compareads is a practical solution for comparing de novo huge metagenomic samples and enables to retrieve biological information while being able to scale to huge datasets.
A Novel Abundance-Based Algorithm for Binning Metagenomic Sequences Using l-Tuples
- BiologyRECOMB
- 2010
The novel approach AbundanceBin, an application of the Lander-Waterman model to metagenomics, achieved accurate, unsupervised, clustering of metagenomic sequences into different bins, such that the reads classified in a bin belong to species of identical or very similar abundances in the sample.
Spaced seeds improve k-mer-based metagenomic classification
- BiologyBioinform.
- 2015
It is shown that spaced seeds provide a significant improvement of classification accuracy, as opposed to traditional contiguous k-mers, within this general framework for metagenomic classification accuracy.
Exploration and retrieval of whole-metagenome sequencing samples
- BiologyBioinform.
- 2014
A content-based exploration and retrieval method for whole-metagenome sequencing samples using a distributed string mining framework to efficiently extract all informative sequence k-mers from a pool of metagenomic samples and use them to measure the dissimilarity between two samples.
Exploration and retrieval of whole-metagenome sequencing
- Computer Science
- 2014
A content-based exploration and retrieval method for whole-metagenome sequencing samples using a distributed string mining framework to efficiently extract all informative sequence k-mers from a pool of metagenomic samples and use them to measure the dissimilarity between two samples.
Identification and assembly of genomes and genetic elements in complex metagenomic samples without using reference genomes
- BiologyNature Biotechnology
- 2014
This work presents a method, based on binning co-abundant genes across a series of metagenomic samples, that enables comprehensive discovery of new microbial organisms, viruses and co-inherited genetic entities and aids assembly of microbial genomes without the need for reference sequences.
A Guide to Enterotypes across the Human Body: Meta-Analysis of Microbial Community Structures in Human Microbiome Datasets
- BiologyPLoS Comput. Biol.
- 2013
This work tested how the following factors influenced the detection of enterotypes: clustering methodology, distance metrics, OTU-picking approaches, sequencing depth, data type (whole genome shotgun (WGS) vs.16S rRNA gene sequence data), and 16S r RNA region.
The Sorcerer II Global Ocean Sampling Expedition: Expanding the Universe of Protein Families
- BiologyPLoS biology
- 2007
This work used sequence similarity clustering to explore proteins with a comprehensive dataset consisting of sequences from available databases together with 6.12 million proteins predicted from an assembly of 7.7 million Global Ocean Sampling sequences to add a great deal of diversity to known protein families and shed light on their evolution.