Multiple comparative metagenomics using multiset k-mer counting

@article{Benoit2016MultipleCM,
  title={Multiple comparative metagenomics using multiset k-mer counting},
  author={Ga{\"e}tan Benoit and Pierre Peterlongo and Mahendra Mariadassou and Erwan Drezen and Sophie Schbath and Dominique Lavenier and Claire Lemaitre},
  journal={PeerJ Comput. Sci.},
  year={2016},
  volume={2},
  pages={e94}
}
Background Large scale metagenomic projects aim to extract biodiversity knowledge between different environmental conditions. Current methods for comparing microbial communities face important limitations. Those based on taxonomical or functional assignation rely on a small subset of the sequences that can be associated to known organisms. On the other hand, de novo methods, that compare the whole sets of sequences, either do not scale up on ambitious metagenomic projects or do not provide… 

Figures and Tables from this paper

Libra: scalable k-mer–based tool for massive all-vs-all metagenome comparisons
TLDR
A tool called Libra is developed that performs an all-vs-all comparison of metagenomes for precise clustering based on their k-mer content in a Hadoop architecture that can scale to any size dataset to enable global-scale analyses and link microbial signatures to biological processes.
SimkaMin: fast and resource frugal de novo comparative metagenomics
TLDR
SimkaMin is presented, a quick comparative metagenomics tool with low disk and memory footprints, thanks to an efficient data subsampling scheme used to estimate Bray-Curtis and Jaccard dissimilarities.
Identifying Group-Specific Sequences for Microbial Communities Using Long k-mer Sequence Signatures
TLDR
A long k-mer (k ≥ 30 bps)-based computational pipeline to detect group-specific sequences at strain resolution free from reference sequences, sequence alignments, and metagenome-wide de novo assembly is developed with parallel computing.
A New Alignment-Free Whole Metagenome Comparison Tool and Its Application on Gut Microbiomes of Wild Giant Pandas
TLDR
A new alignment-free tool, KmerFreqCalc, is developed for the comparison of the whole metagenomic data, which first calculated the frequencies of both forward and reverse complementary sequences of k-mers like Mash and then computed the cosine distance between the samples based on k-mer frequency vectors like Libra.
Recentrifuge: Robust comparative analysis and contamination removal for metagenomics
TLDR
Recentrifuge implements a robust method for the removal of negative-control and crossover taxa from the rest of samples and provides shared and exclusive taxa per sample, thus enabling robust contamination removal and comparative analysis in environmental and clinical metagenomics.
Comparison of microbiome samples: methods and computational challenges
TLDR
Current solutions for three key challenges in the comparison of metagenomic next-generation sequencing data sets are presented, considering both reference-based methods relying on a database of reference genomes and reference-free methods working directly on all sequencing reads from the samples.
MAGNETO: An Automated Workflow for Genome-Resolved Metagenomics
TLDR
MAGNETO is presented, an automated workflow dedicated to MAG reconstruction, which includes a fully-automated coassembly step informed by optimal clustering of metagenomic distances, and implements complementary genome binning strategies, for improving MAG recovery.
Software for Systematics and Evolution
TLDR
APPLES, a distance-based method for phylogenetic placement, is introduced and it is shown that using dense references improves accuracy substantially so that APPLES on dense trees is more accurate than ML on sparser trees, where it can run.
Recentrifuge: robust comparative analysis and contamination removal for metagenomic data
TLDR
Recentrifuge, researchers can analyze results from Centrifug and LMAT classifiers using interactive hierarchical pie charts with special emphasis on the confidence level of the classifications, thus enabling robust comparative analysis of multiple samples in any metagenomic study.
Fast Approximation of Frequent k-Mers and Applications to Metagenomics.
TLDR
This study develops, analyzes, and test a sampling-based approach, called Sampling Algorithm for K-mErs approxIMAtion (SAKEIMA), to approximate the frequent k-mers and their frequencies in a high-throughput sequencing data set while providing rigorous guarantees on the quality of the approximation.
...
...

References

SHOWING 1-10 OF 56 REFERENCES
MetaFast: fast reference-free graph-based comparison of shotgun metagenomic data
TLDR
MetaFast is developed, an approach that allows to represent a shotgun metagenome from an arbitrary environment as a modified de Bruijn graph consisting of simplified components and is computationally efficient and especially promising for an analysis of metagenomes from novel environmental niches.
Assessment of k-mer spectrum applicability for metagenomic dissimilarity analysis
TLDR
The k-mer spectrum-based measure was found to behave similarly to one based on mapping to a reference gene catalog, but different from one using a genome catalog, which turned out to be associated with a significant presence of viral reads in a number of metagenomes.
Compareads: comparing huge metagenomic experiments
TLDR
Using a new data structure based on Bloom filters, Compareads is a practical solution for comparing de novo huge metagenomic samples and enables to retrieve biological information while being able to scale to huge datasets.
A Novel Abundance-Based Algorithm for Binning Metagenomic Sequences Using l-Tuples
TLDR
The novel approach AbundanceBin, an application of the Lander-Waterman model to metagenomics, achieved accurate, unsupervised, clustering of metagenomic sequences into different bins, such that the reads classified in a bin belong to species of identical or very similar abundances in the sample.
Spaced seeds improve k-mer-based metagenomic classification
TLDR
It is shown that spaced seeds provide a significant improvement of classification accuracy, as opposed to traditional contiguous k-mers, within this general framework for metagenomic classification accuracy.
Exploration and retrieval of whole-metagenome sequencing samples
TLDR
A content-based exploration and retrieval method for whole-metagenome sequencing samples using a distributed string mining framework to efficiently extract all informative sequence k-mers from a pool of metagenomic samples and use them to measure the dissimilarity between two samples.
Exploration and retrieval of whole-metagenome sequencing
TLDR
A content-based exploration and retrieval method for whole-metagenome sequencing samples using a distributed string mining framework to efficiently extract all informative sequence k-mers from a pool of metagenomic samples and use them to measure the dissimilarity between two samples.
Identification and assembly of genomes and genetic elements in complex metagenomic samples without using reference genomes
TLDR
This work presents a method, based on binning co-abundant genes across a series of metagenomic samples, that enables comprehensive discovery of new microbial organisms, viruses and co-inherited genetic entities and aids assembly of microbial genomes without the need for reference sequences.
A Guide to Enterotypes across the Human Body: Meta-Analysis of Microbial Community Structures in Human Microbiome Datasets
TLDR
This work tested how the following factors influenced the detection of enterotypes: clustering methodology, distance metrics, OTU-picking approaches, sequencing depth, data type (whole genome shotgun (WGS) vs.16S rRNA gene sequence data), and 16S r RNA region.
The Sorcerer II Global Ocean Sampling Expedition: Expanding the Universe of Protein Families
TLDR
This work used sequence similarity clustering to explore proteins with a comprehensive dataset consisting of sequences from available databases together with 6.12 million proteins predicted from an assembly of 7.7 million Global Ocean Sampling sequences to add a great deal of diversity to known protein families and shed light on their evolution.
...
...