Analyzing genome coverage profiles with applications to quality control in metagenomics

  title={Analyzing genome coverage profiles with applications to quality control in metagenomics},
  author={Martin Michael Serenus Lindner and Maximilian Kollock and Franziska Zickmann and Bernhard Y. Renard},
  volume={29 10},
MOTIVATION Genome coverage, the number of sequencing reads mapped to a position in a genome, is an insightful indicator of irregularities within sequencing experiments. While the average genome coverage is frequently used within algorithms in computational genomics, the complete information available in coverage profiles (i.e. histograms over all coverages) is currently not exploited to its full extent. Thus, biases such as fragmented or erroneous reference genomes often remain unaccounted for… 

Figures from this paper

Sequana coverage: detection and characterization of genomic variations using running median and mixture models

A stand-alone application is provided that reports genomic regions of interest (ROIs) that are significantly over- or underrepresented in high-throughput sequencing data and z-score statistic is assigned to each base position and used to separate the central distribution from the ROIs.

MetaFlow: Metagenomic profiling based on whole-genome coverage analysis with min-cost flows

MetaFlow is the first method based on coverage analysis across entire genomes that also scales to HTS samples, and is more precise and sensitive than popular tools such as MetaPhlAn, mOTU, GSMer and BLAST, and its abundance estimations at species level are two to four times better in terms of ℓ1-norm.

Reading the Underlying Information From Massive Metagenomic Sequencing Data

The aim of the paper is to provide readers a whole picture of metagenomic data processing and analysis, and a reference and perspective to start with for computational scientists who are interested in this exciting field.

Measuring Genome Sizes Using Read-Depth, k-mers, and Flow Cytometry: Methodological Comparisons in Beetles (Coleoptera)

This project compares estimation methods using next-generation sequencing to measurements from flow cytometry, the gold standard for genome size measures, using ground beetles and other members of the beetle suborder Adephaga as the test system, and presents a new protocol for using read-depth of single-copy genes to estimate genome size.

Computational methods and graphical models for integrative proteogenomics

New proteogenomic approaches for the integration of next-generation sequencing and mass spectrometry data in form of DNA and RNASeq and tandem mass spectra and IPred, a computational approach that explicitly combines the results of ab initio gene finders and evidence-based methods are presented.

Computational methods for the identification and quantification of microbial organisms in metagenomes

The Genome Abundance Similarity Correction (GASiC) algorithm is introduced, a method that allows differentiating between and quantifying highly similar microbial organisms in a metagenomic sample and a taxonomic profiling tool, called MicrobeGPS, is developed.

In vitro and in silico parameters for precise cgMLST typing of Listeria monocytogenes

This work highlights that bioinformatics workflows dedicated to cgMLST allele calling are largely robust when paired-end reads are of high quality and when the sequencing depth is ≥40X.

Introduction to Population Genomics Methods.

This chapter is to introduce population genomics methods to beginners following a learning-by-doing strategy in order to help the reader to analyze the sequencing data by themselves.

Chapter 13 Introduction to population genomics methods

The objective of this chapter is to introduce population genomics methods to beginners following a learning-by-doing strategy in order to help the reader to analyze the sequencing data by themselves.

Metagenomic Profiling of Known and Unknown Microbes with MicrobeGPS

MicrobeGPS is the first method that identifies microbiota in the sample and estimates their genomic distances to known reference genomes and can enable reference based taxonomic profiling of complex and less characterized microbial communities.



Accurate Genome Relative Abundance Estimation Based on Shotgun Metagenomic Reads

A unified probabilistic framework by explicitly modeling read assignment ambiguities, genome size biases and read distributions along the genomes using the Mixture Model theory (GRAMMy), which is demonstrated to give estimates that are accurate and robust across both simulated and real read benchmark datasets.

Estimating DNA coverage and abundance in metagenomes using a gamma approximation

A gamma distribution is employed to model a metagenome as a population of DNA fragments (bins), each of which may be covered by one or more reads, and the number of bins that were not sequenced and that could potentially be revealed by additional sequencing is estimated.

ReadDepth: A Parallel R Package for Detecting Copy Number Alterations from Short Sequencing Reads

The readDepth package for R is presented, which can detect copy number alterations by measuring the depth of coverage obtained by massively parallel sequencing of the genome, and demonstrates a method for inferring copy number using reads generated by whole-genome bisulfite sequencing, thus enabling integrative study of epigenomic and copy number alteration.

Metagenomic abundance estimation and diagnostic testing on species level

Genome Abundance Similarity Correction (GASiC) is developed, a method to estimate true genome abundances via read alignment by considering reference genome similarities in a non-negative LASSO approach, and its superior performance over existing methods on simulated benchmark data as well as on real data.

Qualimap: evaluating next-generation sequencing alignment data

Qualimap is a Java application that supports user-friendly quality control of mapping data, by considering sequence features and their genomic properties, and takes sequence alignment data and provides graphical and statistical analyses for the evaluation of data.

Classification of metagenomic sequences: methods and challenges

The premise, methodologies, advantages, limitations and challenges of various methods available for binning of metagenomic datasets obtained using the shotgun sequencing approach are discussed.

Confidence-based Somatic Mutation Evaluation and Prioritization

An algorithm to assign a single statistic, a false discovery rate (FDR), to each somatic mutation identified by NGS, which accurately discriminates true mutations from erroneous calls and enables statistical comparisons of lab and computation methodologies, including ROC curves and AUC metrics.

MetaVelvet: an extension of Velvet assembler to de novo metagenome assembly from short sequence reads

MetaVelvet succeeded to generate higher N50 scores and smaller chimeric scaffolds than any compared single-genome assemblers, produce high-quality scaffolds as well as the separate assembly using Velvet from isolated species sequence reads, and MetaVelvet reconstructed even relatively low-coverage genome sequences as scaffolds.

A human gut microbial gene catalogue established by metagenomic sequencing

The Illumina-based metagenomic sequencing, assembly and characterization of 3.3 million non-redundant microbial genes, derived from 576.7 gigabases of sequence, from faecal samples of 124 European individuals are described, indicating that the entire cohort harbours between 1,000 and 1,150 prevalent bacterial species and each individual at least 160 such species.

Mason – A Read Simulator for Second Generation Sequencing Data

A read simulator software for Illumina, 454 and Sanger reads that has been written with performance in mind and can sample reads from large genomes.