Scaling metagenome sequence assembly with probabilistic de Bruijn graphs

@article{Pell2012ScalingMS,
  title={Scaling metagenome sequence assembly with probabilistic de Bruijn graphs},
  author={Jason Pell and Arend Hintze and Rosangela Canino-Koning and Adina C. Howe and James M. Tiedje and C. Titus Brown},
  journal={Proceedings of the National Academy of Sciences},
  year={2012},
  volume={109},
  pages={13272 - 13277}
}
Deep sequencing has enabled the investigation of a wide range of environmental microbial ecosystems, but the high memory requirements for de novo assembly of short-read shotgun sequencing data from these complex populations are an increasingly large practical barrier. Here we introduce a memory-efficient graph representation with which we can analyze the k-mer connectivity of metagenomic samples. The graph representation is based on a probabilistic data structure, a Bloom filter, that allows us… 
A parallel connectivity algorithm for de Bruijn graphs in metagenomic applications
TLDR
This paper presents the first parallel solution for decomposing the metagenomic assembly problem without compromising the post-assembly quality, and proposes a novel distributed memory algorithm to identify the connected subgraphs in the de Bruijn graph.
Parallel and Memory-Efficient Preprocessing for Metagenome Assembly
TLDR
METAPREP is a new end-to-end parallel implementation of a similar preprocessing step for de novo assembly of largemetagenomic datasets and has efficient implementations of several computational subroutines that occur in other genomic data analysis problems, comparable to the state-of-the-art.
Genome analysis Faucet : streaming de novo assembly graph construction
TLDR
Faucet used orders of magnitude less time and disk space than the specialized metagenome assemblers MetaSPAdes and Megahit, while also improving on their memory use; this broadly matched performance of other assemblers optimizing resource efficiency—namely, Minia and LightAssembler.
HaVec: An Efficient de Bruijn Graph Construction Algorithm for Genome Assembly
TLDR
This paper presents an approach called HaVec that attempts to achieve a balance between the memory consumption and the running time, and uses a hash table along with an auxiliary vector data structure to store the de Bruijn graph thereby improving the total memory usage and theRunning time.
deGSM: memory scalable construction of large scale de Bruijn Graph
TLDR
The experimental results demonstrate that the proposed lightweight parallel de Bruijn graph construction approach, deGSM, is able to handle very large genome sequence(s), e.g., the contigs and scaffolds recorded in Gen-Bank database and Picea abies HTS dataset.
Assembly improvements by read mapping and phasing
TLDR
A novel approach combining the sequence alignment and the assembly fields: a read mapping algorithm working on a De Bruijn graph instead of sequences named BGREAT (de Bruijn Graph REad mApping Tool) to study state-of-the-art low memory and efficient algorithms.
Faucet: streaming de novo assembly graph construction
TLDR
Faucet used orders of magnitude less time and disk space than the specialized metagenome assemblers MetaSPAdes and Megahit, while also improving on their memory use; this broadly matched performance of other assemblers optimizing resource efficiency - namely, Minia and LightAssembler.
Assembling large, complex environmental metagenomes
TLDR
Two pre-assembly filtering approaches, digital normalization and partitioning, are applied to make large metagenome assemblies more tractable, and it is demonstrated that these methods result in assemblies nearly identical to assemblies from unprocessed data.
Accelerating large scale de novo metagenome assembly using GPUs
TLDR
This paper presents the first of its kind GPU-accelerated implementation of the local assembly approach that is an integral part of a widely used large-scale metagenome assembler, MetaHipMer, and outperforms the CPU version by about 7x and boosts the performance of MetahipMer by 42% when running on 64 Summit nodes.
...
...

References

SHOWING 1-10 OF 50 REFERENCES
Succinct data structures for assembling large genomes
TLDR
This article uses entropy compressed or succinct data structures to create a practical representation of the de Bruijn assembly graph, which requires at least a factor of 10 less storage than the kinds of structures used by deployed methods.
GAGE: A critical evaluation of genome assemblies and assembly algorithms.
TLDR
Evaluating several of the leading de novo assembly algorithms on four different short-read data sets generated by Illumina sequencers concludes that data quality, rather than the assembler itself, has a dramatic effect on the quality of an assembled genome.
MetaVelvet: an extension of Velvet assembler to de novo metagenome assembly from short sequence reads
TLDR
MetaVelvet succeeded to generate higher N50 scores and smaller chimeric scaffolds than any compared single-genome assemblers, produce high-quality scaffolds as well as the separate assembly using Velvet from isolated species sequence reads, and MetaVelvet reconstructed even relatively low-coverage genome sequences as scaffolds.
Velvet: algorithms for de novo short read assembly using de Bruijn graphs.
TLDR
Velvet represents a new approach to assembly that can leverage very short reads in combination with read pairs to produce useful assemblies and is in close agreement with simulated results without read-pair information.
Meta-IDBA: a de Novo assembler for metagenomic data
TLDR
Comparison of the performances of Meta-IDBA and existing assemblers, such as Velvet and Abyss for different metagenomic datasets shows that Meta- IDBA can reconstruct longer contigs with similar accuracy.
ABySS: a parallel assembler for short read sequence data.
TLDR
ABySS (Assembly By Short Sequences), a parallelized sequence assembler, was developed and assembled 3.5 billion paired-end reads from the genome of an African male publicly released by Illumina, Inc, representing 68% of the reference human genome.
High-quality draft assemblies of mammalian genomes from massively parallel sequence data
TLDR
The development of an algorithm for genome assembly, ALLPATHS-LG, and its application to massively parallel DNA sequence data from the human and mouse genomes, generated on the Illumina platform, have good accuracy, short-range contiguity, long-range connectivity, and coverage of the genome.
A Primer on Metagenomics
TLDR
A concise yet comprehensive introduction to the current computational requirements presented by metagenomics, and a few representative studies illustrating different facets of recent scientific discoveries made using meetagenomics are provided.
De novo assembly and genotyping of variants using colored de Bruijn graphs
TLDR
An efficient software implementation, Cortex, the first de novo assembler capable of assembling multiple eukaryotic genomes simultaneously is provided, and how population information from ten chimpanzees enables accurate variant calls without a reference sequence is shown.
Efficient counting of k-mers in DNA sequences using a bloom filter
TLDR
A new method is presented that identifies all the k-mers that occur more than once in a DNA sequence data set using a Bloom filter, a probabilistic data structure that stores all the observed k-mer implicitly in memory with greatly reduced memory requirements.
...
...