Canu: scalable and accurate long-read assembly via adaptive k-mer weighting and repeat separation.

@article{Koren2017CanuSA,
  title={Canu: scalable and accurate long-read assembly via adaptive k-mer weighting and repeat separation.},
  author={Sergey Koren and Brian P. Walenz and Konstantin Berlin and Jason Rafe Miller and Nicholas H. Bergman and Adam M. Phillippy},
  journal={Genome research},
  year={2017},
  volume={27 5},
  pages={
          722-736
        }
}
Long-read single-molecule sequencing has revolutionized de novo genome assembly and enabled the automated reconstruction of reference-quality genomes. [] Key Result Canu introduces support for nanopore sequencing, halves depth-of-coverage requirements, and improves assembly continuity while simultaneously reducing runtime by an order of magnitude on large genomes versus Celera Assembler 8.2.

Figures and Tables from this paper

Fast-SG: an alignment-free algorithm for hybrid assembly
TLDR
This paper proposes a new method, called FAST-SG, which uses a new ultra-fast alignment-free algorithm specifically designed for constructing a scaffolding graph using light-weight data structures, which allows the reuse of efficient algorithms designed for short read data and permits the definition of novel modular hybrid assembly pipelines.
Fast and accurate reference-guided scaffolding of draft genomes
TLDR
RaGOO is presented, an open-source reference-guided contig ordering and orienting tool that leverages the speed and sensitivity of Minimap2 to accurately achieve chromosome-scale assemblies in just minutes and outperforms error-prone reference-free methods and enable rapid pan-genome analysis.
Benchmarking of long-read assemblers for prokaryote whole genome sequencing.
TLDR
Of the assemblers tested, Flye, Miniasm/Minipolish and Raven performed best overall, however, no single tool performed well on all metrics, highlighting the need for continued development on long-read assembly algorithms.
Errors in long-read assemblies can critically affect protein prediction
TLDR
The prevalence of indel errors in the recently published Jain et al.3 MinION and Illumina assembly of the human genome is investigated and comparisons to previously published long-read assemblies from PacBio data and short-read Illumina assemblies of the same cell lines are included.
Benchmarking of long-read assemblers for prokaryote whole genome sequencing.
TLDR
Of the assemblers tested, Flye, Miniasm/Minipolish and Raven performed best overall, however, no single tool performed well on all metrics, highlighting the need for continued development on long-read assembly algorithms.
RefKA: A fast and efficient long-read genome assembly approach for large and complex genomes
TLDR
RefKA is developed, a reference-based approach for long read genome assembly that relies on breaking up a closely related reference genome into bins, aligning k-mers unique to each bin with PacBio reads, and then assembling each bin in parallel followed by a final bin-stitching step.
HASLR: Fast Hybrid Assembly of Long Reads
TLDR
HASLR is a hybrid assembler which uses both second and third generation sequencing reads to efficiently generate accurate genome assemblies and is not only the fastest assembler but also the one with the lowest number of misassemblies on all the samples compared to other tested assemblers.
HiCanu: accurate assembly of segmental duplications, satellites, and allelic variants from high-fidelity long reads
TLDR
This HiCanu assembly correctly resolves 337 out of 341 validation BACs sampled from known segmental duplications and provides the first preliminary assemblies of 9 complete human centromeric regions, a significant advance towards the complete assembly of human genomes.
BELLA: Berkeley Efficient Long-Read to Long-Read Aligner and Overlapper
TLDR
A probabilistic model is presented which demonstrates the soundness of using short, fixed length k-mers to detect overlaps, avoiding expensive pairwise alignment of all reads against all others and introduces a notion of reliable k-mer based on this model.
Distributed de novo assembler for large-scale long-read datasets
TLDR
This paper presents a distributed long-read assembler that can assemble large-scale noisy sequence datasets on thousands of cores, resulting in orders of magnitude faster assembly times.
...
...

References

SHOWING 1-10 OF 102 REFERENCES
Assembling large genomes with single-molecule sequencing and locality-sensitive hashing
TLDR
The MinHash Alignment Process (MHAP) is introduced for overlapping noisy, long reads using probabilistic, locality-sensitive hashing and can produce de novo near-complete eukaryotic assemblies that are 99.99% accurate when compared with available reference genomes.
Minimap and miniasm: fast mapping and de novo assembly for noisy long sequences
TLDR
A new mapper, minimap and a de novo assembler, miniasm, is presented for efficiently mapping and assembling SMRT and ONT reads without an error correction stage.
Hybrid error correction and de novo assembly of single-molecule sequencing reads
TLDR
This work introduces a correction algorithm and assembly strategy that uses short, high-fidelity sequences to correct the error in single-molecule sequences, leading to substantially better assemblies than current sequencing strategies.
Single-molecule sequencing and conformational capture enable de novo mammalian reference genomes
TLDR
This assembly represents a >250-fold improvement in contiguity compared to the previously published C. hircus assembly, and better resolves repetitive structures longer than 1 kb, supporting the most complete repeat family and immune gene complex representation ever produced for a ruminant species.
GAGE: A critical evaluation of genome assemblies and assembly algorithms.
TLDR
Evaluating several of the leading de novo assembly algorithms on four different short-read data sets generated by Illumina sequencers concludes that data quality, rather than the assembler itself, has a dramatic effect on the quality of an assembled genome.
Contiguous and accurate de novo assembly of metazoan genomes with modest long read coverage
TLDR
The results motivate a set of preliminary best practices for obtaining accurate and contiguous assemblies, a “missing manual” that guides key decisions in building high quality de novo genome assemblies, from DNA isolation to polishing the assembly.
Aggressive assembly of pyrosequencing reads with mates
TLDR
The revised pipeline called CABOG (Celera Assembler with the Best Overlap Graph) is robust to homopolymer run length uncertainty, high read coverage and heterogeneous read lengths, and in tests on four genomes, it generated the longest contigs among all assemblers tested.
Error correction and assembly complexity of single molecule sequencing reads
TLDR
A new data-driven model using support vector regression that can accurately predict assembly performance is developed and applied to several prokaryotic and eukaryotic genomes, and can achieve near-perfect assemblies of small genomes and substantially improved assemblies of larger ones.
Reducing assembly complexity of microbial genomes with single-molecule sequencing
TLDR
Automated assembly of long, single-molecule sequencing data reduces the cost of microbial finishing to $1,000 for most genomes, and future advances in this technology are expected to drive the cost lower.
Nonhybrid, finished microbial genome assemblies from long-read SMRT sequencing data
We present a hierarchical genome-assembly process (HGAP) for high-quality de novo microbial genome assemblies using only a single, long-insert shotgun DNA library in conjunction with Single Molecule,
...
...