Optimal assembly for high throughput shotgun sequencing

@article{Bresler2013OptimalAF,
  title={Optimal assembly for high throughput shotgun sequencing},
  author={Guy Bresler and Ma'ayan Bresler and David Tse},
  journal={BMC Bioinformatics},
  year={2013},
  volume={14},
  pages={S18 - S18}
}
We present a framework for the design of optimal assembly algorithms for shotgun sequencing under the criterion of complete reconstruction. We derive a lower bound on the read length and the coverage depth required for reconstruction in terms of the repeat statistics of the genome. Building on earlier works, we design a de Brujin graph based assembly algorithm which can achieve very close to the lower bound for repeat statistics of a wide range of sequenced genomes, including the GAGE datasets… 
Near-optimal assembly for shotgun sequencing with noisy reads
TLDR
This work shows that even when there is noise in the reads, one can successfully reconstruct with information requirements close to the noiseless fundamental limit, and a new assembly algorithm is designed based on a probabilistic model of the genome.
A probabilistic analysis of shotgun sequencing for metagenomics
TLDR
This work analyzes the identifiability of collections of M genomes of length N in an asymptotic regime in which N tends to infinity and M may grow with N and provides a threshold in terms of M and N so that if the read length exceeds the threshold, then a simple greedy algorithm successfully reconstructs the full collection of genomes with probability tending to one.
Overlap-based genome assembly from variable-length reads
TLDR
This work introduces a new assembly algorithm with two desirable features in the context of long-read sequencing: it is an overlap-based method, thus being more resilient to read errors than de Bruijn graph approaches; and it achieves the information-theoretic bounds even in the variable-length read setting.
End-to-End Optimization of High-Throughput DNA Sequencing
TLDR
This article model and optimize the end-to-end flow cell synthesis and target genome sequencing process, linking and partially controlling the statistics of the physical processes to the success of the final computational step.
Canu: scalable and accurate long-read assembly via adaptive k-mer weighting and repeat separation
TLDR
Canu, a successor of Celera Assembler that is specifically designed for noisy single-molecule sequences, is presented, demonstrating that Canu can reliably assemble complete microbial genomes and near-complete eukaryotic chromosomes using either PacBio or Oxford Nanopore technologies.
Canu: scalable and accurate long-read assembly via adaptive k-mer weighting and repeat separation.
TLDR
Canu, a successor of Celera Assembler that is specifically designed for noisy single-molecule sequences, is presented, demonstrating that Canu can reliably assemble complete microbial genomes and near-complete eukaryotic chromosomes using either Pacific Biosciences or Oxford Nanopore technologies.
Statistical Methods for Genome Assembly
TLDR
A de Brujin graph based assembly algorithm which can achieve very close to the lower bound for repeat statistics of a wide range of sequenced genomes, based on a set of necessary and sufficient conditions on the DNA sequence and the reads for reconstruction.
OPERA-LG: efficient and exact scaffolding of large, repeat-rich eukaryotic genomes with performance guarantees
TLDR
OPERA-LG is a scalable, exact algorithm for the scaffold assembly of large, repeat-rich genomes, out-performing state-of-the-art programs for scaffold correctness and contiguity.
Theoretical Bounds on Mate-Pair Information for Accurate Genome Assembly
TLDR
This paper provides an alternate perspective on the genome assembly problem by showing genome assembly is easy when provided with sufficient mate-pair information, and quantifies the number of mate- Pair libraries necessary and sufficient for accurate genome assembly, in terms of the length of the longest repetitive region within a genome.
...
1
2
3
4
5
...

References

SHOWING 1-10 OF 38 REFERENCES
Title Optimal Assembly for High Throughput Shotgun Sequencing
TLDR
A de Brujin graph based assembly algorithm is designed which can achieve very close to the lower bound for repeat statistics of a wide range of sequenced genomes, including the GAGE datasets.
Parametric Complexity of Sequence Assembly: Theory and Applications to Next Generation Sequencing
TLDR
This work suggests at least two ways in which existing assemblers can be extended in a rigorous fashion, in addition to delineating directions for future theoretical investigations.
Combinatorial algorithms for DNA sequence assembly
TLDR
A four-phase approach based on rigorous design criteria is presented, and has been found to be very accurate in practice and can accommodate high sequencing error rates.
Maximum Likelihood Genome Assembly
TLDR
It is demonstrated how the technique of bidirected network flow can be used to explicitly model the double-stranded nature of DNA for genome assembly and a maximum likelihood framework for assembling the genome that is the most likely source of the reads is proposed.
GAGE: A critical evaluation of genome assemblies and assembly algorithms.
TLDR
Evaluating several of the leading de novo assembly algorithms on four different short-read data sets generated by Illumina sequencers concludes that data quality, rather than the assembler itself, has a dramatic effect on the quality of an assembled genome.
A New Algorithm for DNA Sequence Assembly
TLDR
This paper proposes a new computer algorithm for DNA sequence assembly that combines in a novel way the techniques of both shotgun and SBH methods, and promises to be very fast and practical forDNA sequence assembly.
TIGR Assembler: A New Tool for Assembling Large Shotgun Sequencing Projects
TLDR
A fast initial comparison of fragments based on oligonucleotide content is used to eliminate the need for a more sensitive comparison between most fragment pairs, thus greatly reducing computer search time.
Assemblathon 1: a competitive assessment of de novo short read assembly methods.
TLDR
The Assemblathon 1 competition is described, which aimed to comprehensively assess the state of the art in de novo assembly methods when applied to current sequencing technologies, and it is established that it is possible to assemble the genome to a high level of coverage and accuracy.
Assembling millions of short DNA sequences using SSAKE
TLDR
SSAKE is a tool for aggressively assembling millions of short nucleotide sequences by progressively searching through a prefix tree for the longest possible overlap between any two sequences to help leverage the information from short sequence reads by stringently assembling them into contiguous sequences that can be used to characterize novel sequencing targets.
The fragment assembly string graph
TLDR
The result demonstrates that the decomposition of reads into kmers employed in the de Bruijn graph approach described earlier is not essential, and exposes its close connection to the unitig approach the authors developed at Celera.
...
1
2
3
4
...