Safe and Complete Contig Assembly Via Omnitigs

@article{Tomescu2016SafeAC,
  title={Safe and Complete Contig Assembly Via Omnitigs},
  author={Alexandru I. Tomescu and Paul Medvedev},
  journal={ArXiv},
  year={2016},
  volume={abs/1601.02932}
}
Contig assembly is the first stage that most assemblers solve when reconstructing a genome from a set of reads. Its output consists of contigs -- a set of strings that are promised to appear in any genome that could have generated the reads. From the introduction of contigs 20 years ago, assemblers have tried to obtain longer and longer contigs, but the following question was never solved: given a genome graph $G$ (e.g. a de Bruijn, or a string graph), what are all the strings that can be… 

Genome Assembly, from Practice to Theory: Safe, Complete and Linear-Time

TLDR
A surprising $O(m)$-time algorithm to identify all maximal omnitigs of a graph with $n$ nodes and $m$ arcs, notwithstanding the existence of families of graphs with $\Theta(mn)$ total maximal Omnitig size.

A safe and complete algorithm for metagenomic assembly

TLDR
A safe and complete algorithm finding all safe walks of G, which constitutes the first theoretical tight upper bound on what can be safely assembled from metagenomic reads using this problem formulation.

Assembling Omnitigs using Hidden-Order de Bruijn Graphs

TLDR
A succinct representation of that array's Cartesian tree, which takes only 2 extra bits per edge and still lets us support interesting navigation operations efficiently and extract a set of safe strings more informative than the unitigs, while using a reasonable amount of memory.

Uncovering hidden assembly artifacts: when unitigs are not safe and bidirected graphs are not helpful

TLDR
This paper is the first to theoretically predict the existence of these assembler artifacts and confirm and measure the extent of their occurrence in practice, and it is proved that, contrary to popular belief, unitigs are not always safe.

From omnitigs to macrotigs: a linear-time algorithm for safe walks - common to all closed arc-coverings of a directed graph

TLDR
An O(m)-time algorithm to identify all maximal omnitigs, thanks to the discovery of a family of walks (macrotigs) with the property that all the non-trivial omnitig are univocal extensions of subwalks of a macrotig.

An Optimal O(nm) Algorithm for Enumerating All Walks Common to All Closed Edge-covering Walks of a Graph

TLDR
New insights about the structure of omnitigs are proved and several open questions about them are solved to achieve an O(nm)-time algorithm for outputting all the maximal omnitig of a graph (with n nodes and m edges).

deGSM: memory scalable construction of large scale de Bruijn Graph

TLDR
The experimental results demonstrate that the proposed lightweight parallel de Bruijn graph construction approach, deGSM, is able to handle very large genome sequence(s), e.g., the contigs and scaffolds recorded in Gen-Bank database and Picea abies HTS dataset.

Safely Filling Gaps with Partial Solutions Common to All Solutions

TLDR
This work gives an efficient safe algorithm for reliable gap filling: filling gaps with those sub-paths common to all gap filling solutions, following the framework of (Tomescu and Medvedev, RECOMB 2016).

deBWT: parallel construction of Burrows–Wheeler Transform for large collection of genomes with de Bruijn-branch encoding

TLDR
The benchmarking suggests that, deBWT is efficient and scalable to construct BWT for large dataset by parallel computing, and well-suited to index many genomes, such as a collection of individual human genomes, with multiple-core servers or clusters.

Safety in s-t Paths, Trails and Walks

TLDR
It is proved that there exists a compact representation computable in linear time, that allows outputting all maximal safe walks in time linear in their length, and that the same complexity results hold for the analogous generalisations of s-tarticulation points (nodes appearing in all s-T paths).

References

SHOWING 1-10 OF 42 REFERENCES

Scaffolding pre-assembled contigs using SSPACE

TLDR
A new tool, called SSPACE, which is a stand-alone scaffolder of pre-assembled contigs using paired-read data with a short runtime, multiple library input of paired-end and/or mate pair datasets and possible contig extension with unmapped sequence reads.

BESST - Efficient scaffolding of large fragmented assemblies

TLDR
A comprehensive comparison of BESST against the most popular stand-alone scaffolders on a large variety of datasets concludes that information sources other than the quantity of links, as is commonly used, can provide useful information about genome structure when scaffolding.

Efficient construction of an assembly string graph using the FM-index

TLDR
The algorithms presented here pave the way for overlap-based assembly approaches to be developed that scale to whole vertebrate genome de novo assembly.

The fragment assembly string graph

TLDR
The result demonstrates that the decomposition of reads into kmers employed in the de Bruijn graph approach described earlier is not essential, and exposes its close connection to the unitig approach the authors developed at Celera.

GAGE: A critical evaluation of genome assemblies and assembly algorithms.

TLDR
Evaluating several of the leading de novo assembly algorithms on four different short-read data sets generated by Illumina sequencers concludes that data quality, rather than the assembler itself, has a dramatic effect on the quality of an assembled genome.

Maximum Likelihood Genome Assembly

TLDR
It is demonstrated how the technique of bidirected network flow can be used to explicitly model the double-stranded nature of DNA for genome assembly and a maximum likelihood framework for assembling the genome that is the most likely source of the reads is proposed.

Paired de Bruijn Graphs: A Novel Approach for Incorporating Mate Pair Information into Genome Assemblers

TLDR
The paired de bruijn graph is introduced, a generalization of the de Bruijn graph that incorporates mate pair information into the graph structure itself instead of analyzing mate pairs at a post-processing step to effectively improve the contig sizes in assembly.

Velvet: algorithms for de novo short read assembly using de Bruijn graphs.

TLDR
Velvet represents a new approach to assembly that can leverage very short reads in combination with read pairs to produce useful assemblies and is in close agreement with simulated results without read-pair information.

Assembly complexity of prokaryotic genomes using short reads

TLDR
The analysis gives an upper-bound on the performance of genome assemblers for de novo reconstruction of genomes across a wide range of read lengths and demonstrates that the majority of genes in prokaryotic genomes can be reconstructed uniquely using very short reads even if the genomes themselves cannot.

Gap Filling as Exact Path Length Problem

TLDR
This work derives a simpler dynamic programming solution than already known, pseudo-polynomial in the maximum value of the input range, and implemented various practical optimizations to it, and compared the exact gap filling solution experimentally to popular gap filling tools.