Safe and Complete Contig Assembly Via Omnitigs

  title={Safe and Complete Contig Assembly Via Omnitigs},
  author={Alexandru I. Tomescu and Paul Medvedev},
Contig assembly is the first stage that most assemblers solve when reconstructing a genome from a set of reads. Its output consists of contigs -- a set of strings that are promised to appear in any genome that could have generated the reads. From the introduction of contigs 20 years ago, assemblers have tried to obtain longer and longer contigs, but the following question was never solved: given a genome graph $G$ (e.g. a de Bruijn, or a string graph), what are all the strings that can be… 

Genome Assembly, from Practice to Theory: Safe, Complete and Linear-Time

A surprising $O(m)$-time algorithm to identify all maximal omnitigs of a graph with $n$ nodes and $m$ arcs, notwithstanding the existence of families of graphs with $\Theta(mn)$ total maximal Omnitig size.

A safe and complete algorithm for metagenomic assembly

A safe and complete algorithm finding all safe walks of G, which constitutes the first theoretical tight upper bound on what can be safely assembled from metagenomic reads using this problem formulation.

Assembling Omnitigs using Hidden-Order de Bruijn Graphs

A succinct representation of that array's Cartesian tree, which takes only 2 extra bits per edge and still lets us support interesting navigation operations efficiently and extract a set of safe strings more informative than the unitigs, while using a reasonable amount of memory.

Uncovering hidden assembly artifacts: when unitigs are not safe and bidirected graphs are not helpful

This paper is the first to theoretically predict the existence of these assembler artifacts and confirm and measure the extent of their occurrence in practice, and it is proved that, contrary to popular belief, unitigs are not always safe.

From omnitigs to macrotigs: a linear-time algorithm for safe walks - common to all closed arc-coverings of a directed graph

An O(m)-time algorithm to identify all maximal omnitigs, thanks to the discovery of a family of walks (macrotigs) with the property that all the non-trivial omnitig are univocal extensions of subwalks of a macrotig.

Overlap graphs and de Bruijn graphs: data structures for de novo genome assembly in the big data era

The most recent advances in the problem of constructing, representing and navigating assembly graphs, focusing on very large datasets are discussed, and some computational techniques to compactly store graphs while keeping all functionalities intact are explored.

An Optimal O(nm) Algorithm for Enumerating All Walks Common to All Closed Edge-covering Walks of a Graph

New insights about the structure of omnitigs are proved and several open questions about them are solved to achieve an O(nm)-time algorithm for outputting all the maximal omnitig of a graph (with n nodes and m edges).

deGSM: memory scalable construction of large scale de Bruijn Graph

The experimental results demonstrate that the proposed lightweight parallel de Bruijn graph construction approach, deGSM, is able to handle very large genome sequence(s), e.g., the contigs and scaffolds recorded in Gen-Bank database and Picea abies HTS dataset.

Safely Filling Gaps with Partial Solutions Common to All Solutions

This work gives an efficient safe algorithm for reliable gap filling: filling gaps with those sub-paths common to all gap filling solutions, following the framework of (Tomescu and Medvedev, RECOMB 2016).

deBWT: parallel construction of Burrows–Wheeler Transform for large collection of genomes with de Bruijn-branch encoding

The benchmarking suggests that, deBWT is efficient and scalable to construct BWT for large dataset by parallel computing, and well-suited to index many genomes, such as a collection of individual human genomes, with multiple-core servers or clusters.



Scaffolding pre-assembled contigs using SSPACE

A new tool, called SSPACE, which is a stand-alone scaffolder of pre-assembled contigs using paired-read data with a short runtime, multiple library input of paired-end and/or mate pair datasets and possible contig extension with unmapped sequence reads.

BESST - Efficient scaffolding of large fragmented assemblies

A comprehensive comparison of BESST against the most popular stand-alone scaffolders on a large variety of datasets concludes that information sources other than the quantity of links, as is commonly used, can provide useful information about genome structure when scaffolding.

Efficient de novo assembly of large genomes using compressed data structures.

A new assembler based on the overlap-based string graph model of assembly, SGA (String Graph Assembler), which provides the first practical assembler for a mammalian-sized genome on a low-end computing cluster and is simply parallelizable.

Efficient construction of an assembly string graph using the FM-index

The algorithms presented here pave the way for overlap-based assembly approaches to be developed that scale to whole vertebrate genome de novo assembly.

The fragment assembly string graph

The result demonstrates that the decomposition of reads into kmers employed in the de Bruijn graph approach described earlier is not essential, and exposes its close connection to the unitig approach the authors developed at Celera.

GAGE: A critical evaluation of genome assemblies and assembly algorithms.

Evaluating several of the leading de novo assembly algorithms on four different short-read data sets generated by Illumina sequencers concludes that data quality, rather than the assembler itself, has a dramatic effect on the quality of an assembled genome.

Maximum Likelihood Genome Assembly

It is demonstrated how the technique of bidirected network flow can be used to explicitly model the double-stranded nature of DNA for genome assembly and a maximum likelihood framework for assembling the genome that is the most likely source of the reads is proposed.

Paired de Bruijn Graphs: A Novel Approach for Incorporating Mate Pair Information into Genome Assemblers

The paired de bruijn graph is introduced, a generalization of the de Bruijn graph that incorporates mate pair information into the graph structure itself instead of analyzing mate pairs at a post-processing step to effectively improve the contig sizes in assembly.

Velvet: algorithms for de novo short read assembly using de Bruijn graphs.

Velvet represents a new approach to assembly that can leverage very short reads in combination with read pairs to produce useful assemblies and is in close agreement with simulated results without read-pair information.

Assembly complexity of prokaryotic genomes using short reads

The analysis gives an upper-bound on the performance of genome assemblers for de novo reconstruction of genomes across a wide range of read lengths and demonstrates that the majority of genes in prokaryotic genomes can be reconstructed uniquely using very short reads even if the genomes themselves cannot.