On the representation of de Bruijn graphs

@article{Chikhi2014OnTR,
  title={On the representation of de Bruijn graphs},
  author={Rayan Chikhi and Antoine Limasset and Shaun D. Jackman and Jared T. Simpson and Paul Medvedev},
  journal={Journal of computational biology : a journal of computational molecular cell biology},
  year={2014},
  volume={22 5},
  pages={
          336-52
        }
}
  • R. Chikhi, A. Limasset, Paul Medvedev
  • Published 21 January 2014
  • Computer Science
  • Journal of computational biology : a journal of computational molecular cell biology
The de Bruijn graph plays an important role in bioinformatics, especially in the context of de novo assembly. However, the representation of the de Bruijn graph in memory is a computational bottleneck for many assemblers. Recent papers proposed a navigational data structure approach in order to improve memory usage. We prove several theoretical space lower bounds to show the limitations of these types of approaches. We further design and implement a general data structure (dbgfm) and… 

Variable-Order de Bruijn Graphs

TLDR
The experiments show the augmenting of a succinct de Bruijn graph representation by Bowe et al. to support new operations that let us change order on the fly only modestly increases space usage, construction time, and navigation time compared to a single order graph.

deGSM: memory scalable construction of large scale de Bruijn Graph

TLDR
The experimental results demonstrate that the proposed lightweight parallel de Bruijn graph construction approach, deGSM, is able to handle very large genome sequence(s), e.g., the contigs and scaffolds recorded in Gen-Bank database and Picea abies HTS dataset.

Practical dynamic de Bruijn graphs

TLDR
A practical implementation of the de Bruijn graph data structure, supporting exact membership queries and fully dynamic edge operations, as well as limited support for dynamic node operations.

Compacting de Bruijn graphs from sequencing data quickly and in low memory

TLDR
An algorithm and a tool bcalm 2 is presented for the compaction of de Bruijn graphs, a parallel algorithm that distributes the input based on a minimizer hashing technique, allowing for good balance of memory usage throughout its execution.

From Indexing Data Structures to de Bruijn Graphs

TLDR
Here, the relationship between suffix trees/arrays and dBGs is formalised, and linear time algorithms for constructing the full or contracted dBGs are exhibited.

Efficient de Bruijn graph construction using suffix trees

TLDR
A novel algorithm has been developed to directly generate the CdBG from the suffix tree of the input and experimental results confirm the theoretical bounds and show that for some set of parameters the approach is competitive with the state of the art.

Cuttlefish: fast, parallel and low-memory compaction of de Bruijn graphs from large-scale genome collections

TLDR
Cuttlefish introduces a novel modeling scheme of the de Bruijn graph vertices as finite-state automata, and constrains the state-space for the automata to enable tracking of their transitioning states with very low memory usage.

Read Mapping on de Bruijn graph

TLDR
This work formally defines the problem of mapping reads on references and provides a first theoretical and practical study toward this direction when the graph is a de Bruijn Graph, and shows that the problem is NP-Complete.

Succinct De Bruijn Graph Construction for Massive Populations Through Space-Efficient Merging

TLDR
This paper creates VariMerge, a means to merge succinct representations of the de Bruijn graph through partitioning the data into smaller subsets, building the colored de Bruijk graph using a FM-index based representation, and merging these representations in an iterative format.

TwoPaCo: an efficient algorithm to build the compacted de Bruijn graph from many complete genomes

TLDR
TwoPaCo, a simple and scalable low memory algorithm for the direct construction of the compacted de Bruijn graph from a set of complete genomes, is presented and demonstrated that it can construct the graph for 100 simulated human genomes in less than a day and eight real primates in < 2 h, on a typical shared‐memory machine.
...

References

SHOWING 1-10 OF 51 REFERENCES

Space-efficient and exact de Bruijn graph representation based on a Bloom filter

TLDR
A new encoding of the de Bruijn graph, which occupies an order of magnitude less space than current representations, is proposed, which performed a complete de novo assembly of human genome short reads using 5.7 GB of memory in 23 hours.

From Indexing Data Structures to de Bruijn Graphs

TLDR
Here, the relationship between suffix trees/arrays and dBGs is formalised, and linear time algorithms for constructing the full or contracted dBGs are exhibited.

Using cascading Bloom filters to improve the memory usage for de Brujin graphs

TLDR
This work shows how to reduce the memory required by the data structure of Chikhi and Rizk (WABI’12) that represents de Brujin graphs using Bloom filters, which constitutes the most efficient practical representation of de Bruijn graphs.

Scaling metagenome sequence assembly with probabilistic de Bruijn graphs

TLDR
A memory-efficient graph representation based on a probabilistic data structure, a Bloom filter, that allows us to efficiently store assembly graphs in as little as 4 bits per k-mer, albeit inexactly, is introduced, which reduces the overall memory requirements for de novo assembly of metagenomes.

Efficient construction of an assembly string graph using the FM-index

TLDR
The algorithms presented here pave the way for overlap-based assembly approaches to be developed that scale to whole vertebrate genome de novo assembly.

Compact representation of k-mer de Bruijn graphs for genome read assembly

TLDR
The kFM-index could replace more memory demanding data structures for storing the de Bruijn k-mer graph representation of sequence reads, and is presented here as a modification of the FM-index.

Memory Efficient Minimum Substring Partitioning

TLDR
This work investigates the memory issue of constructing de Bruijn graph, a core task in leading assembly algorithms, and proposes a disk-based partition method, called Minimum Substring Partitioning (MSP), to complete the task using less than 10 gigabytes memory, without runtime slowdown.

Velvet: algorithms for de novo short read assembly using de Bruijn graphs.

TLDR
Velvet represents a new approach to assembly that can leverage very short reads in combination with read pairs to produce useful assemblies and is in close agreement with simulated results without read-pair information.

Exploring single-sample SNP and INDEL calling with whole-genome de novo assembly

  • Heng Li
  • Computer Science
    Bioinform.
  • 2012
TLDR
A de novo assembler, fermi, that assembles Illumina short reads into unitigs while preserving most of information of the input reads is developed, suggesting that variant calling with de noVO assembly can be a beneficial complement to the standard variant calling pipeline for whole-genome resequencing.

De novo assembly and genotyping of variants using colored de Bruijn graphs

TLDR
An efficient software implementation, Cortex, the first de novo assembler capable of assembling multiple eukaryotic genomes simultaneously is provided, and how population information from ten chimpanzees enables accurate variant calls without a reference sequence is shown.
...