Toward a statistically explicit understanding of de novo sequence assembly

@article{Howison2013TowardAS,
  title={Toward a statistically explicit understanding of de novo sequence assembly},
  author={Mark Howison and Felipe Zapata and Casey W. Dunn},
  journal={Bioinformatics},
  year={2013},
  volume={29 23},
  pages={
          2959-63
        }
}
MOTIVATION Draft de novo genome assemblies are now available for many organisms. These assemblies are point estimates of the true genome sequences. Each is a specific hypothesis, drawn from among many alternative hypotheses, of the sequence of a genome. Assembly uncertainty, the inability to distinguish between multiple alternative assembly hypotheses, can be due to real variation between copies of the genome in the sample, errors and ambiguities in the sequenced data and assumptions and… 
Bayesian Genome Assembly and Assessment by Markov Chain Monte Carlo Sampling
TLDR
The posterior distribution of assembly hypotheses generated by GABI as a majority-rule consensus assembly is summarized, and the posterior distribution to external assemblies of the same test data is compared, and annotate those assemblies by assigning posterior probabilities to features that are in common with GABI's assembly graph.
Extensive Error in the Number of Genes Inferred from Draft Genome Assemblies
TLDR
The magnitude of the problem, both in terms of total gene number and the number of copies of genes in specific families, and the usefulness of RNA-Seq in improving the gene annotation of draft assemblies, are investigated.
Automated ensemble assembly and validation of microbial genomes
TLDR
Although computationally limited to small or mid-sized genomes, this approach is the most effective and reproducible means for generating high-quality assemblies and enables users to select an assembly best tailored to their specific needs.
Automated ensemble assembly and validation of microbial genomes
TLDR
Although computationally limited to small or mid-sized genomes, this approach is the most effective and reproducible means for generating high-quality assemblies and enables users to select an assembly best tailored to their specific needs.
Assembly and Data Quality
TLDR
Methods to assemble sequence reads into larger pieces are described, and different strategies are used for genome, transcriptome and metagenome assemblies, and all of them greatly benefit from the inclusion of long reads.
GMcloser: closing gaps in assemblies accurately with a likelihood-based selection of contig or long-read alignments
TLDR
GMcloser is described, a tool that accurately closes gaps with a preassembled contig set or a long read set (i.e., error-corrected PacBio reads) by using likelihood-based classifiers calculated from the alignment statistics between scaffolds, contigs and paired-end reads to correctly assign contigs or long reads to gap regions of scaffolding, thereby achieving accurate and efficient gap closure.
Evaluation of de novo transcriptome assemblies from RNA-Seq data
TLDR
A model-based score, RSEM-EVAL, for evaluating assemblies when the ground truth is unknown is developed and shown to correctly reflect assembly accuracy, as measured by REF- EVAL, a refined set of ground-truth-based scores that were developed.
Phylogenomics from Whole Genome Sequences Using aTRAM
TLDR
The use of automated Target Restricted Assembly Method (aTRAM) to assemble 1107 single‐copy ortholog genes from whole genome sequencing of sucking lice and out‐groups is demonstrated and it is demonstrated that this approach is successful at developing phylogenomic data sets from raw genome sequencing reads.
ILP-based maximum likelihood genome scaffolding
TLDR
Equipped with NSDP, SILP2 is able to scaffold large mammalian genomes, resulting in the longest and most accurate scaffolds, and the ILP formulation for the maximum likelihood model is shown to be flexible enough to handle metagenomic samples.
Evaluation of de novo transcriptome assemblies from RNA-Seq data
TLDR
This work developed a model-based score, RSEM-EVAL, for evaluating assemblies when the ground truth is unknown, and assembled the transcriptome of the regenerating axolotl limb; this assembly compares favorably to a previous assembly.
...
1
2
3
...

References

SHOWING 1-10 OF 39 REFERENCES
GAGE: A critical evaluation of genome assemblies and assembly algorithms.
TLDR
Evaluating several of the leading de novo assembly algorithms on four different short-read data sets generated by Illumina sequencers concludes that data quality, rather than the assembler itself, has a dramatic effect on the quality of an assembled genome.
Assemblathon 2: evaluating de novo methods of genome assembly in three vertebrate species
TLDR
The high degree of variability between the entries suggests that there is still much room for improvement in the field of genome assembly and that approaches which work well in assembling the genome of one species may not necessarily work well for another.
Efficient de novo assembly of large genomes using compressed data structures.
TLDR
A new assembler based on the overlap-based string graph model of assembly, SGA (String Graph Assembler), which provides the first practical assembler for a mammalian-sized genome on a low-end computing cluster and is simply parallelizable.
ALE: a generic assembly likelihood evaluation framework for assessing the accuracy of genome and metagenome assemblies
TLDR
The ALE framework provides a comprehensive, reference-independent and statistically rigorous measure of single genome and metagenome assembly accuracy, which can be used to identify misassemblies or to optimize the assembly process.
Assembly reconciliation
TLDR
Using the Assembly Reconciliation technique, the produced reconciled assemblies of six Drosophila species in collaboration with Agencourt Bioscience and The J. Craig Venter Institute are now the official (CAF1) assemblies used for analysis.
Assemblathon 1: a competitive assessment of de novo short read assembly methods.
TLDR
The Assemblathon 1 competition is described, which aimed to comprehensively assess the state of the art in de novo assembly methods when applied to current sequencing technologies, and it is established that it is possible to assemble the genome to a high level of coverage and accuracy.
Comparative analysis of algorithms for whole-genome assembly of pyrosequencing data
TLDR
This work reviews the efficiency of a panel of assemblers, specifically designed to handle data from GS FLX 454 platform, on three bacterial data sets with different characteristics in terms of reads coverage and repeats content, and investigates their strengths and weaknesses in the reconstruction of the reference genomes.
An improved maximum likelihood formulation for accurate genome assembly
  • Aditya Varma, A. Ranade, S. Aluru
  • Engineering
    2011 IEEE 1st International Conference on Computational Advances in Bio and Medical Sciences (ICCABS)
  • 2011
TLDR
Improvements to the recently proposed maximum likelihood method for genome assembly are presented, and results indicate that the method can generate accurate estimates of repeat counts and produces fewer and much longer contigs.
Maximum Likelihood Genome Assembly
TLDR
It is demonstrated how the technique of bidirected network flow can be used to explicitly model the double-stranded nature of DNA for genome assembly and a maximum likelihood framework for assembling the genome that is the most likely source of the reads is proposed.
REAPR: a universal tool for genome assembly evaluation
TLDR
This work validated REAPR on complete genomes or de novo assemblies from bacteria, malaria and Caenorhabditis elegans, and demonstrated that 86% and 82% of the human and mouse reference genomes are error-free, respectively.
...
1
2
3
4
...