Exploring genome characteristics and sequence quality without a reference

  title={Exploring genome characteristics and sequence quality without a reference},
  author={Jared T. Simpson},
  pages={1228 - 1235}
  • J. Simpson
  • Published 30 July 2013
  • Biology
  • Bioinformatics
Motivation: The de novo assembly of large, complex genomes is a significant challenge with currently available DNA sequencing technology. While many de novo assembly software packages are available, comparatively little attention has been paid to assisting the user with the assembly. Results: This article addresses the practical aspects of de novo assembly by introducing new ways to perform quality assessment on a collection of sequence reads. The software implementation calculates per-base… 

Figures and Tables from this paper

BUSCO: assessing genome assembly and annotation completeness with single-copy orthologs
This work proposes a measure for quantitative assessment of genome assembly and annotation completeness based on evolutionarily informed expectations of gene content, implemented in open-source software, with sets of Benchmarking Universal Single-Copy Orthologs, named BUSCO.
KAT: a K-mer analysis toolkit to quality control NGS datasets and genome assemblies
The K-mer Analysis Toolkit (KAT) is presented: a multi-purpose software toolkit for reference-free quality control (QC) of WGS reads and de novo genome assemblies, primarily via their k-mer frequencies and GC composition.
GenomeScope: fast reference‐free genome profiling from short reads
Summary: GenomeScope is an open‐source web tool to rapidly estimate the overall characteristics of a genome, including genome size, heterozygosity rate and repeat content from unprocessed short
Parallel algorithms and software tools for high-throughput sequencing data
This thesis presents a collection of novel algorithms and software tools for the analysis of high-throughput sequencing data using efficient data structures and utilizes the latest advances in parallel and distributed computing.
Determining the quality and complexity of next-generation sequencing data without a reference genome
We describe an open-source kPAL package that facilitates an alignment-free assessment of the quality and comparability of sequencing datasets by analyzing k-mer frequencies. We show that kPAL can
The present and future of de novo whole-genome assembly
This review provides guidelines to determine the optimal approach for a given input data type, computational budget or genome and categorizes de novo assemblers on the basis of the type of de Bruijn graphs.
Estimating Assembly Base Errors Using K-mer Abundance Difference (KAD) Between Short Reads and Genome Assembled Sequences
A novel approach is developed, referred to as K-mer Abundance Difference (KAD), to compare the inferred copy number of each k-mer indicated by short reads and the observed copy number in the assembly, which can be used to identify base errors and estimate the overall error rate.
Quality control of next-generation sequencing data without a reference
It is shown that by generating a rapid, non-optimized draft assembly of raw reads, it is possible to obtain reliable and informative QC metrics, thus removing the need for a high quality reference.
Factorial estimating assembly base errors using k-mer abundance difference (KAD) between short reads and genome assembled sequences
A novel approach is developed, referred to as k-mer abundance difference (KAD), to compare the inferred copy number of each k-mers indicated by short reads and the observed copy number in the assembly, which can be used to identify base errors and estimate the overall error rate.
To my family and friends
AssemblyRAST is introduced, a general compute orchestration framework and accompanying domain-specific language that facillitates rapid workflow design for rapid genome assembly, analysis, and method discovery and a method for reference-independent assembly evaluation and error identification through supervised learning are devised.


Assemblathon 2: evaluating de novo methods of genome assembly in three vertebrate species
The high degree of variability between the entries suggests that there is still much room for improvement in the field of genome assembly and that approaches which work well in assembling the genome of one species may not necessarily work well for another.
Reference-Free Validation of Short Read Data
This work proposes analytical methods for identifying biases in a collection of short reads, without recourse to a reference, and shows that, surprisingly, strong biases appear to be present.
Efficient de novo assembly of large genomes using compressed data structures.
A new assembler based on the overlap-based string graph model of assembly, SGA (String Graph Assembler), which provides the first practical assembler for a mammalian-sized genome on a low-end computing cluster and is simply parallelizable.
ABySS: a parallel assembler for short read sequence data.
ABySS (Assembly By Short Sequences), a parallelized sequence assembler, was developed and assembled 3.5 billion paired-end reads from the genome of an African male publicly released by Illumina, Inc, representing 68% of the reference human genome.
De novo assembly and genotyping of variants using colored de Bruijn graphs
An efficient software implementation, Cortex, the first de novo assembler capable of assembling multiple eukaryotic genomes simultaneously is provided, and how population information from ten chimpanzees enables accurate variant calls without a reference sequence is shown.
Substantial biases in ultra-short read data sets from high-throughput DNA sequencing
The results show different types of biases and ways to detect, which have implications on the use and interpretation of Solexa data, for de novo sequencing, re-sequencing, the identification of single nucleotide polymorphisms and DNA methylation sites, as well as for transcriptome analysis.
Informed and automated k-mer size selection for genome assembly
A fast and accurate sampling method is developed that constructs approximate abundance histograms with several orders of magnitude performance improvement over traditional methods and a fast heuristic is presented that uses the generated abundance histogram for putative k values to estimate the best possible value of k.
Estimation of sequencing error rates in short reads
A fast, scalable and accurate approach to estimating error rates in short reads, which has the added advantage of not requiring a reference genome and is more accurate than alternatives that count the difference between the sample of interest and a reference genomes.
Characterizing and measuring bias in sequence data
The assays presented in this paper provide a comprehensive view of sequencing bias, which can be used to drive laboratory improvements and to monitor production processes, and indicate that combining data from two technologies can reduce coverage bias.
Scaling metagenome sequence assembly with probabilistic de Bruijn graphs
A memory-efficient graph representation based on a probabilistic data structure, a Bloom filter, that allows us to efficiently store assembly graphs in as little as 4 bits per k-mer, albeit inexactly, is introduced, which reduces the overall memory requirements for de novo assembly of metagenomes.