Corpus ID: 220961617

Haplotype-resolved de novo assembly with phased assembly graphs

@article{Cheng2020HaplotyperesolvedDN,
  title={Haplotype-resolved de novo assembly with phased assembly graphs},
  author={Haoyu Cheng and Gregory T Concepcion and Xiaowen Feng and Haowen Zhang and Heng Li},
  journal={arXiv: Genomics},
  year={2020}
}
Haplotype-resolved de novo assembly is the ultimate solution to the study of sequence variations in a genome. However, existing algorithms either collapse heterozygous alleles into one consensus copy or fail to cleanly separate the haplotypes to produce high-quality phased assemblies. Here we describe hifiasm, a new de novo assembler that takes advantage of long high-fidelity sequence reads to faithfully represent the haplotype information in a phased assembly graph. Unlike other graph-based… Expand

Figures and Tables from this paper

Extended haplotype-phasing of long-read de novo genome assemblies using Hi-C
TLDR
FALCON-Phase is a phasing tool that uses ultra-long-range Hi-C chromatin interaction data to extend phase blocks of partially-phased diploid assembles to chromosome or scaffold scale and is applicable to any draft assembly that contains long primary contigs and phased associate contigs. Expand
De novo assembly of 64 haplotype-resolved human genomes of diverse ancestry and integrated analysis of structural variation
TLDR
This resource enables reliable graph-based genotyping from short reads of up to 50,340 SVs, resulting in the identification of 1,525 expression quantitative trait loci (SV-eQTLs) as well as SV candidates for adaptive selection within the human population. Expand
GraphUnzip: unzipping assembly graphs with long reads and Hi-C
TLDR
GraphUnzip is presented, a fast, memory-efficient and accurate tool to unzip assembly graphs into their constituent haplotypes using long reads and/or Hi-C data and yields high-quality gap-less supercontigs. Expand
Assembling Long Accurate Reads Using de Bruijn Graphs
TLDR
An efficient jumboDB algorithm is developed and a new concept of a multiplex de Bruijn graph with varying k-mer sizes is used that produces contiguous assemblies of complex repetitive regions in genomes including automated assemblies of various highly-repetitive human centromeres. Expand
Overcoming uncollapsed haplotypes in long-read assemblies of non-model organisms
TLDR
Testing different assembly strategies on the genome of the rotifer Adineta vaga revealed several approaches able to generate haploid assemblies with genome sizes, coverage distributions, and completeness close to expectations. Expand
Minimizer-space de Bruijn graphs
TLDR
The concept of minimizer-space sequencing data analysis, where the minimizers rather than DNA nucleotides are the atomic tokens of the alphabet, is introduced, and advances to be essential to sequence analysis, given the rise of long-read sequencing in genomics, metagenomics and pangenomics. Expand
Probably Correct: Rescuing Repeats with Short and Long Reads
TLDR
The proposed and implemented solutions to the repeat resolution and the multi-mapping read problem are reviewed, as well as the downstream consequences of reference choice, repeat masking, and proper representation of sex chromosomes are considered. Expand
Chromosome-scale and haplotype-resolved genome assembly of a tetraploid potato cultivar
TLDR
The 3.1 Gb haplotype-resolved, chromosome-scale assembly of the autotetraploid potato cultivar, Otava is assembled and it is found that almost 50% of the tetraploids genome were identical-by-descent with at least one of the other haplotypes. Expand
AlignGraph2: similar genome-assisted reassembly pipeline for PacBio long reads
TLDR
AlignGraph2 is a similar genome-assisted reassembly pipeline for the PacBio long reads that can be inputted with either error-prone or HiFi long reads, and contains four novel algorithms: similarity-aware alignment algorithm and alignment filtration algorithm for alignment of the long reads and pre assembled contigs to the similar genome, and reassembly algorithm and weight-adjusted consensus algorithm for extension and refinement of the preassembled contigs. Expand
Time- and memory-efficient genome assembly with Raven
TLDR
Methods for the improvement of de novo genome assembly from erroneous long reads incorporated into a tool called Raven, which is one of the fastest options while having the lowest memory consumption on the majority of benchmarked datasets. Expand
...
1
2
3
...

References

SHOWING 1-10 OF 35 REFERENCES
De novo assembly of haplotype-resolved genomes with trio binning
TLDR
This work used trio binning to recover both haplotypes of a diploid human genome and identified complex structural variants missed by alternative approaches, topping the quality of current cattle reference genomes. Expand
A fully phased accurate assembly of an individual human genome
TLDR
This work leverage the chromosome-wide phasing and scaffolding capabilities of single-cell strand sequencing and combine them with high-fidelity (HiFi) long sequencing reads3, in a novel reference-free workflow for diploid de novo genome assembly. Expand
Chromosome-scale, haplotype-resolved assembly of human genomes
TLDR
A method named diploid assembly (DipAsm) that uses long, accurate reads and long-range conformation data for single individuals to generate a chromosome-scale phased assembly within 1 day, outperforming other approaches in terms of both contiguity and phasing completeness. Expand
Assembly of long, error-prone reads using repeat graphs
TLDR
Flye improves the speed and accuracy of genome assembly by using repeat graphs to resolve repeat regions, and nearly doubled the contiguity of the human genome assembly compared with existing assemblers. Expand
Canu: scalable and accurate long-read assembly via adaptive k-mer weighting and repeat separation.
TLDR
Canu, a successor of Celera Assembler that is specifically designed for noisy single-molecule sequences, is presented, demonstrating that Canu can reliably assemble complete microbial genomes and near-complete eukaryotic chromosomes using either Pacific Biosciences or Oxford Nanopore technologies. Expand
Phased diploid genome assembly with single-molecule real-time sequencing
TLDR
The open-source FALCON and FALcon-Unzip algorithms are introduced to assemble long-read sequencing data into highly accurate, contiguous, and correctly phased diploid genomes. Expand
Identifying and removing haplotypic duplication in primary genome assemblies
TLDR
This work presents a novel tool “purge_dups” that uses sequence similarity and read depth to automatically identify and remove both haplotigs and heterozygous overlaps and shows that it can reduce heterozygus duplication and increase assembly continuity while maintaining completeness of the primary assembly. Expand
HiCanu: accurate assembly of segmental duplications, satellites, and allelic variants from high-fidelity long reads
TLDR
This HiCanu assembly correctly resolves 337 out of 341 validation BACs sampled from known segmental duplications and provides the first preliminary assemblies of 9 complete human centromeric regions, a significant advance towards the complete assembly of human genomes. Expand
Nonhybrid, finished microbial genome assemblies from long-read SMRT sequencing data
We present a hierarchical genome-assembly process (HGAP) for high-quality de novo microbial genome assemblies using only a single, long-insert shotgun DNA library in conjunction with Single Molecule,Expand
Assembling large genomes with single-molecule sequencing and locality-sensitive hashing
TLDR
The MinHash Alignment Process (MHAP) is introduced for overlapping noisy, long reads using probabilistic, locality-sensitive hashing and can produce de novo near-complete eukaryotic assemblies that are 99.99% accurate when compared with available reference genomes. Expand
...
1
2
3
4
...