PEP_scaffolder: using (homologous) proteins to scaffold genomes

  title={PEP\_scaffolder: using (homologous) proteins to scaffold genomes},
  author={Bai-Han Zhu and Ying-Nan Song and Wei Xue and Gui-Cai Xu and Jun Xiao and Ming-Yuan Sun and Xiaowen Sun and Jiongtang Li},
  pages={3193 - 3195}
Abstract Motivation: Recovering the gene structures is one of the important goals of genome assembly. In low-quality assemblies, and even some high-quality assemblies, certain gene regions are still incomplete; thus, novel scaffolding approaches are required to complete gene regions. Results: We developed an efficient and fast genome scaffolding method called PEP_scaffolder, using proteins to scaffold genomes. The pipeline aims to recover protein-coding gene structures. We tested the method on… 

Figures from this paper

P_RNA_scaffolder: a fast and accurate genome scaffolder using paired-end RNA-sequencing reads
The P_RNA_scaffolder can improve the contiguity of genome assembly and benefit gene prediction, and exhibited higher speed and efficiency than the existing state-of-the-art scaffolders.
A comprehensive review of scaffolding methods in genome assembly
The difficulties in scaffolding, the differences in characteristics among various kinds of reads, the methods by which current scaffolding methods address these difficulties, and future research opportunities are focused on.
Phylogenetic approaches to identifying fragments of the same gene, with application to the wheat genome
This study introduces two novel phylogenetic tests to infer non-overlapping or partially overlapping genes that are in fact parts of the same gene, one approach collapses branches with low bootstrap support and the other computes a likelihood ratio test.
Improving draft genome contiguity with reference-derived in silico mate-pair libraries
Cross-Species Scaffolding is developed—a new pipeline that imports long-range distance information directly into the de novo assembly process by constructing mate-pair libraries in silico and shows how genome assembly metrics and gene prediction dramatically improve with this pipeline.
Rapid genome shrinkage in a self-fertile nematode reveals sperm competition proteins
Comparisons of chromosome-scale assemblies of the outcrossing nematode Caenorhabditis nigoni to its self-fertile sibling species, C. briggsae, reveal impacts of sexual mode on genome content that can be used to identify sperm competition factors.
Investigating the genomic basis of discrete phenotypes using a Pool‐Seq‐only approach: New insights into the genetics underlying colour variation in diverse taxa
Using Pool‐Seq data for both genome assembly and SNP frequency estimation, followed by scanning for FST outliers to identify divergent genomic regions, new regions of high divergence and new annotations are discovered that together suggest novel parallels between birds and butterflies in the origins of their colour pattern variation.
A new species in the major malaria vector complex sheds light on reticulated species evolution
Complexes of closely related species provide key insights into the rapid and independent evolution of adaptive traits. Here, we described and studied Anopheles fontenillei sp.n., a new species in the
A new species in the Anopheles gambiae complex reveals new evolutionary relationships between vector and non-vector species
Anopheles fontenillei has implemented the understanding about the relationship of species within the gambiaecomplex and provides insight into the evolution of vectorial capacity traits, relevant for the successful control of malaria in Africa.
Genomic insights into neonicotinoid sensitivity in the solitary bee Osmia bicornis
The genome of the red mason bee, Osmia bicornis, is sequenced and reveals conserved detoxification pathways in model solitary and eusocial bees despite key differences in the evolution of specific pesticide-metabolising enzymes in the two species groups.


Scaffolding low quality genomes using orthologous protein sequences
A pipeline, SWiPS (Scaffolding With Protein Sequences), that uses orthologous proteins to improve low quality genome assemblies by using them as guides to scaffold existing contigs, while simultaneously allowing the gene structure to be predicted by homology.
L_RNA_scaffolder: scaffolding genomes with transcripts
The simplicity and high-throughput of RNA-seq data makes this approach suitable for genome scaffolding, and L_RNA_scaffolder out-performed most scaffolding results by existing scaffolders which employ mate-pair libraries.
Scaffolding a Caenorhabditis nematode genome with RNA-seq.
Efficient sequencing of animal and plant genomes by next-generation technology should allow many neglected organisms of biological and medical importance to be better understood. As a test case, we
Comparative genomics approach to detecting split-coding regions in a low-coverage genome: lessons from the chimaera Callorhinchus milii (Holocephali, Chondrichthyes)
This study is the first formulation of a method to link unassembled genomic segments based on proteomes of relatively distantly related species as references, and provided insights into practical solutions for efficient annotation of only partially sequenced (low-coverage) genomes.
BLAT--the BLAST-like alignment tool.
How BLAT was optimized is described, which is more accurate and 500 times faster than popular existing tools for mRNA/DNA alignments and 50 times faster for protein alignments at sensitivity settings typically used when comparing vertebrate sequences.
GAGE: A critical evaluation of genome assemblies and assembly algorithms.
Evaluating several of the leading de novo assembly algorithms on four different short-read data sets generated by Illumina sequencers concludes that data quality, rather than the assembler itself, has a dramatic effect on the quality of an assembled genome.
UniProt: a hub for protein information
An annotation score for all entries in UniProt is introduced to represent the relative amount of knowledge known about each protein to help identify which proteins are the best characterized and most informative for comparative analysis.
UniProt: A hub for protein information
An annotation score for all entries in UniProt is introduced to represent the relative amount of knowledge known about each protein to help identify which proteins are the best characterized and most informative for comparative analysis.
The UCSC Genome Browser database: 2016 update
The UCSC Genome Browser has greatly expanded the data sets available on the most recent human assembly, hg38/GRCh38, to include updated gene prediction sets from GENCODE, more phenotype- and disease-associated variants from ClinVar and ClinGen, more genomic regulatory data, and a new multiple genome alignment.
AUGUSTUS: ab initio prediction of alternative transcripts
To the authors' knowledge, this is the first ab initio gene finder that can predict multiple transcripts and offers a motif searching facility, where user-defined regular expressions can be searched against putative proteins encoded by the predicted genes.