Using native and syntenically mapped cDNA alignments to improve de novo gene finding

@article{Stanke2008UsingNA,
  title={Using native and syntenically mapped cDNA alignments to improve de novo gene finding},
  author={Mario Stanke and Mark E. Diekhans and Robert Baertsch and David Haussler},
  journal={Bioinformatics},
  year={2008},
  volume={24 5},
  pages={
          637-44
        }
}
MOTIVATION Computational annotation of protein coding genes in genomic DNA is a widely used and essential tool for analyzing newly sequenced genomes. However, current methods suffer from inaccuracy and do poorly with certain types of genes. Including additional sources of evidence of the existence and structure of genes can improve the quality of gene predictions. For many eukaryotic genomes, expressed sequence tags (ESTs) are available as evidence for genes. Related genomes that have been… 

Figures and Tables from this paper

High-throughput sequencing data and the impact of plant gene annotation quality
TLDR
The impact of annotation quality on evolutionary analyses, genome-wide association studies, and the identification of orthologous genes in plants is highlighted and it is predicted that incorporating accurate information from manual curation into databases will dramatically improve the performance of automated gene predictors.
Concerted action of the new Genomic Peptide Finder and AUGUSTUS allows for automated proteogenomic annotation of the Chlamydomonas reinhardtii genome
TLDR
An improved version of the Genomic Peptide Finder (GPF), which aligns de novo predicted amino acid sequences to the genomic DNA sequence of an organism while correcting for peptide sequencing errors and accounting for the possibility of splicing, is described.
RNA-Seq improves annotation of protein-coding genes in the cucumber genome
TLDR
It is concluded that RNA-Seq greatly improves the accuracy of prediction of protein-coding genes in the reassembled cucumber genome and suggests that it is feasible to use RNA- Seq reads to annotate newly sequenced or less-studied genomes.
A pipeline for automated annotation of yeast genome sequences by a conserved-synteny approach
TLDR
The Yeast Genome Annotation Pipeline (YGAP) is an automated system designed specifically for new yeast genome sequences lacking transcriptome data, and outperformed another popular annotation program (AUGUSTUS).
A novel hybrid gene prediction method employing protein multiple sequence alignments
TLDR
This work extended the gene prediction software AUGUSTUS by a method that employs block profiles generated from multiple sequence alignments as a protein signature to improve the accuracy of the prediction.
Multi-Genome Annotation with AUGUSTUS.
TLDR
In this chapter the reader is walked through a small example from eight vertebrate species, including the construction of an alignment of the input genomes and how to integrate RNA-Seq evidence from multiple species for gene finding.
Combining RNA-seq data and homology-based gene prediction for plants, animals and fungi
TLDR
An extension of the gene prediction program GeMoMa that utilizes amino acid sequence conservation, intron position conservation and optionally RNA-seq data for homology-based gene prediction and might be of great utility for annotating newly sequenced genomes but also for finding homologs of a specific gene or gene family.
SnowyOwl: accurate prediction of fungal genes by using RNA-Seq and homology information to select among ab initio models
TLDR
SnowyOwl is a new gene prediction pipeline that uses RNA-Seq data to train and provide hints for the generation of Hidden Markov Model (HMM)-based gene predictions and to evaluate the resulting models.
BRAKER2: automatic eukaryotic genome annotation with GeneMark-EP+ and AUGUSTUS supported by a protein database
TLDR
In comparison with BRAKER1 supported by a large volume of transcript data, BRAKER2 could produce a better gene prediction accuracy if the evolutionary distances to the reference species in the protein database were rather small.
Maximizing prediction of orphan genes in assembled genomes
TLDR
A Findable, Accessible, Interoperable and Reusable (FAIR) approach, called BIND, that mitigates the under-prediction of orphan genes and increases the number and accuracy of orphan gene predictions.
...
...

References

SHOWING 1-10 OF 37 REFERENCES
Gene identification in novel eukaryotic genomes by self-training algorithm
TLDR
A self-training algorithm that had been assumed feasible only for prokaryotic genomes has now been developed for ab initio eukaryotic gene identification and tests have shown that the new method performs comparably or better than conventional methods where the supervised model training precedes the gene prediction step.
Targeted discovery of novel human exons by comparative genomics.
TLDR
A genome-wide effort to identify human genes not yet in the gene catalogs, carried out as part of the Mammalian Gene Collection project, to produce gene predictions by algorithms that rely on comparative sequence data but do not require direct cDNA evidence, then to test predicted novel genes by RT-PCR.
Gene and alternative splicing annotation with AIR.
TLDR
The method is highly selective, eliminating the unlikely candidates while retaining 98% of the high-quality mRNA evidence in well-formed transcripts, and produces annotation that is measurably more accurate than some evidence-based gene sets.
Gene structure conservation aids similarity based gene prediction.
TLDR
An algorithm implemented in a computer program called Projector which combines comparative and similarity approaches and makes explicit use of the conservation of the exon-intron structure between two related genes in addition to the similarity of their encoded amino acid sequences is presented.
Using Multiple Alignments to Improve Gene Prediction
TLDR
N-SCAN can model the phylogenetic relationships between the aligned genome sequences, context dependent substitution rates, and insertions and deletions and exceeds that of all previously published whole-genome de novo gene predictors.
Iterative gene prediction and pseudogene removal improves genome annotation.
TLDR
PPFINDER (for Processed Pseudogene finder), a program that integrates several methods of processed pseudogene finding in mammalian gene annotations, is created and it is shown that gene prediction improves substantially when gene prediction and pseudogene masking are interleaved.
Evidence Combination in Hidden Markov Models for Gene Prediction
TLDR
A new method for combining partial probabilistic statements is presented and it is proved that it is an extension of existing methods for combining complete probability statements and a method for improving the sensitivity of existing tools for this task by careful modeling of sequence properties is presented.
AceView: a comprehensive cDNA-supported gene and transcripts annotation
TLDR
The driving principles of AceView are described, and how, by performing hand-supervised automatic annotation, it solves the combinatorial splicing problem and summarize all of GenBank, dbEST and RefSeq into a genome-wide non-redundant but comprehensive cDNA-supported transcriptome.
GENCODE: producing a reference annotation for ENCODE
TLDR
The comprehensiveness of the GENCODE annotation was assessed by attempting to validate all the predicted exon boundaries outside the GencODE annotation, which showed only 40% of GENCode exons are contained within the two sets, which is a reflection of the high number of alternative splice forms with unique exons annotated.
Integrating alternative splicing detection into gene prediction
TLDR
This automatic combination of experimental data analysis and ab initio gene finding offers an ideal integration of alternatively spliced gene prediction inside a single annotation pipeline.
...
...