Steady progress and recent breakthroughs in the accuracy of automated genome annotation

  title={Steady progress and recent breakthroughs in the accuracy of automated genome annotation},
  author={Michael R. Brent},
  journal={Nature Reviews Genetics},
  • M. Brent
  • Published 2008
  • Biology
  • Nature Reviews Genetics
The sequencing of large, complex genomes has become routine, but understanding how sequences relate to biological function is less straightforward. Although much attention is focused on how to annotate genomic features such as developmental enhancers and non-coding RNAs, there is still no higher eukaryote for which we know the correct exon–intron structure of at least one ORF for each gene. Despite this uncomfortable truth, genome annotation has made remarkable progress since the first drafts… 
Computational Gene Prediction in Eukaryotic Genomes
Because of the large amount of genomic data, in silico methods are needed for this genome annotation task, genome sequences are annotated using mostly computational gene prediction programs.
Using comparative genome analysis to identify problems in annotated microbial genomes.
It is discussed and demonstrated how the methods of comparative genome analysis can refine annotations by locating missing orthologues and shown that the second-generation annotation systems, which combine multiple gene-calling programs with similarity-based methods, perform much better than the first annotation tools.
Developing a bioinformatics framework for proteogenomics
It is critically important to incorporate proteomics data into genome annotation pipelines to provide experimental protein-coding evidence, and this thesis addresses the existing gap between the use of genomic and proteomic sources for accurate genome annotation by applying a proteogenomics approach with a customised methodology.
Finding genes in genome sequence.
The state of the art in automated gene finding is described and the biological basis, computational approaches, and corresponding programs that are available for the automated identification of protein-coding genes are described.
Comparative Genome Annotation.
Methods for comparative structural genome annotation include classical approaches such as the alignment of protein sequences or protein profiles against the genome and comparative gene prediction methods that exploit a genome alignment to annotate a target genome.
Approaches to Fungal Genome Annotation
The application of the latest technologies and tools for eukaryotic genome annotation with a focus on the annotation of fungal nuclear and mitochondrial genomes are highlighted to improve the quality of predicted gene sets.
The functional repertoires of metazoan genomes
Metazoan genomes are being sequenced at an increasingly rapid rate and it is here, encoded in lineage-specific and functional sequence, that the physiological differences between species to be most concentrated.
Similar Ratios of Introns to Intergenic Sequence across Animal Genomes
It is shown that, regardless of genome size, the ratio of introns to intergenic sequence is comparable across essentially all animals, and that when large-genome invertebrates are considered, the fraction of the genome that is genes appears to be strongly predictable by genome size.
A post-assembly genome-improvement toolkit (PAGIT) to obtain annotated genomes from contigs
This protocol describes software (PAGIT) that is used to improve the quality of draft genomes and offers flexible functionality to close gaps in scaffolds, correct base errors in the consensus sequence and exploit reference genomes in order to improve scaffolding and generating annotations.


Genome annotation past, present, and future: how to define an ORF at each locus.
The state of gene prediction roughly 10 years ago is reviewed, the progress that has been made since is summarized, it is argued that the primary ORF identification methods so far are inadequate, and a path toward completing the Catalog of Protein Coding Genes, Version 1.0 is recommended.
CEGMA: a pipeline to accurately annotate core genes in eukaryotic genomes
This study reports a computational method, CEGMA (Core Eukaryotic Genes Mapping Approach), for building a highly reliable set of gene annotations in the absence of experimental data, and defines a set of conserved protein families that occur in a wide range of eukaryotes and presents a mapping procedure that accurately identifies their exon-intron structures in a novel genomic sequence.
Large-scale analysis of pseudogenes in the human genome.
Initial sequencing and comparative analysis of the mouse genome
The results of an international collaboration to produce a high-quality draft sequence of the mouse genome are reported and an initial comparative analysis of the Mouse and human genomes is presented, describing some of the insights that can be gleaned from the two sequences.
Gene finding in the chicken genome
De novo comparative gene prediction followed by experimental verification is effective at enhancing the annotation of the newly sequenced genomes provided by standard homology-based methods.
Improving the Arabidopsis genome annotation using maximal transcript alignment assemblies.
The algorithm of the Program to Assemble Spliced Alignments (PASA) tool is described, as well as the results of automated updates to Arabidopsis gene annotations.
Targeted discovery of novel human exons by comparative genomics.
A genome-wide effort to identify human genes not yet in the gene catalogs, carried out as part of the Mammalian Gene Collection project, to produce gene predictions by algorithms that rely on comparative sequence data but do not require direct cDNA evidence, then to test predicted novel genes by RT-PCR.
Identification and analysis of functional elements in 1% of the human genome by the ENCODE pilot project
Functional data from multiple, diverse experiments performed on a targeted 1% of the human genome as part of the pilot phase of the ENCODE Project are reported, providing convincing evidence that the genome is pervasively transcribed, such that the majority of its bases can be found in primary transcripts.
Leveraging the mouse genome for gene prediction in human: from whole-genome shotgun reads to a global synteny map.
It is shown that TWINSCAN improves gene prediction in human using intermediate products from various stages of the sequencing and analysis of the mouse genome, from low-redundancy, whole-genome shotgun reads to the draft assembly and the synteny map.
What is a gene, post-ENCODE? History and updated definition.
This definition side-steps the complexities of regulation and transcription by removing the former altogether from the definition and arguing that final, functional gene products (rather than intermediate transcripts) should be used to group together entities associated with a single gene.