GENCODE reference annotation for the human and mouse genomes

@article{Frankish2019GENCODERA,
  title={GENCODE reference annotation for the human and mouse genomes},
  author={Adam Frankish and Mark E. Diekhans and Anne-Maud Ferreira and Rory Johnson and Irwin Jungreis and Jane E. Loveland and Jonathan M. Mudge and Cristina Sisu and James C. Wright and Joel Armstrong and If H. A. Barnes and Andrew E. Berry and Alexandra Bignell and Silvia Carbonell Sala and Jacqueline Chrast and Fiona Cunningham and Tom{\'a}s Di Domenico and Sarah M. Donaldson and Ian T. Fiddes and Carlos Garc{\'i}a-Gir{\'o}n and Jose Manuel Gonzalez and Tiago Grego and Matthew Hardy and Thibaut Hourlier and Toby Hunt and Osagie G. Izuogu and Julien Lagarde and Fergal J. Martin and Laura Mart{\'i}nez and Shamika Mohanan and Paul Muir and F{\'a}bio C. P. Navarro and Anne Parker and Baikang Pei and Fernando Pozo and Magali Ruffier and Bianca M. Schmitt and Eloise Stapleton and Marie-Marthe Suner and Irina Sycheva and Barbara Uszczynska-Ratajczak and Jinrui Xu and Andrew D. Yates and Daniel R. Zerbino and Yan Zhang and Bronwen L. Aken and Jyoti S. Choudhary and Mark B. Gerstein and Roderic Guig{\'o} and Tim J. P. Hubbard and Manolis Kellis and Benedict J. Paten and Alexandre Reymond and Michael L. Tress and Paul Flicek},
  journal={Nucleic Acids Research},
  year={2019},
  volume={47},
  pages={D766 - D773}
}
Abstract The accurate identification and description of the genes in the human and mouse genomes is a fundamental requirement for high quality analysis of data informing both genome biology and clinical genomics. Over the last 15 years, the GENCODE consortium has been producing reference quality gene annotations to provide this foundational resource. The GENCODE consortium includes both experimental and computational biology groups who work together to improve and extend the GENCODE gene… 

Figures from this paper

Lost in translation: the pitfalls of Ensembl Gene annotations between human genome assemblies and their impact on diagnostics
TLDR
This study highlights the issue of genes with discrepant annotations, that have been recognized as protein coding in the new but not the old assembly, that are ignored by all genomic resources that still rely on the archived and outdated gene annotations.
Lost in Translation: The Pitfalls of Ensembl Gene Annotations Between Human Genome Assemblies and Their Impact on Diagnostics
TLDR
This study highlights the issue of genes with discrepant annotations, that have been recognized as protein coding in the new but not the old assembly, that are ignored by all genomic resources that still rely on the archived and outdated gene annotations.
Ensembl 2021
TLDR
Recent Ensembl developments are presented including two new website portals, which are designed to provide core tools and services for genomes as soon as possible and has been deployed to support large biodiversity sequencing projects.
Ensembl 2020
TLDR
This work presents 94 newly annotated and re-annotated genomes, bringing the total number of genomes offered by Ensembl to 227, which represents the single largest expansion of the resource since its inception.
The UCSC Genome Browser database: 2021 update
TLDR
The UCSC Genome Browser database has provided high-quality genomics data visualization and genome annotations to the research community for more than two decades, and new features released this past year include a Hi-C heatmap display, a phased family trio display for VCF files, and various track visualization improvements.
Impact of gene annotation choice on the quantification of RNA-seq data
TLDR
This study found that the use of the conservative RefSeq gene annotation yields better RNA-seq quantification results than the more comprehensive Ensembl annotation, and finds that the recent expansion of the RefSequ database resulted in a reduction in the accuracy of RNA- seq quantification.
Impact of gene annotation choice on the quantification of RNA-seq data
TLDR
It is shown that the use of RefSeq gene annotation models led to better quantification accuracy, based on the correlation with ground truths including expression data from >800 real-time PCR validated genes, known titration ratios of gene expression and microarray expression data.
Impact of Gene Annotation Choice on the Quantification of RNA-Seq Data
TLDR
This study found that the use of the conservative RefSeq gene annotation yields better RNA-seq quantification results than the more comprehensive Ensembl annotation.
Mouse genomic and cellular annotations
TLDR
Due to the wide-ranging number and diversity of annotations describing the mouse genome, this review focuses on gene, repeat and regulatory element annotation as well as two relatively new technologies; 3D genome architecture and single-cell sequencing outlining their utility in genetic research and their current challenges.
Using multiple reference genomes to identify and resolve annotation inconsistencies
TLDR
A high-throughput method based on pairwise comparisons of annotations that detect potential split-gene misannotations and quantifies support for whether the genes should be merged into a single gene model and demonstrates the utility of this method using gene annotations of three reference genomes from maize.
...
...

References

SHOWING 1-10 OF 53 REFERENCES
GENCODE: the reference human genome annotation for The ENCODE Project.
TLDR
This work has examined the completeness of the transcript annotation and found that 35% of transcriptional start sites are supported by CAGE clusters and 62% of protein-coding genes have annotated polyA sites, and over one-third of GENCODE protein-Coding genes aresupported by peptide hits derived from mass spectrometry spectra submitted to Peptide Atlas.
Improving GENCODE reference gene annotation using a high-stringency proteogenomics workflow
TLDR
It is shown that a conservative approach, using stringent filtering is required to generate valid identifications in pseudogenes, and a stringent workflow for the interpretation of proteogenomic data is reported, that could be used by the annotation community to interpret novel proteogenomics evidence.
The Ensembl gene annotation system
The Ensembl gene annotation system has been used to annotate over 70 different vertebrate species across a wide range of genome projects. Furthermore, it generates the automatic alignment-based
GENCODE: producing a reference annotation for ENCODE
TLDR
The comprehensiveness of the GENCODE annotation was assessed by attempting to validate all the predicted exon boundaries outside the GencODE annotation, which showed only 40% of GENCode exons are contained within the two sets, which is a reflection of the high number of alternative splice forms with unique exons annotated.
Comparative Annotation Toolkit (CAT) - Simultaneous Clade and Personal Genome Annotation
TLDR
This work describes the fully open source Comparative Annotation Toolkit (CAT), which provides a flexible way to simultaneously annotate entire clades and identify orthology relationships, and demonstrates the resulting discovery of novel genes, isoforms, and structural variants in genomes as well studied as rat and the great apes.
Combining RT-PCR-seq and RNA-seq to catalog all genic elements encoded in the human genome.
TLDR
This work demonstrates that the cataloging of all of the genic elements encoded in the human genome will necessitate a coordinated effort between unbiased and targeted approaches, like RNA-seq and RT-PCR-seq.
High-throughput annotation of full-length long noncoding RNAs with Capture Long-Read Sequencing
TLDR
An experimental reannotation of the GENCODE intergenic lncRNA populations in matched human and mouse tissues resulted in novel transcript models for 3,574 and 561 gene loci, respectively, which enabled us to definitively characterize the genomic features of lncRNAs, including promoter and gene structure, and protein-coding potential.
The GENCODE v7 catalog of human long noncoding RNAs: analysis of their gene structure, evolution, and expression.
TLDR
The most complete human lncRNA annotation to date is presented, produced by the GENCODE consortium within the framework of the ENCODE project and comprising 9277 manually annotated genes producing 14,880 transcripts, and expression correlation analysis indicates that lncRNAs show particularly striking positive correlation with the expression of antisense coding genes.
Consensus coding sequence (CCDS) database: a standardized set of human and mouse protein-coding regions supported by expert curation
TLDR
The ongoing work, growth and stability of the CCDS dataset is outlined and expert curation scenarios, with specific examples highlighting the importance of an accurate reference genome assembly and the crucial role played by input from the research community are presented.
Ensembl 2018
TLDR
The latest developments of the Ensembl project are presented, with a focus on managing an increasing number of assemblies, supporting efforts in genome interpretation and improving the browser.
...
...