Current status and new features of the Consensus Coding Sequence database

  title={Current status and new features of the Consensus Coding Sequence database},
  author={Catherine M. Farrell and Nuala A. O'Leary and Rachel A. Harte and Jane E. Loveland and Laurens G. Wilming and Craig Wallin and Mark E. Diekhans and Daniel Barrell and Stephen M. J. Searle and Bronwen L. Aken and Susan M. Hiatt and Adam Frankish and Marie-Marthe Suner and Bhanu Rajput and Charles A. Steward and Garth R. Brown and Ruth Bennett and Michael R. Murphy and Wendy Wu and Mike P. Kay and Jennifer Hart and Jeena Rajan and Janet Weber and Catherine Snow and Lillian D. Riddick and Toby Hunt and David Webb and Mark Thomas and Pamela Tamez and Sanjida H. Rangwala and Kelly M. McGarvey and Shashikant Pujar and Andrei Shkeda and Jonathan M. Mudge and Jose Manuel Gonzalez and James G. R. Gilbert and Stephen J. Trevanion and Robert Baertsch and Jennifer L. Harrow and Tim J. P. Hubbard and James Ostell and David Haussler and Kim D. Pruitt},
  journal={Nucleic Acids Research},
  pages={D865 - D872}
The Consensus Coding Sequence (CCDS) project ( is a collaborative effort to maintain a dataset of protein-coding regions that are identically annotated on the human and mouse reference genome assemblies by the National Center for Biotechnology Information (NCBI) and Ensembl genome annotation pipelines. Identical annotations that pass quality assurance tests are tracked with a stable identifier (CCDS ID). Members of the collaboration, who are from NCBI, the… 

Figures and Tables from this paper

Consensus coding sequence (CCDS) database: a standardized set of human and mouse protein-coding regions supported by expert curation
The ongoing work, growth and stability of the CCDS dataset is outlined and expert curation scenarios, with specific examples highlighting the importance of an accurate reference genome assembly and the crucial role played by input from the research community are presented.
Reference sequence (RefSeq) database at NCBI: current status, taxonomic expansion, and functional annotation
The approach to utilizing available RNA-Seq and other data types in the authors' manual curation process for vertebrate, plant, and other species is summarized, and a new direction for prokaryotic genomes and protein name management is described.
A unified gene catalog for the laboratory mouse reference genome
A semi-automated process by which mouse genome feature predictions and curated annotations from Ensembl, NCBI and Vertebrate Genome Annotation database are reconciled with the genome features in the Mouse Genome Informatics database into a comprehensive and non-redundant catalog is reported.
Mouse genome annotation by the RefSeq project
Key features and advantages of RefSeq genome annotation products are highlighted and an overview of NCBI processes to generate these data are presented.
Ensembl 2015
The Regulatory Build has been revamped to identify regulatory regions of interest and to efficiently highlight their activity across disparate epigenetic data sets, and the WiggleTools package enables users to summarize large collections of data sets and view them as single tracks in Ensembl.
Gene: a gene-centered information resource at NCBI
The National Center for Biotechnology Information's (NCBI) Gene database integrates gene-specific information from multiple data sources and represents the integration of curation and automated processing from RefSeq, collaborating model organism databases, consortia such as Gene Ontology, and other databases within NCBI.
Genomic Database Searching.
This chapter provides a broad overview of the major genomic databases and browsers, and describes various approaches and the latest resources for searching them.
Creating reference gene annotation for the mouse C57BL6/J genome assembly
The progress of the GENCODE mouse annotation project is described, which combines manual annotation from the HAVANA group with Ensembl computational annotation, alongside experimental and in silico validation pipelines from other members of the consortium.
Database resources of the National Center for Biotechnology Information
The National Center for Biotechnology Information (NCBI) provides a large suite of online resources for biological information and data, including the GenBank® nucleic acid sequence database and the
The 2014 Nucleic Acids Research Database Issue and an updated NAR online Molecular Biology Database Collection
This issue includes descriptions of 58 new molecular biology databases and recent updates to 123 databases previously featured in NAR or other journals, and a collection of articles on bacterial taxonomy and metagenomics, which includes updates on the List of Prokaryotic Names with Standing in Nomenclature.


The consensus coding sequence (CCDS) project: Identifying a common protein-coding gene set for the human and mouse genomes.
The CCDS database centralizes the function of identifying well-supported, identically-annotated, protein-coding regions and indicates that the entries in the CCDS set are highly likely to represent real proteins, more so than annotations from contributing groups not included in CCDS.
NCBI Reference Sequences (RefSeq): current status, new features and genome annotation policy
Recent growth, the status of curating the human RefSeq data set, more extensive feature annotation and current policy for eukaryotic genome annotation via the NCBI annotation pipeline are reported on.
GenBank® is a comprehensive database that contains publicly available nucleotide sequences for over 340 000 formally described species and integrates these records with a variety of other data including taxonomy nodes, genomes, protein structures, and biomedical journal literature in PubMed.
GENCODE: the reference human genome annotation for The ENCODE Project.
This work has examined the completeness of the transcript annotation and found that 35% of transcriptional start sites are supported by CAGE clusters and 62% of protein-coding genes have annotated polyA sites, and over one-third of GENCODE protein-Coding genes aresupported by peptide hits derived from mass spectrometry spectra submitted to Peptide Atlas.
Locus Reference Genomic sequences: an improved basis for describing human DNA variants
It is hoped that widespread adoption of LRGs - which will be created and maintained by the NCBI and the European Bioinformatics Institute - along with consistent use of the Human Genome Variation Society (HGVS)-approved variant nomenclature will reduce errors in the reporting of variants in the literature and improve communication about variants affecting human health.
Tracking and coordinating an international curation effort for the CCDS Project
The relevant background and reasoning behind the curation standards that are developed for CCDS database treatment of transcripts that are nonsense-mediated decay (NMD) candidates, for transcripts containing upstream open reading frames, for identifying the most likely translation start codons and for the annotation of readthrough transcripts are presented.
The vertebrate genome annotation (Vega) database
The Vertebrate Genome Annotation (Vega) database was first made public in 2004 and now contains comprehensive annotation on 20 of the 24 human chromosomes, four whole mouse chromosomes and around 40% of the zebrafish Danio rerio genome.
The International Nucleotide Sequence Database Collaboration
The INSDC is introduced, data growth patterns are outlined and the challenges of increased growth are commented on, with a clear mark on INSDC strategy.
The Universal Protein Resource (UniProt) in 2010
The primary mission of UniProt is to support biological research by maintaining a stable, comprehensive, fully classified, richly and accurately annotated protein sequence knowledgebase, with
Ensembl 2013
The Ensembl project provides genome information for sequenced chordate genomes with a particular focus on human, mouse, zebrafish and rat, as well as variation data resources for 17 species and regulation annotations based on ENCODE and other data sets.