CDD: a Conserved Domain Database for the functional annotation of proteins

  title={CDD: a Conserved Domain Database for the functional annotation of proteins},
  author={Aron Marchler-Bauer and Shennan Lu and John B. Anderson and Farideh Chitsaz and Myra K. Derbyshire and Carol DeWeese-Scott and Jessica H. Fong and Lewis Y. Geer and Renata C. Geer and Noreen R. Gonzales and Marc Gwadz and David I. Hurwitz and John D. Jackson and Zhaoxi Ke and Christopher J. Lanczycki and Fu Lu and Gabriele H. Marchler and Mikhail Mullokandov and Marina V. Omelchenko and Cynthia L. Robertson and James S. Song and Narmada Thanki and Roxanne A. Yamashita and Dachuan Zhang and Naigong Zhang and Chanjuan Zheng and Stephen H. Bryant},
  journal={Nucleic Acids Research},
  pages={D225 - D229}
NCBI’s Conserved Domain Database (CDD) is a resource for the annotation of protein sequences with the location of conserved domain footprints, and functional sites inferred from these footprints. CDD includes manually curated domain models that make use of protein 3D structure to refine domain models and provide insights into sequence/structure/function relationships. Manually curated models are organized hierarchically if they describe domain families that are clearly related by common descent… 

Figures and Tables from this paper

CDD: conserved domains and protein three-dimensional structure
To this date, the majority of protein 3D structures are represented by models tracked by CDD, and CDD curators are characterizing novel families that emerge from protein structure determination efforts.
Annotation of functional sites with the Conserved Domain Database
It is observed that CDD-based site annotation complements existing site annotation in many cases, which may, in part, originate from CDD's curation practice of collecting sites conserved across diverse taxa and supported by evidence from multiple 3D structures.
Automated hierarchical classification of protein domain subfamilies based on functionally-divergent residue signatures
This approach automates the rapid creation of protein domain hierarchies and thus will eliminate one of the most time consuming aspects of conserved domain database curation, and also facilitates protein domain annotation by identifying those pattern residues that most distinguish each protein domain subgroup from other related subgroups.
Improving the consistency of domain annotation within the Conserved Domain Database
An automated algorithm is reported that ‘rescues’ valuable borderline-scoring domain hits that are well-supported by domain architecture (DA, the sequential order of conserved domains in a protein query), including tandem repeats of domain hits reported at a more conservative threshold.
Comprehensive Analysis of Non Redundant Protein Database
It is shown that BoaG can efficiently perform queries on this large dataset to determine the average length of protein sequences and identify the most common taxonomic assignments and functional annotations and that the nonredundant (NR) database has a considerable amount of annotation redundancy at the 95% similarity level.
Protein Family Databases
Protein family data has a number of applications, notably for the functional classification of new protein sequences, and many of these databases have also been amalgamated into integrated protein family resources, which vary in their level of manual curation.
Searching ECOD for Homologous Domains by Sequence and Structure
This unit demonstrates how to access ECOD via the Web and how to search the database by sequence or structure and details the distributable data files available for large‐scale bioinformatics users.
Automatic Identification of Highly Conserved Family Regions and Relationships in Genome Wide Datasets Including Remote Protein Sequences
A completely unsupervised and automated method to identify the shared sequence segments observed in a diverse collection of protein sequences including those present in a smaller fraction of the sequences in the collection, using a combination of sequence alignment, residue conservation scoring and graph-theoretical approaches.
Protein function prediction using domain families
The CAFA results put the domain-based approach among the top ten of 31 competing groups and 56 prediction methods, confirming that it outperforms simple pairwise whole-protein sequence comparisons.
ECOD: An Evolutionary Classification of Protein Domains
A hierarchical evolutionary classification of all proteins with experimentally determined spatial structures, and presented as an interactive and updatable online database that catalogs the largest number of evolutionary links among structural domain classifications.


CDD: specific functional annotation with the Conserved Domain Database
NCBI's Conserved Domain Database is a collection of multiple sequence alignments and derived database search models, which represent protein domains conserved in molecular evolution, and provides annotation of domain footprints and conserved functional sites on protein sequences.
Protein subfamily assignment using the Conserved Domain Database
This work proposes a method for assigning NCBI-curated domains from the Curated Domain Database (CDD) that takes into account the organization of the domains into hierarchies of homologous domain models, and finds that simple heuristics based on sorting scores and domain-specific thresholds are effective at reducing classification error.
CDD: a database of conserved domain alignments with links to domain three-dimensional structure
The Conserved Domain Database (CDD) is a compilation of multiple sequence alignments representing protein domains conserved in molecular evolution. It has been populated with alignment data from the
CD-Search: protein domain annotations on the fly
We describe the Conserved Domain Search service (CD-Search), a web-based tool for the detection of structural and functional domains in protein sequences. CD-Search uses BLAST(R) heuristics to
TIGRFAMs and Genome Properties: tools for the assignment of molecular function and biological process in prokaryotic genomes
The TIGRFAMs and Genome Properties systems are described, which are a collection of protein family definitions built to aid in high-throughput annotation of specific protein functions and a generator of phylogenetic profiles, through which new protein family functions may be discovered.
Pfam: the protein families database
Pfam, available via servers in the UK ( and the USA (, is a widely used database of protein families, containing 14 831 manually curated entries in
The Pfam protein families database
The definition and use of family-specific, manually curated gathering thresholds are explained and some of the features of domains of unknown function (also known as DUFs) are discussed, which constitute a rapidly growing class of families within Pfam.
The COG database: an updated version includes eukaryotes
A major update of the previously developed system for delineation of Clusters of Orthologous Groups of proteins (COGs) from the sequenced genomes of prokaryotes and unicellular eukaryotes is described and is expected to be a useful platform for functional annotation of newlysequenced genomes, including those of complex eukARYotes, and genome-wide evolutionary studies.
SMART 5: domains in the context of genomes and networks
The new ‘Genomic’ mode in SMART makes it easy to analyze domain architectures in completely sequenced genomes, and the network context is now displayed in the results page for more than 350 000 proteins, enabling easy analyses of domain interactions.
Gapped BLAST and PSI-BLAST: a new generation of protein database search programs.
A new criterion for triggering the extension of word hits, combined with a new heuristic for generating gapped alignments, yields a gapped BLAST program that runs at approximately three times the speed of the original.