The Pfam protein families database: towards a more sustainable future
@article{Finn2016ThePP, title={The Pfam protein families database: towards a more sustainable future}, author={Robert D. Finn and Penny C. Coggill and Ruth Y. Eberhardt and Sean R. Eddy and Jaina Mistry and Alex L. Mitchell and Simon C. Potter and Marco Punta and Matloob Qureshi and Amaia Sangrador-Vegas and Gustavo A. Salazar and John G. Tate and Alex Bateman}, journal={Nucleic Acids Research}, year={2016}, volume={44}, pages={D279 - D285} }
In the last two years the Pfam database (http://pfam.xfam.org) has undergone a substantial reorganisation to reduce the effort involved in making a release, thereby permitting more frequent releases. Arguably the most significant of these changes is that Pfam is now primarily based on the UniProtKB reference proteomes, with the counts of matched sequences and species reported on the website restricted to this smaller set. Building families on reference proteomes sequences brings greater…
Figures from this paper
4,402 Citations
The Pfam protein families database in 2019
- Computer ScienceNucleic Acids Res.
- 2019
A significant comparison to the structural classification database that led to the creation of 825 new families based on their set of uncharacterized families (EUFs) was carried out and Pfam entries were connected to the Sequence Ontology (SO) through mapping of the Pfam type definitions to SO terms.
Pfam: The protein families database in 2021
- BiologyNucleic Acids Res.
- 2021
The Pfam database is a widely used resource for classifying protein sequences into families and domains and the reintroduced Pfam-B which provides an automatically generated supplement to Pfam and contains 136 730 novel clusters of sequences that are not yet matched by a Pfam family.
DPCfam: a new method for unsupervised protein family classification
- Computer SciencebioRxiv
- 2020
DPCfam is introduced, a new unsupervised procedure that uses sequence alignments and Density Peak Clustering to automatically classify homologous protein regions and shows potential both for assisting manual annotation efforts and for stand-alone classification of sparsely annotated protein datasets such as those from environmental metagenomics studies.
The MEROPS database of proteolytic enzymes, their substrates and inhibitors in 2017 and a comparison with peptidases in the PANTHER database
- BiologyNucleic Acids Res.
- 2018
Insight is given into the origins and evolution of peptidase families, including an expansion in the number of proteasome components in Asgard archaeotes and as organisms increase in complexity.
Genome properties in 2019: a new companion database to InterPro for the inference of complete functional attributes
- Biology, Computer ScienceNucleic Acids Res.
- 2019
To increase the scope of coverage, GPs have migrated to function as a companion resource utilizing InterPro entries, adding ∼700 new GPs, increasing the coverage of eukaryotic systems, as well as increasing general coverage through automatic generation of GPs from related resources.
A sequence family database built on ECOD structural domains
- Computer ScienceBioinform.
- 2018
This work created multiple sequence alignments and profiles from ECOD domains with the help of structural information in alignment building and boundary delineation and validated the alignment quality by scoring structure superposition to demonstrate that they are comparable to curated seed alignments in Pfam.
The Bologna Annotation Resource (BAR 3.0): improving protein functional annotation
- Computer ScienceNucleic Acids Res.
- 2017
BAR 3.0 offers an improved search interface, allowing queries by UniProtKB-accession, Fasta sequence, GO-term, PFAM-domain, organism, PDB and ligand/s, and largely outperforms the previous version and scores among state-of-the-art methods.
PFASUM: a substitution matrix from Pfam structural alignments
- BiologyBMC Bioinformatics
- 2017
This study shows that the usage of PFASUM matrices can lead to significantly better homology search results when compared to conventional matrices, and implies that PFASum matrices improve homological search performance as well as MSA quality in many cases whenCompared to conventional substitution matrices.
Improving pairwise comparison of protein sequences with domain co-occurrence
- BiologybioRxiv
- 2017
A method to take domain co-occurrence into account in a typical BLAST analysis and to construct new domain families on the basis of these results, which identified 2473 new domains for which no model of the Pfam database could be linked.
SIFTS: updated Structure Integration with Function, Taxonomy and Sequences resource allows 40-fold increase in coverage of structure-based annotations for proteins
- BiologyNucleic Acids Res.
- 2019
The recently released implementation of SIFTS includes support for multiple cross-references for proteins in the PDB, allowing mappings to UniProtKB isoforms and UniRef90 cluster members, and makes structure data in thePDB readily available to over 1.8 million UniProt KB accessions.
References
SHOWING 1-10 OF 19 REFERENCES
Pfam: the protein families database
- BiologyNucleic Acids Res.
- 2008
Pfam, available via servers in the UK (http://pfam.sanger.ac.uk/) and the USA (http://pfam.janelia.org/), is a widely used database of protein families, containing 14 831 manually curated entries in…
UniProt: a hub for protein information
- BiologyNucleic Acids Res.
- 2015
An annotation score for all entries in UniProt is introduced to represent the relative amount of knowledge known about each protein to help identify which proteins are the best characterized and most informative for comparative analysis.
UniProt: A hub for protein information
- Biology
- 2015
An annotation score for all entries in UniProt is introduced to represent the relative amount of knowledge known about each protein to help identify which proteins are the best characterized and most informative for comparative analysis.
Representative Proteomes: A Stable, Scalable and Unbiased Proteome Set for Sequence Analysis and Functional Annotation
- BiologyPloS one
- 2011
A set of Representative Proteomes, each selected from a Representativeproteome Group containing similar proteomes calculated based on co-membership in UniRef50 clusters, finds that a CMT of 55% (RP55) most closely follows standard taxonomic classifications.
SIFTS: Structure Integration with Function, Taxonomy and Sequences resource
- Computer Science, BiologyNucleic Acids Res.
- 2013
The two teams have developed a semi-automated process for maintaining up-to-date cross-reference information to UniProt entries, for all protein chains in the PDB entries present in the UniProt database.
The InterPro protein families database: the classification resource after 15 years
- Computer ScienceNucleic Acids Res.
- 2015
The new domain architecture search tool is described and the process of mapping of Gene Ontology terms to InterPro is outlined, and the challenges faced by the resource given the explosive growth in sequence data in recent years are discussed.
SCOOP: a simple method for identification of novel protein superfamily relationships
- Computer ScienceBioinform.
- 2007
A simpler approach than profile-profile comparison is presented that has a comparable performance to state-of-the-art tools such as COMPASS, HHsearch and PRC, and is shown to find known relationships between families in the Pfam database as well as detect novel distant relationship between families.
AntiFam: a tool to help identify spurious ORFs in protein annotation
- BiologyDatabase J. Biol. Databases Curation
- 2012
This resource currently contains 23 families that identified 1310 spurious proteins in UniProtKB and a further 4119 spurious proteins on the opposite strand or in a collection of metagenomic sequences.
MobiDB 2.0: an improved database of intrinsically disordered and mobile proteins
- Computer Science, BiologyNucleic Acids Res.
- 2015
A new version of MobiDB is provided, a centralized source aimed at providing the most complete picture on different flavors of disorder in protein structures covering all UniProt sequences, and features a consensus annotation and classification for long disordered regions.
Pfam: clans, web tools and services
- Computer ScienceNucleic Acids Res.
- 2006
Improvements to the range of Pfam web tools and the first set of PfAm web services that allow programmatic access to the database and associated tools are presented.