The Pfam protein families database: towards a more sustainable future

@article{Finn2016ThePP,
  title={The Pfam protein families database: towards a more sustainable future},
  author={Robert D. Finn and Penny C. Coggill and Ruth Y. Eberhardt and Sean R. Eddy and Jaina Mistry and Alex L. Mitchell and Simon C. Potter and Marco Punta and Matloob Qureshi and Amaia Sangrador-Vegas and Gustavo A. Salazar and John G. Tate and Alex Bateman},
  journal={Nucleic Acids Research},
  year={2016},
  volume={44},
  pages={D279 - D285}
}
In the last two years the Pfam database (http://pfam.xfam.org) has undergone a substantial reorganisation to reduce the effort involved in making a release, thereby permitting more frequent releases. Arguably the most significant of these changes is that Pfam is now primarily based on the UniProtKB reference proteomes, with the counts of matched sequences and species reported on the website restricted to this smaller set. Building families on reference proteomes sequences brings greater… 

Figures from this paper

The Pfam protein families database in 2019
TLDR
A significant comparison to the structural classification database that led to the creation of 825 new families based on their set of uncharacterized families (EUFs) was carried out and Pfam entries were connected to the Sequence Ontology (SO) through mapping of the Pfam type definitions to SO terms.
Pfam: The protein families database in 2021
TLDR
The Pfam database is a widely used resource for classifying protein sequences into families and domains and the reintroduced Pfam-B which provides an automatically generated supplement to Pfam and contains 136 730 novel clusters of sequences that are not yet matched by a Pfam family.
DPCfam: a new method for unsupervised protein family classification
TLDR
DPCfam is introduced, a new unsupervised procedure that uses sequence alignments and Density Peak Clustering to automatically classify homologous protein regions and shows potential both for assisting manual annotation efforts and for stand-alone classification of sparsely annotated protein datasets such as those from environmental metagenomics studies.
The MEROPS database of proteolytic enzymes, their substrates and inhibitors in 2017 and a comparison with peptidases in the PANTHER database
TLDR
Insight is given into the origins and evolution of peptidase families, including an expansion in the number of proteasome components in Asgard archaeotes and as organisms increase in complexity.
Genome properties in 2019: a new companion database to InterPro for the inference of complete functional attributes
TLDR
To increase the scope of coverage, GPs have migrated to function as a companion resource utilizing InterPro entries, adding ∼700 new GPs, increasing the coverage of eukaryotic systems, as well as increasing general coverage through automatic generation of GPs from related resources.
A sequence family database built on ECOD structural domains
TLDR
This work created multiple sequence alignments and profiles from ECOD domains with the help of structural information in alignment building and boundary delineation and validated the alignment quality by scoring structure superposition to demonstrate that they are comparable to curated seed alignments in Pfam.
The Bologna Annotation Resource (BAR 3.0): improving protein functional annotation
TLDR
BAR 3.0 offers an improved search interface, allowing queries by UniProtKB-accession, Fasta sequence, GO-term, PFAM-domain, organism, PDB and ligand/s, and largely outperforms the previous version and scores among state-of-the-art methods.
PFASUM: a substitution matrix from Pfam structural alignments
TLDR
This study shows that the usage of PFASUM matrices can lead to significantly better homology search results when compared to conventional matrices, and implies that PFASum matrices improve homological search performance as well as MSA quality in many cases whenCompared to conventional substitution matrices.
Improving pairwise comparison of protein sequences with domain co-occurrence
TLDR
A method to take domain co-occurrence into account in a typical BLAST analysis and to construct new domain families on the basis of these results, which identified 2473 new domains for which no model of the Pfam database could be linked.
SIFTS: updated Structure Integration with Function, Taxonomy and Sequences resource allows 40-fold increase in coverage of structure-based annotations for proteins
TLDR
The recently released implementation of SIFTS includes support for multiple cross-references for proteins in the PDB, allowing mappings to UniProtKB isoforms and UniRef90 cluster members, and makes structure data in thePDB readily available to over 1.8 million UniProt KB accessions.
...
...

References

SHOWING 1-10 OF 19 REFERENCES
Pfam: the protein families database
Pfam, available via servers in the UK (http://pfam.sanger.ac.uk/) and the USA (http://pfam.janelia.org/), is a widely used database of protein families, containing 14 831 manually curated entries in
UniProt: a hub for protein information
TLDR
An annotation score for all entries in UniProt is introduced to represent the relative amount of knowledge known about each protein to help identify which proteins are the best characterized and most informative for comparative analysis.
UniProt: A hub for protein information
TLDR
An annotation score for all entries in UniProt is introduced to represent the relative amount of knowledge known about each protein to help identify which proteins are the best characterized and most informative for comparative analysis.
Representative Proteomes: A Stable, Scalable and Unbiased Proteome Set for Sequence Analysis and Functional Annotation
TLDR
A set of Representative Proteomes, each selected from a Representativeproteome Group containing similar proteomes calculated based on co-membership in UniRef50 clusters, finds that a CMT of 55% (RP55) most closely follows standard taxonomic classifications.
SIFTS: Structure Integration with Function, Taxonomy and Sequences resource
TLDR
The two teams have developed a semi-automated process for maintaining up-to-date cross-reference information to UniProt entries, for all protein chains in the PDB entries present in the UniProt database.
The InterPro protein families database: the classification resource after 15 years
TLDR
The new domain architecture search tool is described and the process of mapping of Gene Ontology terms to InterPro is outlined, and the challenges faced by the resource given the explosive growth in sequence data in recent years are discussed.
SCOOP: a simple method for identification of novel protein superfamily relationships
TLDR
A simpler approach than profile-profile comparison is presented that has a comparable performance to state-of-the-art tools such as COMPASS, HHsearch and PRC, and is shown to find known relationships between families in the Pfam database as well as detect novel distant relationship between families.
AntiFam: a tool to help identify spurious ORFs in protein annotation
TLDR
This resource currently contains 23 families that identified 1310 spurious proteins in UniProtKB and a further 4119 spurious proteins on the opposite strand or in a collection of metagenomic sequences.
MobiDB 2.0: an improved database of intrinsically disordered and mobile proteins
TLDR
A new version of MobiDB is provided, a centralized source aimed at providing the most complete picture on different flavors of disorder in protein structures covering all UniProt sequences, and features a consensus annotation and classification for long disordered regions.
Pfam: clans, web tools and services
TLDR
Improvements to the range of Pfam web tools and the first set of PfAm web services that allow programmatic access to the database and associated tools are presented.
...
...