Learn More
MOTIVATION In 2001 and 2002, we published two papers (Bioinformatics, 17, 282-283, Bioinformatics, 18, 77-82) describing an ultrafast protein sequence clustering program called cd-hit. This program can efficiently cluster a huge protein database with millions of sequences. However, the applications of the underlying algorithm are not limited to only protein(More)
We present a fast and flexible program for clustering large protein databases at different sequence identity levels. It takes less than 2 h for the all-against-all sequence comparison and clustering of the non-redundant protein database of over 560,000 sequences on a high-end PC. The output database, including only the representative sequences, can be used(More)
Metagenomics projects based on shotgun sequencing of populations of micro-organisms yield insight into protein families. We used sequence similarity clustering to explore proteins with a comprehensive dataset consisting of sequences from available databases together with 6.12 million proteins predicted from an assembly of 7.7 million Global Ocean Sampling(More)
MOTIVATION Protein structures are flexible and undergo structural rearrangements as part of their function, and yet most existing protein structure comparison methods treat them as rigid bodies, which may lead to incorrect alignment. RESULTS We have developed the Flexible structure AlignmenT by Chaining AFPs (Aligned Fragment Pairs) with Twists (FATCAT),(More)
Tyrosine phosphorylation is catalyzed by protein tyrosine kinases, which are represented by 90 genes in the human genome. Here, we present the set of 107 genes in the human genome that encode members of the four protein tyrosine phosphatase (PTP) families. The four families of PTPases, their substrates, structure, function, regulation, and the role of these(More)
MOTIVATION Existing comparisons of protein structures are not able to describe structural divergence and flexibility in the structures being compared because they focus on identifying a common invariant core and ignore parts of the structures outside this core. Understanding the structural divergence and flexibility is critical for studying the evolution of(More)
The FFAS03 server provides a web interface to the third generation of the profile-profile alignment and fold-recognition algorithm of fold and function assignment system (FFAS) [L. Rychlewski, L. Jaroszewski, W. Li and A. Godzik (2000), Protein Sci., 9, 232-241]. Profile-profile algorithms use information present in sequences of homologous proteins to(More)
MOTIVATION Sequence clustering replaces groups of similar sequences in a database with single representatives. Clustering large protein databases like the NCBI Non-Redundant database (NR) using even the best currently available clustering algorithms is very time-consuming and only practical at relatively high sequence identity thresholds. Our previous(More)
Sequence databases are rapidly growing, thereby increasing the coverage of protein sequence space, but this coverage is uneven because most sequencing efforts have concentrated on a small number of organisms. The resulting granularity of sequence space creates many problems for profile-based sequence comparison programs. In this paper, we suggest several(More)