Learn More
MOTIVATION In 2001 and 2002, we published two papers (Bioinformatics, 17, 282-283, Bioinformatics, 18, 77-82) describing an ultrafast protein sequence clustering program called cd-hit. This program can efficiently cluster a huge protein database with millions of sequences. However, the applications of the underlying algorithm are not limited to only protein(More)
We present a fast and flexible program for clustering large protein databases at different sequence identity levels. It takes less than 2 h for the all-against-all sequence comparison and clustering of the non-redundant protein database of over 560,000 sequences on a high-end PC. The output database, including only the representative sequences, can be used(More)
MOTIVATION Protein structures are flexible and undergo structural rearrangements as part of their function, and yet most existing protein structure comparison methods treat them as rigid bodies, which may lead to incorrect alignment. RESULTS We have developed the Flexible structure AlignmenT by Chaining AFPs (Aligned Fragment Pairs) with Twists (FATCAT),(More)
Tyrosine phosphorylation is catalyzed by protein tyrosine kinases, which are represented by 90 genes in the human genome. Here, we present the set of 107 genes in the human genome that encode members of the four protein tyrosine phosphatase (PTP) families. The four families of PTPases, their substrates, structure, function, regulation, and the role of these(More)
Metagenomics projects based on shotgun sequencing of populations of micro-organisms yield insight into protein families. We used sequence similarity clustering to explore proteins with a comprehensive dataset consisting of sequences from available databases together with 6.12 million proteins predicted from an assembly of 7.7 million Global Ocean Sampling(More)
MOTIVATION Existing comparisons of protein structures are not able to describe structural divergence and flexibility in the structures being compared because they focus on identifying a common invariant core and ignore parts of the structures outside this core. Understanding the structural divergence and flexibility is critical for studying the evolution of(More)
AhpF, the flavin-containing component of the Salmonella typhimurium alkyl hydroperoxide reductase system, catalyzes the NADH-dependent reduction of an active-site disulfide bond in the other component, AhpC, which in turn reduces hydroperoxide substrates. The amino acid sequence of the C-terminus of AhpF is 35% identical to that of thioredoxin reductase(More)
  • Y Okazaki, M Furuno, T Kasukawa, J Adachi, H Bono, S Kondo +131 others
  • 2002
Only a small proportion of the mouse genome is transcribed into mature messenger RNA transcripts. There is an international collaborative effort to identify all full-length mRNA transcripts from the mouse, and to ensure that each is represented in a physical collection of clones. Here we report the manual annotation of 60,770 full-length mouse complementary(More)
Protein structure comparison, an important problem in structural biology, has two main applications: (i) comparing two protein structures in order to identify the similarities and differences between them, and (ii) searching for structures similar to a query structure. Many web-based resources for both applications are available, but all are based on rigid(More)