Incremental generation of summarized clustering hierarchy for protein family analysis
This paper presents a method for clustering a large and mixed set of uncharacterized sequences provided by genome projects. As the measure of the clustering, we use a fast approximation of sequence similarity (FASTA score). However, in the case to detect similarity between two sequences that are much diverged in evolutionary process, FASTA sometimes underestimates the similarity compared to the rigorous Smith-Waterman algorithm. Also the distance derived from the similarity score may not be metric since the triangle inequality may not hold when the sequences have multi-domain structure. To cope with these problems, we introduce a new graph structure called p-quasi complete graph for describing a cluster of sequences with a con dence measure. We prove that a restricted version of the p-quasi complete graph problem (given a positive integer k, whether a graph contains a 0.5-quasi complete subgraph of which size k or not) is NP-complete. Thus we present the outline of an approximation algorithm for clustering a set of sequences into subsets corresponding to p-quasi complete graphs. The e ectiveness of our method is demonstrated by the result of clustering Escherichia coli protein sequences by our method.