Cluster - preserving embedding of proteins by


Similarity searching in protein sequence databases is a standard technique for biologists dealing with a newly sequenced protein. Exhaustive search in such databases is prohibitive because of the large sizes of these database and because pairwise comparisons are slow. Heuristic techniques, such as FASTA and BLAST, are useful because they are fast and accurate, though it has been shown that exhaustive search is more accurate. Therefore, there are times when one would like to perform an exhaustive search. We propose an efficient method, called SparseMap, for preprocessing a database of proteins to support efficient similarity searches using expensive but sensitive distance functions, such as those based on Smith-Waterman similarity. Our method is based on a Lowdimensional Euclidean Embedding approach. We compare our method with other embedding approaches, and show that our method is faster and produces embeddings which preserve more biological information about the proteins, such as pairwise distance and biological clusters.

6 Figures and Tables


Citations per Year

127 Citations

Semantic Scholar estimates that this publication has 127 citations based on the available data.

See our FAQ for additional information.

Cite this paper

@inproceedings{Hristescu1999ClusterP, title={Cluster - preserving embedding of proteins by}, author={Gabriela Hristescu and Martin Farach-Colton}, year={1999} }