Learn More
We suggest that the curse of dimensionality affecting the similarity-based search in large datasets is a manifestation of the phenomenon of concentration of measure on high-dimensional structures. We prove that, under certain geometric assumptions on the query domain Ω and the dataset X, if Ω satisfies the so-called concentration property, then for most(More)
—Exchangeable random variables form an important and well-studied generalization of i.i.d. variables, however simple examples show that no nontrivial concept or function classes are PAC learnable under general exchangeable data inputs X1, X2,. . .. Inspired by the work of Berti and Rigo on a Glivenko–Cantelli theorem for exchangeable inputs, we propose a(More)
We suggest a variation of the Hellerstein— Koutsoupias—Papadimitriou indexability model for datasets equipped with a similarity measure, with the aim of better understanding the structure of indexing schemes for similarity-based search and the geometry of similarity workloads. This in particular provides a unified approach to a great variety of schemes used(More)
We propose a family of very efficient hierarchical indexing schemes for ungapped, score matrix-based similarity search in large datasets of short (4-12 amino acid) protein fragments. This type of similarity search has importance in both providing a building block to more complex algorithms and for possible use in direct biological investigations, and(More)