Caetano Traina

Learn More
In this paper we present the Slim-tree, a dynamic tree for organizing metric datasets in pages of fixed size. The Slim-tree uses the "fat-factor" which provides a simple way to quantify the degree of overlap between the nodes in a metric tree. It is well-known that the degree of overlap directly affects the query performance of index structures. There are(More)
!"#$%&'$B2"&C 3(4(&$ '"$"D"#( "EE.*4"$*0&# 8-#$ '(". F*$@ #*8*."3*$C G-(3*(#= !03 #-4@ "EE.*4"$*0&#> *$ *# *8E03$"&$ $0 8("#-3( $@( #*8*."3*$C D($F((& $F0 0DH(4$# -#*&+ $@( '*#$"&4( D($F((& $@(8= !04-#*&+ 0& $@*# E30D.(8> $@*# E"E(3 E30E0#(# $@( 6.*89$3((> " &(F 'C&"8*4 $3(( 103 03+"&*/*&+ 8($3*4 '"$" #($# *& E"+(# 01 1*)(' #*/(= :@( 6.*89$3(( -#(# $@((More)
Dimensionality curse and dimensionality reduction are two issues that have retained high interest for data mining, machine learning, multimedia indexing, and clustering. We present a fast, scalable algorithm to quickly select the most important attributes (dimensions) for a given set of n-dimensional vectors. In contrast to older methods, our method has the(More)
In this paper we describe a general framework for evaluation and optimization of methods for diversifying query results. In these methods, an initial ranking candidate set produced by a query is used to construct a result set, where elements are ranked with respect to relevance and diversity features, i.e., the retrieved elements should be as relevant as(More)
Similarity search operations require executing expensive algorithms, and although broadly useful in many new applications, they rely on specific structures not yet supported by commercial DBMS. In this paper we discuss the new Omni-technique, which allows to build a variety of dynamic Metric Access Methods based on a number of selected objects from the(More)
We discovered a surprising law governing the spatial join selectivity across two sets of points. An example of such a spatial join is &#8220;<i>find the libraries that are within 10 miles of schools</i>&#8221;. Our law dictates that the number of such qualifying pairs follows a power law, whose exponent we call &#8220;pair-count exponent&#8221; (PC). We(More)
Metric Access Methods (MAM) are employed to accelerate the processing of similarity queries, such as the range and the k-nearest neighbor queries. Current methods improve the query performance minimizing the number of disk accesses, keeping a constant height of the structures stored on disks (height-balanced trees). The Slim-tree and the M-tree are the most(More)
Given a very large moderate-to-high dimensionality dataset, how could one cluster its points? For datasets that don't fit even on a single disk, parallelism is a first class option. In this paper we explore MapReduce for clustering this kind of data. The main questions are (a) how to minimize the I/O cost, taking into account the already existing data(More)