Learn More
Nearest neighbor (NN) search in high dimensional space is an important problem in many applications. Ideally, a practical solution (i) should be implementable in a relational database, and (ii) its query cost should grow <i>sub-linearly</i> with the dataset size, regardless of the data and query distributions. Despite the bulk of NN literature, no solution(More)
Being popular in YouTube is becoming a fundamental way of promoting one's self, services or products. In this paper, we conduct an in depth study of fundamental properties of video popularity in YouTube. We collect and study arguably the largest dataset of YouTube videos, roughly 37 million, accounting for 25&#37; of all YouTube videos. We analyze(More)
This paper studies the <i>nearest keyword</i> (<i>NK</i>) problem on XML documents. In general, the dataset is a tree where each node is associated with one or more keywords. Given a node q and a keyword w, an NK query returns the node that is nearest to q among all the nodes associated with w. NK search is not only useful as a stand-alone operator but also(More)
A hidden database refers to a dataset that an organization makes accessible on the web by allowing users to issue queries through a search interface. In other words, data acquisition from such a source is not by following static hyper-links. Instead, data are obtained by querying the interface, and reading the result page dynamically generated. This, with(More)
Quantiles are a crucial type of order statistics in databases. Extensive research has been focused on maintaining a space-efficient structure for approximate quantile computation as the underlying dataset is updated. The existing solutions, however, are designed to support only the current, most-updated, snapshot of the dataset. Queries on the past versions(More)
Nearest Neighbor (NN) search in high-dimensional space is an important problem in many applications. From the database perspective, a good solution needs to have two properties: (i) it can be easily incorporated in a relational database, and (ii) its query cost should increase <i>sublinearly</i> with the dataset size, regardless of the data and query(More)
Given two vertices s, t in a graph, let P be the shortest path (SP) from <i>s</i> to <i>t</i>, and <i>P*</i> a subset of the vertices in <i>P</i>. <i>P*</i> is a <i>k</i>-skip shortest path from <i>s</i> to <i>t</i>, if it includes at least a vertex out of every <i>k</i> consecutive vertices in <i>P</i>. In general, <i>P*</i> succinctly describes <i>P</i>(More)
We consider the <i>skyline problem</i> (a.k.a. the <i>maxima problem</i>), which has been extensively studied in the database community. The input is a set <i>P</i> of <i>d</i>-dimensional points. A point <i>dominates</i> another if the former has a lower coordinate than the latter on every dimension. The goal is to find the <i>skyline</i>, which is the set(More)
We consider the <i>orthogonal range aggregation</i> problem. The dataset <i>S</i> consists of <i>N</i> axis-parallel rectangles in R<sup>2</sup>, each of which is associated with an integer <i>weight</i>. Given an axis-parallel rectangle <i>Q</i> and an aggregate function <i>F</i>, a query reports the aggregated result of the weights of the rectangles in(More)
Let D be a given set of (string) documents of total length n. The top-k document retrieval problem is to index D such that when a pattern P of length p, and a parameter k come as a query, the index returns those k documents which are most relevant to P. We present the first non-trivial external memory index supporting top-k document retrieval queries in(More)