Learn More
Nearest neighbor (NN) search in high dimensional space is an important problem in many applications. Ideally, a practical solution (i) should be implementable in a relational database, and (ii) its query cost should grow <i>sub-linearly</i> with the dataset size, regardless of the data and query distributions. Despite the bulk of NN literature, no solution(More)
Being popular in YouTube is becoming a fundamental way of promoting one's self, services or products. In this paper, we conduct an in depth study of fundamental properties of video popularity in YouTube. We collect and study arguably the largest dataset of YouTube videos, roughly 37 million, accounting for 25&#37; of all YouTube videos. We analyze(More)
This paper studies the <i>nearest keyword</i> (<i>NK</i>) problem on XML documents. In general, the dataset is a tree where each node is associated with one or more keywords. Given a node q and a keyword w, an NK query returns the node that is nearest to q among all the nodes associated with w. NK search is not only useful as a stand-alone operator but also(More)
The miR-183/-96/-182 cluster is a conserved polycistronic microRNA (miRNA) cluster which is highly expressed in most breast cancers. Although there are some sporadic reports which demonstrate the importance of each miRNA in this cluster in breast cancer, the biological roles of this cluster as a whole and its regulation mechanisms in breast cancer are still(More)
Nearest Neighbor (NN) search in high-dimensional space is an important problem in many applications. From the database perspective, a good solution needs to have two properties: (i) it can be easily incorporated in a relational database, and (ii) its query cost should increase <i>sublinearly</i> with the dataset size, regardless of the data and query(More)
A hidden database refers to a dataset that an organization makes accessible on the web by allowing users to issue queries through a search interface. In other words, data acquisition from such a source is not by following static hyper-links. Instead, data are obtained by querying the interface, and reading the result page dynamically generated. This, with(More)
Quantiles are a crucial type of order statistics in databases. Extensive research has been focused on maintaining a space-efficient structure for approximate quantile computation as the underlying dataset is updated. The existing solutions, however, are designed to support only the current, most-updated, snapshot of the dataset. Queries on the past versions(More)
Given two vertices s, t in a graph, let P be the shortest path (SP) from <i>s</i> to <i>t</i>, and <i>P*</i> a subset of the vertices in <i>P</i>. <i>P*</i> is a <i>k</i>-skip shortest path from <i>s</i> to <i>t</i>, if it includes at least a vertex out of every <i>k</i> consecutive vertices in <i>P</i>. In general, <i>P*</i> succinctly describes <i>P</i>(More)
We consider the <i>skyline problem</i> (a.k.a. the <i>maxima problem</i>), which has been extensively studied in the database community. The input is a set <i>P</i> of <i>d</i>-dimensional points. A point <i>dominates</i> another if the former has a lower coordinate than the latter on every dimension. The goal is to find the <i>skyline</i>, which is the set(More)
We consider the <i>orthogonal range aggregation</i> problem. The dataset <i>S</i> consists of <i>N</i> axis-parallel rectangles in R<sup>2</sup>, each of which is associated with an integer <i>weight</i>. Given an axis-parallel rectangle <i>Q</i> and an aggregate function <i>F</i>, a query reports the aggregated result of the weights of the rectangles in(More)