Learn More
With the increasing amount of data and the need to integrate data from multiple data sources, one of the challenging issues is to identify <i>near-duplicate</i> records efficiently. In this article, we focus on efficient algorithms to find a pair of records such that their similarities are no less than a given threshold. Several existing algorithms rely on(More)
There has been considerable interest in similarity join in the research community recently. Similarity join is a fundamental operation in many application areas, such as data integration and cleaning, bioinformatics, and pattern recognition. We focus on efficient algorithms for similarity join with edit distance constraints. Existing approaches are mainly(More)
With the increasing amount of text data stored in relational databases, there is a demand for RDBMS to support keyword queries over text data. As a search result is often assembled from multiple relational tables, traditional IR-style ranking and query evaluation methods cannot be applied directly. In this paper, we study the <i>effectiveness</i> and the(More)
Uncertain data is inherent in a few important applications such as environmental surveillance and mobile object tracking. Top-<i>k</i> queries (also known as ranking queries) are often natural and useful in analyzing uncertain data in those applications. In this paper, we study the problem of answering probabilistic threshold top-<i>k</i> queries on(More)
Skyline computation has many applications including multi-criteria decision making. In this paper, we study the problem of selecting k skyline points so that the number of points, which are dominated by at least one of these k skyline points, is maximized. We first present an efficient dynamic programming based exact algorithm in a 2d-space. Then, we show(More)
It is widely realized that the integration of database and information retrieval techniques will provide users with a wide range of high quality services. In this paper, we study processing an l-keyword query, p<sub>1</sub>, p<sub>1</sub>, ..., p<sub>l</sub>, against a relational database which can be modeled as a weighted graph, G(V, E). Here V is a set of(More)
Given a query string Q, an edit similarity search finds all strings in a database whose edit distance with Q is no more than a given threshold t. Most existing method answering edit similarity queries rely on a signature scheme to generate candidates given the query string. We observe that the number of signatures generated by existing methods is far(More)
Skyline has been proposed as an important operator for multi-criteria decision making , data mining and visualization, and user-preference queries. In this paper, we consider the problem of efficiently computing a Skycube, which consists of skylines of all possible non-empty subsets of a given set of dimensions. While existing skyline computation algorithms(More)
Given an integer $k$, a {\em representative skyline} contains the $k$ skyline points that best describe the tradeoffs among different dimensions offered by the full skyline. Although this topic has been previously studied, the existing solution may sometimes produce $k$ points that appear in an arbitrarily tiny cluster, and therefore, fail to be(More)