Learn More
Data analysis applications typically aggregate data across many dimensions looking for anomalies or unusual patterns. The SQL aggregate functions and the GROUP BY operator produce zero-dimensional or one-dimensional aggregates. Applications need the N-dimensional generalization of these operators. This paper defines that operator, called the data cube or(More)
Data cleaning based on similarities involves identification of " close " tuples, where closeness is evaluated using a variety of similarity functions chosen to suit the domain and application. Current approaches for efficiently implementing such similarity joins are tightly tied to the chosen similarity function. In this paper, we propose a new primitive(More)
Internet search engines have popularized the keyword-based search paradigm. While traditional database management systems offer powerful query languages, they do not allow keyword-based search. In this paper, we discuss DBXplorer, a system that enables keyword-based search in relational databases. DBXplorer has been implemented using a commercial relational(More)
Attributes of a relation are not typically independent. Multidimensional histograms can be an effective tool for accurate multiattribute query selectivity estimation. In this paper, we introduce <i>STHoles</i>, a &#8220;workload-aware&#8221; histogram that allows bucket nesting to capture data regions with reasonably uniform tuple density. <i>STHoles</i>(More)
Automatically selecting an appropriate set of materialized views and indexes for SQL databases is a non-trivial task. A judicious choice must be cost-driven and influenced by the workload experienced by the system. Although there has been work in materialized view selection in the context of multidimensional (OLAP) databases, no past work has looked at the(More)
To ensure high data quality, data warehouses must validate and cleanse incoming data tuples from external sources. In many situations, clean tuples must match acceptable tuples in <i>reference tables</i>. For example, product name and description fields in a sales record from a distributor must match the pre-recorded name and description fields in a product(More)
In many applications, users specify target values for certain attributes, without requiring exact matches to these values in return. Instead , the result to such queries is typically a rank of the " top k " tuples that best match the given attribute values. In this paper, we study the advantages and limitations of processing a top-k query by translating it(More)
In this paper, we introduce self-tuning histograms. Although similar in structure to traditional histograms, these histograms infer data distributions not by examining the data or a sample thereof, but by using feedback from the query execution engine about the actual selectivity of range selection operators to progressively refine the histogram. Since the(More)
Incorporating the skyline operator inside the relational engine requires solving the cardinality estimation and the cost estimation problem, hitherto unaddressed. We propose robust techniques to estimate the cardinality and the computational cost of Skyline, and through an empirical comparison , show that our technique is substantially more effective than(More)