Learn More
Data analysis applications typically aggregate data across manydimensions looking for anomalies or unusual patterns. The SQL aggregatefunctions and the GROUP BY operator produce zero-dimensional orone-dimensional aggregates. Applications need the N-dimensionalgeneralization of these operators. This paper defines that operator, calledthe data cube or simply(More)
Data cleaning based on similarities involves identification of "close" tuples, where closeness is evaluated using a variety of similarity functions chosen to suit the domain and application. Current approaches for efficiently implementing such similarity joins are tightly tied to the chosen similarity function. In this paper, we propose a new primitive(More)
Internet search engines have popularized the keyword-based search paradigm. While traditional database management systems offer powerful query languages, they do not allow keyword-based search. In this paper, we discuss DBXplorer, a system that enables keyword-based search in relational databases. DBXplorer has been implemented using a commercial relational(More)
Automatically selecting an appropriate set of materialized views and indexes for SQL databases is a non-trivial task. A judicious choice must be cost-driven and influenced by the workload experienced by the system. Although there has been work in materialized view selection in the context of multidimensional (OLAP) databases, no past work has looked at the(More)
To ensure high data quality, data warehouses must validate and cleanse incoming data tuples from external sources. In many situations, clean tuples must match acceptable tuples in <i>reference tables</i>. For example, product name and description fields in a sales record from a distributor must match the pre-recorded name and description fields in a product(More)
Attributes of a relation are not typically independent. Multidimensional histograms can be an effective tool for accurate multiattribute query selectivity estimation. In this paper, we introduce <i>STHoles</i>, a &#8220;workload-aware&#8221; histogram that allows bucket nesting to capture data regions with reasonably uniform tuple density. <i>STHoles</i>(More)
In many applications, users specify target values for certain attributes, without requiring exact matches to these values in return. Instead, the result to such queries is typically a rank of the "top <i>k</i>" tuples that best match the given attribute values. In this paper, we study the advantages and limitations of processing a top-<i>k</i> query by(More)
In this paper, we introduce self-tuning histograms. Although similar in structure to traditional histograms, these histograms infer data distributions not by examining the data or a sample thereof, but by using feedback from the query execution engine about the actual selectivity of range selection operators to progressively refine the histogram. Since the(More)
In decision support applications, the ability to provide fast approximate answers to aggregation queries is desirable. One commonly-used technique for approximate query answering is sampling. For many aggregation queries, appropriately constructed biased (non-uniform) samples can provide more accurate approximations than a uniform sample. The optimal type(More)