Learn More
Ad hoc querying is difficult on very large datasets, since it is usually not possible to have the entire dataset on disk. While compression can be used to decrease the size of the dataset, compressed data is notoriously difficult to index or access. In this paper we consider a very large dataset comprising multiple distinct time sequences. Each point in the(More)
Inherent in the operation of many decision support and continuous referral systems is the notion of the “influence” of a data point on the database. This notion arises in examples such as finding the set of customers affected by the opening of a new store outlet location, notifying the subset of subscribers to a digital library who will find a(More)
Aggregation along hierarchies is a critical summary technique in a large variety of on-line applications including decision support, and network management (e.g., IP clustering , denial-of-service attack monitoring). Despite the amount of recent study that has been dedicated to online aggregation on sets (e.g., quantiles, hot items), surprisingly little(More)
We examine the problem of nding similar tumor shapes. Starting from a natural similarity function (the so-called`max morphological distance'), we show how to lower-bound it and how to search for nearest neighbors in large collections of tumor-like shapes. Speciically, we use state-of-the-art concepts from morphology, namely thèpattern spec-trum' of a shape,(More)
Association Rule Mining algorithms operate on a data matrix (e.g., customers products) to derive association rules 2, 23]. We propose a new paradigm, namely, Ratio Rules, which are quantiiable in that we can measure the \goodness" of a set of discovered rules. We propose to use the \guessing error" as a measure of the \goodness", that is, the(More)
Skewed distributions appear very often in practice. Unfortunately, the traditional Zipf distribution often fails to model them well. In this paper, we propose a new probability distribution, the Discrete Gaussian Exponential (DGX), to achieve excellent fits in a wide variety of settings; our new distribution includes the Zipf distribution as a special case.(More)
In a variety of applications ranging from optimizing queries on alphanumeric attributes to providing approximate counts of documents containing several query terms, there is an increasing need to quickly and reliably estimate the number of strings (tuples, documents, etc.) matching a Boolean query. Boolean queries in this context consist of substring(More)
Association Rule Mining algorithms operate on a data matrix (e.g., customers products) to derive rules 2, 22]. We propose a single-pass algorithm for mining linear rules in such a matrix based on Principal Component Analysis. PCA detects correlated columns of the matrix, which correspond to, e.g., products that sell together. The rst contribution of this(More)
Data items archived in data warehouses or those that arrive online as streams typically have attributes which take values from multiple hierarchies (e.g., time and geographic location; source and destination IP addresses). Providing an aggregate view of such data is important to summarize, visualize, and analyze. We develop the aggregate view based on(More)
Many algorithms have been proposed to approximate holistic aggregates, such as quantiles and heavy hitters, over data streams. However, little work has been done to explore what techniques are required to incorporate these algorithms in a data stream query processor, and to make them useful in practice.In this paper, we study the performance implications of(More)