# Beyond Independence: Probabilistic Models for Query Approximation on Binary Transaction Data

@article{Pavlov2003BeyondIP, title={Beyond Independence: Probabilistic Models for Query Approximation on Binary Transaction Data}, author={Dmitry Pavlov and Heikki Mannila and Padhraic Smyth}, journal={IEEE Trans. Knowl. Data Eng.}, year={2003}, volume={15}, pages={1409-1421} }

We investigate the problem of generating fast approximate answers to queries posed to large sparse binary data sets. We focus in particular on probabilistic model-based approaches to this problem and develop a number of techniques that are significantly more accurate than a baseline independence model. In particular, we introduce two techniques for building probabilistic models from frequent itemsets: the itemset maximum entropy model and the itemset inclusion-exclusion model. In the maximum…

## 101 Citations

### Probabilistic query models for transaction data

- Computer ScienceKDD '01
- 2001

It is shown that frequent itemsets are useful for reducing the original data to a compressed representation and introduced a method to store them using an ADTree data structure and propose several new schemes for query answering based on the compressed representation that avoid direct scans of the data at query time.

### An approximate MRF model for querying large sparse binary data

- Computer Science

A probabilistic estimation procedure is developed to estimate the selectivity of ad-hoc queries from the model summary, and demonstrates the viability of the approach when compared to extant strategies in terms of overall accuracy and especially in efficiency.

### Approximate Query Answering by Model Averaging

- Computer ScienceSDM
- 2003

It is demonstrated that on realworld and simulated data sets that model-averaging can reduce the prediction error of any single model by factors of up to 50%, providing a practical framework for approximate query answering with massive data sets.

### Safe projections of binary data sets

- Computer ScienceActa Informatica
- 2006

A heuristic algorithm for finding almost-safe sets given a size restriction, and it is shown empirically that these sets outperform the trivial projection, and a connection between safe sets and Markov Random Fields is shown.

### Advances in Mining Binary Data: Itemsets as Summaries

- Computer Science
- 2008

This thesis shows how to use itemsets for answering queries, that is, finding out the number of transactions satisfying some given formula, and proposes a new concept called normalised correlation dimension, a known concept that works well with realvalued data.

### Summarizing itemset patterns using probabilistic models

- Computer ScienceKDD '06
- 2006

A novel probabilistic approach to summarize frequent itemset patterns that can effectively summarize a large number of itemsets and typically significantly outperforms extant approaches is proposed.

### Exploiting non-redundant local patterns and probabilistic models for analyzing structured and semi-structured data

- Computer Science
- 2008

This work proposes a probabilistic framework under which the selectivity of a twig query can be estimated from the information of its subtrees and investigates learning approximate global MRFs on large transactional data and proposes a divide-and-conquer style modeling approach.

### Model-based probabilistic frequent itemset mining

- Computer ScienceKnowledge and Information Systems
- 2012

This paper proposes a novel methods to capture the itemset mining process as a probability distribution function taking two models into account: the Poisson distribution and the normal distribution, and gives an intuition which model-based approach fits best to different types of data sets.

### GRAPHICAL MODELS FOR UNCERTAIN DATA

- Computer Science
- 2008

A unified framework based on the concepts from graphical models is presented that can model not only tuple-level and attribute-level uncertainties, but can also handle arbitrary correlations that may be present among the data; this framework can also naturally capture shared correlations where the same uncertainties and correlations occur repeatedly in the data.

### Chapter 1 GRAPHICAL MODELS FOR UNCERTAIN DATA

- Computer Science
- 2008

A unified framework based on the concepts from graphical models is presented that can model not only tuple-level and attribute-level uncertainties, but can also handle arbitrary correlations that may be present among the data; this framework can also naturally capture shared correlations where the same uncertainties and correlations occur repeatedly in the data.

## References

SHOWING 1-10 OF 69 REFERENCES

### Probabilistic query models for transaction data

- Computer ScienceKDD '01
- 2001

It is shown that frequent itemsets are useful for reducing the original data to a compressed representation and introduced a method to store them using an ADTree data structure and propose several new schemes for query answering based on the compressed representation that avoid direct scans of the data at query time.

### Probabilistic Models for Query Approximation with Large Sparse Binary Data Sets

- Computer ScienceUAI
- 2000

A Markov random field (MRF) approach based on frequent sets and maximum entropy is studied, and it is found that the MRF model provides substantially more accurate probability estimates than the other methods but is more expensive from a computational and memory viewpoint.

### Approximate Query Answering by Model Averaging

- Computer ScienceSDM
- 2003

It is demonstrated that on realworld and simulated data sets that model-averaging can reduce the prediction error of any single model by factors of up to 50%, providing a practical framework for approximate query answering with massive data sets.

### Selectivity estimation using probabilistic models

- Computer ScienceSIGMOD '01
- 2001

The approach produces more accurate estimates than standard approaches to selectivity estimation, using comparable space and time for both single-table multi-attribute queries and a general class of select-join queries.

### Independence is good: dependency-based histogram synopses for high-dimensional data

- Computer ScienceSIGMOD '01
- 2001

An important aspect of the general, model-based methodology is that it can be used to enhance the performance of other synopsis techniques that are based on data-space partitioning by providing an effective tool to deal with the “dimensionality curse”.

### Summary structures for frequency queries on large transaction sets

- Computer ScienceProceedings DCC 2000. Data Compression Conference
- 2000

A binary trie-based summary structure for representing transaction sets that can support frequency queries several orders of magnitude faster than raw transaction data is proposed and has better memory (compression) characteristics compared to related schemes.

### Multiple Uses of Frequent Sets and Condensed Representations (Extended Abstract)

- Computer Science, MathematicsKDD
- 1996

This paper shows how frequent sets can be used as a condensed representation for answering various types of queries, and defines a general notion of condensed representations, and shows that frequent sets, samples and the data cube can be viewed as instantations of this concept.

### Cached Sufficient Statistics for Efficient Machine Learning with Large Datasets

- Computer ScienceJ. Artif. Intell. Res.
- 1998

A very sparse data structure, the ADtree, is provided to minimize memory use and it is empirically demonstrated that tractably-sized data structures can be produced for large real-world datasets by using a sparse tree structure that never allocates memory for counts of zero.

### Prediction with local patterns using cross-entropy

- Computer ScienceKDD '99
- 1999

It is shown that the cross-entropy approach can be used for query selectivity estimation for O/l data sets and concluded that viewing local patterns as constraints on a high-order probability model is a useful and principled framework for prediction based on large sets of mined patterns.

### Wavelet-based histograms for selectivity estimation

- Computer ScienceSIGMOD '98
- 1998

This paper presents a technique based upon a multiresolution wavelet decomposition for building histograms on the underlying data distributions, with applications to databases, statistics, and simulation.