Beyond Independence: Probabilistic Models for Query Approximation on Binary Transaction Data

@article{Pavlov2003BeyondIP,
  title={Beyond Independence: Probabilistic Models for Query Approximation on Binary Transaction Data},
  author={Dmitry Pavlov and Heikki Mannila and Padhraic Smyth},
  journal={IEEE Trans. Knowl. Data Eng.},
  year={2003},
  volume={15},
  pages={1409-1421}
}
We investigate the problem of generating fast approximate answers to queries posed to large sparse binary data sets. We focus in particular on probabilistic model-based approaches to this problem and develop a number of techniques that are significantly more accurate than a baseline independence model. In particular, we introduce two techniques for building probabilistic models from frequent itemsets: the itemset maximum entropy model and the itemset inclusion-exclusion model. In the maximum… 

Figures and Tables from this paper

Probabilistic query models for transaction data

TLDR
It is shown that frequent itemsets are useful for reducing the original data to a compressed representation and introduced a method to store them using an ADTree data structure and propose several new schemes for query answering based on the compressed representation that avoid direct scans of the data at query time.

An approximate MRF model for querying large sparse binary data

TLDR
A probabilistic estimation procedure is developed to estimate the selectivity of ad-hoc queries from the model summary, and demonstrates the viability of the approach when compared to extant strategies in terms of overall accuracy and especially in efficiency.

Approximate Query Answering by Model Averaging

TLDR
It is demonstrated that on realworld and simulated data sets that model-averaging can reduce the prediction error of any single model by factors of up to 50%, providing a practical framework for approximate query answering with massive data sets.

Safe projections of binary data sets

TLDR
A heuristic algorithm for finding almost-safe sets given a size restriction, and it is shown empirically that these sets outperform the trivial projection, and a connection between safe sets and Markov Random Fields is shown.

Advances in Mining Binary Data: Itemsets as Summaries

TLDR
This thesis shows how to use itemsets for answering queries, that is, finding out the number of transactions satisfying some given formula, and proposes a new concept called normalised correlation dimension, a known concept that works well with realvalued data.

Summarizing itemset patterns using probabilistic models

TLDR
A novel probabilistic approach to summarize frequent itemset patterns that can effectively summarize a large number of itemsets and typically significantly outperforms extant approaches is proposed.

Exploiting non-redundant local patterns and probabilistic models for analyzing structured and semi-structured data

TLDR
This work proposes a probabilistic framework under which the selectivity of a twig query can be estimated from the information of its subtrees and investigates learning approximate global MRFs on large transactional data and proposes a divide-and-conquer style modeling approach.

Model-based probabilistic frequent itemset mining

TLDR
This paper proposes a novel methods to capture the itemset mining process as a probability distribution function taking two models into account: the Poisson distribution and the normal distribution, and gives an intuition which model-based approach fits best to different types of data sets.

GRAPHICAL MODELS FOR UNCERTAIN DATA

TLDR
A unified framework based on the concepts from graphical models is presented that can model not only tuple-level and attribute-level uncertainties, but can also handle arbitrary correlations that may be present among the data; this framework can also naturally capture shared correlations where the same uncertainties and correlations occur repeatedly in the data.

Chapter 1 GRAPHICAL MODELS FOR UNCERTAIN DATA

TLDR
A unified framework based on the concepts from graphical models is presented that can model not only tuple-level and attribute-level uncertainties, but can also handle arbitrary correlations that may be present among the data; this framework can also naturally capture shared correlations where the same uncertainties and correlations occur repeatedly in the data.
...

References

SHOWING 1-10 OF 69 REFERENCES

Probabilistic query models for transaction data

TLDR
It is shown that frequent itemsets are useful for reducing the original data to a compressed representation and introduced a method to store them using an ADTree data structure and propose several new schemes for query answering based on the compressed representation that avoid direct scans of the data at query time.

Probabilistic Models for Query Approximation with Large Sparse Binary Data Sets

TLDR
A Markov random field (MRF) approach based on frequent sets and maximum entropy is studied, and it is found that the MRF model provides substantially more accurate probability estimates than the other methods but is more expensive from a computational and memory viewpoint.

Approximate Query Answering by Model Averaging

TLDR
It is demonstrated that on realworld and simulated data sets that model-averaging can reduce the prediction error of any single model by factors of up to 50%, providing a practical framework for approximate query answering with massive data sets.

Selectivity estimation using probabilistic models

TLDR
The approach produces more accurate estimates than standard approaches to selectivity estimation, using comparable space and time for both single-table multi-attribute queries and a general class of select-join queries.

Independence is good: dependency-based histogram synopses for high-dimensional data

TLDR
An important aspect of the general, model-based methodology is that it can be used to enhance the performance of other synopsis techniques that are based on data-space partitioning by providing an effective tool to deal with the “dimensionality curse”.

Summary structures for frequency queries on large transaction sets

TLDR
A binary trie-based summary structure for representing transaction sets that can support frequency queries several orders of magnitude faster than raw transaction data is proposed and has better memory (compression) characteristics compared to related schemes.

Multiple Uses of Frequent Sets and Condensed Representations (Extended Abstract)

TLDR
This paper shows how frequent sets can be used as a condensed representation for answering various types of queries, and defines a general notion of condensed representations, and shows that frequent sets, samples and the data cube can be viewed as instantations of this concept.

Cached Sufficient Statistics for Efficient Machine Learning with Large Datasets

TLDR
A very sparse data structure, the ADtree, is provided to minimize memory use and it is empirically demonstrated that tractably-sized data structures can be produced for large real-world datasets by using a sparse tree structure that never allocates memory for counts of zero.

Prediction with local patterns using cross-entropy

TLDR
It is shown that the cross-entropy approach can be used for query selectivity estimation for O/l data sets and concluded that viewing local patterns as constraints on a high-order probability model is a useful and principled framework for prediction based on large sets of mined patterns.

Wavelet-based histograms for selectivity estimation

TLDR
This paper presents a technique based upon a multiresolution wavelet decomposition for building histograms on the underlying data distributions, with applications to databases, statistics, and simulation.
...