# Assessing data mining results via swap randomization

@article{Gionis2007AssessingDM, title={Assessing data mining results via swap randomization}, author={A. Gionis and Heikki Mannila and Taneli Mielik{\"a}inen and Panayiotis Tsaparas}, journal={ACM Trans. Knowl. Discov. Data}, year={2007}, volume={1}, pages={14} }

The problem of assessing the significance of data mining results on high-dimensional 0--1 datasets has been studied extensively in the literature. For problems such as mining frequent sets and finding correlations, significance testing can be done by standard statistical tests such as chi-square, or other methods. However, the results of such tests depend only on the specific attributes and not on the dataset as a whole. Moreover, the tests are difficult to apply to sets of patterns or other…

## Figures, Tables, and Topics from this paper

## 258 Citations

Assessing Data Mining Results on Matrices with Randomization

- Mathematics, Computer Science2010 IEEE International Conference on Data Mining
- 2010

This paper proposes a new approach for randomizing matrices containing features measured in different scales and provides an easily usable implementation that does not need problematic manual tuning as theoretically justified parameter values are given.

Randomization of real-valued matrices for assessing the significance of data mining results

- Computer ScienceSDM
- 2008

Three alternative algorithms based on local transformations and Metropolis sampling are described, and it is shown that they are efficient and usable in practice and work efficiently and solve the defined problem.

Assessing Data Mining Results on Matrices with Randomization

- 2010

Randomization is a general technique for evaluating the significance of data analysis results. In randomizationbased significance testing, a result is considered to be interesting if it is unlikely…

Assessing the Significance of Data Mining Results on Graphs with Feature Vectors

- Computer Science2012 IEEE 12th International Conference on Data Mining
- 2012

This work proposes a novel null model that preserves correlation information between both sources and exploits an adaptive Metropolis sampling and interweaves attribute randomization and graph randomization steps.

Randomization methods for assessing data analysis results on real-valued matrices

- Computer Science
- 2009

Methods based on local transformations and Metropolis sampling are described, and it is shown that the methods are efficient and usable in practice in significance testing of data mining results on real-valued matrices.

Randomization methods for assessing data analysis results on real-valued matrices

- Computer ScienceStat. Anal. Data Min.
- 2009

Methods based on local transformations and Metropolis sampling are described, and it is shown that they work efficiently and are usable in significance testing of data mining results on real-valued matrices.

Tell me something I don't know: randomization strategies for iterative data mining

- Computer Science, MathematicsKDD
- 2009

The problem of randomizing data so that previously discovered patterns or models are taken into account, and the results indicate that in many cases, the results of, e.g., clustering actually imply theresults of, say, frequent pattern discovery.

A statistical significance testing approach to mining the most informative set of patterns

- Computer ScienceData Mining and Knowledge Discovery
- 2012

The novel problem of finding the smallest set of patterns that explains most about the data in terms of a global p value is studied and it is found that a greedy algorithm gives good results on real data and that it can formulate and solve many known data-mining tasks.

Maximum Entropy Modelling for Assessing Results on Real-Valued Data

- Mathematics, Computer Science2011 IEEE 11th International Conference on Data Mining
- 2011

This paper proposes an approach for assessing results on real-valued rectangular databases using the Maximum Entropy principle to fit a background model to the data while respecting its marginal distributions, and employs an MDL based histogram estimator to find these distributions.

The smallest set of constraints that explains the data: a randomization approach

- Mathematics
- 2010

Randomization methods can be used to assess statistical significance of data mining results. A randomization method typically consists of a sampler which draws data sets from a null distribution, and…

## References

SHOWING 1-10 OF 46 REFERENCES

Assessing data mining results via swap randomization

- Computer ScienceKDD '06
- 2006

The approach consists of producing random datasets that have the same row and column margins with the given dataset, computing the results of interest on the randomized instances, and comparing them against the results on the actual data.

Selecting the right interestingness measure for association patterns

- Computer Science, MathematicsKDD
- 2002

An overview of various measures proposed in the statistics, machine learning and data mining literature is presented and it is shown that each measure has different properties which make them useful for some application domains, but not for others.

On Inverse Frequent Set Mining

- Computer Science
- 2003

This paper analyzes the computational complexity of the problem of finding a binary data set compatible with a given collection of frequent sets and shows that in many cases the problem is computationally very difficult.

Beyond market baskets: generalizing association rules to correlations

- Computer ScienceSIGMOD '97
- 1997

This work develops the notion of mining rules that identify correlations (generalizing associations), and proposes measuring significance of associations via the chi-squared test for correlation from classical statistics, enabling the mining problem to reduce to the search for a border between correlated and uncorrelated itemsets in the lattice.

Discovering significant patterns

- Mathematics, Computer ScienceMachine Learning
- 2008

This paper proposes techniques to overcome the extreme risk of type-1 error by applying well-established statistical practices, which allow the user to enforce a strict upper limit on the risk of experimentwise error.

Discovering Significant Patterns

- Computer ScienceMachine Learning
- 2007

This paper proposes techniques to overcome the extreme risk of type-1 error by applying well-established statistical practices, which allow the user to enforce a strict upper limit on the risk of experimentwise error.

Empirical bayes screening for multi-item associations

- Computer Science, MathematicsKDD '01
- 2001

This paper considers the framework of the so-called "market basket problem", in which a database of transactions is mined for the occurrence of unusually frequent item sets, and defines a 95% Bayesian lower confidence limit for the "interestingness" measure of every item set.

Pruning and summarizing the discovered associations

- Computer ScienceKDD '99
- 1999

The technique first prunes the discovered associations to remove those insignificant associations, and then finds a special subset of the unpruned associations to form a summary of the discovered association rules, which are then called the direction setting rules.

Discovering Predictive Association Rules

- Mathematics, Computer ScienceKDD
- 1998

Empirical evaluation shows that on typical datasets the fraction of rules that may be false discoveries is very small, and a novel approach is presented for estimating the number of "false discoveries" at any cutoff level.

Exploiting a support-based upper bound of Pearson's correlation coefficient for efficiently identifying strongly correlated pairs

- Computer Science, MathematicsKDD
- 2004

An upper bound of Pearson's correlation coefficient for binary variables is identified and it is shown that the computation savings from pruning is independent or improves when the number of items is increased in data sets with common Zipf or linear rank-support distributions.