Assessing data mining results via swap randomization

@article{Gionis2007AssessingDM,
  title={Assessing data mining results via swap randomization},
  author={A. Gionis and Heikki Mannila and Taneli Mielik{\"a}inen and Panayiotis Tsaparas},
  journal={ACM Trans. Knowl. Discov. Data},
  year={2007},
  volume={1},
  pages={14}
}
The problem of assessing the significance of data mining results on high-dimensional 0--1 datasets has been studied extensively in the literature. For problems such as mining frequent sets and finding correlations, significance testing can be done by standard statistical tests such as chi-square, or other methods. However, the results of such tests depend only on the specific attributes and not on the dataset as a whole. Moreover, the tests are difficult to apply to sets of patterns or other… 
Assessing Data Mining Results on Matrices with Randomization
  • Markus Ojala
  • Mathematics, Computer Science
    2010 IEEE International Conference on Data Mining
  • 2010
TLDR
This paper proposes a new approach for randomizing matrices containing features measured in different scales and provides an easily usable implementation that does not need problematic manual tuning as theoretically justified parameter values are given.
Randomization of real-valued matrices for assessing the significance of data mining results
TLDR
Three alternative algorithms based on local transformations and Metropolis sampling are described, and it is shown that they are efficient and usable in practice and work efficiently and solve the defined problem.
Assessing Data Mining Results on Matrices with Randomization
Randomization is a general technique for evaluating the significance of data analysis results. In randomizationbased significance testing, a result is considered to be interesting if it is unlikely
Assessing the Significance of Data Mining Results on Graphs with Feature Vectors
TLDR
This work proposes a novel null model that preserves correlation information between both sources and exploits an adaptive Metropolis sampling and interweaves attribute randomization and graph randomization steps.
Randomization methods for assessing data analysis results on real-valued matrices
TLDR
Methods based on local transformations and Metropolis sampling are described, and it is shown that the methods are efficient and usable in practice in significance testing of data mining results on real-valued matrices.
Randomization methods for assessing data analysis results on real-valued matrices
TLDR
Methods based on local transformations and Metropolis sampling are described, and it is shown that they work efficiently and are usable in significance testing of data mining results on real-valued matrices.
Tell me something I don't know: randomization strategies for iterative data mining
TLDR
The problem of randomizing data so that previously discovered patterns or models are taken into account, and the results indicate that in many cases, the results of, e.g., clustering actually imply theresults of, say, frequent pattern discovery.
A statistical significance testing approach to mining the most informative set of patterns
TLDR
The novel problem of finding the smallest set of patterns that explains most about the data in terms of a global p value is studied and it is found that a greedy algorithm gives good results on real data and that it can formulate and solve many known data-mining tasks.
Maximum Entropy Modelling for Assessing Results on Real-Valued Data
TLDR
This paper proposes an approach for assessing results on real-valued rectangular databases using the Maximum Entropy principle to fit a background model to the data while respecting its marginal distributions, and employs an MDL based histogram estimator to find these distributions.
The smallest set of constraints that explains the data: a randomization approach
Randomization methods can be used to assess statistical significance of data mining results. A randomization method typically consists of a sampler which draws data sets from a null distribution, and
...
1
2
3
4
5
...

References

SHOWING 1-10 OF 46 REFERENCES
Assessing data mining results via swap randomization
TLDR
The approach consists of producing random datasets that have the same row and column margins with the given dataset, computing the results of interest on the randomized instances, and comparing them against the results on the actual data.
Selecting the right interestingness measure for association patterns
TLDR
An overview of various measures proposed in the statistics, machine learning and data mining literature is presented and it is shown that each measure has different properties which make them useful for some application domains, but not for others.
On Inverse Frequent Set Mining
TLDR
This paper analyzes the computational complexity of the problem of finding a binary data set compatible with a given collection of frequent sets and shows that in many cases the problem is computationally very difficult.
Beyond market baskets: generalizing association rules to correlations
TLDR
This work develops the notion of mining rules that identify correlations (generalizing associations), and proposes measuring significance of associations via the chi-squared test for correlation from classical statistics, enabling the mining problem to reduce to the search for a border between correlated and uncorrelated itemsets in the lattice.
Discovering significant patterns
TLDR
This paper proposes techniques to overcome the extreme risk of type-1 error by applying well-established statistical practices, which allow the user to enforce a strict upper limit on the risk of experimentwise error.
Discovering Significant Patterns
TLDR
This paper proposes techniques to overcome the extreme risk of type-1 error by applying well-established statistical practices, which allow the user to enforce a strict upper limit on the risk of experimentwise error.
Empirical bayes screening for multi-item associations
TLDR
This paper considers the framework of the so-called "market basket problem", in which a database of transactions is mined for the occurrence of unusually frequent item sets, and defines a 95% Bayesian lower confidence limit for the "interestingness" measure of every item set.
Pruning and summarizing the discovered associations
TLDR
The technique first prunes the discovered associations to remove those insignificant associations, and then finds a special subset of the unpruned associations to form a summary of the discovered association rules, which are then called the direction setting rules.
Discovering Predictive Association Rules
TLDR
Empirical evaluation shows that on typical datasets the fraction of rules that may be false discoveries is very small, and a novel approach is presented for estimating the number of "false discoveries" at any cutoff level.
Exploiting a support-based upper bound of Pearson's correlation coefficient for efficiently identifying strongly correlated pairs
TLDR
An upper bound of Pearson's correlation coefficient for binary variables is identified and it is shown that the computation savings from pruning is independent or improves when the number of items is increased in data sets with common Zipf or linear rank-support distributions.
...
1
2
3
4
5
...