A Rademacher Complexity Based Method for Controlling Power and Confidence Level in Adaptive Statistical Analysis

@article{Stefani2019ARC,
  title={A Rademacher Complexity Based Method for Controlling Power and Confidence Level in Adaptive Statistical Analysis},
  author={Lorenzo De Stefani and Eli Upfal},
  journal={2019 IEEE International Conference on Data Science and Advanced Analytics (DSAA)},
  year={2019},
  pages={71-80}
}
  • L. D. StefaniE. Upfal
  • Published 1 October 2019
  • Computer Science
  • 2019 IEEE International Conference on Data Science and Advanced Analytics (DSAA)
While standard statistical inference techniques and machine learning generalization bounds assume that tests are run on data selected independently of the hypotheses, practical data analysis and machine learning are usually iterative and adaptive processes where the same holdout data is often used for testing a sequence of hypotheses (or models), which may each depend on the outcome of the previous tests on the same data. In this work, we present RADABOUND a rigorous, efficient and practical… 

Figures from this paper

MCRapper: Monte-Carlo Rademacher Averages for Poset Families and Approximate Pattern Mining

MCRapper uses upper bounds to the discrepancy of the functions to efficiently explore and prune the search space, a technique borrowed from pattern mining itself, and TFP-R gives guarantees on the probability of including any false positives (precision) and exhibits higher statistical power than existing methods offering the same guarantees.

MCRapper: Monte-Carlo Rademacher Averages for Poset Families and Approximate Pattern Mining

MCRapper and TFP-R outperform the state-of-the-art for their respective tasks and give guarantees on the probability of including any false positives (precision) and exhibits higher statistical power than existing methods offering the same guarantees.

Bavarian: Betweenness Centrality Approximation with Variance-Aware Rademacher Averages

Bavarian, a collection of sampling-based algorithms for approximating the Betweenness Centrality of all vertices in a graph, is presented and it is proved that, for all estimators, the sample size sufficient to achieve a desired approximation guarantee depends on the vertex-diameter of the graph, an easy-to-bound characteristic quantity.

SILVAN: Estimating Betweenness Centralities with Progressive Sampling and Non-uniform Rademacher Bounds

SILVAN relies on a novel estimation scheme providing non-uniform bounds on the deviation of the estimates of the betweenness centrality of all the nodes from their true values, and a refined characterisation of the number of samples required to obtain a high-quality approximation.

Sharper convergence bounds of Monte Carlo Rademacher Averages through Self-Bounding functions

This work derives sharper probabilistic concentration bounds for the Monte Carlo Empirical Rademacher Averages (MCERA), which are proved through recent results on the concentration of self-bounding functions, and derives novel variance-aware bounds to the supremum deviations.

References

SHOWING 1-10 OF 24 REFERENCES

Preserving Statistical Validity in Adaptive Data Analysis

It is shown that, surprisingly, there is a way to estimate an exponential in n number of expectations accurately even if the functions are chosen adaptively, and this gives an exponential improvement over standard empirical estimators that are limited to a linear number of estimates.

Generalization in Adaptive Data Analysis and Holdout Reuse

A simple and practical method for reusing a holdout set to validate the accuracy of hypotheses produced by a learning algorithm operating on a training set and it is shown that a simple approach based on description length can also be used to give guarantees of statistical validity in adaptive settings.

Controlling the false discovery rate: a practical and powerful approach to multiple testing

SUMMARY The common approach to the multiplicity problem calls for controlling the familywise error rate (FWER). This approach, though, has faults, and we point out a few. A different approach to

The reusable holdout: Preserving validity in adaptive data analysis

A new approach for addressing the challenges of adaptivity based on insights from privacy-preserving data analysis is demonstrated, and how to safely reuse a holdout data set many times to validate the results of adaptively chosen analyses is shown.

Preventing False Discovery in Interactive Data Analysis Is Hard

We show that, under a standard hardness assumption, there is no computationally efficient algorithm that given n samples from an unknown distribution can give valid answers to n3+o(1) adaptively

Interactive fingerprinting codes and the hardness of preventing false discovery

It is shown that, under a standard hardness assumption, there is no computationally efficient algorithm that, given n samples from an unknown distribution, can give valid answers to O(n2) adaptively chosen statistical queries.

Rademacher penalties and structural risk minimization

This work suggests a penalty function to be used in various problems of structural risk minimization, based on the sup-norm of the so-called Rademacher process indexed by the underlying class of functions (sets), and obtains oracle inequalities for the theoretical risk of estimators, obtained by structural minimization of the empirical risk withRademacher penalties.

Rademacher and Gaussian Complexities: Risk Bounds and Structural Results

This work investigates the use of certain data-dependent estimates of the complexity of a function class called Rademacher and Gaussian complexities and proves general risk bounds in terms of these complexities in a decision theoretic setting.

Sequential selection procedures and false discovery rate control

This work proposes two new testing procedures and proves that they control the false discovery rate in the ordered testing setting and shows how the methods can be applied to model selection by using recent results on p‐values in sequential model selection settings.