• Corpus ID: 2540278

Generalization for Adaptively-chosen Estimators via Stable Median

  title={Generalization for Adaptively-chosen Estimators via Stable Median},
  author={Vitaly Feldman and Thomas Steinke},
Datasets are often reused to perform multiple statistical analyses in an adaptive way, in which each analysis may depend on the outcomes of previous analyses on the same dataset. Standard statistical guarantees do not account for these dependencies and little is known about how to provably avoid overfitting and false discovery in the adaptive setting. We consider a natural formalization of this problem in which the goal is to design an algorithm that, given a limited number of i.i.d.~samples… 

Generalization in the Face of Adaptivity: A Bayesian Perspective

This paper shows explicitly that the harms of adaptivity come from the covariance between the behavior of future queries and a Bayes factorbased measure of how much information about the data sample was encoded in the responses given to past queries, and uses this intuition to introduce a new stability notion.

Guaranteed Validity for Empirical Approaches to Adaptive Data Analysis

A framework for providing valid, instance-specific confidence intervals for point estimates that can be generated by heuristics that gives guarantees that are orders of magnitude better than the best worst-case bounds.

A new analysis of differential privacy’s generalization guarantees (invited paper)

We give a new proof of the "transfer theorem" underlying adaptive data analysis: that any mechanism for answering adaptively chosen statistical queries that is differentially private and

The Everlasting Database: Statistical Validity at a Fair Price

This work proposes a mechanism for answering an arbitrarily long sequence of potentially adaptive statistical queries, by charging a price for each query and using the proceeds to collect additional samples, and guarantees statistical validity without any assumptions on how the queries are generated.

The Limits of Post-Selection Generalization

A tight lower bound on the error of any algorithm that satisfies post hoc generalization and answers adaptively chosen statistical queries is shown, showing a strong barrier to progress in post selection data analysis.

Privacy-preserving Prediction

A simple baseline approach based on training several models on disjoint subsets of data and using standard private aggregation techniques to predict has nearly optimal sample complexity for PAC learning of any class of Boolean functions and introduces a substantial overhead for the aggregation step.

Mitigating Bias in Adaptive Data Gathering via Differential Privacy

This paper shows that there exist differentially private bandit algorithms with near optimal regret bounds, and applies existing theorems in the simple stochastic case, and gives a new analysis for linear contextual bandits.

On the Robustness of CountSketch to Adaptive Inputs

A robust estimator is proposed (for a slightly modified sketch) that al-lows for quadratic number of queries in the sketch size, which is an improvement factor of √ k (for k heavy hitters) over prior "blackbox" approaches.

Learning with User-Level Privacy

User-level DP protects a user’s entire contribution, providing more stringent but more realistic protection against information leaks, and shows that for high-dimensional mean estimation, empirical risk minimization with smooth losses, stochastic convex optimization, and learning hypothesis class with finite metric entropy, the privacy cost decreases as O(1/ m) as users provide more samples.

The structure of optimal private tests for simple hypotheses

Hypothesis testing plays a central role in statistical inference, and is used in many settings where privacy concerns are paramount. This work answers a basic question about privately testing simple



Algorithmic stability for adaptive data analysis

The first upper bounds on the number of samples required to answer more general families of queries, including arbitrary low-sensitivity queries and an important class of optimization queries (alternatively, risk minimization queries), are proved.

Generalization in Adaptive Data Analysis and Holdout Reuse

A simple and practical method for reusing a holdout set to validate the accuracy of hypotheses produced by a learning algorithm operating on a training set and it is shown that a simple approach based on description length can also be used to give guarantees of statistical validity in adaptive settings.

Preserving Statistical Validity in Adaptive Data Analysis

It is shown that, surprisingly, there is a way to estimate an exponential in n number of expectations accurately even if the functions are chosen adaptively, and this gives an exponential improvement over standard empirical estimators that are limited to a linear number of estimates.

Privacy-preserving statistical estimation with optimal convergence rates

It is shown that for a large class of statistical estimators T and input distributions P, there is a differentially private estimator AT with the same asymptotic distribution as T, which implies that AT (X) is essentially as good as the original statistic T(X) for statistical inference, for sufficiently large samples.

Preventing False Discovery in Interactive Data Analysis Is Hard

We show that, under a standard hardness assumption, there is no computationally efficient algorithm that given n samples from an unknown distribution can give valid answers to n3+o(1) adaptively

Efficient noise-tolerant learning from statistical queries

This paper formalizes a new but related model of learning from statistical queries, and demonstrates the generality of the statistical query model, showing that practically every class learnable in Valiant’s model and its variants can also be learned in the new model (and thus can be learning in the presence of noise).

Typicality-Based Stability and Privacy

It is shown that if a typically stable interaction with a dataset yields a query from that class, then this query when evaluated on the same dataset will have small generalization error with high probability (i.e., it will not overfit to the dataset).

Interactive fingerprinting codes and the hardness of preventing false discovery

It is shown that, under a standard hardness assumption, there is no computationally efficient algorithm that, given n samples from an unknown distribution, can give valid answers to O(n2) adaptively chosen statistical queries.

Max-Information, Differential Privacy, and Post-selection Hypothesis Testing

A principled study of how the generalization properties of approximate differential privacy can be used to perform adaptive hypothesis testing, while giving statistically valid p-value corrections, by observing that the guarantees of algorithms with bounded approximate max-information are sufficient to correct the p-values of adaptively chosen hypotheses.

A Multiplicative Weights Mechanism for Privacy-Preserving Data Analysis

A new differentially private multiplicative weights mechanism for answering a large number of interactive counting (or linear) queries that arrive online and may be adaptively chosen, and it is shown that when the input database is drawn from a smooth distribution — a distribution that does not place too much weight on any single data item — accuracy remains as above, and the running time becomes poly-logarithmic in the data universe size.