How Much Does Your Data Exploration Overfit? Controlling Bias via Information Usage

@article{Russo2020HowMD,
  title={How Much Does Your Data Exploration Overfit? Controlling Bias via Information Usage},
  author={Daniel Russo and J. Zou},
  journal={IEEE Transactions on Information Theory},
  year={2020},
  volume={66},
  pages={302-323}
}
  • Daniel Russo, J. Zou
  • Published 2020
  • Computer Science, Mathematics
  • IEEE Transactions on Information Theory
  • Modern data is messy and high-dimensional, and it is often not clear a priori what are the right questions to ask. [...] Key Method Our general framework also naturally motivates randomization techniques that provably reduce exploration bias while preserving the utility of the data analysis. We discuss the connections between our approach and related ideas from differential privacy and blinded data analysis, and supplement our results with illustrative simulations.Expand Abstract
    Multiaccuracy: Black-Box Post-Processing for Fairness in Classification
    • 43
    • PDF
    Where is the Information in a Deep Neural Network?
    • 21
    • PDF
    Chaining Mutual Information and Tightening Generalization Bounds
    • 28
    • Highly Influenced
    • PDF
    Information-Theoretic Generalization Bounds for SGLD via Data-Dependent Estimates
    • 13
    • PDF
    The Role of the Information Bottleneck in Representation Learning
    • 13
    On the Robustness of Information-Theoretic Privacy Measures and Mechanisms
    • 8
    • PDF
    Deep Exploration via Randomized Value Functions
    • 103
    • PDF
    Statistical Mechanics and Information Theory in Approximate Robust Inference
    • 1

    References

    Publications referenced by this paper.
    SHOWING 1-10 OF 43 REFERENCES
    The reusable holdout: Preserving validity in adaptive data analysis
    • 211
    • Highly Influential
    • PDF
    Controlling the false discovery rate: a practical and powerful approach to multiple testing
    • 60,318
    • PDF
    Preserving Statistical Validity in Adaptive Data Analysis
    • 230
    • Highly Influential
    • PDF
    THE CONTROL OF THE FALSE DISCOVERY RATE IN MULTIPLE TESTING UNDER DEPENDENCY
    • 7,092
    • PDF
    False-Positive Psychology
    • 3,307
    • PDF
    Generalization in Adaptive Data Analysis and Holdout Reuse
    • 136
    • PDF
    Least angle regression
    • 7,795
    • Highly Influential
    • PDF
    Statistical learning and selective inference
    • 159
    • PDF
    Independent filtering increases detection power for high-throughput experiments
    • 509
    • PDF