Corpus ID: 236087942

Learning to Limit Data Collection via Scaling Laws: Data Minimization Compliance in Practice

  title={Learning to Limit Data Collection via Scaling Laws: Data Minimization Compliance in Practice},
  author={Divya Shanmugam and Samira Shabanian and Fernando D{\'i}az and Mich{\`e}le Finck and Asia J. Biega},
Data minimization is a legal obligation defined in the European Union’s General Data Protection Regulation (GDPR) as the responsibility to process an adequate, relevant, and limited amount of personal data in relation to a processing purpose. However, unlike fairness or transparency, the principle has not seen wide adoption for machine learning systems due to a lack of computational interpretation. In this paper, we build on literature in machine learning and law to propose the first learning… Expand

Figures and Tables from this paper


Reviving Purpose Limitation and Data Minimisation in Personalisation, Profiling and Decision-Making Systems
Whether data minimisation and purpose limitation can be meaningfully implemented in data-driven algorithmic systems, including personalisation, profiling and decision-making systems, is determined through an interdisciplinary law and computer science lens. Expand
Operationalizing the Legal Principle of Data Minimization for Personalization
It is found that the performance decrease incurred by data minimization might not be substantial, but that it might disparately impact different users---a finding which has implications for the viability of different formal minimization definitions. Expand
Slice Tuner: A Selective Data Collection Framework for Accurate and Fair Machine Learning Models
Slice Tuner is a practical tool for suggesting concrete action items based on model analysis by iteratively and efficiently updating the learning curves as more data is collected and significantly outperforms baselines in terms of model accuracy and fairness. Expand
Auditing Algorithms: On Lessons Learned and the Risks of Data Minimization
In this paper, the Algorithmic Audit of REM!X, a personalized well-being recommendation app developed by Telefónica Innovación Alpha, is presented, providing important insights into how general data ethics principles such as data minimization, fairness, non-discrimination and transparency can be operationalized via algorithmic auditing. Expand
The Label Complexity of Active Learning from Observational Data
This work incorporates a more efficient counterfactual risk minimizer into the active learning algorithm, and provably demonstrates that the result is an algorithm which is statistically consistent as well as more label-efficient than prior work. Expand
Exploring recommendations under user-controlled data filtering
This paper explores how recommendation performance may be affected by time-sensitive user data filtering, that is, users choosing to share only recent "N days" of data and suggests a potential win-win solution for services and end users. Expand
Icebreaker: Element-wise Active Information Acquisition with Bayesian Deep Latent Gaussian Model
This paper proposes Icebreaker, a principled framework to approach the ice-start problem, a full Bayesian Deep Latent Gaussian Model (BELGAM) with a novel inference method that combines recent advances in amortized inference and stochastic gradient MCMC to enable fast and accurate posterior inference. Expand
“Data Strikes”: Evaluating the Effectiveness of a New Form of Collective Action Against Technology Companies
The results suggest that data strikes can be effective and that users have more power in their relationship with technology companies than they do with other companies, but also highlight important trade-offs and challenges that must be considered by potential organizers. Expand
Predicting sample size required for classification performance
A simple and effective sample size prediction algorithm that conducts weighted fitting of learning curves and outperformed an un-weighted algorithm described in previous literature can help researchers determine annotation sample size for supervised machine learning. Expand
Speeding Up Automatic Hyperparameter Optimization of Deep Neural Networks by Extrapolation of Learning Curves
This paper mimics the early termination of bad runs using a probabilistic model that extrapolates the performance from the first part of a learning curve, enabling state-of-the-art hyperparameter optimization methods for DNNs to find DNN settings that yield better performance than those chosen by human experts. Expand