• Corpus ID: 6249086

Private Exploration Primitives for Data Cleaning

  title={Private Exploration Primitives for Data Cleaning},
  author={Chang Ge and Ihab F. Ilyas and Xi He and Ashwin Machanavajjhala},
Data cleaning, or the process of detecting and repairing inaccurate or corrupt records in the data, is inherently human-driven. State of the art systems assume cleaning experts can access the data (or a sample of it) to tune the cleaning process. However, in many cases, privacy constraints disallow unfettered access to the data. To address this challenge, we observe and provide empirical evidence that data cleaning can be achieved without access to the sensitive data, but with access to a… 

APEx: Accuracy-Aware Differentially Private Data Exploration

This work presents APEx, a novel system that allows data analysts to pose adaptively chosen sequences of queries along with required accuracy bounds and returns query answers to the data analyst that meet the accuracy bounds, and proves to theData owner that the entire data exploration process is differentially private.

Private True Data Mining: Differential Privacy Featuring Errors to Manage Internet-of-Things Data

Based on TDP, the amount of noise added by differential privacy techniques can be reduced by approximately 20% by the solution, and it is proved that the privacy protection level does not decrease as long as the measurement error is not overestimated.

A Review of Data Cleaning Methods for Web Information System

This paper presents a review of the state-of-the-art methods for data cleaning in WIS with sub-elements such as data & user interaction, data quality rule, model, crowdsourcing, and privacy preservation.

Technical Perspective:: Toward Building Entity Matching Management Systems

Magellan is a new kind of EM system that provides how-to guides that tell users what to do in each EM scenario, step by step, and provides tools to help users execute these steps.

Differentially Private k-Nearest Neighbor Missing Data Imputation

Using techniques employing smooth sensitivity, we develop a method for \( k \) -nearest neighbor missing data imputation with differential privacy. This requires bounding the number of data



PrivateClean: Data Cleaning and Differential Privacy

PrivateClean explores the link between data cleaning and differential privacy in a framework that includes a technique for creating private datasets of numerical and discrete-valued attributes, a formalism for privacy-preserving data cleaning, and techniques for answering sum, count, and avg queries after cleaning.

A sample-and-clean framework for fast and accurate query processing on dirty data

The Sample-and-Clean framework is introduced, which applies data cleaning to a relatively small subset of the data and uses the results of the cleaning process to lessen the impact of dirty data on aggregate query answers and derive confidence intervals as a function of sample size.

Practical Differential Privacy for SQL Queries Using Elastic Sensitivity

FLEX is built, a practical end-to-end system to enforce differential privacy for SQL queries using elastic sensitivity, a novel method for approximating the local sensitivity of queries with general equijoins and proves that elastic sensitivity is an upper bound on local sensitivity and can therefore be used to enforcing differential privacy using any local sensitivity-based mechanism.

Optimizing linear counting queries under differential privacy

The matrix mechanism is proposed, a new algorithm for answering a workload of predicate counting queries and the problem of computing the optimal query strategy in support of a given workload can be formulated as a rank-constrained semidefinite program.

A Data- and Workload-Aware Query Answering Algorithm for Range Queries Under Differential Privacy

A new algorithm for answering a given set of range queries under e-differential privacy which often achieves substantially lower error than competing methods is described, and can achieve the benefits of data-dependence on both "easy" and "hard" databases.

Calibrating Noise to Sensitivity in Private Data Analysis

The study is extended to general functions f, proving that privacy can be preserved by calibrating the standard deviation of the noise according to the sensitivity of the function f, which is the amount that any single argument to f can change its output.

The matrix mechanism: optimizing linear counting queries under differential privacy

The matrix mechanism, an algorithm for answering a workload of linear counting queries that adapts the noise distribution to properties of the provided queries, is described and it is shown that this problem can be formulated as a rank-constrained semidefinite program.

Privacy: Theory meets Practice on the Map

In this paper, we propose the first formal privacy analysis of a data anonymization process known as the synthetic data generation, a technique becoming popular in the statistics community. The

ActiveClean: Interactive Data Cleaning For Statistical Modeling

This work proposes ActiveClean, which allows for progressive and iterative cleaning in statistical modeling problems while preserving convergence guarantees, and returns more accurate models than uniform sampling and Active Learning.

Calibrating Data to Sensitivity in Private Data Analysis

The data analysis platform wPINQ is detailed, which generalizes the Privacy Integrated Query to weighted datasets and shows how to integrate probabilistic inference techniques to synthesize datasets respecting more complicated (and less easily interpreted) measurements.