• Corpus ID: 219179388

Identification Risk Evaluation of Continuous Synthesized Variables

  title={Identification Risk Evaluation of Continuous Synthesized Variables},
  author={Ryan Hornby and Jingchen Hu},
  journal={arXiv: Methodology},
We propose a general approach to evaluating identification risk of continuous synthesized variables in partially synthetic data. We introduce the use of a radius $r$ in the construction of identification risk probability of each target record, and illustrate with working examples for one or more continuous synthesized variables. We demonstrate our methods with applications to a data sample from the Consumer Expenditure Surveys (CE), and discuss the impacts on risk and data utility of 1) the… 

Figures and Tables from this paper


Bayesian Estimation of Attribute and Identification Disclosure Risks in Synthetic Data
This paper focuses on the detailed re-construction of some Bayesian methods proposed for estimating disclosure risks in synthetic data, to give the readers a comprehensive view of the Bayesian estimation procedures, and enable synthetic data researchers and producers to use these procedures to evaluate disclosure risks.
General and specific utility measures for synthetic data
A previous general measure of data utility, the propensity score mean-squared-error (pMSE), is adapted to the specific case of synthetic data and derive its distribution for the case when the correct synthesis model is used to create the synthetic data.
Estimating Risks of Identification Disclosure in Partially Synthetic Data
How to evaluate identification disclosure risks in partially synthetic data is described, accounting for released information from the multiple datasets, the model used to generate synthetic values, and the approach used to select values to synthesize.
Practical Data Synthesis for Large Samples
New variance estimates for use with large samples of completely synthesised data that do not require them to be generated from the posterior predictive distribution derived from the observed data and can be used with a single synthetic data set are introduced.
Bayesian Data Synthesis and Disclosure Risk Quantification: An Application to the Consumer Expenditure Surveys
This work develops innovative disclosure risks measures to quantify inherent risks in the original CE data and how those data risks are ameliorated by two non-parametric Bayesian models as data synthesizers for the county identifier of each data record.
Likelihood Based Finite Sample Inference for Singly Imputed Synthetic Data Under the Multivariate Normal and Multiple Linear Regression Models
In this paper we develop likelihood-based finite sample inference based on singly imputed partially synthetic data, when the original data follow either a multivariate normal or a multiple linear
Using CART to generate partially synthetic public use microdata
This article presents and evaluates the use of classification and regression trees to generate partially synthetic data and potential applications of CART are studied via simulation to generate synthetic data for sensitive variables.
Global Measures of Data Utility for Microdata Masked for Disclosure Limitation
When releasing microdata to the public, data disseminators typically alter the original data to protect the confldentiality of database subjects' identities and sensitive attributes. However, such
Bayesian Pseudo Posterior Mechanism under Differential Privacy
A Bayesian pseudo posterior mechanism to generate record-level synthetic datasets with a Differential privacy (DP) guarantee from any proposed synthesizer model and it is shown that utility is better preserved for this mechanism as compared to the exponential mechanism (EM) estimated on the same non-private synthesizer.
Statistical Disclosure Limitation in the Presence of Edit Rules
A simulation study based on data from the Colombian Annual Manufacturing Survey suggests that variants of microaggregation and partially synthetic data offer the most attractive risk-utility profiles among the SDL strategies.