• Corpus ID: 231662446

Fidelity and Privacy of Synthetic Medical Data

  title={Fidelity and Privacy of Synthetic Medical Data},
  author={Ofer Mendelevitch and Michael D. Lesh},
The digitization of medical records ushered in a new era of big data to clinical science, and with it the possibility that data could be shared, to multiply insights beyond what investigators could abstract from paper records. The need to share individual-level medical data to accelerate innovation in precision medicine continues to grow, and has never been more urgent, as scientists grapple with the COVID-19 pandemic. However, enthusiasm for the use of big data has been tempered by a fully… 
Private sampling: a noiseless approach for generating differentially private synthetic data
The first noisefree method to construct differentially private synthetic data is proposed, using the Boolean cube as benchmark data model, and explicit bounds on accuracy and privacy of the constructed synthetic data are derived.
Private measures, random walks, and synthetic data
A polynomial-time algorithm is developed that creates a private measure from a data set using metric privacy, a powerful generalization of differential privacy, and is proved an asymptotically sharp min-max result for private measures and synthetic data for general compact metric spaces.
Adversarial Attacks Against Deep Generative Models on Data: A Survey
This comprehensive and specialized survey on the security and privacy preservation of GANs and VAEs focuses on the inner connection between attacks and model architectures and, more specifically, on five components of deep generative models.
Synthetic Data - what, why and how?
approaches for empirically evaluating synthetic data, both in terms of its privacy, and its utility and fidelity.
GAN-Based Approaches for Generating Structured Data in the Medical Domain
An evaluation framework is developed and implemented where binary classifiers are trained on extended datasets containing both real and synthetic data, and results show improved accuracy for classifiers trained with generated data from more advanced GAN models, even when limited amounts of original data are available.


Generating Electronic Health Records with Multiple Data Types and Constraints
This paper introduces a method to simulate EHRs composed of multiple data types by refining the GAN model, accounting for feature constraints, and incorporating key utility measures for such generation tasks, without sacrificing privacy.
Synthetic Data for Social Good
Important use cases for synthetic data that challenge the state of the art in privacy-preserving data generation are discussed, and DataSynthesizer is described, a dataset generation tool that takes a sensitive dataset as input and generates a structurally and statistically similar synthetic dataset, with strong privacy guarantees, as output.
k-Anonymity: A Model for Protecting Privacy
  • L. Sweeney
  • Computer Science
    Int. J. Uncertain. Fuzziness Knowl. Based Syst.
  • 2002
The solution provided in this paper includes a formal protection model named k-anonymity and a set of accompanying policies for deployment and examines re-identification attacks that can be realized on releases that adhere to k- anonymity unless accompanying policies are respected.
Generation and evaluation of synthetic patient data
This paper evaluates three classes of synthetic data generation approaches; probabilistic models, classification-based imputation models, and generative adversarial neural networks and discusses the trade-offs of the different methods and metrics, providing guidance on considerations for the generation and usage of medical synthetic data.
Synthetic Data - A Privacy Mirage
It is found that, across the board, synthetic data provides little privacy gain even under a black-box adversary with access to a single synthetic dataset only, and the need to re-consider whether synthetic data is an appropriate strategy to privacy-preserving data publishing.
Data Sharing: An Ethical and Scientific Imperative.
The data sharing process has generated controversy,1- 5 about which data should be shared, with whom, and how quickly, but there is limited information to help guide the discussion.
Approach and Method for Generating Realistic Synthetic Electronic Healthcare Records for Secondary Use
The GRiSER method is presented for use in PADARSER to allow the RS-EHR to be synthesized for statistically significant localised synthetic patients with statistically prevalent medical conditions based upon information found from publicly available data sources.
A Data Utility-Driven Benchmark for De-identification Methods
The proposed solution systematically compares de-identification methods while considering their nature, context and de-identified data set goal in order to provide a combination of methods that satisfies privacy requirements while minimizing losses of data utility.
Synthea: An approach, method, and software mechanism for generating synthetic patients and the synthetic electronic health care record
Synthea, an open-source software package that simulates the lifespans of synthetic patients, modeling the 10 most frequent reasons for primary care encounters and the 10 chronic conditions with the highest morbidity in the United States is developed.
Predicting Social Security numbers from public data
Using only publicly available information, a correlation between individuals' SSNs and their birth data is observed and it is found that for younger cohorts the correlation allows statistical inference of private SSNs.