• Corpus ID: 227745419

Synthetic Data: Opening the data floodgates to enable faster, more directed development of machine learning methods

  title={Synthetic Data: Opening the data floodgates to enable faster, more directed development of machine learning methods},
  author={James Jordon and A. Wilson and Mihaela van der Schaar},
Many ground-breaking advancements in machine learning can be attributed to the availability of a large volume of rich data. Unfortunately, many large-scale datasets are highly sensitive, such as healthcare data, and are not widely available to the machine learning community. Generating synthetic data with privacy guarantees provides one such solution, allowing meaningful research to be carried out "at scale" - by allowing the entirety of the machine learning community to potentially accelerate… 
Generating Synthetic Mixed-type Longitudinal Electronic Health Records for Artificial Intelligent Applications
A generative adversarial network (GAN) entitled EHR-M-GAN which synthesizes mixed-type timeseries EHR data and may have use in developing AI algorithms in resource-limited settings, lowering the barrier for data acquisition while preserving patient privacy.
A review of Generative Adversarial Networks for Electronic Health Records: applications, evaluation measures and data sources
This work aims to review the major developments in various applications of GANs for EHRs and provides an overview of the proposed methodologies, and combines perspectives from healthcare applications and machine learning techniques in terms of source datasets and the fidelity and privacy evaluation of the generated synthetic datasets.
Interpretable machine learning for high-dimensional trajectories of aging health
The dynamic joint interpretable network (DJIN) model is scalable to large longitudinal data sets, is predictive of individual high-dimensional health trajectories and survival from baseline health states, and infers aninterpretable network of directed interactions between the health variables.
Synthetic Data - what, why and how?
approaches for empirically evaluating synthetic data, both in terms of its privacy, and its utility and fidelity.


Anonymization Through Data Synthesis Using Generative Adversarial Networks (ADS-GAN)
A novel framework for generating synthetic data that closely approximates the joint distribution of variables in an original EHR dataset is proposed, providing a readily accessible, legally and ethically appropriate solution to support more open data sharing, enabling the development of AI solutions.
PATE-GAN: Generating Synthetic Data with Differential Privacy Guarantees
This paper investigates a method for ensuring (differential) privacy of the generator of the Generative Adversarial Nets (GAN) framework, and modifies the Private Aggregation of Teacher Ensembles (PATE) framework and applies it to GANs.
Scalable Private Learning with PATE
This work shows how PATE can scale to learning tasks with large numbers of output classes and uncurated, imbalanced training data with errors, and introduces new noisy aggregation mechanisms for teacher ensembles that are more selective and add less noise, and prove their tighter differential-privacy guarantees.
Generating Multi-label Discrete Electronic Health Records using Generative Adversarial Networks
It is shown that medGAN generates synthetic EHR datasets that achieve comparable performance to real data on many experiments including distribution statistics, predictive modeling tasks and medical expert review.
Semi-supervised Knowledge Transfer for Deep Learning from Private Training Data
Private Aggregation of Teacher Ensembles (PATE) is demonstrated, in a black-box fashion, multiple models trained with disjoint datasets, such as records from different subsets of users, which achieves state-of-the-art privacy/utility trade-offs on MNIST and SVHN.
Privacy-Preserving Generative Deep Neural Networks Support Clinical Data Sharing
Deep neural networks that generate synthetic participants facilitate secondary analyses and reproducible investigation of clinical datasets by enhancing data sharing while preserving participant privacy.
Deep Learning with Differential Privacy
This work develops new algorithmic techniques for learning and a refined analysis of privacy costs within the framework of differential privacy, and demonstrates that deep neural networks can be trained with non-convex objectives, under a modest privacy budget, and at a manageable cost in software complexity, training efficiency, and model quality.
Real-valued (Medical) Time Series Generation with Recurrent Conditional GANs
This work proposes a Recurrent GAN (RGAN) and Recurrent Conditional GGAN (RCGAN) to produce realistic real-valued multi-dimensional time series, with an emphasis on their application to medical data.
Differentially Private Generative Adversarial Network
This paper proposes a differentially private GAN (DPGAN) model, in which it is demonstrated that the method can generate high quality data points at a reasonable privacy level by adding carefully designed noise to gradients during the learning procedure.
Hide-and-Seek Privacy Challenge
The AmsterdamUMCdb dataset is presented, which aims to advance generative techniques for dense and high-dimensional temporal data streams that are clinically meaningful in terms of fidelity and predictivity, as well as capable of minimizing membership privacy risks in termsof the concrete notion of patient re-identification.