Copula-based synthetic data augmentation for machine-learning emulators

  title={Copula-based synthetic data augmentation for machine-learning emulators},
  author={David Meyer and Thomas Nagler and Robin J. Hogan},
  journal={Geoscientific Model Development},
Abstract. Can we improve machine-learning (ML) emulators with synthetic data? If data are scarce or expensive to source and a physical model is available, statistically generated data may be useful for augmenting training sets cheaply. Here we explore the use of copula-based models for generating synthetically augmented datasets in weather and climate by testing the method on a toy physical model of downwelling longwave radiation and corresponding neural network emulator. Results show that for… 

Figures and Tables from this paper

Synthia: multidimensional synthetic data generation in Python
In computational sciences such as weather and climate, data often consist of large, labelled multidimensional datasets with complex dependencies.
KGML-ag: A Modeling Framework of Knowledge-Guided Machine Learning to Simulate Agroecosystems: A Case Study of Estimating N2O Emission using Data from Mesocosm Experiments
Abstract. Agricultural nitrous oxide (N2O) emission accounts for a non-trivial fraction of global greenhouse gases (GHGs) budget. To date, estimating N2O fluxes from cropland remains a challenging
Machine Learning Emulation of 3D Cloud Radiative Effects
The current operational scheme ecRad, used for operational predictions at the European Centre for Medium-Range Weather Forecasts, is corrected for 3D cloud radiative effects using computationally cheap neural networks, and the emulator increases the overall accuracy for both longwave and shortwave with a negligible impact on the model's runtime performance.
Machine Learning Emulation of Urban Land Surface Processes
This paper presents a probabilistic procedure called “supervised learning” to estimate the intensity and direction of convection in urban land surface processes using a variety of algorithms.


Neural networks for post-processing ensemble weather forecasts
A flexible alternative based on neural networks that can incorporate nonlinear relationships between arbitrary predictor variables and forecast distribution parameters that are automatically learned in a data-driven way rather than requiring prespecified link functions is proposed.
Synthia: multidimensional synthetic data generation in Python
In computational sciences such as weather and climate, data often consist of large, labelled multidimensional datasets with complex dependencies.
Variational autoencoder based synthetic data generation for imbalanced learning
This paper proposes a variational autoencoder (VAE) based synthetic data generation method for imbalanced learning that can produce new samples which are similar to those in the original dataset, but not exactly the same.
Deep learning and process understanding for data-driven Earth system science
It is argued that contextual cues should be used as part of deep learning to gain further process understanding of Earth system science problems, improving the predictive ability of seasonal forecasting and modelling of long-range spatial connections across multiple timescales.
Accelerating Radiation Computations for Dynamical Models With Targeted Machine Learning and Code Optimization
Using neural networks to replace only one part of traditional radiation code, where the optical properties of the atmosphere are computed, is investigated, finding that this approach can be several times faster, while still being accurate in various situations, such as simulating future climate.
Copulas as High-Dimensional Generative Models: Vine Copula Autoencoders
The proposed approach can transform any already trained AE into a flexible generative model at a low computational cost, an advantage over existing generative models such as adversarial networks and variational AEs which can be difficult to train and can impose strong assumptions on the latent space.
The Synthetic Data Vault
The Synthetic Data Vault is presented, a system that builds generative models of relational databases and is able to sample from the model and create synthetic data, hence the name SDV.
Could Machine Learning Break the Convection Parameterization Deadlock?
A novel approach to convective parameterization based on machine learning is presented, using an aquaplanet with prescribed sea surface temperatures as a proof of concept to show that neural networks trained on a high-resolution model in which moist convection is resolved can be an appealing technique to tackle and better represent moist convections in coarse resolution climate models.
A survey on Image Data Augmentation for Deep Learning
This survey will present existing methods for Data Augmentation, promising developments, and meta-level decisions for implementing DataAugmentation, a data-space solution to the problem of limited data.
Using Machine Learning to Parameterize Moist Convection: Potential for Modeling of Climate, Climate Change, and Extreme Events
  • P. O'Gorman, J. Dwyer
  • Physics, Environmental Science
    Journal of Advances in Modeling Earth Systems
  • 2018
The parameterization of moist convection contributes to uncertainty in climate modeling and numerical weather prediction. Machine learning (ML) can be used to learn new parameterizations directly