Corpus ID: 236087674

Analysis of training and seed bias in small molecules generated with a conditional graph-based variational autoencoder - Insights for practical AI-driven molecule generation

  title={Analysis of training and seed bias in small molecules generated with a conditional graph-based variational autoencoder - Insights for practical AI-driven molecule generation},
  author={Seung-gu Kang and Joseph A Morrone and Jeffrey K. Weber and Wendy D. Cornell},
The application of deep learning to generative molecule design has shown early promise for accelerating lead series development. However, questions remain concerning how factors like training, dataset, and seed bias impact the technology’s utility to medicinal and computational chemists. In this work, we analyze the impact of seed and training bias on the output of an activity-conditioned graph-based variational autoencoder (VAE). Leveraging a massive, labeled dataset corresponding to the… Expand

Figures and Tables from this paper


PaccMannRL: De novo generation of hit-like anticancer molecules from transcriptomic data via reinforcement learning
A hybrid Variational Autoencoder is constructed that tailors molecules to target-specific transcriptomic profiles, using an anticancer drug sensitivity prediction model (PaccMann) as reward function and frequently exhibit the highest structural similarity to compounds with known efficacy against these cancer types. Expand
Constrained Graph Variational Autoencoders for Molecule Design
A variational autoencoder model in which both encoder and decoder are graph-structured is proposed and it is shown that by using appropriate shaping of the latent space, this model allows us to design molecules that are (locally) optimal in desired properties. Expand
Multi-objective de novo drug design with conditional graph generative model
A new de novo molecular design framework is proposed based on a type of sequential graph generators that do not use atom level recurrent units, which is much more tuned for molecule generation and has been scaled up to cover significantly larger molecules in the ChEMBL database. Expand
Constrained Generation of Semantically Valid Graphs via Regularizing Variational Autoencoders
A regularization framework for variational autoencoders is proposed that focuses on the matrix representation of graphs and formulate penalty terms that regularize the output distribution of the decoder to encourage the satisfaction of validity constraints. Expand
Junction Tree Variational Autoencoder for Molecular Graph Generation
The junction tree variational autoencoder generates molecular graphs in two phases, by first generating a tree-structured scaffold over chemical substructures, and then combining them into a molecule with a graph message passing network, which allows for incrementally expand molecules while maintaining chemical validity at every step. Expand
Low Data Drug Discovery with One-Shot Learning
This work demonstrates how one-shot learning can be used to significantly lower the amounts of data required to make meaningful predictions in drug discovery applications and introduces a new architecture, the iterative refinement long short-term memory, that significantly improves learning of meaningful distance metrics over small-molecules. Expand
GuacaMol: Benchmarking Models for De Novo Molecular Design
This work proposes an evaluation framework, GuacaMol, based on a suite of standardized benchmarks, to standardize the assessment of both classical and neural models for de novo molecular design, and describes a variety of single and multiobjective optimization tasks. Expand
Convolutional Embedding of Attributed Molecular Graphs for Physical Property Prediction.
A convolutional neural network is employed for the embedding task of learning an expressive molecular representation by treating molecules as undirected graphs with attributed nodes and edges, and preserves molecule-level spatial information that significantly enhances model performance. Expand
GraphVAE: Towards Generation of Small Graphs Using Variational Autoencoders
This work proposes to sidestep hurdles associated with linearization of discrete structures by having a decoder output a probabilistic fully-connected graph of a predefined maximum size directly at once by formulated as a variational autoencoder. Expand
Randomized SMILES strings improve the quality of molecular generative models
An extensive benchmark on models trained with subsets of GDB-13 of different sizes, with different SMILES variants (canonical, randomized and DeepSMILES), with two different recurrent cell types (LSTM and GRU) and with different hyperparameter combinations shows that models that use LSTM cells trained with 1 million randomized SMilES are able to generalize to larger chemical spaces than the other approaches and they represent more accurately the target chemical space. Expand