Group SELFIES: A Robust Fragment-Based Molecular String Representation

  title={Group SELFIES: A Robust Fragment-Based Molecular String Representation},
  author={Austin H. Cheng and Andy Cai and Santiago Miret and Gustavo Malkomes and Mariano Phielipp and Al{\'a}n Aspuru-Guzik},
We introduce Group SELFIES, a molecular string representation that leverages group tokens to represent functional groups or entire substructures while maintaining chemical robustness guarantees. Molecular string representations, such as SMILES and SELFIES, serve as the basis for molecular generation and optimization in chemical language models, deep generative models, and evolutionary methods. While SMILES and SELFIES leverage atomic representations, Group SELFIES builds on top of the chemical… 

Recent advances in the Self-Referencing Embedding Strings (SELFIES) library

This work generalized SELFIES to support a wider range of molecules and semantic constraints and streamlined its underlying grammar, and implemented this updated representation in subsequent versions of \selfieslib, where it has made major advances with respect to design, efficiency, and supported features.



Self-referencing embedded strings (SELFIES): A 100% robust molecular string representation

SELFIES (SELF-referencIng Embedded Strings), a string-based representation of molecules which is 100% robust and allows for explanation and interpretation of the internal working of the generative models.

FastFlows: Flow-Based Models for Molecular Graph Generation

This work proposes a framework using normalizing-flow based models, SELF-Referencing Embedded Strings, and multi-objective optimization that efficiently generates small molecules and enables fast generation and identification of druglike, synthesizable molecules.

Junction Tree Variational Autoencoder for Molecular Graph Generation

The junction tree variational autoencoder generates molecular graphs in two phases, by first generating a tree-structured scaffold over chemical substructures, and then combining them into a molecule with a graph message passing network, which allows for incrementally expand molecules while maintaining chemical validity at every step.

Beyond generative models: superfast traversal, optimization, novelty, exploration and discovery (STONED) algorithm for molecules using SELFIES†

STONED is proposed – a simple and efficient algorithm to perform interpolation and exploration in the chemical space, comparable to deep generative models, bypassing the need for large amounts of data and training times by using string modifications in the SELFIES molecular representation.

SMILES Pair Encoding: A Data-Driven Substructure Tokenization Algorithm for Deep Learning

SPE could be a promising tokenization method for SMILES-based deep learning models and an open-source Python package SmilesPE was developed to implement this algorithm and is now freely available at

ChemBERTa: Large-Scale Self-Supervised Pretraining for Molecular Property Prediction

This work makes one of the first attempts to systematically evaluate transformers on molecular property prediction tasks via the ChemBERTa model, and suggests that transformers offer a promising avenue of future work for molecular representation learning and property prediction.

Data-Efficient Graph Grammar Learning for Molecular Generation

A data-efficient generative model that can be learned from datasets with orders of magnitude smaller sizes than common benchmarks is proposed that achieves remarkable performance in a challenging polymer generation task with only 117 training samples and is competitive against existing methods using 81k data points.

Molecular Sets (MOSES): A Benchmarking Platform for Molecular Generation Models

A benchmarking platform called Molecular Sets (MOSES) is introduced to standardize training and comparison of molecular generative models and suggest to use the results as reference points for further advancements in generative chemistry research.

Multi-Objective Molecule Generation using Interpretable Substructures

This work proposes to offset the complexity of the generative modeling of molecules by composing molecules from a vocabulary of substructures that are likely responsible for each property of interest, called molecular rationales.

Language models can learn complex molecular distributions

This work introduces several challenging generative modeling tasks by compiling larger, more complex distributions of molecules and evaluates the ability of language models on each task, demonstrating that language models are powerful generative models, capable of adeptly learning complex molecular distributions.