Group SELFIES: A Robust Fragment-Based Molecular String Representation
@article{Cheng2022GroupSA, title={Group SELFIES: A Robust Fragment-Based Molecular String Representation}, author={Austin H. Cheng and Andy Cai and Santiago Miret and Gustavo Malkomes and Mariano Phielipp and Al{\'a}n Aspuru-Guzik}, journal={ArXiv}, year={2022}, volume={abs/2211.13322} }
We introduce Group SELFIES, a molecular string representation that leverages group tokens to represent functional groups or entire substructures while maintaining chemical robustness guarantees. Molecular string representations, such as SMILES and SELFIES, serve as the basis for molecular generation and optimization in chemical language models, deep generative models, and evolutionary methods. While SMILES and SELFIES leverage atomic representations, Group SELFIES builds on top of the chemical…
Figures and Tables from this paper
One Citation
Recent advances in the Self-Referencing Embedding Strings (SELFIES) library
- Computer Science
- 2023
This work generalized SELFIES to support a wider range of molecules and semantic constraints and streamlined its underlying grammar, and implemented this updated representation in subsequent versions of \selfieslib, where it has made major advances with respect to design, efficiency, and supported features.
References
SHOWING 1-10 OF 54 REFERENCES
Self-referencing embedded strings (SELFIES): A 100% robust molecular string representation
- BiologyMach. Learn. Sci. Technol.
- 2020
SELFIES (SELF-referencIng Embedded Strings), a string-based representation of molecules which is 100% robust and allows for explanation and interpretation of the internal working of the generative models.
FastFlows: Flow-Based Models for Molecular Graph Generation
- Computer ScienceArXiv
- 2022
This work proposes a framework using normalizing-flow based models, SELF-Referencing Embedded Strings, and multi-objective optimization that efficiently generates small molecules and enables fast generation and identification of druglike, synthesizable molecules.
Junction Tree Variational Autoencoder for Molecular Graph Generation
- Computer ScienceICML
- 2018
The junction tree variational autoencoder generates molecular graphs in two phases, by first generating a tree-structured scaffold over chemical substructures, and then combining them into a molecule with a graph message passing network, which allows for incrementally expand molecules while maintaining chemical validity at every step.
Beyond generative models: superfast traversal, optimization, novelty, exploration and discovery (STONED) algorithm for molecules using SELFIES†
- Computer ScienceChemical science
- 2021
STONED is proposed – a simple and efficient algorithm to perform interpolation and exploration in the chemical space, comparable to deep generative models, bypassing the need for large amounts of data and training times by using string modifications in the SELFIES molecular representation.
SMILES Pair Encoding: A Data-Driven Substructure Tokenization Algorithm for Deep Learning
- Computer ScienceJ. Chem. Inf. Model.
- 2021
SPE could be a promising tokenization method for SMILES-based deep learning models and an open-source Python package SmilesPE was developed to implement this algorithm and is now freely available at https://github.com/XinhaoLi74/SmilesPE.
ChemBERTa: Large-Scale Self-Supervised Pretraining for Molecular Property Prediction
- Computer ScienceArXiv
- 2020
This work makes one of the first attempts to systematically evaluate transformers on molecular property prediction tasks via the ChemBERTa model, and suggests that transformers offer a promising avenue of future work for molecular representation learning and property prediction.
Data-Efficient Graph Grammar Learning for Molecular Generation
- Computer ScienceICLR
- 2022
A data-efficient generative model that can be learned from datasets with orders of magnitude smaller sizes than common benchmarks is proposed that achieves remarkable performance in a challenging polymer generation task with only 117 training samples and is competitive against existing methods using 81k data points.
Molecular Sets (MOSES): A Benchmarking Platform for Molecular Generation Models
- Computer ScienceFrontiers in Pharmacology
- 2020
A benchmarking platform called Molecular Sets (MOSES) is introduced to standardize training and comparison of molecular generative models and suggest to use the results as reference points for further advancements in generative chemistry research.
Multi-Objective Molecule Generation using Interpretable Substructures
- Computer ScienceICML
- 2020
This work proposes to offset the complexity of the generative modeling of molecules by composing molecules from a vocabulary of substructures that are likely responsible for each property of interest, called molecular rationales.
Language models can learn complex molecular distributions
- Computer ScienceNature Communications
- 2022
This work introduces several challenging generative modeling tasks by compiling larger, more complex distributions of molecules and evaluates the ability of language models on each task, demonstrating that language models are powerful generative models, capable of adeptly learning complex molecular distributions.