SELFIES and the future of molecular string representations

  title={SELFIES and the future of molecular string representations},
  author={Mario Krenn and Qianxiang Ai and Senja Barthel and Nessa Carson and Angelo Frei and Nathan C Frey and Pascal Friederich and Th{\'e}ophile Gaudin and Alberto Gayle and Kevin Maik Jablonka and R. Lameiro and Dominik Lemm and Alston Lo and Seyed Mohamad Moosavi and Jos'e Manuel N'apoles-Duarte and AkshatKumar Nigam and Robert Pollice and Kohulan Rajan and Ulrich Schatzschneider and Philippe Schwaller and Marta Skreta and Berend Smit and Felix Strieth‐Kalthoff and Chong Sun and G. Tom and Guido Falk von Rudorff and Andrew Wang and Andrew D. White and Adamo Young and Rose Yu and Al{\'a}n Aspuru‐Guzik},
in the context of AI and ML in chemistry, S MILES has several shortcomings – most pertinently, most combinations of symbols lead to invalid results with no valid chemical interpretation. To overcome this issue, a new language for molecules was introduced in 2020 that guarantees 100% robustness: S ELFIES (SELF-referencIng Embedded Strings). S ELFIES has since simplified and enabled numerous new applications in chemistry. In this manuscript, we look to the future and discuss molecular string… 

Graph neural networks for materials science and chemistry

This review article provides an overview of the basic principles of GNNs, widely used datasets and state-of-the-art architectures, followed by a discussion of a wide range of recent applications ofGNNs in chemistry and materials science, and concluding with a road-map for the further development and application of Gnns.

xtal2png: A Python package for representing crystal structure as PNG files

The ability to feed these images directly into image-based pipelines allows you, as a materials informatics practitioner, to get streamlined results for new state-of-the-art image- based machine learning models applied to crystal structures.

Tartarus: A Benchmarking Platform for Realistic And Practical Inverse Molecular Design

: The efficient exploration of chemical space to design molecules with intended properties enables the accelerated discovery of drugs, materials, and catalysts, and is one of the most important

Advancing data-driven chemistry by beating benchmarks

  • H. Stein
  • Chemistry
    Trends in Chemistry
  • 2022



Self-referencing embedded strings (SELFIES): A 100% robust molecular string representation

SELFIES (SELF-referencIng Embedded Strings), a string-based representation of molecules which is 100% robust and allows for explanation and interpretation of the internal working of the generative models.

A review of molecular representation in the age of machine learning

Questions for consideration are presented in future work which are believed to make chemical VAEs even more accessible, including string, connection table, feature‐based, and computer‐learned representations.

BigSMILES: A Structurally-Based Line Notation for Describing Macromolecules

A new representation system that is capable of handling the stochastic nature of polymers and based on the popular “simplified molecular-input line-entry system” (SMILES) is proposed, and it aims to provide representations that can be used as indexing identifiers for entries in polymer databases.

STOUT: SMILES to IUPAC names using neural machine translation

This work presents STOUT, a deep-learning neural machine translation approach to generate the IUPAC name for a given molecule from its SMILES string as well as the reverse translation, i.e. predicting the SMilES string from the IupAC name.

Machine Learning Force Fields

An overview of applications of ML-FFs and the chemical insights that can be obtained from them is given, and a step-by-step guide for constructing and testing them from scratch is given.

DeepSMILES: An Adaptation of SMILES for Use in Machine-Learning of Chemical Structures

A SMilES-like syntax called DeepSMILES is described that addresses two of the main reasons for invalid syntax when using a probabilistic model to generate SMILES strings and can be interconverted to/from SMILes with string processing without any loss of information.

Molecular Transformer: A Model for Uncertainty-Calibrated Chemical Reaction Prediction

This work shows that a multihead attention Molecular Transformer model outperforms all algorithms in the literature, achieving a top-1 accuracy above 90% on a common benchmark data set and is able to handle inputs without a reactant–reagent split and including stereochemistry, which makes the method universally applicable.

Importance of Engineered and Learned Molecular Representations in Predicting Organic Reactivity, Selectivity, and Chemical Properties.

The application and suitability of different representations, from expert-guided "engineered" descriptors to automatically "learned" features, in different prediction tasks relevant to organic and organometallic chemistry, are highlighted, where differing amounts of training data are available.

Methods of Writing Constitutional Formulas

This article presents the development of various kinds of chemical formulas and discusses their meaning in the historical context, special attention is paid to line notation, developed for computers (WLN, SMILES, InChI etc.).

“Just as the Structural Formula Does”: Names, Diagrams, and the Structure of Organic Chemistry at the 1892 Geneva Nomenclature Congress*

The relationship between diagram and name established at the Geneva Congress became the foundation not only of subsequent systems of chemical nomenclature but of methods of organising information that have supported the modern chemical sciences.