Multilingual Molecular Representation Learning via Contrastive Pre-training

  title={Multilingual Molecular Representation Learning via Contrastive Pre-training},
  author={Zhihui Guo and Pramod Kumar Sharma and Andy Martinez and Liang Du and Robin Abraham},
  booktitle={Annual Meeting of the Association for Computational Linguistics},
Molecular representation learning plays an essential role in cheminformatics. Recently, language model-based approaches have gained popularity as an alternative to traditional expert-designed features to encode molecules. However, these approaches only utilize a single molecular language for representation learning. Motivated by the fact that a given molecule can be described using different languages such as Simplified Molecular Line Entry System (SMILES), The International Union of Pure and… 

A Systematic Survey of Molecular Pre-trained Models

A systematic survey of pre-trained models for molecular representations from several key perspectives including molecular descriptors, encoder architectures, pre-training strategies, and applications is provided.

MORN: Molecular Property Prediction Based on Textual-Topological-Spatial Multi-View Learning

Predicting molecular properties has significant implications for the discovery and generation of drugs and further research in the domain of medicinal chemistry. Learning representations of molecules

Multilingual and Multimodal Topic Modelling with Pretrained Embeddings

This paper presents M3L-Contrast—a novel multimodal multilingual (M3L) neural topic model for comparable data that maps texts from multiple languages and images into a shared topic space and demonstrates that the model is competitive with a zero-shot topic model in predicting topic distributions for comparable multilingual data.




Dual-view Molecule Pre-training

This work proposes to leverage both the representations and design a new pre-training algorithm, dual-view molecule pre- training (briefly, DMP), that can effectively combine the strengths of both types of molecule representations.

MolCLR: Molecular Contrastive Learning of Representations via Graph Neural Networks

This work presents MolCLR: Molecular Contrastive Learning of Representations via Graph Neural Networks (GNNs), a self-supervised learning framework for large unlabeled molecule datasets and proposes three novel molecule graph augmentations: atom masking, bond deletion, and subgraph removal.

MolGPT: Molecular Generation Using a Transformer-Decoder Model

The model, MolGPT, performs on par with other previously proposed modern machine learning frameworks for molecular generation in terms of generating valid, unique, and novel molecules and it is demonstrated that the model can be trained conditionally to control multiple properties of the generated molecules.

Self-Supervised Graph Transformer on Large-Scale Molecular Data

A novel framework, GROVER, which stands for Graph Representation frOm self-supervised mEssage passing tRansformer, which allows it to be trained efficiently on large-scale molecular dataset without requiring any supervision, thus being immunized to the two issues mentioned above.

ChemBERTa: Large-Scale Self-Supervised Pretraining for Molecular Property Prediction

This work makes one of the first attempts to systematically evaluate transformers on molecular property prediction tasks via the ChemBERTa model, and suggests that transformers offer a promising avenue of future work for molecular representation learning and property prediction.

FragNet, a Contrastive Learning-Based Transformer Model for Clustering, Interpreting, Visualizing, and Navigating Chemical Space

Transformers, contrastive learning, and an embedded autoencoder are brought together to create a successful and disentangled representation of molecular latent spaces that at once uses the entire training set in their construction while allowing “similar” molecules to cluster together in an effective and interpretable way.

MoleculeNet: A Benchmark for Molecular Machine Learning

MoleculeNet benchmarks demonstrate that learnable representations are powerful tools for molecular machine learning and broadly offer the best performance, however, this result comes with caveats.

VAE-Sim: A Novel Molecular Similarity Measure Based on a Variational Autoencoder

The VAE vector distances provide a rapid and novel metric for molecular similarity that is both easily and rapidly calculated.

SMILES Transformer: Pre-trained Molecular Fingerprint for Low Data Drug Discovery

Inspired by Transformer and pre-trained language models from natural language processing, SMILES Transformer learns molecular fingerprints through unsupervised pre-training of the sequence-to-sequence language model using a huge corpus of SMilES, a text representation system for molecules.

Translating the Molecules: Adapting Neural Machine Translation to Predict IUPAC Names from a Chemical Identifier

The model uses two stacks of transformers in an encoder-decoder architecture, a setup similar to the neural networks used in state-of-the-art machine translation, and performed particularly well on organics, with the exception of macrocycles.