• Stars
    star
    654
  • Rank 68,432 (Top 2 %)
  • Language
    Python
  • License
    Apache License 2.0
  • Created over 5 years ago
  • Updated about 1 month ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Robust representation of semantically constrained graphs, in particular for molecules in chemistry

SELFIES

GitHub release versions License Maintenance GitHub issues Documentation Status GitHub contributors

Self-Referencing Embedded Strings (SELFIES): A 100% robust molecular string representation
Mario Krenn, Florian Haese, AkshatKumar Nigam, Pascal Friederich, Alan Aspuru-Guzik
Machine Learning: Science and Technology 1, 045024 (2020), extensive blog post January 2021.
Talk on youtube about SELFIES.
A community paper with 31 authors on SELFIES and the future of molecular string representations.
Blog explaining SELFIES in Japanese language
Code-Paper in February 2023
Major contributors of v1.0.n: Alston Lo and Seyone Chithrananda
Main developer of v2.0.0: Alston Lo
Chemistry Advisor: Robert Pollice


A main objective is to use SELFIES as direct input into machine learning models, in particular in generative models, for the generation of molecular graphs which are syntactically and semantically valid.

SELFIES validity in a VAE latent space

Installation

Use pip to install selfies.

pip install selfies

To check if the correct version of selfies is installed, use the following pip command.

pip show selfies

To upgrade to the latest release of selfies if you are using an older version, use the following pip command. Please see the CHANGELOG to review the changes between versions of selfies, before upgrading:

pip install selfies --upgrade

Usage

Overview

Please refer to the documentation, which contains a thorough tutorial for getting started with selfies and detailed descriptions of the functions that selfies provides. We summarize some key functions below.

Function Description
selfies.encoder Translates a SMILES string into its corresponding SELFIES string.
selfies.decoder Translates a SELFIES string into its corresponding SMILES string.
selfies.set_semantic_constraints Configures the semantic constraints that selfies operates on.
selfies.len_selfies Returns the number of symbols in a SELFIES string.
selfies.split_selfies Tokenizes a SELFIES string into its individual symbols.
selfies.get_alphabet_from_selfies Constructs an alphabet from an iterable of SELFIES strings.
selfies.selfies_to_encoding Converts a SELFIES string into its label and/or one-hot encoding.
selfies.encoding_to_selfies Converts a label or one-hot encoding into a SELFIES string.

Examples

Translation between SELFIES and SMILES representations:

import selfies as sf

benzene = "c1ccccc1"

# SMILES -> SELFIES -> SMILES translation
try:
    benzene_sf = sf.encoder(benzene)  # [C][=C][C][=C][C][=C][Ring1][=Branch1]
    benzene_smi = sf.decoder(benzene_sf)  # C1=CC=CC=C1
except sf.EncoderError:
    pass  # sf.encoder error!
except sf.DecoderError:
    pass  # sf.decoder error!

len_benzene = sf.len_selfies(benzene_sf)  # 8

symbols_benzene = list(sf.split_selfies(benzene_sf))
# ['[C]', '[=C]', '[C]', '[=C]', '[C]', '[=C]', '[Ring1]', '[=Branch1]']

Very simple creation of random valid molecules:

A key property of SELFIES is the possibility to create valid random molecules in a very simple way -- inspired by a tweet by Rajarshi Guha:

import selfies as sf
import random

alphabet=sf.get_semantic_robust_alphabet() # Gets the alphabet of robust symbols
rnd_selfies=''.join(random.sample(list(alphabet), 9))
rnd_smiles=sf.decoder(rnd_selfies)
print(rnd_smiles)

These simple lines gives crazy molecules, but all are valid. Can be used as a start for more advanced filtering techniques or for machine learning models.

Integer and one-hot encoding SELFIES:

In this example, we first build an alphabet from a dataset of SELFIES strings, and then convert a SELFIES string into its padded encoding. Note that we use the [nop] (no operation) symbol to pad our SELFIES, which is a special SELFIES symbol that is always ignored and skipped over by selfies.decoder, making it a useful padding character.

import selfies as sf

dataset = ["[C][O][C]", "[F][C][F]", "[O][=O]", "[C][C][O][C][C]"]
alphabet = sf.get_alphabet_from_selfies(dataset)
alphabet.add("[nop]")  # [nop] is a special padding symbol
alphabet = list(sorted(alphabet))  # ['[=O]', '[C]', '[F]', '[O]', '[nop]']

pad_to_len = max(sf.len_selfies(s) for s in dataset)  # 5
symbol_to_idx = {s: i for i, s in enumerate(alphabet)}

dimethyl_ether = dataset[0]  # [C][O][C]

label, one_hot = sf.selfies_to_encoding(
   selfies=dimethyl_ether,
   vocab_stoi=symbol_to_idx,
   pad_to_len=pad_to_len,
   enc_type="both"
)
# label = [1, 3, 1, 4, 4]
# one_hot = [[0, 1, 0, 0, 0], [0, 0, 0, 1, 0], [0, 1, 0, 0, 0], [0, 0, 0, 0, 1], [0, 0, 0, 0, 1]]

Customizing SELFIES:

In this example, we relax the semantic constraints of selfies to allow for hypervalences (caution: hypervalence rules are much less understood than octet rules. Some molecules containing hypervalences are important, but generally, it is not known which molecules are stable and reasonable).

import selfies as sf

hypervalent_sf = sf.encoder('O=I(O)(O)(O)(O)O', strict=False)  # orthoperiodic acid
standard_derived_smi = sf.decoder(hypervalent_sf)
# OI (the default constraints for I allows for only 1 bond)

sf.set_semantic_constraints("hypervalent")
relaxed_derived_smi = sf.decoder(hypervalent_sf)
# O=I(O)(O)(O)(O)O (the hypervalent constraints for I allows for 7 bonds)

Explaining Translation:

You can get an "attribution" list that traces the connection between input and output tokens. For example let's see which tokens in the SELFIES string [C][N][C][Branch1][C][P][C][C][Ring1][=Branch1] are responsible for the output SMILES tokens.

selfies = "[C][N][C][Branch1][C][P][C][C][Ring1][=Branch1]"
smiles, attr = sf.decoder(
    selfies, attribute=True)
print('SELFIES', selfies)
print('SMILES', smiles)
print('Attribution:')
for smiles_token in attr:
    print(smiles_token)

# output
SELFIES [C][N][C][Branch1][C][P][C][C][Ring1][=Branch1]
SMILES C1NC(P)CC1
Attribution:
AttributionMap(index=0, token='C', attribution=[Attribution(index=0, token='[C]')])
AttributionMap(index=2, token='N', attribution=[Attribution(index=1, token='[N]')])
AttributionMap(index=3, token='C', attribution=[Attribution(index=2, token='[C]')])
AttributionMap(index=5, token='P', attribution=[Attribution(index=3, token='[Branch1]'), Attribution(index=5, token='[P]')])
AttributionMap(index=7, token='C', attribution=[Attribution(index=6, token='[C]')])
AttributionMap(index=8, token='C', attribution=[Attribution(index=7, token='[C]')])

attr is a list of AttributionMaps containing the output token, its index, and input tokens that led to it. For example, the P appearing in the output SMILES at that location is a result of both the [Branch1] token at position 3 and the [P] token at index 5. This works for both encoding and decoding. For finer control of tracking the translation (like tracking rings), you can access attributions in the underlying molecular graph with get_attribution.

More Usages and Examples

Tests

selfies uses pytest with tox as its testing framework. All tests can be found in the tests/ directory. To run the test suite for SELFIES, install tox and run:

tox -- --trials=10000 --dataset_samples=10000

By default, selfies is tested against a random subset (of size dataset_samples=10000) on various datasets:

  • 130K molecules from QM9
  • 250K molecules from ZINC
  • 50K molecules from a dataset of non-fullerene acceptors for organic solar cells
  • 160K+ molecules from various MoleculeNet datasets
  • 36M+ molecules from the eMolecules Database. Due to its large size, this dataset is not included on the repository. To run tests on it, please download the dataset into the tests/test_sets directory and run the tests/run_on_large_dataset.py script.

Version History

See CHANGELOG.

Credits

We thank Jacques Boitreaud, Andrew Brereton, Nessa Carson (supersciencegrl), Matthew Carbone (x94carbone), Vladimir Chupakhin (chupvl), Nathan Frey (ncfrey), Theophile Gaudin, HelloJocelynLu, Hyunmin Kim (hmkim), Minjie Li, Vincent Mallet, Alexander Minidis (DocMinus), Kohulan Rajan (Kohulan), Kevin Ryan (LeanAndMean), Benjamin Sanchez-Lengeling, Andrew White, Zhenpeng Yao and Adamo Young for their suggestions and bug reports, and Robert Pollice for chemistry advices.

License

Apache License 2.0

More Repositories

1

chemical_vae

Code for 10.1021/acscentsci.7b00572, now running on Keras 2.0 and Tensorflow
Python
482
star
2

ORGANIC

Code repo for optimizing distributions of molecules.
Jupyter Notebook
130
star
3

stoned-selfies

This repository contains code for the paper: Beyond Generative Models: Superfast Traversal, Optimization, Novelty, Exploration and Discovery (STONED) Algorithm for Molecules using SELFIES
Jupyter Notebook
119
star
4

GA

Code for the paper: Augmenting genetic algorithms with deep neural networks for exploring the chemical space
Python
94
star
5

phoenics

Phoenics: Bayesian optimization for efficient experiment planning
Python
88
star
6

olympus

Olympus: a benchmarking framework for noisy optimization and experiment planning
Jupyter Notebook
82
star
7

JANUS

Code for the paper "JANUS: Parallel Tempered Genetic Algorithm Guided by Deep Neural Networks for Inverse Molecular Design"
Python
75
star
8

Tartarus

A Benchmarking Platform for Realistic And Practical Inverse Molecular Design
Python
68
star
9

ChemOS

Python
60
star
10

gryffin

Python
51
star
11

group-selfies

Jupyter Notebook
50
star
12

qtorch

qTorch (Quantum Tensor Contraction Handler) https://arxiv.org/abs/1709.03636 -> for quantum simulation using tensor networks
C
48
star
13

DiffiQult

A fully autodifferentiable and variational HF
Python
41
star
14

funsies

funsies is a lightweight workflow engine 🔧
Python
40
star
15

gpHSP

Code to build a probabilistic predictive model for HSP
Jupyter Notebook
35
star
16

atlas

A brain for self-driving laboratories
Python
25
star
17

Theseus

Conceptual understanding through efficient inverse-design of quantum optical experiments
25
star
18

SCILLA

Automated discovery of superconducting circuits
Python
25
star
19

Computer-vision-for-the-chemistry-lab

Use convolutional neural net to detect segment and classify material phases and vessels in chemistry lab and other setting involving materials in mostly transparent vessels
Python
23
star
20

Pasithea

Deep Molecular Dreaming
Python
22
star
21

assessing_mol_prediction_confidence

https://arxiv.org/abs/2102.11439
20
star
22

QNODE

Quantum dynamics latent neural ode
Python
19
star
23

xtb-gaussian

A wrapper to run xtb inside Gaussian.
Perl
18
star
24

dionysus

For analysis of calibration, performance, and generalizability of probabilistic models on small molecular datasets. Paper on RSC Digital Discovery: https://pubs.rsc.org/en/content/articlehtml/2023/dd/d2dd00146b
Python
18
star
25

golem

Golem: an algorithm for robust experiment and process optimization
Jupyter Notebook
16
star
26

selfies_tutorial

Jupyter Notebook
14
star
27

Beyond-Molecular-Structure-ML-for-OPV-Materials-Devices

Python
13
star
28

kraken

Code to compute electronic and steric features to create a database of ligands and their properties
Python
12
star
29

Meta-VQE

Meta-VQE data and examples repository
Jupyter Notebook
9
star
30

gemini

scalable multi-fidelity machine learning
Python
9
star
31

molar

Molar is a database management to make it easy to store experiment whether computational or not
Python
9
star
32

curiosity

Python
9
star
33

da_for_polymers

Augmenting Polymer Datasets via Iterative Rearrangement
Python
9
star
34

gp_redox_rxn

Code repo for redox potentials with GPs
Jupyter Notebook
8
star
35

long-acting-injectables

Code and results for Machine Learning Models to Accelerate the Design of Polymeric Long-Acting Injectables
Jupyter Notebook
8
star
36

cheapocrest

Conformer generation on the cheap.
Perl
8
star
37

acdc_laser

Python
8
star
38

iacta

Code for the paper "Automatic Discovery of Chemical Reactions Using Imposed Activation"
Python
7
star
39

chimera

Chimera: hierarchy-based multi-objective optimization
Python
6
star
40

gryffin-known-constraints

Results for Bayesian optimization with known experimental and design constraints for chemistry applications
Jupyter Notebook
6
star
41

Organic-molcules-with-inverted-gaps

Code and data for organic molecules with inverted singlet-triplet gaps.
6
star
42

atlas-unknown-constraints

Unknown constraints in Bayesian optimization benchmark with Atlas
Jupyter Notebook
5
star
43

routescore

For working on the RouteScore/subway maps project code.
Python
5
star
44

Artificial-Design-of-Organic-Emitters

Code and data for "Artificial Design of Organic Emitters via a Genetic Algorithm Enhanced by a Deep Neural Network".
Python
4
star
45

Semantic-segmentation-of-materials-and-vessels-in-chemistry-lab-using-FCN

Given an image find the region of vessels/container and the material inside it. Assign one or class per pixel using fully convolutional net (FCN)) for semantic segmentation.
Python
4
star
46

quantum-generative-models

Python
4
star
47

kreed

Code for Reflection-Equivariant Diffusion for 3D Structure Determination from Isotopologue Rotational Spectra in Natural Abundance
Jupyter Notebook
3
star
48

Instance-segmentation-of-images-of-materials-in-transparent-vessels-using-GES-net-

Hierarchical instance aware segmentation of materials in vessels in chemistry lab setting using generator evaluator selector net
Python
3
star
49

MERMES

Multimodal Reaction Mining pipeline for ElectroSynthesis: extract reaction information from figures
Python
3
star
50

QIPA

Jupyter Notebook
2
star
51

mission_control

MissionControl: a workflow library.
Python
2
star
52

waveflow

Boundary-conditioned normalizing flows for electronic structures.
Python
2
star
53

jobman

A library for managing job submissions.
Python
1
star
54

chemspyd

1
star
55

CompositeMS

Python
1
star
56

Rational-design-of-organic-molecules-with-inverted-gaps

Code and data for "Rational Design of Organic Molecules with Inverted Gaps between First Excited Singlet and Triplet".
1
star
57

electrode-polishing

Python
1
star
58

DELFI

Python
1
star