• Stars
    star
    153
  • Rank 243,368 (Top 5 %)
  • Language
    HTML
  • License
    MIT License
  • Created over 4 years ago
  • Updated over 3 years ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Reaction fingerprints, atlases and classification. Code complementing our Nature Machine Intelligence publication on "Mapping the space of chemical reactions using attention-based neural networks" (http://rdcu.be/cenmd).

RXNFP - chemical reaction fingerprints

This library generates chemical reaction fingerprints from reaction SMILES

Install

For all installations, we recommend using conda to get the necessary rdkit and tmap dependencies:

From pypi

conda create -n rxnfp python=3.6 -y
conda activate rxnfp
conda install -c rdkit rdkit=2020.03.3 -y
conda install -c tmap tmap -y
pip install rxnfp

From github

conda create -n rxnfp python=3.6 -y
conda activate rxnfp
conda install -c rdkit rdkit=2020.03.3 -y
conda install -c tmap tmap -y
git clone [email protected]:rxn4chemistry/rxnfp.git
cd rxnfp
pip install -e .

How to use

Compute a fingerprint from a reaction SMILES

from rxnfp.transformer_fingerprints import (
    RXNBERTFingerprintGenerator, get_default_model_and_tokenizer, generate_fingerprints
)

model, tokenizer = get_default_model_and_tokenizer()

rxnfp_generator = RXNBERTFingerprintGenerator(model, tokenizer)

example_rxn = "Nc1cccc2cnccc12.O=C(O)c1cc([N+](=O)[O-])c(Sc2c(Cl)cncc2Cl)s1>>O=C(Nc1cccc2cnccc12)c1cc([N+](=O)[O-])c(Sc2c(Cl)cncc2Cl)s1"

fp = rxnfp_generator.convert(example_rxn)
print(len(fp))
print(fp[:5])
256
[-2.0174953937530518, 1.7602033615112305, -1.3323537111282349, -1.1095019578933716, 1.2254549264907837]

Or for a list of reactions:

rxns = [example_rxn, example_rxn]
fps = rxnfp_generator.convert_batch(rxns)
print(len(fps), len(fps[0]))
2 256

Reaction Atlas

Pistachio

The fingerprints can be used to map the space of chemical reactions:

Figure: Annotated Atlas of the Pistachio test set generated with TMAP.

Schneider 50k set - tutorial

In the notebooks, we show how to generate an interative reaction atlas for the Schneider 50k set. The end result is similar to this interactive Reaction Atlas.

Where you will find different reaction properties highlighted in the different layers:

Figure: Reaction atlas of 50k data set with different properties highlighted.

USPTO 1k TPL (reaction classification data set)

We introduce a new data set for chemical reaction classification called USPTO 1k TPL. USPTO 1k TPL is derived from the USPTO data base by Lowe. It consists of 445k reactions divided into 1000 template labels. The data set was randomly split into train/valid 90% and test 10%. The labels were obtained by atom-mapping the USPTO data set with RXNMapper, then applying the template extraction workflow by Thakkar et al. and finally, selecting reactions belonging to the 1000 most frequent template hashes. Those template hashes were taken as class labels. Similarly to the Pistachio data set, USPTO 1k TPL is strongly imbalanced.

The data set can be downloaded from: MappingChemicalReactions.

Citation

Our work was first presented in the NeurIPS 2019 workshop for Machine Learning and the Physical Sciences. And has been published after multiple updates in 2021 in Nature Machine Intelligence (free access link).

@article{schwaller2021mapping,
  title={Mapping the space of chemical reactions using attention-based neural networks},
  author={Schwaller, Philippe and Probst, Daniel and Vaucher, Alain C and Nair, Vishnu H and Kreutter, David and Laino, Teodoro and Reymond, Jean-Louis},
  journal={Nature Machine Intelligence},
  volume={3},
  number={2},
  pages={144--152},
  year={2021},
  publisher={Nature Publishing Group}
}

RXNFP has been developed in a collaboration between IBM Research Europe and the Reymond group at the University of Bern. The classification models are used on the RXN for Chemistry platform.

Our publication is part of the Nature Portfolio "Synthesis and enabling technologies" collection and was featured in a News & Views on Transformers for future medicinal chemists.

Moreover, the rxnfp code was reused to train new models on different data as described in Reusability report: Learning the language of synthetic methods used in medicinal chemistry.

More Repositories

1

rxnmapper

RXNMapper: Unsupervised attention-guided atom-mapping. Code complementing our Science Advances publication on "Extraction of organic chemistry grammar from unsupervised learning of chemical reactions" (https://advances.sciencemag.org/content/7/15/eabe4166).
Python
279
star
2

rxn4chemistry

Python wrapper for the IBM RXN for Chemistry API
Python
172
star
3

rxn_yields

Code complementing our manuscript on the prediction of chemical reaction yields (https://iopscience.iop.org/article/10.1088/2632-2153/abc81d) and data augmentation strategies (https://doi.org/10.26434/chemrxiv.13286741).
Jupyter Notebook
97
star
4

biocatalysis-model

RXN for biochemical reactions
Python
60
star
5

paragraph2actions

Extraction of action sequences from experimental procedures
Python
36
star
6

rxnaamapper

Reaction SMILES-AA mapping via language modelling
Python
29
star
7

disconnection_aware_retrosynthesis

Python
28
star
8

smiles2actions

Action sequence prediction for arbitrary chemical equations
Python
25
star
9

rxn-chemutils

Chemistry-related Python utilities used in the RXN universe
Python
20
star
10

rxn-ir-to-structure

Predicting molecular structure from Infrared (IR) Spectra
Python
13
star
11

nmr-to-structure

Prediction molecular structure from NMR spectra
Python
11
star
12

rxn-reaction-preprocessing

Preprocessing of datasets of chemical reactions: standardization, filtering, augmentation, tokenization, etc.
Python
9
star
13

rxn-utilities

General Python utilities commonly used in the RXN universe
Python
7
star
14

rxn-standardization

Standardizing chemical compounds with language models
Python
7
star
15

rxn_cluster_token_prompt

Code to train high diversity retrosynthesis models with cluster token prompt
Python
5
star
16

multimodal-spectroscopic-dataset

Code for generation and benchmarks of the Multimodal Spectroscopic Dataset
Python
4
star
17

sac-action-extraction

Extraction of single-atom catalyst synthesis actions with transformers.
Python
3
star
18

rxn-onmt-models

Training of OpenNMT-based RXN models
Python
2
star
19

rxn-models

Open-source RXN models page
2
star
20

rxn-models-for-polymerization

RXN models for polymerization
1
star
21

rxn-metrics

Metrics for RXN models
Python
1
star