MolBERT
This repository contains the implementation of the MolBERT, a state-of-the-art representation learning method based on the modern language model BERT.
The details are described in "Molecular representation learning with language models and domain-relevant auxiliary tasks", presented at the Machine Learning for Molecules Workshop @ NeurIPS 2020.
Work done by Benedek Fabian, Thomas Edlich, Héléna Gaspar, Marwin Segler, Joshua Meyers, Marco Fiscato, Mohamed Ahmed
Installation
Create your conda environment first:
conda create -y -q -n molbert -c rdkit rdkit=2019.03.1.0 python=3.7.3
Then install the package by running the following commands from the cloned directory:
conda activate molbert
pip install -e .
Run tests
To verify your installation, execute the tests:
python -m pytest . -p no:warnings
Load pretrained model
You can download the pretrained model here
After downloading the weights, you can follow scripts/featurize.py
to load the model and use it as a featurizer (you just need to replace the path in the script).
Train model from scratch:
You can use the guacamol dataset (links at the bottom)
python molbert/apps/smiles.py \
--train_file data/guacamol_baselines/guacamol_v1_train.smiles \
--valid_file data/guacamol_baselines/guacamol_v1_valid.smiles \
--max_seq_length 128 \
--batch_size 16 \
--masked_lm 1 \
--num_physchem_properties 200 \
--is_same_smiles 0 \
--permute 1 \
--max_epochs 20 \
--num_workers 8 \
--val_check_interval 1
Add the --tiny
flag to train a smaller model on a CPU, or the --fast_dev_run
flag for testing purposes. For full list of options see molbert/apps/args.py
and molbert/apps/smiles.py
.
Finetune
After you have trained a model, and you would like to finetune on a certain training set, you can use the FinetuneSmilesMolbertApp
class to further specialize your model to your task.
For classification you can set can set the mode to classification
and the output_size
to 2.
python molbert/apps/finetune.py \
--train_file path/to/train.csv \
--valid_file path/to/valid.csv \
--test_file path/to/test.csv \
--mode classification \
--output_size 2 \
--pretrained_model_path path/to/lightning_logs/version_0/checkpoints/last.ckpt \
--label_column my_label_column
For regression set the mode to regression
and the output_size
to 1.
python molbert/apps/finetune.py \
--train_file path/to/train.csv \
--valid_file path/to/valid.csv \
--test_file path/to/test.csv \
--mode regression \
--output_size 1 \
--pretrained_model_path path/to/lightning_logs/version_0/checkpoints/last.ckpt \
--label_column pIC50
To reproduce the finetuning experiments we direct you to use scripts/run_qsar_test_molbert.py
and scripts/run_finetuning.py
.
Both scripts rely on the Chembench and optionally the CDDD repositories.
Please follow the installation instructions described in their READMEs.
Data
Guacamol datasets
You can download pre-built datasets here:
md5 05ad85d871958a05c02ab51a4fde8530
training
md5 e53db4bff7dc4784123ae6df72e3b1f0
validation
md5 677b757ccec4809febd83850b43e1616
test
md5 7d45bc95c33c10cb96ef5e78c38ac0b6
all