• Stars
    star
    122
  • Rank 292,031 (Top 6 %)
  • Language
    Python
  • License
    MIT License
  • Created about 4 years ago
  • Updated over 3 years ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

MolBERT

This repository contains the implementation of the MolBERT, a state-of-the-art representation learning method based on the modern language model BERT.

The details are described in "Molecular representation learning with language models and domain-relevant auxiliary tasks", presented at the Machine Learning for Molecules Workshop @ NeurIPS 2020.

Work done by Benedek Fabian, Thomas Edlich, Héléna Gaspar, Marwin Segler, Joshua Meyers, Marco Fiscato, Mohamed Ahmed

Installation

Create your conda environment first:

conda create -y -q -n molbert -c rdkit rdkit=2019.03.1.0 python=3.7.3

Then install the package by running the following commands from the cloned directory:

conda activate molbert
pip install -e . 

Run tests

To verify your installation, execute the tests:

python -m pytest . -p no:warnings

Load pretrained model

You can download the pretrained model here

After downloading the weights, you can follow scripts/featurize.py to load the model and use it as a featurizer (you just need to replace the path in the script).

Train model from scratch:

You can use the guacamol dataset (links at the bottom)

python molbert/apps/smiles.py \
    --train_file data/guacamol_baselines/guacamol_v1_train.smiles \
    --valid_file data/guacamol_baselines/guacamol_v1_valid.smiles \
    --max_seq_length 128 \
    --batch_size 16 \
    --masked_lm 1 \
    --num_physchem_properties 200 \
    --is_same_smiles 0 \
    --permute 1 \
    --max_epochs 20 \
    --num_workers 8 \
    --val_check_interval 1

Add the --tiny flag to train a smaller model on a CPU, or the --fast_dev_run flag for testing purposes. For full list of options see molbert/apps/args.py and molbert/apps/smiles.py.

Finetune

After you have trained a model, and you would like to finetune on a certain training set, you can use the FinetuneSmilesMolbertApp class to further specialize your model to your task.

For classification you can set can set the mode to classification and the output_size to 2.

python molbert/apps/finetune.py \
    --train_file path/to/train.csv \
    --valid_file path/to/valid.csv \
    --test_file path/to/test.csv \
    --mode classification \
    --output_size 2 \
    --pretrained_model_path path/to/lightning_logs/version_0/checkpoints/last.ckpt \
    --label_column my_label_column

For regression set the mode to regression and the output_size to 1.

python molbert/apps/finetune.py \
    --train_file path/to/train.csv \
    --valid_file path/to/valid.csv \
    --test_file path/to/test.csv \
    --mode regression \
    --output_size 1 \
    --pretrained_model_path path/to/lightning_logs/version_0/checkpoints/last.ckpt \
    --label_column pIC50

To reproduce the finetuning experiments we direct you to use scripts/run_qsar_test_molbert.py and scripts/run_finetuning.py. Both scripts rely on the Chembench and optionally the CDDD repositories. Please follow the installation instructions described in their READMEs.

Data

Guacamol datasets

You can download pre-built datasets here:

md5 05ad85d871958a05c02ab51a4fde8530 training
md5 e53db4bff7dc4784123ae6df72e3b1f0 validation
md5 677b757ccec4809febd83850b43e1616 test
md5 7d45bc95c33c10cb96ef5e78c38ac0b6 all

More Repositories

1

guacamol

Benchmarks for generative chemistry
Python
401
star
2

DeeplyTough

DeeplyTough: Learning Structural Comparison of Protein Binding Sites
Python
153
star
3

guacamol_baselines

Baselines models for GuacaMol benchmarks
Python
133
star
4

RELVM

This repository contains the code accompanying the paper "Learning Informative Representations of Biomedical Relations with Latent Variable Models", Harshil Shah and Julien Fauqueur, EMNLP SustaiNLP 2020.
Python
14
star
5

CoMP

CoMP: Contrastive Mixture of Posteriors
Python
10
star
6

ukbiobank-loaders

Python
6
star
7

benevolentai-dat

BenevolentAI's Diversity Analysis Tool (DAT) is a software package that can be used to produce demographic analysis reports given health data sets that contain for fields of age, sex, ethnicity, race and socio-economic status. For example, you might have a data about a cohort of patients and want to know how well you cover various ethnicities, age groups, sex groups and socio-economic status levels. Assuming your data sets have some of these fields, this software will help generate various views of the data to help inform your work. The DAT tool was developed as part of BenevolentAI's Diversity in Data Initiative, which aims to help improve the ways patients are represented in precision medicine. It is meant to help inspire other developers to find ways of assessing the data diversity in their current and prospective health data sets.
Python
5
star
8

funkea

Perform functional enrichment analysis at scale.
Python
2
star
9

guacamol_results

HTML
2
star
10

sre-interview

Interview scenario for Level 2 SREs
Python
1
star