• Stars
    star
    121
  • Rank 293,924 (Top 6 %)
  • Language
    Python
  • License
    Apache License 2.0
  • Created about 4 years ago
  • Updated about 1 year ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

HuggingMolecules

License

We envision models that are pre-trained on a vast range of domain-relevant tasks to become key for molecule property prediction. This repository aims to give easy access to state-of-the-art pre-trained models.

Quick tour

To quickly fine-tune a model on a dataset using the pytorch lightning package follow the below example based on the MAT model and the freesolv dataset:

from huggingmolecules import MatModel, MatFeaturizer

# The following import works only from the source code directory:
from experiments.src import TrainingModule, get_data_loaders

from torch.nn import MSELoss
from torch.optim import Adam

from pytorch_lightning import Trainer
from pytorch_lightning.metrics import MeanSquaredError

# Build and load the pre-trained model and the appropriate featurizer:
model = MatModel.from_pretrained('mat_masking_20M')
featurizer = MatFeaturizer.from_pretrained('mat_masking_20M')

# Build the pytorch lightning training module:
pl_module = TrainingModule(model,
                           loss_fn=MSELoss(),
                           metric_cls=MeanSquaredError,
                           optimizer=Adam(model.parameters()))

# Build the data loader for the freesolv dataset:
train_dataloader, _, _ = get_data_loaders(featurizer,
                                          batch_size=32,
                                          task_name='ADME',
                                          dataset_name='hydrationfreeenergy_freesolv')

# Build the pytorch lightning trainer and fine-tune the module on the train dataset:
trainer = Trainer(max_epochs=100)
trainer.fit(pl_module, train_dataloader=train_dataloader)

# Make the prediction for the batch of SMILES strings:
batch = featurizer(['C/C=C/C', '[C]=O'])
output = pl_module.model(batch)

Installation

Create your conda environment and install the rdkit package:

conda create -n huggingmolecules python=3.8.5
conda activate huggingmolecules
conda install -c conda-forge rdkit==2020.09.1

Then install huggingmolecules from the cloned directory:

conda activate huggingmolecules
pip install -e ./src

Huggingmolecules caches weights and configs of the models. To avoid issues with incompatibility of different package versions, it is recommended to clean up the cache directory after every package update:

python -m src.clean_cache --all

Project Structure

The project consists of two main modules: src/ and experiments/ modules:

  • The src/ module contains abstract interfaces for pre-trained models along with their implementations based on the pytorch library. This module makes configuring, downloading and running existing models easy and out-of-the-box.
  • The experiments/ module makes use of abstract interfaces defined in the src/ module and implements scripts based on the pytorch lightning package for running various experiments. This module makes training, benchmarking and hyper-tuning of models flawless and easily extensible.

Supported models architectures

Huggingmolecules currently provides the following models architectures:

  • MAT
  • GROVER
  • R-MAT (weights were obtained by joint efforts with Nvidia)

For ease of benchmarking, we also include wrappers in the experiments/ module for three other models architectures:

The src/ module

The implementations of the models in the src/ module are divided into three modules: configuration, featurization and models module. The relation between these modules is shown on the following examples based on the MAT model:

Configuration examples

from huggingmolecules import MatConfig

# Build the config with default parameters values, 
# except 'd_model' parameter, which is set to 1200:
config = MatConfig(d_model=1200)

# Build the pre-defined config:
config = MatConfig.from_pretrained('mat_masking_20M')

# Build the pre-defined config with 'init_type' parameter set to 'normal':
config = MatConfig.from_pretrained('mat_masking_20M', init_type='normal')

# Save the pre-defined config with the previous modification:
config.save_to_cache('mat_masking_20M_normal.json')

# Restore the previously saved config:
config = MatConfig.from_pretrained('mat_masking_20M_normal.json')

Featurization examples

from huggingmolecules import MatConfig, MatFeaturizer

# Build the featurizer with pre-defined config:
config = MatConfig.from_pretrained('mat_masking_20M')
featurizer = MatFeaturizer(config)

# Build the featurizer in one line:
featurizer = MatFeaturizer.from_pretrained('mat_masking_20M')

# Encode (featurize) the batch of two SMILES strings: 
batch = featurizer(['C/C=C/C', '[C]=O'])

Models examples

from huggingmolecules import MatConfig, MatFeaturizer, MatModel

# Build the model with the pre-defined config:
config = MatConfig.from_pretrained('mat_masking_20M')
model = MatModel(config)

# Load the pre-trained weights 
# (which do not include the last layer of the model)
model.load_weights('mat_masking_20M')

# Build the model and load the pre-trained weights in one line:
model = MatModel.from_pretrained('mat_masking_20M')

# Encode (featurize) the batch of two SMILES strings: 
featurizer = MatFeaturizer.from_pretrained('mat_masking_20M')
batch = featurizer(['C/C=C/C', '[C]=O'])

# Feed the model with the encoded batch:
output = model(batch)

# Save the weights of the model (usually after the fine-tuning process):
model.save_weights('tuned_mat_masking_20M.pt')

# Load the previously saved weights
# (which now includes all layers of the model):
model.load_weights('tuned_mat_masking_20M.pt')

# Load the previously saved weights, but without 
# the last layer of the model ('generator' in the case of the 'MatModel')
model.load_weights('tuned_mat_masking_20M.pt', excluded=['generator'])

# Build the model and load the previously saved weights:
config = MatConfig.from_pretrained('mat_masking_20M')
model = MatModel.from_pretrained('tuned_mat_masking_20M.pt',
                                 excluded=['generator'],
                                 config=config)

Running tests

To run base tests for src/ module, type:

pytest src/ --ignore=src/tests/downloading/

To additionally run tests for downloading module (which will download all models to your local computer and therefore may be slow), type:

pytest src/tests/downloading

The experiments/ module

Requirements

In addition to dependencies defined in the src/ module, the experiments/ module goes along with few others. To install them, run:

pip install -r experiments/requirements.txt

The following packages are crucial for functioning of the experiments/ module:

Neptune.ai

In addition, we recommend installing the neptune.ai package:

  1. Sign up to neptune.ai at https://neptune.ai/.

  2. Get your Neptune API token (see getting-started for help).

  3. Export your Neptune API token to NEPTUNE_API_TOKEN environment variable.

  4. Install neptune-client: pip install neptune-client.

  5. Enable neptune.ai in the experiments/configs/setup.gin file.

  6. Update neptune.project_name parameters in experiments/configs/bases/*.gin files.

Running scripts:

We recommend running experiments scripts from the source code. For the moment there are three scripts implemented:

  • experiments/scripts/train.py - for training with the pytorch lightning package
  • experiments/scripts/tune_hyper.py - for hyper-parameters tuning with the optuna package
  • experiments/scripts/benchmark.py - for benchmarking based on the hyper-parameters tuning (grid-search)

In general running scripts can be done with the following syntax:

python -m experiments.scripts.<script_name> /
       -d <dataset_name> / 
       -m <model_name> /
       -b <parameters_bindings>

Then the script <script_name>.py runs with functions/methods parameters values defined in the following gin-config files:

  1. experiments/configs/bases/<script_name>.gin
  2. experiments/configs/datasets/<dataset_name>.gin
  3. experiments/configs/models/<model_name>.gin

If the binding flag -b is used, then bindings defined in <parameters_binding> overrides corresponding bindings defined in above gin-config files.

So for instance, to fine-tune the MAT model (pre-trained on masking_20M task) on the freesolv dataset using GPU 1, simply run:

python -m experiments.scripts.train /
       -d freesolv / 
       -m mat /
       -b model.pretrained_name=\"mat_masking_20M\"#train.gpus=[1]

or equivalently:

python -m experiments.scripts.train /
       -d freesolv / 
       -m mat /
       --model.pretrained_name mat_masking_20M /
       --train.gpus [1]

Local dataset

To use a local dataset, create an appropriate gin-config file in the experiments/configs/datasets directory and specify the data.data_path parameter within. For details see the get_data_split implementation.

Benchmarking

For the moment there is one benchmark available. It works as follows:

  • experiments/scripts/benchmark.py: on the given dataset we fine-tune the given model on 10 learning rates and 6 seeded data splits (60 fine-tunings in total). Then we choose that learning rate that minimizes an averaged (on 6 data splits) validation metric (metric computed on the validation dataset, e.g. RMSE). The result is the averaged value of test metric for the chosen learning rate.

Running a benchmark is essentially the same as running any other script from the experiments/ module. So for instance to benchmark the vanilla MAT model (without pre-training) on the Caco-2 dataset using GPU 0, simply run:

python -m experiments.scripts.benchmark /
       -d caco2 / 
       -m mat /
       --model.pretrained_name None /
       --train.gpus [0]

However, the above script will only perform 60 fine-tunings. It won't compute the final benchmark result. To do that wee need to run:

python -m experiments.scripts.benchmark --results_only /
       -d caco2 / 
       -m mat

The above script won't perform any fine-tuning, but will only compute the benchmark result. If we had neptune enabled in experiments/configs/setup.gin, all data necessary to compute the result will be fetched from the neptune server.

Benchmark results

We performed the benchmark described in Benchmarking as experiments/scripts/benchmark.py for various models architectures and pre-training tasks.

Summary

We report mean/median ranks of tested models across all datasets (both regression and classification ones). For detailed results see Regression and Classification sections.

model mean rank rank std
MAT 200k 5.6 3.5
MAT 2M 5.3 3.4
MAT 20M 4.1 2.2
GROVER Base 3.8 2.7
GROVER Large 3.6 2.4
ChemBERTa 7.4 2.8
MolBERT 5.9 2.9
D-MPNN 6.3 2.3
D-MPNN 2d 6.4 2.0
D-MPNN mc 5.3 2.1

Regression

As the metric we used MAE for QM7 and RMSE for the rest of datasets.

model FreeSolv Caco-2 Clearance QM7 Mean rank
MAT 200k 0.913 ยฑ 0.196 0.405 ยฑ 0.030 0.649 ยฑ 0.341 87.578 ยฑ 15.375 5.25
MAT 2M 0.898 ยฑ 0.165 0.471 ยฑ 0.070 0.655 ยฑ 0.327 81.557 ยฑ 5.088 6.75
MAT 20M 0.854 ยฑ 0.197 0.432 ยฑ 0.034 0.640 ยฑ 0.335 81.797 ยฑ 4.176 5.0
Grover Base 0.917 ยฑ 0.195 0.419 ยฑ 0.029 0.629 ยฑ 0.335 62.266 ยฑ 3.578 3.25
Grover Large 0.950 ยฑ 0.202 0.414 ยฑ 0.041 0.627 ยฑ 0.340 64.941 ยฑ 3.616 2.5
ChemBERTa 1.218 ยฑ 0.245 0.430 ยฑ 0.013 0.647 ยฑ 0.314 177.242 ยฑ 1.819 8.0
MolBERT 1.027 ยฑ 0.244 0.483 ยฑ 0.056 0.633 ยฑ 0.332 177.117 ยฑ 1.799 8.0
Chemprop 1.061 ยฑ 0.168 0.446 ยฑ 0.064 0.628 ยฑ 0.339 74.831 ยฑ 4.792 5.5
Chemprop 2d 1 1.038 ยฑ 0.235 0.454 ยฑ 0.049 0.628 ยฑ 0.336 77.912 ยฑ 10.231 6.0
Chemprop mc 2 0.995 ยฑ 0.136 0.438 ยฑ 0.053 0.627 ยฑ 0.337 75.575 ยฑ 4.683 4.25

1 chemprop with additional rdkit_2d_normalized features generator
2 chemprop with additional morgan_count features generator

Classification

We used ROC AUC as the metric.

model HIA Bioavailability PPBR Tox21 (NR-AR) BBBP Mean rank
MAT 200k 0.943 ยฑ 0.015 0.660 ยฑ 0.052 0.896 ยฑ 0.027 0.775 ยฑ 0.035 0.709 ยฑ 0.022 5.8
MAT 2M 0.941 ยฑ 0.013 0.712 ยฑ 0.076 0.905 ยฑ 0.019 0.779 ยฑ 0.056 0.713 ยฑ 0.022 4.2
MAT 20M 0.935 ยฑ 0.017 0.732 ยฑ 0.082 0.891 ยฑ 0.019 0.779 ยฑ 0.056 0.735 ยฑ 0.006 3.4
Grover Base 0.931 ยฑ 0.021 0.750 ยฑ 0.037 0.901 ยฑ 0.036 0.750 ยฑ 0.085 0.735 ยฑ 0.006 4.0
Grover Large 0.932 ยฑ 0.023 0.747 ยฑ 0.062 0.901 ยฑ 0.033 0.757 ยฑ 0.057 0.757 ยฑ 0.057 4.2
ChemBERTa 0.923 ยฑ 0.032 0.666 ยฑ 0.041 0.869 ยฑ 0.032 0.779 ยฑ 0.044 0.717 ยฑ 0.009 7.0
MolBERT 0.942 ยฑ 0.011 0.737 ยฑ 0.085 0.889 ยฑ 0.039 0.761 ยฑ 0.058 0.742 ยฑ 0.020 4.6
Chemprop 0.924 ยฑ 0.069 0.724 ยฑ 0.064 0.847 ยฑ 0.052 0.766 ยฑ 0.040 0.726 ยฑ 0.008 7.0
Chemprop 2d 0.923 ยฑ 0.015 0.712 ยฑ 0.067 0.874 ยฑ 0.030 0.775 ยฑ 0.041 0.724 ยฑ 0.006 6.8
Chemprop mc 0.924 ยฑ 0.082 0.740 ยฑ 0.060 0.869 ยฑ 0.033 0.772 ยฑ 0.041 0.722 ยฑ 0.008 6.2

More Repositories

1

pykernels

Python library for working with kernel methods in machine learning
Python
110
star
2

points2nerf

Points2NeRF
Python
72
star
3

geo-gcn

The official implementation of the SGCN architecture.
Python
59
star
4

ViewingDirectionGaussianSplatting

Python
48
star
5

toolkit

A set of useful tools for DL experiments, project templates, etc.
Python
35
star
6

gmum.r

GMUM machine learning group R package
C++
32
star
7

3d-point-clouds-autocomplete

The official implementation of the "HyperPocket: Generative Point Cloud Completion" paper in PyTorch
Python
27
star
8

3d-point-clouds-HyperCloud

The official implementation of the "Hypernetwork approach to generating point clouds" paper
Python
25
star
9

graph-representations

Comparing graph representations for molecular features prediction
Python
21
star
10

ProtoPool

Code for "Interpretable image classification with differentiable prototypes assignment", ECCV 2022
Python
21
star
11

non-gaussian-gaussian-processes

Jupyter Notebook
20
star
12

mldd23

The repository for the course "Machine Learning in Drug Design" taught at the Jagiellonian University in 2023. The page is hosted by the machine learning research group GMUM.
HTML
20
star
13

Kernel_SA-AbMILP

Python
17
star
14

few-shot-hypernets-public

Jupyter Notebook
17
star
15

rl-crash-course

Crash Course in Reinforcement Learning - ML in PL workshops
Jupyter Notebook
17
star
16

MultiPlaneNeRF

MultiPlaneNeRF
Python
17
star
17

mlls2015

Active Learning experiments for Machine Learning in Life Sciences Workshop ECML 2015
Jupyter Notebook
16
star
18

HyperNeRFGAN

Generative model for 3D objects.
Python
15
star
19

Zero-Time-Waste

Python
13
star
20

ChiENN

Python
12
star
21

ProtoPShare

GitHub repository for KDD 2021 work: ProtoPShare: Prototypical Parts Sharing for Similarity Discovery in Interpretable Image Classification
Python
11
star
22

LocoGAN

PyTorch implmentation of LocoGAN: https://arxiv.org/abs/2002.07897
Python
11
star
23

dl-mo-2021

Deep Learning with Multiple Objectives: 2021 edition
Jupyter Notebook
10
star
24

MLinPL2019_cheminfo_workshops

Jupyter Notebook
10
star
25

wica

WICA: nonlinear weighted ICA
Jupyter Notebook
10
star
26

umwpl2021

The repository of the course "Machine Learning in Drug Design" at the Jagiellonian University, Krakรณw, Poland. The page is hosted by the machine learning research group GMUM.
Jupyter Notebook
10
star
27

ml2023-24

Materials for the 2022/23 edition of Machine Learning classes.
HTML
9
star
28

cwae

Cramer-Wold AutoEncoder
Jupyter Notebook
9
star
29

metstab-shap

Analyse metabolic stability predictions using SHapley Additive exPlanations.
Python
8
star
30

AppliedDL2019

Applied Deep Learning 2018/2019
Jupyter Notebook
7
star
31

ecml17

Code supporting "Flexible semi-supervised clustering with pairwise constraints reproduction" submission for ECML PKDD 2017
Python
6
star
32

cwae-pytorch

Implementation of CWAE paper in PyTorch
Python
6
star
33

nn2019

Sieci neuronowe 2018/19
Jupyter Notebook
6
star
34

nice_pytorch

Flow model - NICE
Python
6
star
35

ml2021-22

Jupyter Notebook
5
star
36

plugen

Python
5
star
37

LoCondA

The official implementation of the "Modeling 3D Surface Manifolds with a Locally Conditioned Atlas" paper
Python
5
star
38

feature-based-interpolation

Jupyter Notebook
5
star
39

TimeSeriesNotes

Jupyter Notebook
5
star
40

lcw-generator

Implementation of "Generative models with kernel distance in data space" paper in PyTorch
Python
4
star
41

MANGO

Python
4
star
42

SpatialNetworks

[WIP] Implementation of Biologically-Inspired Spatial Neural Networks (https://arxiv.org/abs/1910.02776)
Python
4
star
43

OneFlow

Jupyter Notebook
4
star
44

mldd24

The repository for the course "Machine Learning in Drug Design" taught at the Jagiellonian University in 2024. The page is hosted by the machine learning research group GMUM.
HTML
4
star
45

seminars

Group Seminar Notebooks
Jupyter Notebook
4
star
46

wzum-23

Selected Machine Learning problems 2023
Jupyter Notebook
4
star
47

ICICLE

Python
3
star
48

disentanglement-multitask

Disentanglement multi-task repostory
Jupyter Notebook
3
star
49

glebokie_uczenie_wprowadzenie

Jupyter Notebook
3
star
50

ml2022-23

Materials for the 2022/23 edition of Machine Learning classes
Jupyter Notebook
3
star
51

ml2017

Repository for course Machine Learning at JU
Jupyter Notebook
3
star
52

natural-language-processing-classes

Natural language processing classes. Winter 2018/2019.
Jupyter Notebook
3
star
53

InterContiNet

Python
3
star
54

ProMIL

Python
2
star
55

SONGs

Python
2
star
56

r2-learner

Code supporting "On certain limitations of Recursive Representation Models" paper
Python
2
star
57

adversarial_examples_ae_layers

Python
2
star
58

HyperMask

HyperMask: Adaptive Hypernetwork-based Masks for Continual Learning
Python
2
star
59

melm

Maximum Entropy Linear Manifold
Python
2
star
60

classification-optimal-network-depth

Python
2
star
61

set-aggregation

Python
2
star
62

PMLM

Probabilistic Multithreshold Linear Models
Python
1
star
63

AppliedDL2020

Applied Deep Learning 2020
Jupyter Notebook
1
star
64

gmum-workshops

Jupyter Notebook
1
star
65

clones_classification

Python
1
star
66

cyp-inhibitors

Python
1
star
67

interpretability-benchmark

Code for "Interpretability Benchmark for Evaluating Spatial Misalignment of Prototypical Parts Explanations"
Python
1
star
68

MultiPlaneGan

Python
1
star