• Stars
    star
    325
  • Rank 129,350 (Top 3 %)
  • Language
    Python
  • License
    Other
  • Created almost 4 years ago
  • Updated over 2 years ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

This is a Pytorch implementation of the paper: Self-Supervised Graph Transformer on Large-Scale Molecular Data

Self-Supervised Graph Transformer on Large-Scale Molecular Data

This is a Pytorch implementation of the paper: Self-Supervised Graph Transformer on Large-Scale Molecular Data.

Requirements

  • Python 3.6.8
  • For the other packages, please refer to the requirements.txt. To resolve PackageNotFoundError, please add the following channels before creating the environment.
   conda config --add channels pytorch
   conda config --add channels rdkit
   conda config --add channels conda-forge
   conda config --add channels rmg

You can just execute following command to create the conda environment.

conda create --name chem --file requirements.txt
  • We also provide the Dockerfile to build the environment, please refer to the Dockerfile for more details.

Pretained Model Download

We provide the pretrained models used in paper.

Usage

The whole framework supports pretraining, finetuning, prediction, fingerprint generation, and evaluation functions.

Pretraining

Pretrain GTransformer model given the unlabelled molecular data.

Data Preparation

We provide an input example of unlabelled molecular data at exampledata/pretrain/tryout.csv.

Semantic Motif Label Extraction

The semantic motif label is extracted by scripts/save_feature.py with feature generator fgtasklabel.

python scripts/save_features.py --data_path exampledata/pretrain/tryout.csv  \
                                --save_path exampledata/pretrain/tryout.npz   \
                                --features_generator fgtasklabel \
                                --restart

Contributing guide: you are welcomed to register your own feature generator to add more semantic motif for the graph-level prediction task. For more details, please refer to grover/data/task_labels.py.

Atom/Bond Contextual Property (Vocabulary)

The atom/bond Contextual Property (Vocabulary) is extracted by scripts/build_vocab.py.

python scripts/build_vocab.py --data_path exampledata/pretrain/tryout.csv  \
                             --vocab_save_folder exampledata/pretrain  \
                             --dataset_name tryout

The outputs of this script are vocabulary dicts of atoms and bonds, tryout_atom_vocab.pkl and tryout_bond_vocab.pkl, respectively. For more options for contextual property extraction, please refer to scripts/build_vocab.py.

Data Splitting

To accelerate the data loading and reduce the memory cost in the multi-gpu pretraining scenario, the unlabelled molecular data need to be spilt into several parts using scrpits/split_data.py.

Note: This step is required for single-gpu pretraining scenario as well.

python scripts/split_data.py --data_path exampledata/pretrain/tryout.csv  \
                             --features_path exampledata/pretrain/tryout.npz  \
                             --sample_per_file 100  \
                             --output_path exampledata/pretrain/tryout

It's better to set a larger sample_per_file for the large dataset.

The output dataset folder will look like this:

tryout
  |- feature # the semantic motif labels
  |- graph # the smiles
  |- summary.txt

Running Pretraining on Single GPU

Note: There are more hyper-parameters which can be tuned during pretraining. Please refer to add_pretrain_args inutil/parsing.py .

python main.py pretrain \
               --data_path exampledata/pretrain/tryout \
               --save_dir model/tryout \
               --atom_vocab_path exampledata/pretrain/tryout_atom_vocab.pkl \
               --bond_vocab_path exampledata/pretrain/tryout_bond_vocab.pkl \
               --batch_size 32 \
               --dropout 0.1 \
               --depth 5 \
               --num_attn_head 1 \
               --hidden_size 100 \
               --epochs 3 \
               --init_lr 0.0002 \
               --max_lr 0.0004 \
               --final_lr 0.0001 \
               --weight_decay 0.0000001 \
               --activation PReLU \
               --backbone gtrans \
               --embedding_output_type both

Running Pretraining on Multiple GPU

We have implemented distributed pretraining on multiple GPU using horovod. To start the distributed pretraining, please refer to this link. To enable the multi-GPU training of the pretraining model, --enable_multi_gpu flag should be proposed in the above command line.

Training & Finetuning

The finetune dataset is organized as a .csv file. This file should contain a column named as smiles.

(Optional) Molecular Feature Extraction

Given a labelled molecular dataset, it is possible to extract the additional molecular features in order to train & finetune the model from the existing pretrained model. The feature matrix is stored as .npz.

python scripts/save_features.py --data_path exampledata/finetune/bbbp.csv \
                                --save_path exampledata/finetune/bbbp.npz \
                                --features_generator rdkit_2d_normalized \
                                --restart 

Finetuning with Existing Data

Given the labelled dataset and the molecular features, we can use finetune function to finetunning the pretrained model.

Note: There are more hyper-parameters which can be tuned during finetuning. Please refer to add_finetune_args inutil/parsing.py .

python main.py finetune --data_path exampledata/finetune/bbbp.csv \
                        --features_path exampledata/finetune/bbbp.npz \
                        --save_dir model/finetune/bbbp/ \
                        --checkpoint_path model/tryout/model.ep3 \
                        --dataset_type classification \
                        --split_type scaffold_balanced \
                        --ensemble_size 1 \
                        --num_folds 3 \
                        --no_features_scaling \
                        --ffn_hidden_size 200 \
                        --batch_size 32 \
                        --epochs 10 \
                        --init_lr 0.00015

The final finetuned model is stored in model/bbbp and will be used in the subsequent prediction and evaluation tasks.

Prediction

Given the finetuned model, we can use it to make the prediction of the target molecules. The final prediction is made by the averaging the prediction of all sub models (num_folds * ensemble_size).

(Optional) Molecular Feature Extraction

Note: If the finetuned model uses the molecular feature as input, we need to generate the molecular feature for the target molecules as well.

python scripts/save_features.py --data_path exampledata/finetune/bbbp.csv \
                                --save_path exampledata/finetune/bbbp.npz \
                                --features_generator rdkit_2d_normalized \
                                --restart 

Prediction with Finetuned Model

python main.py predict --data_path exampledata/finetune/bbbp.csv \
               --features_path exampledata/finetune/bbbp.npz \
               --checkpoint_dir ./model \
               --no_features_scaling \
               --output data_pre.csv

Generating Fingerprints

The pretrained model can also be used to generate the molecular fingerprints.

Note: We provide three ways to generate the fingerprint.

  • atom: The mean pooling of atom embedding from node-view GTransformer and edge-view GTransformer.
  • bond: The mean pooling of bond embedding from node-view GTransformer and edge-view GTransformer.
  • both: The concatenation of atom and bond fingerprints. Moreover, the additional molecular features are appended to the output of GTransformer as well if provided.
python main.py fingerprint --data_path exampledata/finetune/bbbp.csv \
                           --features_path exampledata/finetune/bbbp.npz \
                           --checkpoint_path model/tryout/model.ep3 \
                           --fingerprint_source both \
                           --output model/fingerprint/fp.npz

The Results

  • The classification datasets.
Model BBBP SIDER ClinTox BACE Tox21 ToxCast
GROVERbase 0.936(0.008) 0.656(0.006) 0.925(0.013) 0.878(0.016) 0.819(0.020) 0.723(0.010)
GROVERlarge 0.940(0.019) 0.658(0.023) 0.944(0.021) 0.894(0.028) 0.831(0.025) 0.737(0.010)
  • The regression datasets.
Model FreeSolv ESOL Lipo QM7 QM8
GROVERbase 1.592(0.072) 0.888(0.116) 0.563(0.030) 72.5(5.9) 0.0172(0.002)
GROVERlarge 1.544(0.397) 0.831(0.120) 0.560(0.035) 72.6(3.8) 0.0125(0.002)

The Reproducibility Issue

Due to the non-deterministic behavior of the function index_select_nd( See link), it is hard to exactly reproduce the training process of finetuning. Therefore, we provide the finetuned model for eleven datasets to guarantee the reproducibility of the experiments.

We provide the eval function to reproduce the experiments. Suppose the finetuned model is placed in model/finetune/.

python main.py eval --data_path exampledata/finetune/bbbp.csv \
                    --features_path exampledata/finetune/bbbp.npz \
                    --checkpoint_dir model/finetune/bbbp \
                    --dataset_type classification \
                    --split_type scaffold_balanced \
                    --ensemble_size 1 \
                    --num_folds 3 \
                    --metric auc \
                    --no_features_scaling

Note: The defualt metric setting is rmse for regression tasks. For QM7 and QM8 datasets, you need to set metric as mae to reproduce the results. For classification tasks, you need to set metric as auc.

Known Issues

  • Comparing with the original implementation, we add the dense connection in MessagePassing layer in GTransformer . If you do not want to add the dense connection in MessagePasssing layer, please fix it at L256 of model/layers.py.

Roadmap

  • Implementation of GTransformer in DGL / PyG.
  • The improvement of self-supervised tasks, e.g. more semantic motifs.

Reference

@article{rong2020self,
  title={Self-Supervised Graph Transformer on Large-Scale Molecular Data},
  author={Rong, Yu and Bian, Yatao and Xu, Tingyang and Xie, Weiyang and Wei, Ying and Huang, Wenbing and Huang, Junzhou},
  journal={Advances in Neural Information Processing Systems},
  volume={33},
  year={2020}
}

Disclaimer

This is not an officially supported Tencent product.

More Repositories

1

IP-Adapter

The image prompt adapter is designed to enable a pretrained text-to-image diffusion model to generate images with image prompt.
Jupyter Notebook
5,177
star
2

V-Express

V-Express aims to generate a talking head video under the control of a reference image, an audio, and a sequence of V-Kps images.
Python
2,182
star
3

persona-hub

Official repo for the paper "Scaling Synthetic Data Creation with 1,000,000,000 Personas"
Python
768
star
4

hifi3dface

Code and data for our paper "High-Fidelity 3D Digital Human Creation from RGB-D Selfies".
Python
758
star
5

hok_env

Honor of Kings AI Open Environment of Tencent
Python
616
star
6

pika

a lightweight speech processing toolkit based on Pytorch and (Py)Kaldi
Python
338
star
7

bddm

BDDM: Bilateral Denoising Diffusion Models for Fast and High-Quality Speech Synthesis
Python
217
star
8

FRA-RIR

Python
169
star
9

PCDMs

Implementation code:Advancing Pose-Guided Image Synthesis with Progressive Conditional Diffusion Models
Jupyter Notebook
150
star
10

DrugOOD

OOD Dataset Curator and Benchmark for AI-aided Drug Discovery
Python
149
star
11

Frequency_Aug_VAE_MoESR

Latent-based SR using MoE and frequency augmented VAE decoder
Python
145
star
12

tleague_projpage

Jinja
135
star
13

3m-asr

3M: Multi-loss, Multi-path and Multi-level Neural Networks for speech recognition
Python
119
star
14

TLeague

Python
79
star
15

RLogist

RLogist = RL (reinforcement learning) + Pathologist
Python
65
star
16

CogKernel

Python
44
star
17

MDM

MDM
Python
43
star
18

UltraDualPathCompression

A Pytorch-based implementation of the compression and decompression module in "Ultra Dual-Path Compression For Joint Echo Cancellation And Noise Suppression".
Jupyter Notebook
36
star
19

Lodoss

Python
34
star
20

mini-hok

Mini HoK: a novel MARL benchmark based on the popular mobile game, Honor of Kings, to address limitations in existing environments such as complexity and accessibility.
Python
29
star
21

TriNet

TriNet: stabilizing self-supervised learning from complete or slow collapse on ASR.
Python
26
star
22

ICML21_OAXE

Python
25
star
23

season

[EMNLP 2022] Salience Allocation as Guidance for Abstractive Summarization
Python
22
star
24

hokoff

Python
21
star
25

Leopard

The repository for the paper titled "Leopard: A Vision Language Model For Text-Rich Multi-Image Tasks"
18
star
26

hifi3dface_projpage

Project page for our paper "High-Fidelity 3D Digital Human Creation from RGB-D Selfies".
HTML
16
star
27

GrndPodcastSum

(ACL 2022) The source code for the paper "Towards Abstractive Grounded Summarization of Podcast Transcripts"
Python
15
star
28

OASum

13
star
29

EMNLP21_SemEq

This repo is the code release of EMNLP 2021 conference paper "Connect-the-Dots: Bridging Semantics between Words and Definitions via Aligning Word Sense Inventories".
Python
12
star
30

learning_singing_from_speech

Project page for our paper "DurIAN : DurIAN-SC: Duration Informed Attention Network based Singing Voice Conversion System".
10
star
31

valuationgame

Jupyter Notebook
9
star
32

Arena

Python
8
star
33

MetaLogic

Python
8
star
34

ZED

This is the repository for EMNLP 2022 paper "Efficient Zero-shot Event Extraction with Context-Definition Alignment"
Python
8
star
35

machine-translation

Open source on machine translation
7
star
36

TPolicies

Python
6
star
37

zebra-inference

Python
5
star
38

Interformer

Jupyter Notebook
5
star
39

FOLNet

This repository includes the code for First-Order Logic Network (FOLNet).
Python
4
star
40

TLeagueAutoBuild

Python
4
star
41

TImitate

Python
2
star
42

siam

2
star