MolRep: A Deep Representation Learning Library for Molecular Property Prediction
Summary
MolRep is a Python package for fairly measuring algorithmic progress on chemical property datasets. It currently provides a complete re-evaluation of 16 state-of-the-art deep representation models over 16 benchmark property datsaets.
If you found this package useful, please cite our papers: MolRep and Mol-XAI for now:
@article{rao2021molrep,
title={MolRep: A Deep Representation Learning Library for Molecular Property Prediction},
author={Rao, Jiahua and Zheng, Shuangjia and Song, Ying and Chen, Jianwen and Li, Chengtao and Xie, Jiancong and Yang, Hui and Chen, Hongming and Yang, Yuedong},
journal={bioRxiv},
year={2021},
publisher={Cold Spring Harbor Laboratory}
}
@article{rao2021quantitative,
title={Quantitative Evaluation of Explainable Graph Neural Networks for Molecular Property Prediction},
author={Rao, Jiahua and Zheng, Shuangjia and Yang, Yuedong},
journal={arXiv preprint arXiv:2107.04119},
year={2021}
}
Install & Usage
We provide a script to install the environment. You will need the conda package manager, which can be installed from here.
To install the required packages, follow there instructions (tested on a linux terminal):
-
clone the repository
git clone https://github.com/biomed-AI/MolRep
-
cd
into the cloned directorycd MolRep
-
run the install script
source install.sh
Where <your_conda_path>
is your conda path, and <CUDA_VERSION>
is an optional argument that can be either cpu
, cu92
, cu100
, cu101
, cu110
. If you do not provide a cuda version, the script will default to cu110
. The script will create a virtual environment named MolRep
, with all the required packages needed to run our code. Important: do NOT run this command using bash
instead of source
!
Data
Data (including Explainable Dataset) could be download from Google_Driver
[!NEWS] The human experiments fro explainability task (molecules and results) are available at Here
Current Dataset
Dataset | Task | Task type | #Molecule | Splits | Metric | Reference |
---|---|---|---|---|---|---|
QM7 | 1 | Regression | 7160 | Stratified | MAE | Wu et al. |
QM8 | 12 | Regression | 21786 | Random | MAE | Wu et al. |
QM9 | 12 | Regression | 133885 | Random | MAE | Wu et al. |
ESOL | 1 | Regression | 1128 | Random | RMSE | Wu et al. |
FreeSolv | 1 | Regression | 642 | Random | RMSE | Wu et al. |
Lipophilicity | 1 | Regression | 4200 | Random | RMSE | Wu et al. |
BBBP | 1 | Classification | 2039 | Scaffold | ROC-AUC | Wu et al. |
Tox21 | 12 | Classification | 7831 | Random | ROC-AUC | Wu et al. |
SIDER | 27 | Classification | 1427 | Random | ROC-AUC | Wu et al. |
ClinTox | 2 | Classification | 1478 | Random | ROC-AUC | Wu et al. |
Liver injury | 1 | Classification | 2788 | Random | ROC-AUC | Xu et al. |
Mutagenesis | 1 | Classification | 6511 | Random | ROC-AUC | Hansen et al. |
hERG | 1 | Classification | 4813 | Random | ROC-AUC | Li et al. |
MUV | 17 | Classification | 93087 | Random | PRC-AUC | Wu et al. |
HIV | 1 | Classification | 41127 | Random | ROC-AUC | Wu et al. |
BACE | 1 | Classification | 1513 | Random | ROC-AUC | Wu et al. |
Methods
Current Methods
Self-/unsupervised Models
Methods | Descriptions | Reference |
---|---|---|
Mol2Vec | Mol2Vec is an unsupervised approach to learns vector representations of molecular substructures that point in similar directions for chemically related substructures. | Jaeger et al. |
N-Gram graph | N-gram graph is a simple unsupervised representation for molecules that first embeds the vertices in the molecule graph and then constructs a compact representation for the graph by assembling the ver-tex embeddings in short walks in the graph. | Liu et al. |
FP2Vec | FP2Vec is a molecular featurizer that represents a chemical compound as a set of trainable embedding vectors and combine with CNN model. | Jeon et al. |
VAE | VAE is a framework for training two neural networks (encoder and decoder) to learn a mapping from high-dimensional molecular representation into a lower-dimensional space. | Kingma et al. |
Sequence Models
Methods | Descriptions | Reference |
---|---|---|
BiLSTM | BiLSTM is an artificial recurrent neural network (RNN) architecture to encoding sequences from compound SMILES strings. | Hochreiter et al. |
SALSTM | SALSTM is a self-attention mechanism with improved BiLSTM for molecule representation. | Zheng et al |
Transformer | Transformer is a network based solely on attention mechanisms and dispensing with recurrence and convolutions entirely to encodes compound SMILES strings. | Vaswani et al. |
MAT | MAT is a molecule attention transformer utilized inter-atomic distances and the molecular graph structure to augment the attention mechanism. | Maziarka et al. |
Graph Models
Methods | Descriptions | Reference |
---|---|---|
DGCNN | DGCNN is a deep graph convolutional neural network that proposes a graph convolution model with SortPooling layer which sorts graph vertices in a consistent order to learning the embedding of molec-ular graph. | Zhang et al. |
GraphSAGE | GraphSAGE is a framework for inductive representation learning on molecular graphs that used to generate low-dimensional representations for atoms and performs sum, mean or max-pooling neigh-borhood aggregation to updates the atom representation and molecular representation. | Hamilton et al. |
GIN | GIN is the Graph Isomorphism Network that builds upon the limitations of GraphSAGE to capture different graph structures with the Weisfeiler-Lehman graph isomorphism test. | Xu et al. |
ECC | ECC is an Edge-Conditioned Convolution Network that learns a different parameter for each edge label (bond type) on the molecular graph, and neighbor aggregation is weighted according to specific edge parameters. | Simonovsky et al. |
DiffPool | DiffPool combines a differentiable graph encoder with its an adaptive pooling mechanism that col-lapses nodes on the basis of a supervised criterion to learning the representation of molecular graphs. | Ying et al. |
MPNN | MPNN is a message-passing graph neural network that learns the representation of compound molecular graph. It mainly focused on obtaining effective vertices (atoms) embedding | Gilmer et al. |
D-MPNN | DMPNN is another message-passing graph neural network that messages associated with directed edges (bonds) rather than those with vertices. It can make use of the bond attributes. | Yang et al. |
CMPNN | CMPNN is the graph neural network that improve the molecular graph embedding by strengthening the message interactions between edges (bonds) and nodes (atoms). | Song et al. |
Training
To train a model by K-fold, run 5-fold-training_example.ipynb.
Testing
To test a pretrained model, run testing-example.ipynb.
Explainable
To explain the GNN model, run Explainer_Experiments.py
More results will be updated soon.