• Stars
    star
    473
  • Rank 92,832 (Top 2 %)
  • Language
    Python
  • License
    MIT License
  • Created about 3 years ago
  • Updated over 1 year ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

EquiBind: geometric deep learning for fast predictions of the 3D structure in which a small molecule binds to a protein

EquiBind: Geometric Deep Learning for Drug Binding Structure Prediction

Paper on arXiv

Before using EquiBind, also consider checking out our new approach called DiffDock which improves over EquiBind in multiple ways. The DiffDock GitHub and paper.

EquiBind, is a SE(3)-equivariant geometric deep learning model performing direct-shot prediction of both i) the receptor binding location (blind docking) and ii) the ligandโ€™s bound pose and orientation. EquiBind achieves significant speed-ups compared to traditional and recent baselines. If you have questions, don't hesitate to open an issue or ask me via [email protected] or social media or Octavian Ganea via [email protected]. We are happy to hear from you!

Dataset

Our preprocessed data (see dataset section in the paper Appendix) is available from zenodo.
The files in data contain the names for the time-based data split.

If you want to train one of our models with the data then:

  1. download it from zenodo
  2. unzip the directory and place it into data such that you have the path data/PDBBind

Use provided model weights to predict binding structure of your own protein-ligand pairs:

Step 1: What you need as input

Ligand files of the formats .mol2 or .sdf or .pdbqt or .pdb whose names contain the string ligand (your ligand files should contain all hydrogens).
Receptor files of the format .pdb whose names contain the string protein. We ran reduce on our training proteins. Maybe you also want to run it on your protein.
For each complex you want to predict you need a directory containing the ligand and receptor file. Like this:

my_data_folder
โ””โ”€โ”€โ”€name1
    โ”‚   name1_protein.pdb
    โ”‚   name1_ligand.sdf
โ””โ”€โ”€โ”€name2
    โ”‚   name2_protein.pdb
    โ”‚   name2_ligand.mol2
...

Step 2: Setup Environment

We will set up the environment using Anaconda. Clone the current repo

git clone https://github.com/HannesStark/EquiBind

Create a new environment with all required packages using environment.yml. If you have a CUDA GPU run:

conda env create -f environment.yml

If you instead only have a CPU run:

conda env create -f environment_cpuonly.yml

Activate the environment

conda activate equibind

Here are the requirements themselves for the case with a CUDA GPU if you want to install them manually instead of using the environment.yml:

python=3.7
pytorch 1.10
torchvision
cudatoolkit=10.2
torchaudio
dgl-cuda10.2
rdkit
openbabel
biopython
rdkit
biopandas
pot
dgllife
joblib
pyaml
icecream
matplotlib
tensorboard

Step 3: Predict Binding Structures!

In the config file configs_clean/inference.yml set the path to your input data folder inference_path: path_to/my_data_folder.
Then run:

python inference.py --config=configs_clean/inference.yml

Done! ๐ŸŽ‰
Your results are saved as .sdf files in the directory specified in the config file under output_directory: 'data/results/output' and as tensors at runs/flexible_self_docking/predictions_RDKitFalse.pt!

Inference for multiple ligands in the same .sdf file and a single receptor

python multiligand_infernce.py -o path/to/output_directory -r path/to/receptor.pdb -l path/to/ligands.sdf

This runs EquiBind on every ligand in ligands.sdf against the protein in receptor.pdb. The outputs are 3 files in output_directory with the following names and contents:

failed.txt - contains the index (in the file ligands.sdf) and name of every molecule for which inference failed in a way that was caught and handled.
success.txt - contains the index (in the file ligands.sdf) and name of every molecule for which inference succeeded.
output.sdf - contains the conformers produced by EquiBind in .sdf format.

Reproducing paper numbers

Download the data and place it as described in the "Dataset" section above.

Using the provided model weights

To predict binding structures using the provided model weights run:

python inference.py --config=configs_clean/inference_file_for_reproduce.yml

This will give you the results of EquiBind-U and then those of EquiBind after running the fast ligand point cloud fitting corrections.
The numbers are a bit better than what is reported in the paper. We will put the improved numbers into the next update of the paper.

Training a model yourself and using those weights

To train the model yourself, run:

python train.py --config=configs_clean/RDKitCoords_flexible_self_docking.yml

The model weights are saved in the runs directory.
You can also start a tensorboard server tensorboard --logdir=runs and watch the model train.
To evaluate the model on the test set, change the run_dirs: entry of the config file inference_file_for_reproduce.yml to point to the directory produced in runs. Then you can runpython inference.py --config=configs_clean/inference_file_for_reproduce.yml as above!

Reference

๐Ÿ“ƒ Paper on arXiv

@inproceedings{equibind,
  title={Equibind: Geometric deep learning for drug binding structure prediction},
  author={St{\"a}rk, Hannes and Ganea, Octavian and Pattanaik, Lagnajit and Barzilay, Regina and Jaakkola, Tommi},
  booktitle={International Conference on Machine Learning},
  pages={20503--20521},
  year={2022},
  organization={PMLR}
}

More Repositories

1

3DInfomax

Making self-supervised learning work on molecules by using their 3D geometry to pre-train GNNs. Implemented in DGL and Pytorch Geometric.
Python
149
star
2

FlowSite

Implementation of FlowSite and HarmonicFlow from the paper "Harmonic Self-Conditioned Flow Matching for Multi-Ligand Docking and Binding Site Design"
Python
83
star
3

dirichlet-flow-matching

Python
75
star
4

SMPL-NeRF

Embed human pose information into neural radiance fields (NeRF) to render images of humans in desired poses ๐Ÿƒ from novel views
Python
58
star
5

protein-localization

Using Transformer protein embeddings with a linear attention mechanism to make SOTA de-novo predictions for the subcellular location of proteins ๐Ÿ”ฌ
Jupyter Notebook
54
star
6

gnn-reinforcement-learning

Representing robots as graphs for reinforcement-learning in PyBullet locomotion environments.
Jupyter Notebook
26
star
7

hannes-stark

Code for my website built with Angular and running on GitHub Pages.
HTML
13
star
8

GNN-primer

Jupyter Notebook
8
star
9

CodonMPNN

Python
6
star
10

attention-to-binding-sites

Unsupervised method for binding site prediction using attention patterns of protein language models.
Jupyter Notebook
3
star
11

molecule-ELECTRA

Pre-train and evaluate Graph Neural Networks or Transformers on molecules with the ELECTRA method.
Python
3
star
12

audioImprovement

Removing background noise from clips of speech and improving audio quality (PyTorch)
Python
3
star
13

genie

Python
2
star
14

bachelorThesis

TensorFlow code and LaTex for Bachelor Thesis: Understanding Variational Autoencoders' Latent Representations of Remote Sensing Images ๐ŸŒ
TeX
1
star
15

ec-number-prediction

Using similarity in embedding space for predicting EC numbers
Jupyter Notebook
1
star
16

dependencyNodeRanking

R code for NetworkCentralityCalculator. A web-tool with 5 different centrality measures. LaTex and pdf for documentation and explanation of different measures with a focus on "dependency centrality".
TeX
1
star
17

logag

1
star
18

HannesStark

1
star