• Stars
    star
    142
  • Rank 257,022 (Top 6 %)
  • Language
    Python
  • License
    MIT License
  • Created about 5 years ago
  • Updated about 2 months ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Protein-Ligand Benchmark Dataset for Free Energy Calculations

ProteinLigandBenchmarks

build codecov Language grade: Python Documentation Status Code style: black DOI

Protein-Ligand Benchmark Dataset for testing Parameters and Methods of Free Energy Calculations.

Documentation

Documentation for the protein-ligand-benchmark package is hosted at readthedocs.

Related Publication

The LiveCoMS article on "Best practices for constructing, preparing, and evaluating protein-ligand binding affinity benchmarks" provides accompanying information to this benchmark dataset and how to use it for alchemical free energy calculations. For any suggestions of improvements please raise an issue in its GitHub repository protein-ligand-benchmark-livecoms.

Installation

The repository uses git-lfs (large file storage) for the storage of all the data file. Ideally git-lfs is installed first before cloning the repository.

conda create -n plbenchmark python=3.7 git-lfs
conda activate plbenchmark
git lfs clone https://github.com/openforcefield/protein-ligand-benchmark.git
cd protein-ligand-benchmark
conda env update --file environment.yml
pip install -e .

Getting Started

Example notebooks can be found in the Documentation and in examples. Paper repository here.

Data file tree and file description

The data is organized as followed:

data
β”œβ”€β”€ targets.yml                               # list of all targets and their directories   
β”œβ”€β”€ <date>_<target_name_1>                    # directory for target 1
β”‚   β”œβ”€β”€ 00_data                               #     metadata for target 1
β”‚   β”‚   β”œβ”€β”€ edges.yml                         #         edges/perturbations
β”‚   β”‚   β”œβ”€β”€ ligands.yml                       #         ligands and activities
β”‚   β”‚   └── target.yml                        #         target
β”‚   β”œβ”€β”€ 01_protein                            #     protein data
β”‚   β”‚   β”œβ”€β”€ crd                               #         coordinates
β”‚   β”‚   β”‚   β”œβ”€β”€ cofactors_crystalwater.pdb    #             cofactors and cyrstal waters (might be empty if there are none)  
β”‚   β”‚   β”‚   └── protein.pdb                   #             aminoacid residues   
β”‚   β”‚   └── top                               #         topology(s)
β”‚   β”‚   β”‚   └── amber99sb-star-ildn-mut.ff    #             force field spec.     
β”‚   β”‚   β”‚       β”œβ”€β”€ cofactors_crystalwater.top#                 Gromacs TOP file of cofactors and crystal water (might be empty if there are none)
β”‚   β”‚   β”‚       β”œβ”€β”€ protein.top               #                 Gromacs TOP file of amino acid residues
β”‚   β”‚   β”‚       └── *.itp                     #                 Gromacs ITP file(s) to be included in TOP files
β”‚   └── 02_ligands                            #     ligands
β”‚   β”œβ”€β”€ lig_<name_1>                          #          ligand 1 
β”‚   β”‚   β”œβ”€β”€ crd                               #              coordinates
β”‚   β”‚   β”‚   └── lig_<name_1>.sdf              #                  SDF file
β”‚   β”‚   └── top                               #              topology(s)
β”‚   β”‚       └── openff-1.0.0.offxml           #                  force field spec.       
β”‚   β”‚           β”œβ”€β”€ fflig_<name_1>.itp        #                      Gromacs ITP file : atom types     
β”‚   β”‚           β”œβ”€β”€ lig_<name_1>.itp          #                      Gromacs ITP file       
β”‚   β”‚           β”œβ”€β”€ lig_<name_1>.top          #                      Gromacs TOP file                
β”‚   β”‚           └── posre_lig_<name_1>.itp    #                      Gromacs ITP file : position restraint file  
β”‚   β”œβ”€β”€ lig_<name_2>                          #         ligand 2                               
β”‚   …                                        
β”‚   └── 03_hybrid                             #    edges (perturbations)
β”‚   β”œβ”€β”€ edge_<name_1>_<name_2>                #         edge between ligand 1 and ligand 2   
β”‚   β”‚   └── water                             #             edge in water 
β”‚   β”‚       β”œβ”€β”€ crd                           #                 coordinates 
β”‚   β”‚       β”‚   β”œβ”€β”€ mergedA.pdb               #                     merged conf based on coords of ligand 1  
β”‚   β”‚       β”‚   β”œβ”€β”€ mergedB.pdb               #                     merged conf based on coords of ligand 2   
β”‚   β”‚       β”‚   β”œβ”€β”€ pairs.dat                 #                     atom mapping                  
β”‚   β”‚       β”‚   └── score.dat                 #                     similarity score         
β”‚   β”‚       └── top                           #                 topology(s)       
β”‚   β”‚           └── openff-1.0.0.offxml       #                     force field spec.         
β”‚   β”‚               β”œβ”€β”€ ffmerged.itp          #                         Gromacs ITP file  
β”‚   β”‚               β”œβ”€β”€ ffMOL.itp             #                         Gromacs ITP file   
β”‚   β”‚               └── merged.itp            #                         Gromacs ITP file     
β”‚   …                                        
β”œβ”€β”€ <date>_<target_name_2>                    # directory for target 2  
…

Description of meta data YAML files

targets.yml

This file lists all the registered targets in the benchmark set. Each entry denotes one target and contains the following information:

mcl1_sample:
  name:     mcl1_sample
  date:     2020-08-26
  dir:      2020-08-26_mcl1_sample

mcl1_sample is the entry name and each entry has three sub-entries:

  • name is the target name, which is usually the same as the entry name of the target.
  • date is the date when the target was initially added to the benchmark set.
  • dir is the directory name where all the data for the target is found. Usually it is the date and the name field, connected by a underscore _.

target.yml

This file is found in the meta data directory of each target: <date>_<target_name>/00_data/target.yml. It contains additionally information about the target:

alternate:
  iridium_classifier: HT
  iridium_score: 0.3
  pdb: 6O6F
associated_sets:
- Schrodinger JACS
comments: hydrophobic interactions contributing to binding
date: 2019-12-13
dpi: 0.26
id: 9
iridium_classifier: HT
iridium_score: 0.41
name: mcl1
netcharge: 4 e
pdb: 4HW3
references:
  calculation:
  - 10.1021/ja512751q
  - 10.1021/acs.jcim.9b00105
  - 10.1039/C9SC03754C
  measurement:
  - 10.1021/jm301448p

Explanation of the entries:

  • alternate: Alternate X-ray structure which could be used
    • iridium_classifier: Iridium classifier of the alternate structure
    • iridium_score: Iridium score of the alternate structure
    • pdb: PDB ID of the alternate structure
  • associated_sets: list of benchmark set tags, where this target is in (e.g. "Schrodinger JACS")
  • comments: hydrophobic interactions contributing to binding
  • date: date when the target was initially added to the benchmark set.
  • dpi: diffraction precision index of the used structure (quality metric for the structure)
  • id: a given ID
  • iridium_classifier: Iridium classifier of the used structure
  • iridium_score: Iridium score of the used structure
  • name: name/identifier of the target
  • netcharge: total charge of the prepared protein (this should be equalized with counter ions during preparation of the simulation system)
  • pdb: PDB ID of the used structure
  • references: doi to references
    • calculation: list of references where this target was used in calculations
    • measurement: list of references of affinity measurements

ligands.yml

This file is found in the meta data directory of each target: <date>_<target_name>/00_data/ligands.yml. It contains information of the ligands of one target. One entry looks like this:

lig_23:
  measurement:
    comment: Table 2, entry 23
    doi: 10.1021/jm301448p
    error: 0.03
    type: ki
    unit: uM
    value: 0.37
  name: lig_23
  smiles: '[H]c1c(c(c2c(c1[H])c(c(c(c2OC([H])([H])C([H])([H])C([H])([H])C3=C(Sc4c3c(c(c(c4[H])[H])[H])[H])C(=O)[O-])[H])[H])[H])[H])[H]'

Explanation of the entries:

  • measurement: affinity measurement entry
    • comment: comment about the measurement
    • doi: DOI (digital object identifier) pointing to the reference for this measurement
    • error: Error of measurement, null if not reported
    • type: type of measurement observable, ki (binding equilibrium constant), ic50 (IC50 value), pic50 (pIC50 value), or dg (free energy of binding) are accepted entries.
    • unit: Unit of value and error entries.
    • value: Value of the measurement.
  • name: name of ligand, which always starts with lig_, followed by a unique identifier.
  • smiles: SMILES string of the ligand, with charge state information and chirality information.

edges.yml

This file is found in the meta data directory of each target: <date>_<target_name>/00_data/edges.yml. It contains information of the edges of one target. One entry looks like this:

edge_50_60:
  ligand_a: lig_50
  ligand_b: lig_60

Each entry is just a list of two ligand identifiers.

Summary

Summary of the contents of the Protein-Ligand Benchmark Dataset. It contains the available protein targets with corresponding PDB ID and number of ligands.

Target PDB N. Lig.
bace 4DJW 36
bace_hunt 4JPC 32
bace_p2 3IN4 12
cdk2 1H1Q 16
cdk8 5HNB 33
cmet 4R1Y 12
eg5 3L9H 28
galectin 5E89 8
hif2a 5TBM 42
jnk1 2GMX 21
mcl1 4HW3 42
p38 3FLY 34
pde10 4BBX 35
pde2 6EZF 21
pfkfb3 6HVI 40
ptp1b 2QBS 23
shp2 5EHR 26
syk 4PV0 44
thrombin 2ZFF 11
tnks2 4UI5 27
tyk2 4GIH 16

Release History

Releases follow the major.minor.micro scheme recommended by PEP440, where

  • major increments denote a change that may break API compatibility with previous major releases
  • minor increments denote addition of new targets or addition and larger changes to the API
  • micro increments denote bugfixes, addition of API features, changes of coordinates or topologies, and changes of metadata

Contributions

License

MIT. See the License File for more information.

CC-BY-4.0 for data (content of directory data). See the License File for more information.

Copyright

Copyright (c) 2021, Open Force Field Consortium, David F. Hahn

Acknowledgements

Project based on the Computational Molecular Science Python Cookiecutter version 1.1.

More Repositories

1

openff-toolkit

The Open Forcefield Toolkit provides implementations of the SMIRNOFF format, parameterization engine, and other tools. Documentation available at http://open-forcefield-toolkit.readthedocs.io
Python
309
star
2

openff-forcefields

Force fields produced by the Open Force Field Initiative
Python
122
star
3

openff-interchange

A project (and object) for storing, manipulating, and converting molecular mechanics data.
Python
69
star
4

openff-evaluator

A physical property evaluation toolkit from the Open Forcefield Consortium.
Python
54
star
5

openff-bespokefit

Automated tools for the generation of bespoke SMIRNOFF format parameters for individual molecules.
Python
42
star
6

openff-fragmenter

Fragment molecules for quantum mechanics torsion scans
Python
41
star
7

qca-dataset-submission

Data generation and submission scripts for the QCArchive ecosystem.
Jupyter Notebook
29
star
8

smirnoff99Frosst

A general small molecule force field descended from AMBER99 and parm@Frosst, available in the SMIRNOFF format
Python
28
star
9

openff-qcsubmit

Automated tools for submitting molecules to QCFractal
Python
26
star
10

cmiles

Generate canonical molecule identifiers for quantum chemistry database
Jupyter Notebook
23
star
11

alchemiscale

a high-throughput alchemical free energy execution system for use with HPC, cloud, bare metal, and Folding@Home
Python
23
star
12

openff-recharge

An automated framework for generating optimized partial charges for molecules
Python
20
star
13

smarty

Chemical perception tree automated exploration tool.
Python
19
star
14

openff-sage

Scripts, inputs and the results generated as part of the training the Sage line of OpenFF force fields.
Python
19
star
15

protein-ligand-benchmark-livecoms

Jupyter Notebook
17
star
16

open-forcefield-group

For discussing and aggregating data for force field development
Python
14
star
17

openff-nagl

OpenFF NAGL
Python
11
star
18

open-forcefield-data

Datasets for open forcefield parameterization and development
Jupyter Notebook
11
star
19

openforcefield-forcebalance

Optimization of OpenFF parameters using ForceBalance and QCArchive
Python
11
star
20

openff-benchmark

Comparison benchmarks between public force fields and Open Force Field Initiative force fields
Python
10
star
21

openff-units

A common units module for the OpenFF software stack
Python
8
star
22

open-forcefield-tools

Tools for open forcefield development
Python
8
star
23

openff-arsenic

Package for consistent reporting of relative free energy results
Jupyter Notebook
8
star
24

openforcefield.org

Hugo website source for openforcefield.org
JavaScript
7
star
25

smirnoff-plugins

Plugins to enable using custom functional forms in SMIRNOFF based force fields
Python
6
star
26

polymer_examples

Example polymers for testing.
Jupyter Notebook
6
star
27

2021-bespokefit-workshop

Jupyter Notebook
6
star
28

MiniDrugBank

A repository to track the creation and evolution of the MiniDrugBank Molecule set
Jupyter Notebook
6
star
29

bayes-implicit-solvent

experiments with Bayesian calibration of implicit solvent models
Jupyter Notebook
6
star
30

release-1-benchmarking

Benchmarking relating to OpenFF release 1.0 (currently OpenFF 1.0 pre-release), Parsley.
Jupyter Notebook
6
star
31

status

Assorted maintenance tools within the Open Force Field software stack
5
star
32

proteinbenchmark

Benchmarks for OpenFF protein force fields
Python
4
star
33

nistdataselection

Records the tools and decisions used to select NIST data for curation.
Python
3
star
34

openff-models

Helper classes for Pydantic compatibility in the OpenFF stack
Python
3
star
35

openff-sphinx-theme

A material-based, responsive theme inspired by mkdocs-material
Sass
3
star
36

yammbs

Internal tool for benchmarking force fields
Python
3
star
37

CMILES-Cloud

CMILES, but in the cloud.
Python
2
star
38

openff-amber-ff-ports

Data repository for distributing Amber force fields in the SMIRNOFF format
Python
2
star
39

alchemiscale-fah

protocols and compute service for using alchemiscale with Folding@Home
Python
2
star
40

amber-ff-porting

Scratch space for porting amber FFs into SMIRNOFF format
Jupyter Notebook
1
star
41

standards

A repository of the standards employed across the Open Force Field Consortium.
1
star
42

best-practices-observables

Best practices for calculating observables via simulation
TeX
1
star
43

cheminformatics-toolkit-equivalence

Resources for benchmarking cheminformatics toolkits for equivalent molecule processing behavior
Jupyter Notebook
1
star
44

qca-dataset-submission-next-test

Jupyter Notebook
1
star
45

dangerbot

A bot that applies OpenFF software best practices
Ruby
1
star
46

toolkit-installer-constructor

Recipe for making single-file installers of the Open Force Field toolkit
Python
1
star
47

openff-reference

(EXPERIMENTAL) Distributing reference energies for SMIRNOFF implementations
Python
1
star
48

openff-nagl-models

This repository contains NAGL models released by the Open Force Field Initiative. They are intended to be used by OpenFF NAGL.
Python
1
star