• Stars
    star
    133
  • Rank 272,600 (Top 6 %)
  • Language
    Python
  • License
    MIT License
  • Created about 6 years ago
  • Updated 10 months ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Baselines models for GuacaMol benchmarks

GuacaMol Baselines

A series of baseline model implementations for the guacamol benchmark for generative chemistry.
A more in depth explanation of the benchmarks and scores for these baselines is can be found in our paper.

Dependencies

To install all dependencies:

pip install -r requirements.txt

We also provide a Dockerfile which containerizes baselines from this repo. This may be a useful start point when implementing your own generative models.

docker build -f dockers/Dockerfile . -t guacamol-baselines

Dataset

Some baselines require the guacamol dataset to run, to get it run:

bash fetch_guacamol_dataset.sh

Random Sampler

Dummy baseline, always returning random molecules form the guacamol training set.

To execute the goal-directed generation benchmarks:

python -m random_smiles_sampler.goal_directed_generation

To execute the distribution learning benchmarks:

python -m random_smiles_sampler.distribution_learning

Best from ChEMBL

Dummy baseline that simply returns the molecules from the guacamol training set that best satisfy the score of a goal-directed benchmark.
There is no model nor training, its only purpose is to establish a lower bound on the benchmark scores.

To execute the goal-directed generation benchmarks:

python -m best_from_chembl.goal_directed_generation

No distribution learning benchmark available.

SMILES GA

Genetic algorithm on SMILES as described in: https://www.journal.csj.jp/doi/10.1246/cl.180665

Implementation adapted from: https://github.com/tsudalab/ChemGE

To execute the goal-directed generation benchmarks:

python -m smiles_ga.goal_directed_generation

No distribution learning benchmark available.

Graph GA

Genetic algoritm on molecule graphs as described in: https://doi.org/10.26434/chemrxiv.7240751

Implementation adapted from: https://github.com/jensengroup/GB-GA

To execute the goal-directed generation benchmarks:

python -m graph_ga.goal_directed_generation

No distribution learning benchmark available.

Graph MCTS

Monte Carlo Tree Search on molecule graphs as described in: https://doi.org/10.26434/chemrxiv.7240751

Implementation adapted from: https://github.com/jensengroup/GB-GB

To execute the goal-directed generation benchmarks:

python -m graph_mcts.goal_directed_generation

To execute the distribution learning benchmarks:

python -m graph_mcts.distribution_learning

To re-generate the distribution statistics as pickle files:

python -m graph_mcts.analyze_dataset

SMILES LSTM Hill Climbing

Long-short term memory on SMILES as described in: https://arxiv.org/abs/1701.01329

This implementation optimizes using hill climbing algorithm.

Implementation by BenevolentAI

A pre-trained model is provided in: smiles_lstm/pretrained_model

To execute the goal-directed generation benchmarks:

python -m smiles_lstm_hc.goal_directed_generation

To execute the distribution learning benchmark:

python -m smiles_lstm_hc.distribution_learning

To train a model from scratch:

python -m smiles_lstm_hc.train_smiles_lstm_model

SMILES LSTM PPO

Long-short term memory on SMILES as described in: https://arxiv.org/abs/1701.01329

This implementation optimizes using proximal policy optimization algorithm.

Implementation by BenevolentAI

A pre-trained model is provided in: smiles_lstm/pretrained_model

To execute the goal-directed generation benchmarks:

python -m smiles_lstm_ppo.goal_directed_generation

Frag GT

Fragment-based evolutionary algorithm for generating molecules.

See frag-gt readme for install instructions and description.

Implementation by BenevolentAI

Pre-computed fragment libraries are available from Zenodo (https://zenodo.org/record/6038464)

To execute the goal-directed generation benchmarks:

python frag_gt/goal_directed_generation.py --fragstore_path frag_gt/data/fragment_libraries/guacamol_v1_all_fragstore_brics.pkl --smiles_file data/guacamol_v1_all.smiles

Change log

  • 15 Oct 2020: upgrade guacamol version to 0.5.3
  • 10 Nov 2021: upgrade guacamol version to 0.5.4. Migrate RDKit install conda->pip. Update dependencies.
  • 21 Feb 2022: addition of frag-gt baseline.

More Repositories

1

guacamol

Benchmarks for generative chemistry
Python
401
star
2

DeeplyTough

DeeplyTough: Learning Structural Comparison of Protein Binding Sites
Python
153
star
3

MolBERT

Python
122
star
4

RELVM

This repository contains the code accompanying the paper "Learning Informative Representations of Biomedical Relations with Latent Variable Models", Harshil Shah and Julien Fauqueur, EMNLP SustaiNLP 2020.
Python
14
star
5

CoMP

CoMP: Contrastive Mixture of Posteriors
Python
10
star
6

ukbiobank-loaders

Python
6
star
7

benevolentai-dat

BenevolentAI's Diversity Analysis Tool (DAT) is a software package that can be used to produce demographic analysis reports given health data sets that contain for fields of age, sex, ethnicity, race and socio-economic status. For example, you might have a data about a cohort of patients and want to know how well you cover various ethnicities, age groups, sex groups and socio-economic status levels. Assuming your data sets have some of these fields, this software will help generate various views of the data to help inform your work. The DAT tool was developed as part of BenevolentAI's Diversity in Data Initiative, which aims to help improve the ways patients are represented in precision medicine. It is meant to help inspire other developers to find ways of assessing the data diversity in their current and prospective health data sets.
Python
5
star
8

funkea

Perform functional enrichment analysis at scale.
Python
2
star
9

guacamol_results

HTML
2
star
10

sre-interview

Interview scenario for Level 2 SREs
Python
1
star