• Stars
    star
    122
  • Rank 290,370 (Top 6 %)
  • Language
    Jupyter Notebook
  • License
    MIT License
  • Created almost 6 years ago
  • Updated 11 months ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

BGC Detection and Classification Using Deep Learning

DeepBGC: Biosynthetic Gene Cluster detection and classification

DeepBGC detects BGCs in bacterial and fungal genomes using deep learning. DeepBGC employs a Bidirectional Long Short-Term Memory Recurrent Neural Network and a word2vec-like vector embedding of Pfam protein domains. Product class and activity of detected BGCs is predicted using a Random Forest classifier.

BioConda Install PyPI - Downloads PyPI license PyPI version CI

DeepBGC architecture

πŸ“Œ News πŸ“Œ

  • DeepBGC 0.1.23: Predicted BGCs can now be uploaded for visualization in antiSMASH using a JSON output file
    • Install and run DeepBGC as usual based on instructions below
    • Upload antismash.json from the DeepBGC output folder using "Upload extra annotations" on the antiSMASH page
    • Predicted BGC regions and their prediction scores will be displayed alongside antiSMASH BGCs

Publications

A deep learning genome-mining strategy for biosynthetic gene cluster prediction
Geoffrey D Hannigan, David Prihoda et al., Nucleic Acids Research, gkz654, https://doi.org/10.1093/nar/gkz654

Install using conda (recommended)

You can install DeepBGC using Conda or one of the alternatives (Miniconda, Miniforge).

Set up Bioconda and Conda-Forge channels:

conda config --add channels bioconda
conda config --add channels conda-forge

Install DeepBGC using:

# Create a separate DeepBGC environment and install dependencies
conda create -n deepbgc python=3.7 hmmer prodigal

# Install DeepBGC into the environment using pip
conda activate deepbgc
pip install deepbgc

# Alternatively, install everything using conda (currently unstable due to conda conflicts)
conda install deepbgc

Install dependencies manually (if conda is not available)

If you don't mind installing the HMMER and Prodigal dependencies manually, you can also install DeepBGC using pip:

Use DeepBGC

Download models and Pfam database

Before you can use DeepBGC, download trained models and Pfam database:

deepbgc download

You can display downloaded dependencies and models using:

deepbgc info

Detection and classification

DeepBGC pipeline

Detect and classify BGCs in a genomic sequence. Proteins and Pfam domains are detected automatically if not already annotated (HMMER and Prodigal needed)

# Show command help docs
deepbgc pipeline --help

# Detect and classify BGCs in mySequence.fa using DeepBGC detector.
deepbgc pipeline mySequence.fa

# Detect and classify BGCs in mySequence.fa using custom DeepBGC detector trained on your own data.
deepbgc pipeline --detector path/to/myDetector.pkl mySequence.fa

This will produce a mySequence directory with multiple files and a README.txt with file descriptions.

See Train DeepBGC on your own data section below for more information about training a custom detector or classifier.

Example output

See the DeepBGC Example Result Notebook. Data can be downloaded on the releases page

Detected BGC Regions

Train DeepBGC on your own data

You can train your own BGC detection and classification models, see deepbgc train --help for documentation and examples.

Training and validation data can be found in release 0.1.0 and release 0.1.5. You will need:

If you have any questions about using or training DeepBGC, feel free to submit an issue.

Preparing training data

The training examples need to be prepared in Pfam TSV format, which can be prepared from your sequence using deepbgc prepare.

First, you will need to manually add an in_cluster column that will contain 0 for pfams outside a BGC and 1 for pfams inside a BGC. We recommend preparing a separate negative TSV and positive TSV file, where the column will be equal to all 0 or 1 respectively.

Finally, you will need to manually add a sequence_id column , which will identify a continuous sequence of Pfams from a single sample (BGC or negative sequence). The samples are shuffled during training to present the model with a random order of positive and negative samples. Pfams with the same sequence_id value will be kept together. For example, if your training set contains multiple BGCs, the sequence_id column should contain the BGC ID.

! New in version 0.1.17 ! You can now prepare protein FASTA sequences into a Pfam TSV file using deepbgc prepare --protein.

JSON model training template files

DeepBGC is using JSON template files to define model architecture and training parameters. All templates can be downloaded in release 0.1.0.

JSON template for DeepBGC LSTM detector with pfam2vec is structured as follows:

{
  "type": "KerasRNN", - Model architecture (KerasRNN/DiscreteHMM/GeneBorderHMM)
  "build_params": { - Parameters for model architecture
    "batch_size": 16, - Number of splits of training data that is trained in parallel 
    "hidden_size": 128, - Size of vector storing the LSTM inner state
    "stateful": true - Remember previous sequence when training next batch
  },
  "fit_params": {
    "timesteps": 256, - Number of pfam2vec vectors trained in one batch
    "validation_size": 0, - Fraction of training data to use for validation (if validation data is not provided explicitly). Use 0.2 for 20% data used for testing.
    "verbose": 1, - Verbosity during training
    "num_epochs": 1000, - Number of passes over your training set during training. You probably want to use a lower number if not using early stopping on validation data.
    "early_stopping" : { - Stop model training when at certain validation performance
      "monitor": "val_auc_roc", - Use validation AUC ROC to observe performance
      "min_delta": 0.0001, - Stop training when the improvement in the last epochs did not improve more than 0.0001
      "patience": 20, - How many of the last epochs to check for improvement
      "mode": "max" - Stop training when given metric stops increasing (use "min" for decreasing metrics like loss)
    },
    "shuffle": true, - Shuffle samples in each epoch. Will use "sequence_id" field to group pfam vectors belonging to the same sample and shuffle them together 
    "optimizer": "adam", - Optimizer algorithm
    "learning_rate": 0.0001, - Learning rate
    "weighted": true - Increase weight of less-represented class. Will give more weight to BGC training samples if the non-BGC set is larger.
  },
  "input_params": {
    "features": [ - Array of features to use in model, see deepbgc/features.py
      {
        "type": "ProteinBorderTransformer" - Add two binary flags for pfam domains found at beginning or at end of protein
      },
      {
        "type": "Pfam2VecTransformer", - Convert pfam_id field to pfam2vec vector using provided pfam2vec table
        "vector_path": "#{PFAM2VEC}" - PFAM2VEC variable is filled in using command line argument --config
      }
    ]
  }
}

JSON template for Random Forest classifier is structured as follows:

{
  "type": "RandomForestClassifier", - Type of classifier (RandomForestClassifier)
  "build_params": {
    "n_estimators": 100, - Number of trees in random forest
    "random_state": 0 - Random seed used to get same result each time
  },
  "input_params": {
    "sequence_as_vector": true, - Convert each sample into a single vector
    "features": [
      {
        "type": "OneHotEncodingTransformer" - Convert each sequence of Pfams into a single binary vector (Pfam set)
      }
    ]
  }
}

Using your trained model

Since version 0.1.10 you can provide a direct path to the detector or classifier model like so:

deepbgc pipeline \
    mySequence.fa \
    --detector path/to/myDetector.pkl \
    --classifier path/to/myClassifier.pkl 

More Repositories

1

BioPhi

BioPhi is an open-source antibody design platform. It features methods for automated antibody humanization (Sapiens), humanness evaluation (OASis) and an interface for computer-assisted antibody sequence design.
Python
136
star
2

Halyard

Halyard is an extremely horizontally scalable Triplestore with support for Named Graphs, designed for integration of extremely large Semantic Data Models, and for storage and SPARQL 1.1 querying of the whole Linked Data universe snapshots.
Java
106
star
3

r2rtf

Easily Create Production-Ready Rich Text Format (RTF) Table and Figure
R
76
star
4

DeepNeuralNet-QSAR

Python
65
star
5

matcher

Matcher is a tool for understanding how chemical structure optimization problems have been solved. Matcher enables deep control over searching structure/activity relationships (SAR) derived from large datasets, and takes the form of an accessible web application with simple deployment. Matcher is built around the mmpdb platform.
Python
48
star
6

rdf2x

RDF2X converts big RDF datasets to the relational database model, CSV, JSON and ElasticSearch.
Java
46
star
7

Sapiens

Sapiens is a human antibody language model based on BERT.
Jupyter Notebook
44
star
8

pkglite

Compact Package Representations
R
30
star
9

sonar-r-plugin

Adds support for R language into SonarQube. It uses output from lintr tool which is processed by the plugin and uploaded into SonarQube server.
Java
23
star
10

Line-of-Therapy-Algorithm

This is the Line of Therapy Algorithm, as described in the paper "Temporal phenotyping by mining healthcare data to derive lines of therapy for cancer" pending submission in the Journal of Biomedical Informatics.
Python
23
star
11

gsDesign2

Group Sequential Design Under Non-Proportional Hazards
R
18
star
12

AlgebraicAgents.jl

A lightweight framework to enable hierarchical, heterogeneous dynamical systems co-integration. Batteries included!
Julia
17
star
13

metalite.ae

An R package for standard adverse events analysis
R
17
star
14

simtrial

Clinical trial simulation for time-to-event endpoints
R
16
star
15

BioPhi-2021-publication

This repository contains scripts, data and jupyter notebooks used to produce the evaluation results in the BioPhi 2021 publication
Jupyter Notebook
15
star
16

metalite

An R package to create metadata structure for ADaM data analysis and reporting
R
15
star
17

PepSeA

Python
14
star
18

Mutation_Maker

Application for mutagenic primer design. Facilitates development of biocatalysts (Green Chemistry) and new therapeutic proteins.
Python
14
star
19

ReactiveDynamics.jl

A Julia package that implements a category of reaction (transportation) network-type dynamical systems.
Julia
14
star
20

boxly

Interactive box plot for clinical trial analysis
R
13
star
21

mRNAid

Jupyter Notebook
11
star
22

forestly

Interactive forest plot for adverse events analysis
R
11
star
23

pmpo

Probabilistic Multi-Parameter Optimization (pMPO)
Python
11
star
24

bgc-pipeline

Jupyter Notebook
9
star
25

AbLEF

Antibody Langauge Ensemble Fusion - fuses antibody structural ensemble and language representation for property prediction
Python
8
star
26

gMCPLite

Lightweight graph-based multiple comparison procedures
R
8
star
27

GeneratedExpressions.jl

A Julia package that implements a metalanguage to support expression comprehensions.
Julia
8
star
28

Data-Profiler

Java
8
star
29

gMCPShiny

A Shiny app for graphical multiplicity control
R
7
star
30

NNGP

Nearest Neighbor Gaussian Process
7
star
31

CEEDesigns.jl

A decision-making framework for the cost-efficient design of experiments, balancing the value of acquired experimental evidence and incurred costs.
Julia
7
star
32

rtdpy

Residence Time Distribution modeling in Python.
Python
6
star
33

matcher-mmpdb

Python
5
star
34

MolPROP

fuses molecular language and graph representation for property prediction
Python
5
star
35

Real-world-Time-to-Treatment-Discontinuation-Prediction-Algorithm

Real-world Time to Treatment Discontinuation Prediction Algorithm
Perl
4
star
36

compoundcomplexity

This is an implementation of Compound Complexity for use in the SMART-PMI as described by Sherer et al. It contains derived training data as required by the described Random Forest Model in order to replicate data presented in paper as well as applying to novel data.
Perl
4
star
37

TraceTrack

Python
3
star
38

gsdmvn

The goal of gsdmvn is to enable group sequential trial design for time-to-event endpoints under non-proportional hazards assumptions.
R
3
star
39

curation-open-source

This wrapper enables the HPC execution of FDA DB curation and list all the step in a programming language style.
Jupyter Notebook
3
star
40

Message-Hub

The Messaging Orchestration HUB will be responsible for providing a connection between an organization's GS1 EPCIS-based track and trace data source system (for example ATTP) and the blockchain networks that require data relevant to product serialization and track & trace.
TypeScript
3
star
41

helm-visualisation

JavaScript
2
star
42

OMOP-CONCEPT-EMBEDDING

Python
2
star
43

polo

POLO: web interface to MARCO-scored crystallization images
Python
2
star
44

deker

This library is made to perform feature selection based on a method originally proposed in by Sun et al. [1]. This library specifically relates to the methodology described in [2], named DEKER for decomposed kernel regression, which includes methods for identifying optimal hyperparameter values. This library was also designed for use in the context of network inference, also described in [2], by iteratively reapplying the DEKER method for feature selection across all features of a dataset.
C++
2
star
45

BART-QSAR

R
1
star
46

3D_Tumor_Lightsheet_Analysis_Pipeline

Python
1
star
47

MicroMap_Pipeline

R
1
star
48

ProbeDesign

HTML
1
star
49

bayesiansprt

The goal of bayesiansprt (under GPL-3 license) is to provide the results for sequential probability ratio test under frequentist and Bayesian setup.
R
1
star
50

mmrm

R
1
star
51

rCPDMS

Chemoproteomics Data Analysis
R
1
star
52

Infant-Microbiome-Cohort

Infant Microbiome Cohort
Jupyter Notebook
1
star
53

psm3mkv

psm3mkv: A package to evaluate the fit and efficiency of three state oncology cost-effectiveness model structures
R
1
star