• Stars
    star
    300
  • Rank 138,870 (Top 3 %)
  • Language
    Python
  • License
    BSD 3-Clause "New...
  • Created about 5 years ago
  • Updated over 1 year ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Deep functional residue identification

DeepFRI

Deep functional residue identification

Citing

@article {Gligorijevic2019,
	author = {Gligorijevic, Vladimir and Renfrew, P. Douglas and Kosciolek, Tomasz and Leman,
	Julia Koehler and Cho, Kyunghyun and Vatanen, Tommi and Berenberg, Daniel
	and Taylor, Bryn and Fisk, Ian M. and Xavier, Ramnik J. and Knight, Rob and Bonneau, Richard},
	title = {Structure-Based Function Prediction using Graph Convolutional Networks},
	year = {2019},
	doi = {10.1101/786236},
	publisher = {Cold Spring Harbor Laboratory},
	URL = {https://www.biorxiv.org/content/early/2019/10/04/786236},
	journal = {bioRxiv}
}

Dependencies

DeepFRI is tested to work under Python 3.7.

The required dependencies for DeepFRI are TensorFlow, Biopython and scikit-learn. To install all dependencies run:

pip install .

Protein function prediction

To predict protein functions use predict.py script with the following options:

  • seq str, Protein sequence as a string
  • cmap str, Name of a file storing a protein contact map and sequence in *.npz file format (with the following numpy array variables: C_alpha, seqres. See examples/pdb_cmaps/)
  • pdb str, Name of a PDB file (cleaned)
  • pdb_dir str, Directory with cleaned PDB files (see examples/pdb_files/)
  • cmap_csv str, Filename of the catalogue (in *.csv file format) containg mapping between protein names and directory with *.npz files (see examples/catalogue_pdb_chains.csv)
  • fasta_fn str, Fasta filename (see examples/pdb_chains.fasta)
  • model_config str, JSON file with model filenames (see trained_models/)
  • ont str, Ontology (mf - Molecular Function, bp - Biological Process, cc - Cellular Component, ec - Enzyme Commission)
  • output_fn_prefix str, Output filename (sampe prefix for predictions/saliency will be used)
  • verbose bool, Whether or not to print function prediction results
  • saliency bool, Whether or not to compute class activaton maps (outputs a *.json file)

Generated files (see examples/outputs/):

  • output_fn_prefix_MF_predictions.csv Predictions in the *.csv file format with columns: Protein, GO-term/EC-number, Score, GO-term/EC-number name
  • output_fn_prefix_MF_pred_scores.json Predictions in the *.json file with keys: pdb_chains, Y_hat, goterms, gonames
  • output_fn_prefix_MF_saliency_maps.json JSON file storing a dictionary of saliency maps for each predicted function of every protein

DeepFRI offers 6 possible options for predicting functions. See examples below.

Option 1: predicting functions of a protein from its contact map

Example: predicting MF-GO terms for Parvalbumin alpha protein using its sequence and contact map (PDB: 1S3P):

>> python predict.py --cmap ./examples/pdb_cmaps/1S3P-A.npz -ont mf --verbose

Output:

Protein GO-term/EC-number Score GO-term/EC-number name
query_prot GO:0005509 0.99824 calcium ion binding

Option 2: predicting functions of a protein from its sequence

Example: predicting MF-GO terms for Parvalbumin alpha protein using its sequence (PDB: 1S3P):

>> python predict.py --seq 'SMTDLLSAEDIKKAIGAFTAADSFDHKKFFQMVGLKKKSADDVKKVFHILDKDKDGFIDEDELGSILKGFSSDARDLSAKETKTLMAAGDKDGDGKIGVEEFSTLVAES' -ont mf --verbose

Output:

Protein GO-term/EC-number Score GO-term/EC-number name
query_prot GO:0005509 0.99769 calcium ion binding

Option 3: predicting functions of proteins from a fasta file

>> python predict.py --fasta_fn examples/pdb_chains.fasta -ont mf -v

Output:

Protein GO-term/EC-number Score GO-term/EC-number name
1S3P-A GO:0005509 0.99769 calcium ion binding
2J9H-A GO:0004364 0.46937 glutathione transferase activity
2J9H-A GO:0016765 0.19910 transferase activity, transferring alkyl or aryl
(other than methyl) groups
2J9H-A GO:0097367 0.10537 carbohydrate derivative binding
2PE5-B GO:0003677 0.53502 DNA binding
2W83-E GO:0032550 0.99260 purine ribonucleoside binding
2W83-E GO:0001883 0.99242 purine nucleoside binding
2W83-E GO:0005525 0.99231 GTP binding
2W83-E GO:0019001 0.99222 guanyl nucleotide binding
2W83-E GO:0032561 0.99194 guanyl ribonucleotide binding
2W83-E GO:0032549 0.99149 ribonucleoside binding
2W83-E GO:0001882 0.99135 nucleoside binding
2W83-E GO:0017076 0.98687 purine nucleotide binding
2W83-E GO:0032555 0.98641 purine ribonucleotide binding
2W83-E GO:0035639 0.98611 purine ribonucleoside triphosphate binding
2W83-E GO:0032553 0.98573 ribonucleotide binding
2W83-E GO:0097367 0.98168 carbohydrate derivative binding
2W83-E GO:0003924 0.52355 GTPase activity
2W83-E GO:0016817 0.36863 hydrolase activity, acting on acid anhydrides
2W83-E GO:0016818 0.36683 hydrolase activity, acting on acid anhydrides, in phosphorus-containing anhydrides
2W83-E GO:0017111 0.35465 nucleoside-triphosphatase activity
2W83-E GO:0016462 0.35303 pyrophosphatase activity

Option 4: predicting functions of proteins from contact map catalogue

>> python predict.py --cmap_csv examples/catalogue_pdb_chains.csv -ont mf -v

Output:

Protein GO-term/EC-number Score GO-term/EC-number name
1S3P-A GO:0005509 0.99824 calcium ion binding
2J9H-A GO:0004364 0.84826 glutathione transferase activity
2J9H-A GO:0016765 0.82014 transferase activity, transferring alkyl or aryl
(other than methyl) groups
2PE5-B GO:0003677 0.89086 DNA binding
2PE5-B GO:0017111 0.12892 nucleoside-triphosphatase activity
2PE5-B GO:0004386 0.12847 helicase activity
2PE5-B GO:0032553 0.12091 ribonucleotide binding
2PE5-B GO:0097367 0.11961 carbohydrate derivative binding
2PE5-B GO:0016887 0.11331 ATPase activity
2W83-E GO:0097367 0.97069 carbohydrate derivative binding
2W83-E GO:0019001 0.96842 guanyl nucleotide binding
2W83-E GO:0017076 0.96737 purine nucleotide binding
2W83-E GO:0001882 0.96473 nucleoside binding
2W83-E GO:0035639 0.96439 purine ribonucleoside triphosphate binding
2W83-E GO:0032555 0.96294 purine ribonucleotide binding
2W83-E GO:0016818 0.96181 hydrolase activity, acting on acid anhydrides, in phosphorus-containing anhydrides
2W83-E GO:0032550 0.96142 purine ribonucleoside binding
2W83-E GO:0016817 0.96082 hydrolase activity, acting on acid anhydrides
2W83-E GO:0016462 0.95998 pyrophosphatase activity
2W83-E GO:0032553 0.95935 ribonucleotide binding
2W83-E GO:0032561 0.95930 guanyl ribonucleotide binding
2W83-E GO:0032549 0.95877 ribonucleoside binding
2W83-E GO:0003924 0.95453 GTPase activity
2W83-E GO:0001883 0.95271 purine nucleoside binding
2W83-E GO:0005525 0.94635 GTP binding
2W83-E GO:0017111 0.93942 nucleoside-triphosphatase activity
2W83-E GO:0044877 0.64519 protein-containing complex binding
2W83-E GO:0001664 0.31413 G protein-coupled receptor binding
2W83-E GO:0005102 0.20078 signaling receptor binding

Option 5: predicting functions of a protein from a PDB file

>> python predict.py -pdb ./examples/pdb_files/1S3P-A.pdb -ont mf -v

Output:

Protein GO-term/EC-number Score GO-term/EC-number name
query_prot GO:0005509 0.99824 calcium ion binding

Option 6: predicting functions of a protein from a directory with PDB files

>> python predict.py --pdb_dir ./examples/pdb_files -ont mf --saliency --use_backprop

Output:

See files in: examples/outputs/

Training DeepFRI

To train DeepFRI run the following command from the project directory:

>> python train_DeepFRI.py -h

or to launch jobs run the following script:

>> ./run_train_DeepFRI.sh

Output

Generated files:

  • model_name_prefix_ont_model.hdf5 trained model with architecture and weights saved in HDF5 format
  • model_name_prefix_ont_pred_scores.pckl pickle file with predicted GO term/EC number scores for test proteins
  • model_name_prefix_ont_model_params.json JSON file with metadata (GO terms/names, architecture params, etc.)

See examples of pre-trained models (*.hdf5) and model params (*.json) in: trained_models/.

Functional residue identification

To visualize class activation (saliency) maps use viz_gradCAM.py script with the following options:

  • saliency_fn str, JSON filename with saliency maps generated by predict.py script (see Option 6 above)
  • list_all bool, list all proteins and their predicted GO terms with corresponding class activation (saliency) maps
  • protein_id str, protein (PDB chain), saliency maps of which are to be visualized for each predicted function
  • go_id str, GO term, saliency maps of which are to be visualized
  • go_name str, GO name, saliency maps of which are to be visualized

Generated files:

  • saliency_fig_PDB-chain_GOterm.png class activation (saliency) map profile over sequence (see fig below, right)
  • pymol_viz.py pymol script for mapping salient residues onto 3D structure (pymol output is shown in fig below, left)

Example:

>>> python viz_gradCAM.py -i ./examples/outputs/DeepFRI_MF_saliency_maps.json -p 1S3P-A -go GO:0005509

Output:

Data

Data (train and validation) used for training DeepFRI model are provided as TensorFlow-specific TFRecord files and they can be downloaded from:

PDB SWISS-MODEL
Gene Ontology(19GB) Gene Ontology(165GB)
Enzyme Commission(13GB) Enzyme Commission(117GB)

Pretrained models

Pretrained models can be downloaded from:

  • Models (use these models if you run DeepFRI on GPU)
  • Newest Models (use these models if you run DeepFRI on CPU)

Uncompress tar.gz file into the DeepFRI directory (tar xvzf trained_models.tar.gz -C /path/to/DeepFRI).

More Repositories

1

CaImAn

Computational toolbox for large scale Calcium Imaging Analysis, including movie handling, motion correction, source extraction, spike deconvolution and result visualization.
Python
620
star
2

finufft

Non-uniform fast Fourier transform library of types 1,2,3 in dimensions 1,2,3
C++
293
star
3

CaImAn-MATLAB

Complete Matlab pipeline for large scale calcium imaging data analysis
MATLAB
248
star
4

NoRMCorre

Matlab routines for online non-rigid motion correction of calcium imaging data
MATLAB
142
star
5

deepblast

Neural Networks for Protein Sequence Alignment
Python
110
star
6

FMM3D

Flatiron Institute Fast Multipole Libraries --- This codebase is a set of libraries to compute N-body interactions governed by the Laplace and Helmholtz equations, to a specified precision, in three dimensions, on a multi-core shared-memory machine.
Fortran
96
star
7

cufinufft

Nonuniform fast Fourier transforms of types 1 and 2, in 1D, 2D, and 3D, on the GPU
Cuda
83
star
8

nemos

NEural MOdelS, a statistical modeling framework for neuroscience.
Python
78
star
9

jax-finufft

JAX bindings to the Flatiron Institute Non-uniform Fast Fourier Transform (FINUFFT) library
Python
77
star
10

sparse_dot

Python wrapper for Intel Math Kernel Library (MKL) matrix multiplication
Python
73
star
11

sciware

Learning materials for scientific software development
HTML
61
star
12

mountainsort

Spike sorting software
C++
48
star
13

inferelator

Task-based gene regulatory network inference using single-cell or bulk gene expression data conditioned on a prior network.
Python
46
star
14

mountainlab-js

MountainLab is data processing, sharing and visualization software for scientists. It is built around MountainSort, spike sorting software, but is designed to be more generally applicable.
JavaScript
43
star
15

bayes-kit

Bayesian inference and posterior analysis for Python
Python
41
star
16

neurosift

Browser-based NWB visualization and DANDI exploration
TypeScript
40
star
17

online_psp

A collection of computationally efficient algorithms for online subspace learning and principal component analysis
Python
38
star
18

spikeforest2

SpikeForest -- spike sorting analysis for website -- version 2
Python
34
star
19

mcmc-monitor

Monitor MCMC runs in the browser
TypeScript
34
star
20

nixpack

nix+spack = nixpack (spanix?)
Python
34
star
21

nifty-ls

A fast Lomb-Scargle periodogram. It's nifty, and uses a NUFFT!
Python
33
star
22

figurl

Shareable, interactive scientific figures in the cloud
Python
31
star
23

disBatch

Tool to distribute a list of computational tasks over a pool of compute resources. The pool can grow or shrink.
Python
31
star
24

mountainsort5

MountainSort spike sorting algorithm, version 5
Python
30
star
25

ironclust

Spike sorting software being developed at Flatiron Institute, based on JRCLUST (Janelia Rocket Cluster)
Jupyter Notebook
28
star
26

st_gridnet

A Python implementation of the model described in our publication "A convolutional neural network for common-coordinate registration of high-resolution histology images" developed principally for applications to registration of spatial transcriptomics image data.
Python
25
star
27

mountainsort_examples

Examples of using MountainSort spike sorting software.
Jupyter Notebook
25
star
28

SkellySim

Hydrodynamic Cytoskeleton Simulator
C++
23
star
29

fmm2d

This codebase is a set of libraries to compute N-body interactions governed by the Laplace and Helmholtz equations, to a specified precision, in two dimensions, on a multi-core shared-memory machine.
Fortran
23
star
30

catvae

Categorical Variational Autoencoders
Jupyter Notebook
22
star
31

online_psp_matlab

Benchmark of online PCA algorithms
MATLAB
22
star
32

spikeforest

Spike sorting benchmarking system
Python
22
star
33

slurm-prometheus-exporter

Prometheus exporter for slurm job/node data
Haskell
20
star
34

kachery-cloud

Python
20
star
35

dendro

Analyze neuroscience data in the cloud
TypeScript
19
star
36

comptools24

Computational Tools for PDEs with Complicated Geometries and Interfaces workshop materials, June 10-14, 2024.
MATLAB
18
star
37

aLENS

a Living ENsemble Simulator -- a lens to help you watch biophysics
C++
17
star
38

spikeforest_old

SpikeForest -- spike sorting analysis for website
Python
16
star
39

baobzi

An adaptive fast function approximator based on tree search
C++
14
star
40

isosplit5

ISO-SPLIT clustering (stand-alone version)
C++
14
star
41

bio-sfa

Code for reproducing the experiment from the paper "A biologically plausible neural network for Slow Feature Analysis"
Jupyter Notebook
12
star
42

pytorch-finufft

Pytorch wrappers for the FINUFFT library
Python
11
star
43

sf_benchmarks

Special function benchmarks
C++
10
star
44

mantis

Manifold-tiling Localized Receptive Fields are Optimal in Similarity-preserving Neural Networks
Python
10
star
45

flathub

A simple elasticsearch frontend for serving astrophysical simulation catalog data
Haskell
10
star
46

ccq-software-build

CCQ software build scripts
Shell
10
star
47

Ensemble-reweighting-using-Cryo-EM-particles

Jupyter Notebook
10
star
48

PointCloud_Regression

Point Cloud regression with new algebraic representation on ModelNet dataset (ICCV 2023)
Python
10
star
49

public_www

Basic templates for individual user pages in the public/_www directory
SCSS
9
star
50

inferelator-prior

Gene regulatory network inference using DNA-binding motifs and chromatin accessibility data.
Python
9
star
51

boxcodes3d

This repository contains box codes for evaluating volume potentials for Laplace, Helmholtz, Maxwell, and Stokes
Fortran
9
star
52

bio-nica

Code for reproducing the experiments in the paper "Bio-NICA: A biologically inspired single-layer network for Nonnegative Independent Component Analysis"
Jupyter Notebook
9
star
53

cppdlr

Discrete Lehmann representation of imaginary time Green's functions
C++
8
star
54

nix-modules

nixpkgs-based module builds of cluster packages
Nix
8
star
55

bio-cca

Code accompanying the paper "A biologically plausible neural network for multi-channel Canonical Correlation Analysis"
Jupyter Notebook
8
star
56

reactopya

React components paired with Python classes for backend computation
JavaScript
8
star
57

clair

Clair (Clang Introspection and Reflection) tools. A set of clang tools developed for Flatiron/CCQ
C++
8
star
58

adapol

Adaptive Pole Fitting for Quantum Many-Body Physics
Jupyter Notebook
8
star
59

DLR_DMFT_scripts

Python
7
star
60

kbucket

System for sharing data for scientific research
JavaScript
7
star
61

ccn-template

Template repository for CCN software projects
Python
7
star
62

neutorch

neuron segmentation and synapse detection using PyTorch
Python
7
star
63

SlurmUtil

slurm monitoring tools and interface
Python
7
star
64

q2-matchmaker

A qiime2 plugin for case-control differential abundance analysis
Jupyter Notebook
6
star
65

mountainlab

Scientific data analysis, sharing, and visualization
C++
6
star
66

least_absolute_regression

Least absolute error regression implemented using Linear Programming, primarily to illustrate repository structure conventions.
Jupyter Notebook
6
star
67

neuropixels-data-sep-2020

Example neuropixels datasets for purposes of developing spike sorting algorithms
Python
6
star
68

ephys-viz

Neurophysiology visualization components deployable to the notebook, web, or desktop
JavaScript
6
star
69

inverse-obstacle-scattering2d

This repository contains codes for solving inverse obstacle scattering problems to recover the shape of the obstacle for sound soft, sound hard, and penetrable objects
Fortran
6
star
70

cryoSBI

Python
6
star
71

kachery-p2p

Peer-to-peer content-addressable file sharing using kachery
TypeScript
5
star
72

gp-shootout

Benchmark and compare large-scale Gaussian process regression methods in 1D, 2D, and 3D, from MATLAB
MATLAB
5
star
73

ccn-software-fens-2024

Materials for CCN software workshop at FENS 2024
Python
5
star
74

looking_glass_3d_test

Test of displaying 3d genomic data using webgl in a "Looking glass" holographic display
5
star
75

caiman_central

Central hub for resources related to Caiman calcium imaging analysis package.
Jupyter Notebook
5
star
76

labbox-ephys

TypeScript
4
star
77

FREYA

Canine breast cancer analysis pipeline
Jupyter Notebook
4
star
78

spikesortercomparison

Notes on metrics and datasets for community spike sorting comparison and validation
TeX
4
star
79

hither

Run batches of Python functions in containers and on remote servers
Python
4
star
80

ILCnetworks

Gene regulatory networks for intestinal innate lymphoid cells
Jupyter Notebook
4
star
81

Fermi_surface_visualization

Interactive visualization of Fermi surfaces representing electron energies and velocities in crystal structures.
Python
4
star
82

flatiron-coffee

Meet your colleagues across the Institute
Python
4
star
83

spike-front

Front end tool for benchmarking spike sorting algorithms.
JavaScript
4
star
84

ccm_widgets

Reactopya widgets of relevance to the Center for Computational Mathematics, Flatiron Institute
JavaScript
4
star
85

Cryo-EM-Heterogeneity-Challenge-1

The Inaugural Flatiron Institute Cryo-EM Heterogeneity Community Challenge
Jupyter Notebook
4
star
86

kvsstcp

Key value storage server. Light weight (single python module with no non-standard module dependencies). Simple protocol minimizes effort needed to write a client (example for C and FORTRAN provided).
Python
4
star
87

bio-lda

Code accompanying the paper "A linear discriminant analysis model of imbalanced associative learning in the mushroom body compartment"
Jupyter Notebook
3
star
88

Baobzi.jl

Wrapper library for Baobzi interpolator library
Julia
3
star
89

ftk

Factorization of the translation kernel for fast rigid image alignment
Python
3
star
90

sciware-testing-cpp

C++
3
star
91

binary_classification_metrics

Visualizations and other code for exploring binary classification metrics.
Jupyter Notebook
3
star
92

inverse-volume-scattering2d

This repository contains beta MATLAB codes for data generation of forward medium problems using a hybrid HPS-HBS solver, and an inverse medium solver
MATLAB
3
star
93

stan-playground

Run Stan models in the browser
TypeScript
3
star
94

qt-mountainview

Visualization of spike sorting experiments to be used as a plugin package to mountainlab-js
C++
3
star
95

fgt2d

This repository contains codes for evaluating sums and integrals of gaussians in two dimensions
Fortran
3
star
96

spikeforest_recordings

Ephys recordings for the SpikeForest project
Python
3
star
97

quantum_rosetta_private

Playing with quantum computing algorithms for protein and peptide design and structure prediction, with interfaces for the Rosetta software suite. (Repository created by Vikram K. Mulligan, [email protected]).
C#
3
star
98

mcmc-monitor-old

Monitor MCMC runs
Python
3
star
99

h5_to_json

Represent hdf5 files via json, separating out large binary content into individual files
Python
3
star
100

mountainsort_examples-old

Examples for getting started with MountainSort spike sorting software
MATLAB
2
star