• Stars
    star
    120
  • Rank 294,330 (Top 6 %)
  • Language
    Python
  • License
    MIT License
  • Created over 4 years ago
  • Updated about 1 year ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

A visible neural network model for drug response prediction

DrugCell: a visible neural network model for drug response prediction

DrugCell is an interpretable neural network-based model that predicts cell response to a wide range of drugs. Unlike fully-connected neural networks, connectivity of neurons in the DrugCell mirrors a biological hierarchy (e.g. Gene Ontology), so that the information travels only between subsystems (or pathways) with known hierarchical relationship during the model training. This feature of the framework allows for identification of subsystems in the hierarchy that are important to the model's prediction, warranting further investigation on underlying biological mechanisms of cell response to treatments.

The current version (v1.0) of the DrugCell model is trained using 509,294 (cell line, drug) pairs across 1,235 tumor cell lines and 684 drugs. The training data is retrieved from Genomics of Drug Sensitivity in Cancer database (GDSC) and the Cancer Therapeutics Response Portal (CTRP) v2.

DrugCell characterizes each cell line using its genotype; the feature vector for each cell is a binary vector representing mutational status of the top 15% most frequently mutated genes (n = 3,008) in cancer. Drugs are encoded using Morgan Fingerprint (radius = 2), and the resulting feature vectors are binary vectors of length 2,048.

Environment set up for training and testing of DrugCell

DrugCell training/testing scripts require the following environmental setup:

  • Hardware required for training a new model

    • GPU server with CUDA>=10 installed
  • Software

    • Python 2.7 or >=3.6
    • Anaconda
    • PyTorch >=0.4
      • Depending on the specification of your machine, run appropriate command to install PyTorch. The installation command line can be found in https://pytorch.org/. Specify Conda as your default package.
      • Example 1: if you are working with a CPU machine running on MAC OS X, execute the following command line:
      conda install pytorch torchvision -c pytorch
      
      • Example 2: for a LINUX machine without GPUs, run the following command line:
      conda install pytorch torchvision cpuonly -c pytorch
      
      • Example 3: for a LINUX-based GPU server with CUDA version 10.1, run the following command line:
      conda install pytorch torchvision cudatoolkit=10.1 -c pytorch
      
    • networkx
    • numpy
  • Set up a virtual environment

    • If you are testing the pre-trained model using a CPU machine, run the following command line to set up an appropriate virtual environment (pytorch3drugcellcpu) using the .yml files in environment_setup.
      • MAC OS X
      conda env create -f environment_cpu_mac.yml
      
      • LINUX
      conda env create -f environment_cpu_linux.yml
      
    • If you are training a new model or test the pre-trained model using a GPU server, run the following command line to set up a virtual environment (pytorch3drugcell).
       conda env create -f environment.yml
      
    • After setting up the conda virtual environment, make sure to activate environment before executing DrugCell scripts. When testing in sample directory, no need to run this as the example bash scripts already have the command line.
      source activate pytorch3drugcell (or pytorch3drugcellcpu)
      

DrugCell release v1.0

DrugCell v1.0 was trained using (cell line, drug) pairs, but it can be generalized to estimate response of any cells to any drugs if:

  1. The feature vector of cell is built as a binary vector representing mutational status of 3,008 genes (the list of index and name of the genes is provided in gene2ind.txt).
  2. The feature vector of drug is encoded into a binary vector of length 2,048 using Morgan Fingerprint (radius = 2). We also provide the pre-computed feature vectors for 684 drugs in our training data (drug2fingerprint.txt).

Pre-trained DrugCell v1.0 model and the drug response data for 509,294 (cell line, drug) pairs used to train the model is shared in http://drugcell.ucsd.edu/downloads.

Required input files:

  1. Cell feature files: gene2ind.txt, cell2ind.txt, cell2mutation.txt
    • gene2ind.txt: make sure you are using gene2ind.txt file provided in this repository.
    • cell2ind.txt: a tab-delimited file where the 1st column is index of cells and the 2nd column is the name of cells (genotypes).
    • cell2mutation.txt: a comma-delimited file where each row has 3,008 binary values indicating each gene is mutated (1) or not (0). The column index of each gene should match with those in gene2ind.txt file. The line number should match with the indices of cells in cell2ind.txt file.
  2. Drug feature files: drug2ind, drug2fingerprints
    • drug2ind.txt: a tab-delimited file where the 1st column is index of drug and the 2nd column is identification of each drug (e.g., SMILES representation or name). The identification of drugs should match to those in drug2fingerprint.txt file.
    • drug2fingerprint.txt: a comma-delimited file where each row has 2,048 binary values which would form , when combined, a Morgan Fingerprint representation of each drug. The line number of should match with the indices of drugs in drug2ind.txt file.
  3. Test data file: drugcell_test.txt
    • A tab-delimited file containing all data points that you want to estimate drug response for. The 1st column is identification of cells (genotypes) and the 2nd column is identification of drugs.

To load a pre-trained model used for analyses in our manuscript and make prediction for (cell, drug) pairs of your interest, execute the following:

  1. Make sure you have gene2ind.txt, cell2ind.txt, cell2mutation.txt, drug2ind.txt, drug2fingerprint.txt, and your file containing test data in proper format (examples are provided in data and sample folder)

  2. To run the model in a GPU server, execute the following:

    python predict_drugcell.py -gene2id gene2ind.txt
                                   -cell2id cell2ind.txt 
                                   -drug2id drug2ind.txt 
                                   -genotype cell2mutation.txt 
                                   -fingerprint drug2fingerprint.txt 
                                   -predict testdata.txt 
                                   -hidden <path_to_directory_to_store_hidden_values>
                                   -result <path_to_directory_to_store_prediction_results>
                                   -load <path_to_model_file>
                                   -cuda <GPU_unit_to_use> (optional)
    
    • An example bash script (commandline_test_gpu.sh) is provided in sample folder.
  3. To load and test the DrugCell model in CPU, run predict_drugcell_cpu.py (instead of predict_drugcell.py) with same set of parameters as 2. -cuda option is not available in this scenario.

Train a new DrugCell model

To train a new DrugCell model using a custom data set, first make sure that you have a proper virtual environment set up. Also make sure that you have all the required files to run the training scripts:

  1. Cell feature files: gene2ind.txt, cell2ind.txt, cell2mutation.txt

    • A detailed description about the contents of the files is given in DrugCell release v1.0 section.
  2. Drug feature files: drug2ind.txt, drug2fingerprints.txt

    • A detailed description about the contents of the files is given in DrugCell release v1.0 section.
  3. Training data file: drugcell_train.txt

    • A tab-delimited file containing all data points that you want to use to train the model. The 1st column is identification of cells (genotypes), the 2nd column is identification of drugs and the 3rd column is an observed drug response in a floating number. The current version of the DrugCell code utilizes a loss function better suited for a regression problem (Minimum Squared Error; MSE), and we recommend using the code to train a regressor rather a classifier.
  4. Validation data file: drugcell_val.txt

    • A tab-delimited file that in the same format as the training data. DrugCell training script would evaluate the model trained in each iteration using the data contained in this file. The performance of the model on the validation data may be used as an early termination condition.
  5. Ontology (hierarchy) file: drugcell_ont.txt

    • A tab-delimited file that contains the ontology (hierarchy) that defines the structure of a branch of a DrugCell model that encodes the genotypes. The first column is always a term (subsystem or pathway), and the second column is a term or a gene. The third column should be set to "default" when the line represents a link between terms, "gene" when the line represents an annotation link between a term and a gene. The following is an example describing a sample hierarchy.

     GO:0045834	GO:0045923	default
     GO:0045834	GO:0043552	default
     GO:0045923	AKT2	gene
     GO:0045923	IL1B	gene
     GO:0043552	PIK3R4	gene
     GO:0043552	SRC	gene
     GO:0043552	FLT1	gene       
    
    • Example of the file (drugcell_ont.txt) is provided in data folder.

There are a few optional parameters that you can provide in addition to the input files:

  1. -model: a name of directory where you want to store the trained models. The default is set to "MODEL" in the current working directory.

  2. -genotype_hiddens: a number of neurons to assign each subsystem in the hierarchy. The default is set to 6.

  3. -drug_hiddens: a string listing the number of neurons for the drug-encoding branch of DrugCell. The number should be delimited by comma. The default value is "100,50,6", and with the default option, the drug branch of the resulting DrugCell model will be a fully-connected neural network with 3 layers consisting of 100, 50, and 6 neurons.

  4. -final_hiddens: the number of neurons in the top layer of DrugCell that combines the genotype-encoding and the drug-encoding branches. The default is 6.

  5. -epoch: the number of epoch to run during the training phase. The default is set to 300.

  6. -batchsize: the size of each batch to process at a time. The deafult is set to 5000. You may increase this number to speed up the training process within the memory capacity of your GPU server.

  7. -cuda: the ID of GPU unit that you want to use for the model training. The default setting is to use GPU 0.

Finally, to train a DrugCell model, execute a command line similar to the example provided in sample/commandline_cuda.sh:

python -u train_drugcell.py -onto drugcell_ont.txt 
                            -gene2id gene2ind.txt 
                            -cell2id cell2ind.txt
                            -drug2id drug2ind.txt
                            -genotype cell2mutation.txt
                            -fingerprint drug2fingerprints.txt
                            -train drugcell_train.txt 
                            -test drugcell_val.txt 
                            -model ./MODEL
                            -genotype_hiddens 6
                            -drug_hiddens "100,50,6"
                            -final_hiddens 6
                            -epoch 100
                            -batchsize 5000
                            -cuda 1

Example data files in sample directory

There are three subsets of our training data provided as toy example: drugcell_train.txt, drugcell_test.txt and drugcell_val.txt have 10,000, 1,000, and 1,000 (cell line, drug) pairs along with the corresponding drug response (area under the dose-response curve).

More Repositories

1

DCell

DCell browser and gene deletion simulator
JavaScript
117
star
2

pyNBS

Python 2.7 implementation of network-based stratification (NBS) algorithm from Hofree et al (Nature Methods 2013)
Jupyter Notebook
37
star
3

MuSIC

Multi-Scale Integrated Cell
Python
35
star
4

TCRP

Few shot learning for cancer
Python
34
star
5

Network_Evaluation_Tools

Python 2.7 package with examples for evaluating a network's ability to group a given node set in network proximity.
Jupyter Notebook
32
star
6

cyREST

DEPRECATED. Please visit our new repository (cytoscape/cyREST)
Java
28
star
7

cy-rest-R

Example R script to use Cytoscape via RESTful API module.
HTML
27
star
8

hiview

HiView: the universal viewer for hierarchical data
JavaScript
16
star
9

heat-diffusion

Python
12
star
10

cy-rest-python

cyREST examples for Python users.
Python
11
star
11

llm_evaluation_for_gene_set_interpretation

Code space for 'Evaluation of large language models for discovery of gene set function'
Jupyter Notebook
9
star
12

cy-net-share

A simple web application to share network files generated with Cytoscape.
JavaScript
9
star
13

nexo

Prototype for NeXO web app.
JavaScript
8
star
14

tsri-lecture

Course material for TSRI network biology lecture
Jupyter Notebook
8
star
15

dot-app

Cytoscape application for exporting to .dot file format
Java
7
star
16

web.cytoscape

New version of cyNetShare
JavaScript
7
star
17

vizbi-2015

Sample data and notebooks for VIZBI 2015 tutorial session
5
star
18

jActiveModules

Java
4
star
19

cellmaps_pipeline

Python
4
star
20

cy-rest-node

Node.js examples for cyREST module.
JavaScript
3
star
21

cxmateold

A RESTFUL network API proxy service for network algorithms
Go
3
star
22

sdcsb-advanced-tutorial

Course material for SDCSB Advanced Cytoscape Workshop (4/17/2015)
3
star
23

auto-graph-visualizer

Automatic graph visualizer for the Cytoscape ecosystem
Python
3
star
24

neoelsa

Erlang
2
star
25

ce-components

Example implementation for CE component collection
TypeScript
2
star
26

large-graph-renderer

Webpack version of LGR
TypeScript
2
star
27

multitask_vnn

Multi-task learning VNN
Python
2
star
28

BiNGO

Java
2
star
29

ci-service-template

Template code to create new Cytoscape CI services.
Python
2
star
30

TreeViewer

D3-based DAG viewer for DCell web applicaiton
JavaScript
2
star
31

cyrest-examples

Latest example notebooks for CyREST
Jupyter Notebook
1
star
32

qfieldlayout

Python
1
star
33

obo-exporter

Jupyter Notebook
1
star
34

cellmaps_annotate_hierarchy

Jupyter Notebook
1
star
35

network-viewer

TypeScript
1
star
36

cdhidef

Python
1
star
37

webservice-ncbi-client

NCBI Client for CYtoscape 3. Moved from the core.
Java
1
star
38

ndex-web

React webapp code for CyNDEx2 Cytoscape App
JavaScript
1
star
39

diffusion-old

Heat diffusion daemon
Python
1
star
40

Automatic_graph_visualizer

Jupyter Notebook
1
star
41

cy-components

Monorepo for all React components maintained by Ideker Lab
JavaScript
1
star
42

drugcell-web-app

JavaScript
1
star
43

nexo-client

Client side module for NeXO web app.
JavaScript
1
star
44

cdoslom

Packaged OSLOM algorithm towards standardized community detection services for Cytoscape
Python
1
star
45

cellmaps_utils

Python
1
star
46

cxio_python

Python
1
star
47

ddot_rest_server

Python
1
star
48

NBGWAS-Frontend

A ReactJS Frontend page for the NBGWAS service created by Samson Fong and Dan Carlin
JavaScript
1
star
49

science-direct-app

Java
1
star
50

cyEZVis

Adding support for CalVR via Mugic plugin.
Java
1
star
51

ndex-valet-electron

NDEx Valet Electron app
JavaScript
1
star
52

GSAI

TypeScript
1
star
53

nest_vnn

VNN for drug response using NeST
Python
1
star