• Stars
    star
    157
  • Rank 238,399 (Top 5 %)
  • Language Groff
  • License
    Other
  • Created almost 9 years ago
  • Updated almost 8 years ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

A hybrid convolutional and recurrent neural network for predicting the function of DNA sequences

README for DanQ

DanQ is a hybrid convolutional and recurrent neural network model for predicting the function of DNA de novo from sequence.

Citing DanQ

Quang, D. and Xie, X. ``DanQ: a hybrid convolutional and recurrent neural network for predicting the function of DNA sequences'', NAR, 2015.

INSTALL

DanQ uses a lot of bleeding edge software packages, and very often these software packages are not backwards compatible when they are updated. Therefore, I have included the most recent version numbers of the software packages for the configuration that worked for me. For the record, I am using Ubuntu Linux 14.04 LTS with an NVIDIA Titan Z GPU.

Required

  • [Python] (https://www.python.org) (2.7.10). The easiest way to install Python and all of the necessary dependencies is to download and install [Anaconda] (https://www.continuum.io) (2.3.0). I listed the versions of Python and Anaconda I used, but the latest versions should be fine. If you're curious as to what packages in Anaconda are used, they are: [numpy] (http://www.numpy.org/) (1.10.1), [scipy] (http://www.scipy.org/) (0.16.0), and [h5py] (http://www.h5py.org) (2.5.0).
  • [Theano] (https://github.com/Theano/Theano) (latest). At the time I wrote this, Theano 0.7.0 is already included in Anaconda. However, it is missing some crucial helper functions. You need to git clone the latest bleeding edge version since there isn't a version number for it:
$ git clone git://github.com/Theano/Theano.git
$ cd Theano
$ python setup.py develop
  • [keras] (https://github.com/fchollet/keras/releases/tag/0.2.0) (0.2.0). Deep learning package that uses Theano backend. I'm in the process of upgrading to version 0.3.0 with the Tensorflow backend.

  • [seya] (https://github.com/EderSantana/seya) (???). I had to modify the source code of this package a little bit. You can try getting the latest version from Github, but for your convenience I've uploaded my copy of the package. You can install it as follows:

$ tar zxvf DanQ_seya.tar.gz
$ cd DanQ_seya
$ python setup.py install

I will likely improve DanQ soon and drop the dependency on seya.

Optional

USAGE

You need to first download the training, validation, and testing sets from DeepSEA. You can download the datasets from [here] (http://deepsea.princeton.edu/media/code/deepsea_train_bundle.v0.9.tar.gz). After you have extracted the contents of the tar.gz file, move the 3 .mat files into the data/ folder.

If you have everything installed, you can train a model as follows:

$ python DanQ_train.py

On my system, each epoch took about 6 hours. Whenever the validation loss is reaches a new minimum at the end of a training epoch, the best weights are stored in [DanQ_bestmodel.hdf5] (https://cbcl.ics.uci.edu/public_data/DanQ/DanQ_bestmodel.hdf5). I've already uploaded the fully trained model in the hyperlink. You can see motif results, including visualizations and TOMTOM comparisons to known motifs, in the motifs/ folder. Likewise, you can also train a much larger model where about half of the motifs are initialized with JASPAR motifs:

$ python DanQ-JASPAR_train.py

Weights are saved to the fight [DanQ-JASPAR_bestmodel.hdf5] (https://cbcl.ics.uci.edu/public_data/DanQ/DanQ-JASPAR_bestmodel.hdf5) whenever the validation loss is lowered. Motif results for this model are also stored in the motifs/ folder.

For your convenience, I've posted the current ROC AUC and PR AUC statistics comparing DanQ and DanQ-JASPAR with DeepSEA.

If you do not want to train a model from scratch and just want to do predictions, I've included test scripts for both models and the file example.h5 in the data folder. This is the same hdf5 file that is generated using the example from the DeepSEA package. The test scripts here have the same input and output formats as the prediction script from DeepSEA, so you can replace the prediction step of the DeepSEA pipeline (i.e. the 2_DeepSEA.lua script) with the test scripts here:

$ python DanQ_test.py data/example.h5 data/example_DanQ_pred.h5

To-Do

  • Annotate genetic variation (xgboost model files are currently included, but not detailed at the moment)
  • Improve DanQ architecture

More Repositories

1

NoduleNet

[MICCAI' 19] NoduleNet: Decoupled False Positive Reduction for Pulmonary Nodule Detection and Segmentation
Python
187
star
2

DeepLung

WACV18 paper "DeepLung: Deep 3D Dual Path Nets for Automated Pulmonary Nodule Detection and Classification"
Python
153
star
3

D-GEX

Deep learning for gene expression inference
Python
146
star
4

UFold

Python
59
star
5

UaNet

Jupyter Notebook
59
star
6

RP-Net

Code for Recurrent Mask Refinement for Few-Shot Medical Image Segmentation (ICCV 2021).
Python
56
star
7

FactorNet

A deep learning package for predicting TF binding
Python
41
star
8

PyLOH

Deconvolving tumor purity and ploidy by integrating copy number alterations and loss of heterozygosity
Python
38
star
9

EXTREME

An online EM implementation of the MEME model for fast motif discovery in large ChIP-Seq and DNase-Seq Footprinting data
C
30
star
10

tree-hmm

Tree hidden Markov model for learning epigenetic states in multiple cell types
Python
27
star
11

HLA-bind

Amino acid embedding and Convolutional Neural Network for HLA Class I-peptide binding prediction
Python
25
star
12

DeepCons

Understanding sequence conservation with deep learning
HTML
19
star
13

DeepEM-for-Weakly-Supervised-Detection

MICCAI18 DeepEM: Deep 3D ConvNets with EM for Weakly Supervised Pulmonary Nodule Detection
15
star
14

GBMCI

The implementation of gradient boosting machine for concordance index learning.
C++
14
star
15

esm-efficient

Python
12
star
16

SAILER

Jupyter Notebook
5
star
17

BioML

5
star
18

TEMT

Transcripts abundances estimation from heterogeneous tissue sample of RNA-Seq data (TEMT)
Python
5
star
19

ChestXRay

Jupyter Notebook
5
star
20

MixClone

A mixture model for inferring tumor subclonal populations
Python
5
star
21

SAILERX

Jupyter Notebook
3
star
22

scFAN

Python
3
star
23

genomix

Parallel genome assembly using Hyracks
Java
3
star
24

EpiOut

A statistical method to detect, analyze and visualize aberrations in chromatin accessibility (ATAC-seq, DNase-Seq) and quantify its effect on gene expression.
Python
2
star
25

Rainfall

MATLAB
1
star
26

RBPnet

Python
1
star