• Stars
    star
    146
  • Rank 252,769 (Top 5 %)
  • Language
    Python
  • License
    GNU General Publi...
  • Created over 9 years ago
  • Updated almost 6 years ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Deep learning for gene expression inference

README for D-GEX

INTRODUCTION

Large-scale gene expression profiling has been widely used to characterize cellular states in response to various disease conditions, genetic perturbations and so on. Although the cost of whole-genome expression profiling has been dropping steadily, generating a compendium of expression profiling over thousands of samples is still very expensive. Recognizing that gene expressions are often highly correlated, researchers from the NIH LINCS program have developed a cost-effective strategy of profiling only ˜1,000 carefully selected landmark genes and relying on computational methods to infer the expression of remaining target genes. However, the computational approach adopted by the LINCS program is currently based on linear regression, limiting its accuracy since it does not capture complex nonlinear relationship between expression of genes.

We present a deep learning method (abbreviated as DGEX) to infer the expression of target genes from the expression of landmark genes. We used the microarray-based GEO dataset, consisting of 111K expression profiles, to train our model and compare its performance to those from other methods. In terms of mean absolute error averaged across all genes, deep learning significantly outperforms linear regression with 15.33% relative improvement. A gene-wise comparative analysis shows that deep learning achieves lower error than linear regression in 99.97% of the target genes. We also tested the performance of our learned model on an independent RNA-Seq-based GTEx dataset, which consists of 2,921 expression profiles. Deep learning still outperforms linear regression with 6.57% relative improvement, and achieves lower error in 81.31% of the target genes.

This code base provides all the necessary pieces to reproduce the main results of D-GEX. If you have any questions, please email [email protected]

PREREQUISITES

DATA

The original data files are not provided within this codebase, as some of them require applying for access. Once you download all of them, please put them in this codebase.

GEO and GTEx

The GEO and GTEx data we used in our paper is a preliminary version before their official publication, and is not publicly available. For those who are interested in the data, please email us ([email protected]) with your basic information through an academic institute email address, and we will provide you the private download link. The data you will download is bgedv2_QNORM.gctx and GTEx_RNASeq_RPKM_n2921x55993.gctx.

1000G

The 1000 Genomes RNA-Seq expression data can be accessed from EMBL-EBI. The original data downloaded is GD462.GeneQuantRPKM.50FN.samplename.resk10.txt.

L1000

The predicted expression of L1000 data based on D-GEX can be downloaded at l1000_n1328098x22268.gctx. It consists of 1328098 expression profiles of 22268 genes. The first 978 genes are landmark genes that were directly measured by the L1000 platform. The other 21290 genes are target genes infered by D-GEX based on the GEO data. The expression profiles of each gene were standardized to mean 0 and standard deviation 1.

PREPROCESS

The whole preprocessing step should be done by run

$ ./preprocess.sh

Specifically, there are four steps.

  1. Removing duplicates by k-means: kmeans.py, nodup_idx.py.
  2. Coverting data into numpy format: bgedv2.py, GTEx.py, 1000G.py.
  3. Quantile normalization: bgedv2_reqnorm.py, GTEx_reqnorm.py, 1000G_reqnorm.py.
  4. Standardization: bgedv2_norm.py, GTEx_norm.py, 1000G_norm.py.

TRAINING

Training D-GEX is done by run H1_0-4760.py, H1_4760-9520.py, H2_0-4760.py, H2_4760-9520.py, H3_0-4760.py, H3_4760-9520.py. Each stript trains half of the target genes (0-4760 or 4760-9520) with a certain architecture (1, 2 or 3 hidden layers).

A training example using 200 epoch, 0.75 include rate (0.25 dropout rate) and 1 hidden layer with 9000 hidden units in each hidden layer for 0-4760 target genes is by:

$ ./H1_0-4760.py 9000_H1_0-4760_75 200 9000 0.75

In which, 9000_H1_0-4760_75 is the base name for all the output files.

OUTPUT

Each training instance will output 7 files. For example, by running

$ ./H1_0-4760.py 9000_H1_0-4760_75 200 9000 0.75

It outputs:

9000_H1_0-4760_75.log, the log file of the training instance.

9000_H1_0-4760_75_bestva_model.pkl, the model saved by best performance on Y_va (GEO microarray data).

9000_H1_0-4760_75_bestva_Y_va_hat.npy, the Y_va_hat predicted by best performance on Y_va (GEO microarray data).

9000_H1_0-4760_75_bestva_Y_te_hat.npy, the Y_te_hat predicted by best performance on Y_va (GEO microarray data).

9000_H1_0-4760_75_best1000G_model.pkl, the model saved by best performance on Y_1000G (1000G RNA-Seq data).

9000_H1_0-4760_75_best1000G_Y_1000G_hat.npy, the Y_1000G_hat predicted by best performance on Y_1000G (1000G RNA-Seq data).

9000_H1_0-4760_75_best1000G_Y_GTEx_hat.npy, the Y_GTEx_hat predicted by best performance on Y_1000G (1000G RNA-Seq data).

Reference

Gene expression inference with deep learning, 2016. Bioinformatics, bioRxiv.

More Repositories

1

NoduleNet

[MICCAI' 19] NoduleNet: Decoupled False Positive Reduction for Pulmonary Nodule Detection and Segmentation
Python
187
star
2

DanQ

A hybrid convolutional and recurrent neural network for predicting the function of DNA sequences
Groff
157
star
3

DeepLung

WACV18 paper "DeepLung: Deep 3D Dual Path Nets for Automated Pulmonary Nodule Detection and Classification"
Python
153
star
4

UFold

Python
59
star
5

UaNet

Jupyter Notebook
59
star
6

RP-Net

Code for Recurrent Mask Refinement for Few-Shot Medical Image Segmentation (ICCV 2021).
Python
56
star
7

FactorNet

A deep learning package for predicting TF binding
Python
41
star
8

PyLOH

Deconvolving tumor purity and ploidy by integrating copy number alterations and loss of heterozygosity
Python
38
star
9

EXTREME

An online EM implementation of the MEME model for fast motif discovery in large ChIP-Seq and DNase-Seq Footprinting data
C
30
star
10

tree-hmm

Tree hidden Markov model for learning epigenetic states in multiple cell types
Python
27
star
11

HLA-bind

Amino acid embedding and Convolutional Neural Network for HLA Class I-peptide binding prediction
Python
25
star
12

DeepCons

Understanding sequence conservation with deep learning
HTML
19
star
13

DeepEM-for-Weakly-Supervised-Detection

MICCAI18 DeepEM: Deep 3D ConvNets with EM for Weakly Supervised Pulmonary Nodule Detection
15
star
14

GBMCI

The implementation of gradient boosting machine for concordance index learning.
C++
14
star
15

esm-efficient

Python
12
star
16

SAILER

Jupyter Notebook
5
star
17

BioML

5
star
18

TEMT

Transcripts abundances estimation from heterogeneous tissue sample of RNA-Seq data (TEMT)
Python
5
star
19

ChestXRay

Jupyter Notebook
5
star
20

MixClone

A mixture model for inferring tumor subclonal populations
Python
5
star
21

SAILERX

Jupyter Notebook
3
star
22

scFAN

Python
3
star
23

genomix

Parallel genome assembly using Hyracks
Java
3
star
24

EpiOut

A statistical method to detect, analyze and visualize aberrations in chromatin accessibility (ATAC-seq, DNase-Seq) and quantify its effect on gene expression.
Python
2
star
25

Rainfall

MATLAB
1
star
26

RBPnet

Python
1
star