• Stars
    star
    139
  • Rank 262,954 (Top 6 %)
  • Language
    Perl
  • Created over 8 years ago
  • Updated over 6 years ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

RNAseqDB

Background

A multitude of large-scale studies, e.g. TCGA and GTEx, have recently generated an unprecedented volume of RNA-seq data. The RNA-seq expression data from different studies typically are not directly comparable, due to differences in sample and data processing and other batch effects. Here, we developped a pipeline that processes and unifies RNA-seq data from different studies. Using the pipeline, we have processed data from the GTEx and TCGA and have successfully corrected for study-specific biases, allowing comparative analysis across studies.

Methods

The input of the pipeline is paired-end raw sequencing reads (in FASTQ format). The raw reads of the RNA-seq samples for the TCGA and GTEx projects were retrieved from the Cancer Genomics Hub (CGHub, https://cghub.ucsc.edu) and the Database of Genotypes and Phenotypes (dbGaP, http://www.ncbi.nlm.nih.gov/gap), respectively.

We used STAR to align sequencing reads, RSEM and FeatureCounts to quantify gene expression, mRIN to evaluate sample degradation, RSeQC to measure sample strandness and quality, and SVAseq to correct batch biases.

Related software

The pipeline requires the following third-party tools, whose full directories needs to be specified in a configuration file config.txt. Several example configuration files are provided in folder configuration.

STAR aligner v2.4.2a

rsem v1.2.20 (http://deweylab.biostat.wisc.edu/rsem/src/rsem-1.2.20.tar.gz)

SubRead v1.5.0 (http://sourceforge.net/projects/subread/files/subread-1.5.0-p1/subread-1.5.0-p1-source.tar.gz/download)

SVAseq, (R package downloaded from http://bioconductor.org/biocLite.R)

samtools (v1.2)

RSeQC v1.1.8 (http://www.broadinstitute.org/cancer/cga/tools/rnaseqc/RNA-SeQC_v1.1.8.jar)

ubu v1.2 (https://github.com/mozack/ubu/archive/v1.2.tar.gz)

mRIN

picard-tools v1.126

FastQC v0.11.3

bedtools v2.23

These software have already been installed in two computer clusters at MSKCC: HAL and LUNA. So users of these two clusters don't need install them (except Rsubread) to run the pipeline.

Finally, the pipeline wraps gtdownload and sratoolkit to allow user to conveniently download GTEx and TCGA samples from CGHub and dbGaP. To download the data through the pipeline, user needs install gtdownload and sratoolkit.

Availability

The pipeline was designed to run on various computer clusters, e.g. the HAL cluster hal.cbio.mskcc.org and LUNA cluster luna.cbio.mskcc.org.

For LUNA user, the complete pipeline was under directory /ifs/e63data/schultzlab/wangq/bin/RNAseqDB.

To run the pipeline in LUNA, user needs to specify some paths using the environment variables PATH, PYTHONPATH, and PERL5LIB. The following is a summary of the paths user can add into personal '.bashrc' file. These paths were also summarized in file configuration/sourceme.

  1. export PERL5LIB=/ifs/e63data/schultzlab/wangq/perl5:/opt/common/CentOS_6/perl/perl-5.22.0/lib/5.22.0:/ifs/e63data/schultzlab/opt/perl5/lib/perl5:/ifs/e63data/schultzlab/opt/perl5/lib/perl5/czplib
  2. export PATH=/opt/common/CentOS_6-dev/perl/perl-5.22.0/bin:$PATH
  3. export PATH=/opt/common/CentOS_6/python/python-2.7.8/bin/:$PATH
  4. export PYTHONPATH=/ifs/e63data/schultzlab/bin/RSeQC-2.6.1/opt/common/CentOS_6/python/python-2.7.8/lib/python2.7/site-packages:$PYTHONPATH
  5. export PATH=/ifs/e63data/schultzlab/bin/RSeQC-2.6.1/opt/common/CentOS_6/python/python-2.7.8/bin:$PATH
  6. export PATH=/ifs/e63data/schultzlab/wangq/bin/GeneTorrent-download-3.8.7-207/bin:$PATH

In the HAL cluster hal.cbio.mskcc.org, a copy of the pipeline was under this directory: /cbio/ski/schultz/home/wangq/scripts

For users outside MSKCC, the source code of the pipeline is freely accessible through GitHub (at https://github.com/mskcc/RNAseqDB/).

Quick start

To demonstrate how to run the pipeline, suppose we have a set of RNA-seq samples under a directory ~/data/RNA-seq/ and we want to quantify expression levels of genes in them.

If samples are in SRA format, you should put them directly in ~/data/RNA-seq/. For FASTQ or BAM files, you should save each sample in a sub-directory under ~/data/RNA-seq/. The script will automatically find and analyze all the samples under the given directory ~/data/RNA-seq/. For samples downloaded from GTEx or TCGA, user should also put meta data file, SraRunTable.txt for GTEx and manifest.xml and summary.tsv for TCGA, under ~/data/RNA-seq/.

To run the pipeline, firstly initialize the environment variables using the following command:

source /ifs/e63data/schultzlab/wangq/bin/RNAseqDB/sourceme

Then, use the command below to analyze all the samples under the directory ~/data/RNA-seq/:

perl /ifs/e63data/schultzlab/wangq/bin/RNAseqDB/pipeline.pl -i ~/data/RNA-seq/ -s

The argument '-s' means to submit a job for each sample.

The script pipeline.pl requires a configuration file to find and execute other software it needs. If we do not specify it in the command line (like above), the script will look for a default one, config.txt, under its own directory /ifs/e63data/schultzlab/wangq/bin/RNAseqDB/.

Another script pipeline-wrapper.pl provides more functionality, e.g. batch bias correction, than pipeline.pl. When executed with no argument or with the argument '-h', pipeline-wrapper.pl, as well as other script files, will print detailed instructions on how to use it.

After all the jobs terminate, you can (optionally) use another script file collect-qc.pl to create a report on the quality of the samples and to filter out low quality one. Downstream scripts read QC report created by collect-qc.pl and filter out low-quality samples by default.

perl /ifs/e63data/schultzlab/wangq/bin/RNAseqDB/collect-qc.pl -i ~/data/RNA-seq/

For expression of the genes in all samples, if you want to create a sample-gene matrix, you can run create-matrix.pl:

perl /ifs/e63data/schultzlab/wangq/bin/RNAseqDB/create-matrix.pl -i ~/data/RNA-seq/ -o data-matrix-file.txt -p

The script create-matrix.pl provides both raw or normalized outputs: read count, TPM, and FPKM, for both gene or transcript expression.

Batch bias correction

Our pipeline can correct biases specific to the TCGA and GTEx projects so that TCGA samples are directly comparable with the GTEx samples.

To do it, you need the configuration file, e.g. config-luna.txt, to specify paths of the data. In config-luna.txt, we utilize two variables, 'gtex_path' and 'tcga_path', to point to two directories for storing GTEx and TCGA data, respectively.

The script file, post-process.pl and run-combat.R, do the actual work of batch effect correction. Another script, pipeline-wrapper.pl, wraps all necessary steps together to make the analysis easy. The following is a command I used to analyze bladder tissue:

perl /ifs/e63data/schultzlab/wangq/bin/RNAseqDB/pipeline-wrapper.pl -t bladder -s

This is what happen when running the command above: the script pipeline-wrapper.pl firstly reads a configuration file tissue-conf.txt for lines containing the word 'bladder'. Directories storing GTEx and TCGA samples are then obtained by concatenating 'gtex_path' and 'tcga_path' with the keywords found in tissue-conf.txt. Then, the script submits jobs for the samples and does batch bias correction after all job terminates. Finally, the script creates sample-gene matrices from the samples.

Data

We have applied the pipeline to GTEx and TCGA. The data generated has been deposited into a directory data. This directory contains threes subdirectories, each corresponding to a dataset as described below.

  1. expected_count: the maximum likelihood gene expression levels computed using RSEM, i.e. the expected_count in RSEM’s output. There are 52 data files in this subdirectory, each being a sample-gene matrix of a certain tissue type. These files can be provided to programs such as EBSeq, DESeq, or edgeR for identifying differentially expressed genes.

  2. unnormalized: the gene expression levels calculated from fpkm of RSEM’s output. The data matrices here, however, were not the direct output of RSEM. They underwent quantile normalization, but were not corrected for batch effects.

  3. normalized: the normalized gene expression levels (FPKM). This set of data files was not only quantile normalized, but also was corrected for batch effects (using tool ComBat).

Handling replicates

If a sample has more than 2 FASTQ files or has multiple replicates, user should provide a file named SampleSheet.csv together with the FASTQ files. The file SampleSheet.csv should be CSV format and include at least two columns: 'SampleID' and 'Lane'.

Below is the content of an example SampleSheet.csv file. It is the simplest sample sheet file allowed, as it only contains two required columns. Given this file, the program will look for all FASTQ files with file names matching prefix '130723_7001407_0116_AC2AEVACXX' and lane number <= 8 (each lane is treated as a replicate).

SampleID,Lane
130723_7001407_0116_AC2AEVACXX,8

pipeline.pl provides an arugment '-m | --merge-replicates' to allow user either to merge all replicates of a sample (if they are technical replicates) or analyze each replicate separately (if they are biological replicates).

The following is an example command to merge all replicates of each sample under the directory ~/data/RNA-seq/,

perl /ifs/e63data/schultzlab/wangq/bin/RNAseqDB/pipeline.pl -i ~/data/RNA-seq/ -s -m

Handling species other than human

The pipeline can be applied to species other than human. An example configuration file, config-luna-mouse.txt, for mouse is provided under folder configuration. To quantify gene/transcript expression for mouse samples, user need to specify mouse configuration file (not necessarily in command line):

perl /ifs/e63data/schultzlab/wangq/bin/RNAseqDB/pipeline.pl -i ~/data/RNA-seq/ -s -m -c /ifs/e63data/schultzlab/wangq/bin/RNAseqDB/config-luna-mouse.txt

Citation

Q. Wang, J Armenia, C. Zhang, A.V. Penson, E. Reznik, L. Zhang, T. Minet, A. Ochoa, B.E. Gross, C. A. Iacobuzio-Donahue, D. Betel, B.S. Taylor, J. Gao, N. Schultz. Unifying cancer and normal RNA sequencing data from different sources. Scientific Data 5:180061, 2018.

Qingguo Wang, Joshua Armenia, Chao Zhang, Alexander V Penson, Ed Reznik, Liguo Zhang, Thais Minet, Angelica Ochoa, Benjamin E Gross, Christine A Iacobuzio-Donahue, Doron Betel, Barry S Taylor, Jianjiong Gao, Nikolaus Schultz. Enabling cross-study analysis of RNA-Sequencing data. bioRxiv 110734, 2017.

Contact

Jianjiong Gao, Qingguo Wang
Nikolaus Schultz Lab
Memorial Sloan Kettering Cancer Center
New York, NY 10065

More Repositories

1

vcf2maf

Convert a VCF into a MAF, where each variant is annotated to only one of all possible gene isoforms
Perl
371
star
2

facets

Algorithm to implement Fraction and Copy number Estimate from Tumor/normal Sequencing.
R
140
star
3

mutation-signatures

Create mutation signatures from MAF's, and decompose them into Stratton signatures
R
60
star
4

facets-suite

Utility functions for FACETS
R
34
star
5

ngs-filters

Filters for false-positive mutation calls in NGS
R
30
star
6

lohhla

Fork from https://bitbucket.org/mcgranahanlab/lohhla/src, modified for MSKCC needs
R
28
star
7

mimsi

Microsatellite Instability Classification using Multiple Instance Learning
Python
19
star
8

roslin-variant

Roslin is a reproducible and reusable workflow for Cancer Genomic Sequencing Analysis
Python
15
star
9

cbsp-hackathon

Computational Biology Summer Program Hackathon
Jupyter Notebook
13
star
10

tempo

CCS research pipeline to process WES and WGS TN pairs
Groovy
12
star
11

Innovation-IMPACT-Pipeline

Framework to process and call somatic variation from NGS dataset generated using MSK-IMPACT assay
Perl
11
star
12

ACCESS-Pipeline

cfDNA Sequencing Pipeline with UMI
Python
10
star
13

facets2n

Algorithm to implement Fraction and Allelic Copy number Estimate from Tumor/normal Sequencing using unmatched normal sample(s) for log ratio calculations
R
10
star
14

Marianas

Software for processing molecular barcoding (UMI)-based NGS data
Java
9
star
15

cmo

Command-line tools for data analysts at the CMO
Python
7
star
16

forte

Functional Observation of RNA Transcriptome Elements/Expression
Nextflow
7
star
17

tempoSig

Fitting mutational catalog to signatures with maximum likelihood
HTML
7
star
18

igo-demux

Demultiplex Illumina sequencer output via DRAGEN, create fastq files and launch pipelines
Python
6
star
19

htstools

C++
5
star
20

redcap-ddp-laravel

A Laravel implementation of the REDCap DDP (Dynamic Data Pull) middleware.
PHP
4
star
21

vcf2tsv

a tool that takes the output files of different callers and creates maf-like, row-based output
Python
4
star
22

hermes

Data collection management and analysis of information
JavaScript
4
star
23

Medidata.RWS.NET

Medidata.RWS.NET is a comprehensive, fluent .NET API library for Medidata RAVE Web Services (RWS). It handles a large portion of the boilerplate C# code you'd normally have to write in order to communicate with RWS, allowing you to get up and running faster.
C#
4
star
24

cwl-commandlinetools

Central location for CWL CommandLineTools
Common Workflow Language
3
star
25

Halo_Melanoma_IL2

Code for Melanoma paper
R
3
star
26

igo-qc

HTML
3
star
27

LimsRest

the restful service used by the IGO LIMS
Java
3
star
28

Medidata.RWS.NET.Standard

Medidata.RWS.NET.Standard is a comprehensive, fluent .NET Standard 2.0 API library for Medidata RAVE Web Services (RWS). It handles a large portion of the boilerplate C# code you'd normally have to write in order to communicate with RWS, allowing you to get up and running faster.
C#
3
star
29

facetsAPI

Python
3
star
30

igo-lims-plugins

Sapio LIMS customizations to workflows for IGO
Java
2
star
31

igo-request-tracker

Tracks state of IGO projects
JavaScript
2
star
32

roslin-core

Core of the Roslin pipeline
Python
2
star
33

beagle

Voyager Backend
Python
2
star
34

smile-server

Java
2
star
35

roslin-qc

Python
2
star
36

redcap-linter

JavaScript
1
star
37

hera

First web app to be hosted on delphi.mskcc.org.
Python
1
star
38

igo-genomics

Vue
1
star
39

pointer

Voyager Frontend
JavaScript
1
star
40

helix_filters_01

Workflows for post-pipeline data processing and file generation
Python
1
star
41

dragen_util

Helper DRAGEN scripts
Shell
1
star
42

hera-sample-tracker

Python
1
star
43

igo-sample-qc-backend

Python
1
star
44

pluto-cwl

CWL workflows for helix filter scripts
Python
1
star
45

concordance-workflow

Python
1
star
46

artisan-fhir-server

PHP
1
star
47

roslin-helix

Documentation for Roslin/Helix Pipeline
1
star
48

variant_launch_validation

a legacy project to validate m/p/g files
Python
1
star
49

ngs-stats

NGS Statistics Database with historical Picard Stats, IGO fastq.gz paths and Sequencer Start & Stop Times
HTML
1
star
50

sample-qc-node

JavaScript
1
star
51

vcf_accuracy

vcf accuracy evaluator using VT, BEDTOOLS, PyVCF, and TABIX
Python
1
star
52

ridgeback

Toil API
Python
1
star
53

awsbatch_mock

AWSBatch Mock built for scaling tests
1
star
54

Chronos

Generate uuid's that will sort chronologically
Python
1
star
55

common-domain

Domain objects
Java
1
star
56

dotfiles

Unix dotfiles for users of MSK compute
Shell
1
star
57

smile-commons

Centralized configurations for checkstyle plugin and dependency management.
Java
1
star
58

Waltz

Fast, efficient bam metrics, pileups and genotyping
Java
1
star
59

GDD-Phase2

This is the Second Generation of Genome Derived Diagnosis AI Project
Python
1
star
60

pipeline-kickoff

Java
1
star
61

msisensor

C++
1
star
62

process_fastq

This package will help process, merge and link fastq in user specified directory from manifest file
Python
1
star
63

facets-preview-dev

R
1
star
64

neoantigen-pipeline

Pipeline for computing neoantigen qualities from DNA and RNA-Seq data
Python
1
star
65

cHL-spatial-profiling

Code supporting results of cHL manuscript
R
1
star
66

DeepSig

Single-Base Substitution Mutational Signature Inference for WES and MSK-IMPACT
HTML
1
star
67

nf-fastq-plus

Generate IGO fastqs, bams, stats and fingerprinting
Shell
1
star