• Stars
    star
    146
  • Rank 252,769 (Top 5 %)
  • Language
    C++
  • License
    Other
  • Created over 7 years ago
  • Updated 3 months ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Fast, efficient RNA-Seq metrics for quality control and process optimization

RNA-SeQC

Version CI

RNA-SeQC 2 is described in A. Graubert*, F. Aguet*, A. Ravi, K.G. Ardlie, Gad Getz, "RNA-SeQC 2: efficient RNA-seq quality control and quantification for large cohorts," Bioinformatics, 2021.

Installing

The latest stable build of RNA-SeQC is available on the GitHub Releases page, and contains static binaries for Linux and OSX.

RNA-SeQC is also available as a docker image: gcr.io/broad-cga-aarong-gtex/rnaseqc:latest which is automatically updated with any code change. Older versions of the docker image are tagged using the full commit SHA of any commit which introduced a code change.

To checkout the source of RNA-SeQC run git clone --recursive https://github.com/getzlab/rnaseqc.git. If you do not use the --recursive flag, you'll need to run git submodule update --init --recursive or you will be missing SeqLib.

Unit Tests

Input data for RNA-SeQC's testing suite is not stored in the repository due to size constraints. The current test data is available here, and must be unpacked within the test_data/ directory. Please note that the location of the test data is subject to change. The test resources use ~1.2 GB of space.

You can download and unpack test data with:

cd test_data
wget https://storage.googleapis.com/agraubert/broadinstitute/rnaseqc/test_inputs.tar.gz
tar xzf test_inputs.tar.gz

You can run the unit tests with make test

Usage

NOTE: This tool requires that the provided GTF be collapsed in such a way that there are no overlapping transcripts on the same strand and that each gene have a single transcript whose id matches the parent gene id. This is not a transcript-quantification method. Readcounts and coverage are made towards exons and genes only if all aligned segments of a read fully align to exons of a gene, but keep in mind that coverage may be counted towards multiple transcripts (and its exons) if these criteria are met. Beyond this, no attempt will be made to disambiguate which transcript a read belongs to. You can collapse an existing GTF using the GTEx collapse annotation script

Command Line Usage:

rnaseqc [OPTIONS] gtf bam output

Example: ./rnaseqc test_data/downsampled.gtf test_data/downsampled.bam --bed test_data/downsampled.bed --coverage .

OPTIONS:
  -h, --help                        Display this message and quit

  --version                         Display the version and quit

  gtf                               The input GTF file containing features
                                    to check the bam against

  bam                               The input SAM/BAM file containing reads
                                    to process

  output                            Output directory

  -s[sample], --sample=[sample]     The name of the current sample. Default:
                                    The bam's filename

  --bed=[BEDFILE]                   Optional input BED file containing
                                    non-overlapping exons used for fragment
                                    size calculations

  --fasta=[fasta]                   Optional input FASTA/FASTQ file
                                    containing the reference sequence used
                                    for parsing CRAM files

  --chimeric-distance=[DISTANCE]    Set the maximum accepted distance
                                    between read mates. Mates beyond this
                                    distance will be counted as chimeric
                                    pairs. Default: 2000000 [bp]

  --fragment-samples=[SAMPLES]      Set the number of samples to take when
                                    computing fragment sizes. Requires the
                                    --bed argument. Default: 1000000

  -q[QUALITY],
  --mapping-quality=[QUALITY]       Set the lower bound on read quality for
                                    exon coverage counting. Reads below this
                                    number are excluded from coverage
                                    metrics. Default: 255

  --base-mismatch=[MISMATCHES]      Set the maximum number of allowed
                                    mismatches between a read and the
                                    reference sequence. Reads with more than
                                    this number of mismatches are excluded
                                    from coverage metrics. Default: 6

  --offset=[OFFSET]                 Set the offset into the gene for the 3'
                                    and 5' windows in bias calculation. A
                                    positive value shifts the 3' and 5'
                                    windows towards eachother, while a
                                    negative value shifts them apart.
                                    Default: 150 [bp]

  --window-size=[SIZE]              Set the size of the 3' and 5' windows in
                                    bias calculation. Default: 100 [bp]

  --gene-length=[LENGTH]            Set the minimum size of a gene for bias
                                    calculation. Genes below this size are
                                    ignored in the calculation. Default: 600
                                    [bp]

  --legacy                          Use legacy counting rules. Gene and exon
                                    counts match output of RNA-SeQC 1.1.9

  --stranded=[stranded]             Use strand-specific metrics. Only
                                    features on the same strand of a read
                                    will be considered. Allowed values are
                                    'RF', 'rf', 'FR', and 'fr'

  -v, --verbose                     Give some feedback about what's going
                                    on. Supply this argument twice for
                                    progress updates while parsing the bam

  -t[TAG...], --tag=[TAG...]        Filter out reads with the specified tag.

  --chimeric-tag=[TAG]              Reads maked with the specified tag will
                                    be labeled as Chimeric. Defaults to 'ch'
                                    for STAR

  --exclude-chimeric                Exclude chimeric reads from the read
                                    counts

  -u, --unpaired                    Allow unpaired reads to be quantified.
                                    Required for single-end libraries

  --rpkm                            Output gene RPKM values instead of TPMs

  --coverage                        If this flag is provided, coverage
                                    statistics for each transcript will be
                                    written to a table. Otherwise, only
                                    summary coverage statistics are
                                    generated and added to the metrics table

  --coverage-mask=[SIZE]            Sets how many bases at both ends of a
                                    transcript are masked out when computing
                                    per-base exon coverage. Default: 500bp

  -d[threshold],
  --detection-threshold=[threshold] Number of counts on a gene to consider
                                    the gene 'detected'. Additionally, genes
                                    below this limit are excluded from 3'
                                    bias computation. Default: 5 reads

  "--" can be used to terminate flag options and force all following
  arguments to be treated as positional options

Output files:

The following output files are generated in the output directory you provide:

  • {sample}.metrics.tsv : A tab-delimited list of (Statistic, Value) pairs of all statistics and metrics recorded.
  • {sample}.exon_reads.gct : A tab-delimited GCT file with (Exon ID, Gene Name, coverage) tuples for all exons which had at least part of one read mapped.
  • {sample}.gene_reads.gct : A tab-delimited GCT file with (Gene ID, Gene Name, coverage) tuples for all genes which had at least one read map to at least one of its exons. This file contains the gene-level read counts used, e.g., for differential expression analyses.
  • {sample}.gene_tpm.gct : A tab-delimited GCT file with (Gene ID, Gene Name, TPM) tuples for all genes reported in the gene_reads.gct file, with expression values in transcript per million (TPM) units. Note: this file is renamed to .gene_rpkm.gct if the --rpkm flag is present.
  • {sample}.fragmentSizes.txt : A list of fragment sizes recorded, if a BED file was provided
  • {sample}.coverage.tsv : A tab-delimited list of (Gene ID, Transcript ID, Mean Coverage, Coverage Std, Coverage CV) tuples for all transcripts encountered in the GTF.

Metrics reported:

See Metrics.md for a description of all metrics reported in the metrics.tsv, coverage.tsv, and fragmentSizes.txt files.

Legacy mode differences

The --legacy flag enables compatibility with RNASeQC 1.1.9. This ensures that exon and gene readcounts match exactly the counts which would have been produced by running that version. This also adds an extra condition to classify reads as chimeric (see "Chimeric Reads", above). Any metrics which existed in 1.1.9 will also match within Java's floating point precision.

More Repositories

1

SignatureAnalyzer

Updated SignatureAnalyzer-GPU with mutational spectra & RNA expression compatibility.
Python
71
star
2

deTiN

DeTiN is designed to measure tumor-in-normal contamination and improve somatic variant detection sensitivity when using a contaminated matched control.
Jupyter Notebook
49
star
3

MutSig2CV

MutSig2CV from Lawrence et al. 2014
MATLAB
30
star
4

dalmatian

dalmatian is a collection of high-level companion functions for Firecloud and FISS.
Python
18
star
5

CPTAC_PanCan_PTM_2023

Jupyter Notebook
14
star
6

scrinvex

Single Cell RNA Intron-Exon counting
C++
11
star
7

Haradhvala_et_al_2022

Jupyter Notebook
8
star
8

10x-cd45-isoform-quantification

Jupyter Notebook
8
star
9

CLUMPS-PTM

CLUMPS adapted to post-translational modifications.
Python
7
star
10

ABSOLUTE

ABSOLUTE source code that works with allelic copy ratio on both hg19 and hg38
R
6
star
11

canine

A modular, high-performance computing solution to run jobs using SLURM
Python
6
star
12

AnnoMate

Review anything (purities, mutations, etc) within a jupyter notebook with plotly dash and jupyter widgets
Python
6
star
13

Boiarsky-etal-2022

Code to reproduce methods & results from Boiarsky et. al., Nature Communications 2022
Jupyter Notebook
6
star
14

igv_remote

A Python interface for interacting with IGV desktop over a port using IGV batch commands
Python
5
star
15

SMM_clustering_2020

Repo for analysis of SMM DNA + RNA data.
Jupyter Notebook
3
star
16

LUAD_subtypes

HTML
3
star
17

CApy

Cancer Analysis tools in Python
Python
3
star
18

CLUMPS

Python
3
star
19

rnaseqc2-paper

RNA-SeQC 2 manuscript materials
Jupyter Notebook
3
star
20

pathway_based_biomarker

MATLAB
3
star
21

CLUMPS2

Re-factored code for CLUMPS.
Python
2
star
22

mudi

A collection of downstream python single-cell utilities.
Python
2
star
23

mutect-benchmark

Code repo for master thesis
Jupyter Notebook
2
star
24

MSMuTect2

Accurate detection of somatic MS (microsatellite) indels.
Python
2
star
25

CLLmap

R
2
star
26

PurityReviewer

Suite of purity reviewers and review components
Python
2
star
27

igv_gdc_buddy

stream & update gdc signed url to IGV
Python
2
star
28

gbm_icb_manuscript

Code for reproducing the figures in the Ghannam et al. GBM ICB manuscript
Jupyter Notebook
1
star
29

GISTIC2_postprocessing

The repo contains codes (in matlab) to force call GISTIC2 driver events at sample level
MATLAB
1
star
30

lapdog

A python module, CLI, and browser-based UI to run jobs outside of Firecloud
Python
1
star
31

MSMuTect1

Shell
1
star
32

MSMuTect

Calling indels in microsatellite loci
1
star
33

REBC_tools

Repo for code & tasks used for REBC analysis
MATLAB
1
star
34

AsymTools2

AsymTools is a package for analyzing transcriptional and replicative mutational strand asymmetries in cancer sequencing datasets.
MATLAB
1
star
35

MSMutSig

Detects loci with a significantly higher-than-expected frequency of somatic MS (microsatellite) indels.
R
1
star
36

PhylogicNDTReviewer

Interactive app to review Phylogic solutions and data.
Jupyter Notebook
1
star
37

bladder_cancer_classifier

R
1
star
38

DLBclass-tool

Jupyter Notebook
1
star