• Stars
    star
    867
  • Rank 52,618 (Top 2 %)
  • Language
    Python
  • License
    MIT License
  • Created about 9 years ago
  • Updated about 3 years ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

RNAseq analysis notes from Ming Tang

RNA-seq analysis

General sequencing data analysis materials

RNA-seq specific

RNA-seq experimental design

Quality Control

  • QoRTs: a comprehensive toolset for quality control and data processing of RNA-Seq experiments
  • QUaCRS
  • RSeQC RNA-seq data QC
  • RNA-SeqQC

Normalization, quantification, and differential expression

Normalization is essential for RNAseq analysis. However, one needs to understand the underlining assumptions for each methods. Most methods assume there is no global changes between conditions (e.g. TMM normalization). However, this may not be true when global effect occurs. For example, if you delete a gene that controls transcription, you expect to see global gene expression reduction. In that case, other normalization methods need to be considered. (e.g. spike-in controls). The same principle applies to other high-throughput sequencing data such as ChIPseq.

read this very important paper by Rafael A Irizarry: Genome-wide repressive capacity of promoter DNA methylation is revealed through epigenomic manipulation

DESseq2 normalization by Simon Anders:

To estimate the library size, simply taking the total number of (mapped or unmapped) reads is, in our experience, not a good idea. Sometimes, a few very strongly expressed genes are differentially expressed, and as they make up a good part of the total counts, they skew this number. After you divide by total counts, these few strongly expressed genes become equal, and the whole rest looks differentially expressed.

The following simple alternative works much better:

  • Construct a "reference sample" by taking, for each gene, the geometric mean of the counts in all samples.
  • To get the sequencing depth of a sample relative to the reference, calculate for each gene the quotient of the counts in your sample divided by the counts of the reference sample. Now you have, for each gene, an estimate of the depth ratio.
  • Simply take the median of all the quotients to get the relative depth of the library.

This is what the estimateSizeFactors function of our DESeq package doese.

If one wants to use a set of genes that are not affected by the global change, do

dds = newCountDataSet(CountTable, Design$condition )
dds <- estimateSizeFactors(dds, 
                           controlGenes = rownames(dds) %in% norm_genes)
dds_global <- estimateSizeFactors(dds)
dds_global <- DESeq(dds_global)
res_global <- results(dds_global)

or give self-defined size factors.

sizeFactors(dds) = c(my_Values)

Traditional way of RNA-seq analysis

A post from Nextgeneseek

QuickRNASeq lifts large-scale RNA-seq data analyses to the next level of automation and interactive visualization

The three papers kind of replaces earlier tools from Salzberg’s group (Bowtie/TopHat,Cufflinks, and Cuffmerge)
they offer a totally new way to go from raw RNA-seq reads to differential expression analysis:
align RNA-seq reads to genome (HISATinstead of Bowtie/TopHat, STAR),
assemble transcripts and estimate expression (StringTie instead of Cufflinks), and
perform differential expression analysis (Ballgown instead of Cuffmerge).

nature protocol:Transcript-level expression analysis of RNA-seq experiments with HISAT, StringTie and Ballgown 08/11/2016

Simulation-based comprehensive benchmarking of RNA-seq aligners A nature method paper.

We found that performance varied by genome complexity, and accuracy and popularity were poorly correlated. The most widely cited tool underperforms for most metrics, particularly when using default settings

RapMap: A Rapid, Sensitive and Accurate Tool for Mapping RNA-seq Reads to Transcriptomes. From Sailfish group.

For mapping based methods, usually the raw reads are mapped to transcriptome or genome (need to model gaps by exon-exon junction), and then a gene/transcript level counts are obtained by:

Finally, differential expression is carried out by

  • DESeq2

  • EdgeR

  • limma Voom

  • EBseq An R package for gene and isoform differential expression analysis of RNA-seq data

  • JunctionSeq differential usage of exons and splice junctions in High-Throughput, Next-Generation RNA-Seq datasets. The methodology is heavily based on the DEXSeq bioconductor package.The core advantage of JunctionSeq over other similar tools is that it provides a powerful automated tools for generating readable and interpretable plots and tables to facilitate the interpretation of the results. An example results report is available here.

  • MetaSeq Meta-analysis of RNA-Seq count data in multiple studies

  • derfinder Annotation-agnostic differential expression analysis of RNA-seq data at base-pair resolution

  • DGEclust is a program for clustering and differential expression analysis of expression data generated by next-generation sequencing assays, such as RNA-seq, CAGE and others

  • Degust: Perform RNA-seq analysis and visualisation. Simply upload a CSV file of read counts for each replicate; then view your DGE data.

  • Vennt Dynamic Venn diagrams for Differential Gene Expression.

  • GlimmaInteractive HTML graphics for RNA-seq data.

Extra Notes

Benchmarking

bcbio.rnaseq
RNAseqGUI. I have used several times. looks good.
compcodeR
paper: Benchmark Analysis of Algorithms for Determining and Quantifying Full-length mRNA Splice Forms from RNA-Seq Data
paper: Comparative evaluation of isoform-level gene expression estimation algorithms for RNA-seq and exon-array platforms
paper:A benchmark for RNA-seq quantification pipelines

Map free

It provides a method to detect changes in the relative abundance of the alternative transcripts (isoforms) of genes. This is called Differential Transcript Usage (DTU).

Detecting DTU is supplementary to the quantification of transcripts by tools like Salmon, Sailfish and Kallisto and the detection of Differential Transcript Expression (DTE) by tools such as Sleuth.

I particularly like the figure in the tutorial showing the differences among DTU, DTE and DEG. The paper transcript-level estimates improve gene-level inferences above also talks about the differences:

  1. differential gene expression (DGE) studies, where the overall transcriptional output of each gene is compared between conditions; 2) differential transcript/exon usage (DTU/DEU) studies, where the composition of a gene’s isoform abundance spectrum is compared between conditions, or
  2. differential transcript expression (DTE) studies, where the interest lies in whether individual transcripts show differential expression between conditions.

! MATS is a computational tool to detect differential alternative splicing events from RNA-Seq data. The statistical model of MATS calculates the P-value and false discovery rate that the difference in the isoform ratio of a gene between two conditions exceeds a given user-defined threshold.

Blog posts on Kallisto/Salmon

  1. Comparing unpublished RNA-Seq gene expression quantifiers
  2. Kallisto, a new ultra fast RNA-seq quantitation method from Next GEN SEEK
  3. kallisto paper summary: Near-optimal RNA-seq quantification from Next GEN SEEK
  4. Not-quite alignments: Salmon, Kallisto and Efficient Quantification of RNA-Seq data
  5. Using Kallisto for gene expression analysis of published RNAseq data
  6. How accurate is Kallisto? from Mark Ziemann
  7. ALIGNMENT FREE TRANSCRIPTOME QUANTIFICATION
  8. A sleuth for RNA-seq
  9. Using Salmon, Sailfish and Sleuth for differential expression
  10. Road-testing Kallisto
  11. Why you should use alignment-independent quantification for RNA-Seq

A biostar post: Do not feed rounded estimates of gene counts from kallisto into DESeq2 (please make sure you read through all the comments, and now there is a suggested workflow for feeding rounded estimates of gene counts to DESeq etc)

There is some confusion in the answers to this question that hopefully I can clarify with the three comments below:

  1. kallisto produces estimates of transcript level counts, and therefore to obtain an estimate of the number of reads from a gene the correct thing to do is to sum the estimated counts from the constituent transcripts of that gene. Of note in the language above is the word "estimate", which is necessary because in many cases reads cannot be mapped uniquesly to genes. However insofar as obtaining a good estimate, the approach of kallisto (and before it Cufflinks, RSEM, eXpress and other "transcript level quantification tools") is superior to naïve "counting" approaches for estimating the number of reads originating from a gene. This point has been argued in many papers; among my own papers it is most clearly explained and demonstrated in Trapnell et al. 2013.
  1. Although estimated counts for a gene can be obtained by summing the estimated counts of the constituent transcripts from tools such as kallisto, and the resulting numbers can be rounded to produce integers that are of the correct format for tools such as DESeq, the numbers produced by such an approach do not satisfy the distributional assumptions made in DESeq and related tools. For example, in DESeq2, counts are modeled "as following a negative binomial distribution". This assumption is not valid when summing estimated counts of transcripts to obtain gene level counts, hence the justified concern of Michael Love that plugging in sums of estimated transcript counts could be problematic for DESeq2. In fact, even the estimated transcript counts themselves are not negative binomial distributed, and therefore also those are not appropriate for plugging into DESeq2. His concern is equally valid with many other "count based" differential expression tools.
  1. Fortunately there is a solution for performing valid statistical testing of differential abundance of individual transcripts, namely the method implemented in sleuth. The approach is described here. To test for differential abundance of genes, one must first address the question of what that means. E.g. is a gene differential if at least one isoform is? or if all the isoforms are? The tests of sleuth are performed at the granularity of transcripts, allowing for downstream analysis that can capture the varied questions that might make biological sense in specific contexts.

In summary, please do not plug in rounded estimates of gene counts from kallisto into DESeq2 and other tools. While it is technically possible, it is not statistically advisable. Instead, you should use tools that make valid distributional assumptions about the estimates.

However, Charlotte Soneson, Mike Love and Mark Robinson showed in a f1000 paper: Differential analyses for RNA-seq: transcript-level estimates improve gene-level inferences that rounded values from transcript level can be fed into DESeq2 etc for gene-level differential expression, and it is valid and preferable in many ways.

Thanks Rob Patro for pointing it out!

  • artemis: RNAseq analysis, from raw reads to pathways, typically in a few minutes. Mostly by wrapping Kallisto and caching everything we possibly can.
  • isolator:Rapid and robust analysis of RNA-Seq experiments.

Isolator has a particular focus on producing stable, consistent estimates. Maximum likelihood approaches produce unstable point estimates: small changes in the data can result in drastically different results, conflating downstream analysis like clustering or PCA. Isolator produces estimates that are in general, simultaneously more stable and more accurate other methods

Circular RNA

  • a bunch of tools from Dieterich Lab in github.

Batch effects

Databases

This package is for searching for datasets in EMBL-EBI Expression Atlas, and downloading them into R for further analysis. Each Expression Atlas dataset is represented as a SimpleList object with one element per platform. Sequencing data is contained in a SummarizedExperiment object, while microarray data is contained in an ExpressionSet or MAList object.

Gene Set enrichment analysis

Pathway analysis

Fusion gene detection

Alternative splicing

  • SplicePlot: a tool for visualizing alternative splicing Sashimi plots
  • Multivariate Analysis of Transcript Splicing (MATS)
  • SNPlice is a software tool to find and evaluate the co-occurrence of single-nucleotide-polymorphisms (SNP) and altered splicing in next-gen mRNA sequence reads. SNPlice requires, as input: genome aligned reads, exon-intron-exon junctions, and SNPs. exon-intron-exon junctions and SNPs may be derived from the reads directly, using, for example, TopHat2 and samtools, or they may be derived from independent sources
  • Visualizing Alternative Splicing github page
  • spladder Tool for the detection and quantification of alternative splicing events from RNA-Seq data
  • SUPPA This tool generates different Alternative Splicing (AS) events and calculates the PSI ("Percentage Spliced In") value for each event exploiting the fast quantification of transcript abundances from multiple samples.
  • IntSplice: Upload a VCF (variant call format) file to predict if an SNV (single nucleotide variation) from intronic positions -50 to -3 is pathogenic or not.
  • Whippet: an efficient method for the detection and quantification of alternative splicing reveals extensive transcriptomic complexity
  • RNA splicing analysis using heterogeneous and large RNA-seq datasets MAJIQ v2.

microRNAs and non-coding RNAs

transcriptional pausing

intron retention

Allel specific expression

immnune related

  • ImReP is a computational method for rapid and accurate profiling of the adaptive immune repertoire from regular RNA-Seq data.
  • Comprehensive analyses of tumor immunity: implications for cancer immunotherapy
  • pVAC-Seq is a cancer immunotherapy pipeline for the identification of personalized Variant Antigens by Cancer Sequencing (pVAC-Seq) that integrates tumor mutation and expression data (DNA- and RNA-Seq). It enables cancer immunotherapy research by using massively parallel sequence data to predicting tumor-specific mutant peptides (neoantigens) that can elicit anti-tumor T cell immunity.
  • [JingleBells] (http://jinglebells.bgu.ac.il/) - A repository of standardized single cell RNA-Seq datasets for analysis and visualization in IGV of the raw reads at the single cell level. Currently focused on immune cells. (http://www.jimmunol.org/content/198/9/3375.long)
  • immunedeconv - an R package for unified access to computational methods for estimating immune cell fractions from bulk RNA sequencing data.

Reads from xenografts

  • Xenosplit XenoSplit is a fast computational tool to detect the true origin of the graft RNA-Seq and DNA-Seq libraries prior to profiling of patient-derived xenografts (PDXs).

single cell tutorials

single cell RNA-seq normalization

single cell batch effect

Single cell RNA-seq

Considerable differences are found between the methods in terms of the number and characteristics of the genes that are called differentially expressed. Pre-filtering of lowly expressed genes can have important effects on the results, particularly for some of the methods originally developed for analysis of bulk RNA-seq data. Generally, however, methods developed for bulk RNA-seq analysis do not perform notably worse than those developed specifically for scRNA-seq.

single cell RNA-seq clustering

  • Geometry of the Gene Expression Space of Individual Cells
  • pcaReduce: Hierarchical Clustering of Single Cell Transcriptional Profiles.
  • CountClust: Clustering and Visualizing RNA-Seq Expression Data using Grade of Membership Models. Fits grade of membership models (GoM, also known as admixture models) to cluster RNA-seq gene expression count data, identifies characteristic genes driving cluster memberships, and provides a visual summary of the cluster memberships
  • FastProject: A Tool for Low-Dimensional Analysis of Single-Cell RNA-Seq Data
  • SNN-Cliq Identification of cell types from single-cell transcriptomes using a novel clustering method
  • Compare clusterings for single-cell sequencing bioconductor package.The goal of this package is to encourage the user to try many different clustering algorithms in one package structure. We give tools for running many different clusterings and choices of parameters. We also provide visualization to compare many different clusterings and algorithm tools to find common shared clustering patterns.
  • CIDR: Ultrafast and accurate clustering through imputation for single cell RNA-Seq data
  • SC3- consensus clustering of single-cell RNA-Seq data. SC3 achieves high accuracy and robustness by consistently integrating different clustering solutions through a consensus approach. Tests on twelve published datasets show that SC3 outperforms five existing methods while remaining scalable, as shown by the analysis of a large dataset containing 44,808 cells. Moreover, an interactive graphical implementation makes SC3 accessible to a wide audience of users, and SC3 aids biological interpretation by identifying marker genes, differentially expressed genes and outlier cells.

nanostring

advance of scRNA-seq tech

More Repositories

1

getting-started-with-genomics-tools-and-resources

Unix, R and python tools for genomics and data science
Shell
1,143
star
2

ChIP-seq-analysis

ChIP-seq analysis notes from Ming Tang
Python
670
star
3

scRNAseq-analysis-notes

scRNAseq analysis notes from Ming Tang
626
star
4

bioinformatics-one-liners

Bioinformatics one liners from Ming Tang
452
star
5

awesome_spatial_omics

tools and notes for spatial omics
209
star
6

The-world-of-faculty

resources for faculty
207
star
7

TCR-BCR-seq-analysis

T/B cell receptor sequencing analysis notes
202
star
8

DNA-seq-analysis

DNA sequencing analysis notes from Ming Tang
Shell
139
star
9

scATACseq-analysis-notes

my notes for scATACseq analysis
113
star
10

pyflow-ChIPseq

a snakemake pipeline to process ChIP-seq files from GEO or in-house
Python
101
star
11

scclusteval

Single Cell Cluster Evaluation
R
85
star
12

pyflow-ATACseq

ATAC-seq snakemake pipeline
Python
82
star
13

DNA-methylation-analysis

DNA methylation analysis notes from Ming Tang
78
star
14

machine-learning-resource

70
star
15

papers_with_data_to_mine

published papers with a lot of data
61
star
16

oneliner_100day_challenge

Bioinformatics one-liner for 100 days
43
star
17

scATACutils

R/Bioconductor package for working with 10x scATACseq data
R
38
star
18

scRNA-seq-workshop-Fall-2019

Harvard FAS informatics scRNAseq workshop website
R
36
star
19

biotech_resource

some resources for startup companies
36
star
20

compbio_tutorials

My youtube programming scripts
HTML
33
star
21

compbio_resources_chatomics

24
star
22

pyflow-RNAseq

RNAseq pipeline based on snakemake
Python
22
star
23

Machine_learning_drug_discovery

21
star
24

awesome-long-reads

tools and notes for long reads analysis
19
star
25

pyflow-scATACseq

snakemake workflow for post-processing scATACseq data
Python
19
star
26

crazyhottommy

17
star
27

Coursera_Bioinformatics_for_Beginners

python scripts for the Coursera Bioinformatics for Beginners
Python
17
star
28

pyflow-cellranger

A Snakemake pipeline for cellranger to process 10x single-cell RNAseq data
Python
15
star
29

scripts-general-use

HTML
15
star
30

single-cell-DNAseq-notes

14
star
31

pyflow_seurat_parameter

cluster stability measurement by subsampling and reclustering with Seurat V3 and V4
R
11
star
32

immunotherapy_scRNAseq_papers

11
star
33

CV

my CV using pagedown
JavaScript
10
star
34

awesome-single-cell-proteomics

9
star
35

mixed_histology_lung_cancer

8
star
36

cloud_computing_resources

7
star
37

MIT6.00.1x-Introduction-to-Computer-Science-and-Programming-Using-Python

my notes for the homework
Python
5
star
38

immunology_tools

5
star
39

pyflow-single-cell

single-cell RNAseq ATACseq processing pipeline
Python
5
star
40

writing-tips

5
star
41

scATACtools

R, python, unix tools for 10x scATACseq data
R
5
star
42

wholebrain_docker

docker file for wholebrain http://www.wholebrainsoftware.org/cms/installing-wholebrain-on-ubuntudebian/
Dockerfile
5
star
43

Genrich_compare

snakemake workflow comparing Genrich and MACS2
Python
5
star
44

phantompeakqualtools

Automatically exported from code.google.com/p/phantompeakqualtools
R
4
star
45

computation_wiki

Tommy's computation wiki
HTML
4
star
46

flowcytometry_analysis_notes

4
star
47

mixing_histology_lung_cancer

HTML
3
star
48

pyflow-chromForest

snakemake workflow for random forest based feature selection on chromHMM data
Python
3
star
49

odyssey_dot_files

my dot files on Harvard Odyssey HPC
Shell
3
star
50

primer3_scATAC_peaks

batch design primers for scATACseq differential peaks
Shell
3
star
51

seurat_v3_dockerfile

docker file for seurat v3
Dockerfile
2
star
52

PRADA_pipeline_Verhaak_lab

Shell
2
star
53

STAT115_HW

Tommy's homework
2
star
54

rocker_tidyvese_jpeg_cairo

docker file to extend rocker tidyverse
Dockerfile
2
star
55

ucn3_neuron_microarray_analysis

2
star
56

epigenomics_concept_learning

2
star
57

CIDC_single_cell

snakemake single cell pipeline for CIDC
Python
2
star
58

ucn3_neuron_microarray

2
star
59

EvaluateSingleCellClustering

examples for using scclusteval
R
2
star
60

bulk-RNAseq-workshop

HTML
2
star
61

compbio_challenges

2
star
62

machine_learning_datasets

2
star
63

rosalind_problems_python_solutions

Python
1
star
64

Epigenome_RoadTrip

my RoadTrip project
Python
1
star
65

data-science-machine-learning-project

HTML
1
star
66

ChIP-seq-carpentry

Development of the ChIPseq workshop for Data Carpentry
Python
1
star
67

one-click-hugo-cms

SCSS
1
star
68

nextjs-blog-theme

JavaScript
1
star
69

hodgkin_lymphoma_publication_scRNAseq_analysis

Jupyter Notebook
1
star