• Stars
    star
    626
  • Rank 71,755 (Top 2 %)
  • Language
  • License
    MIT License
  • Created over 6 years ago
  • Updated 8 months ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

scRNAseq analysis notes from Ming Tang

scRNAseq-analysis-notes

my scRNAseq analysis notes. Please also check https://github.com/mdozmorov/scRNA-seq_notes from Mikhail Dozmorov and https://github.com/seandavi/awesome-single-cell

I have another repo for awesome spatial omics

The reason

Single cell RNAseq is becoming more and more popular, and as a technique, it might become as common as PCR. I just got some 10x genomics single cell RNAseq data to play with, it is a good time for me to take down notes here. I hope it is useful for other people as well.

readings before doing anything

single cell tutorials

database

  • CuratedAtlasQueryR is a query interface that allow the programmatic exploration and retrieval of the harmonised, curated and reannotated CELLxGENE single-cell human cell atlas. Data can be retrieved at cell, sample, or dataset levels based on filtering criteria.

scRNAseq experimental design

New technology

remove ambient RNA

single cell RNA-seq normalization

single cell impute

single cell batch effect

Benchmark single cell pipeline

Differential expression

for multi-sample multi-group differential analysis

Single cell RNA-seq

Considerable differences are found between the methods in terms of the number and characteristics of the genes that are called differentially expressed. Pre-filtering of lowly expressed genes can have important effects on the results, particularly for some of the methods originally developed for analysis of bulk RNA-seq data. Generally, however, methods developed for bulk RNA-seq analysis do not perform notably worse than those developed specifically for scRNA-seq.

Above about 15,000 reads per cell the benefit of increased sequencing depth is minor.

perturbation

  • Single cell perturbation prediction https://scgen.readthedocs.io A tensorflow implementation of scGen. scGen is a generative model to predict single-cell perturbation response across cell types, studies and species.

cell type annotation

single cell RNA-seq clustering

Overall, methods such as Seurat, SingleR, CP, RPC and SingleCellNet performed well, with Seurat being the best at annotating major cell types. Also, Seurat, SingleR and CP are more robust against down-sampling. However, Seurat does have a major drawback at predicting rare cell populations, and it is suboptimal at differentiating cell types that are highly similar to each other, while SingleR and CP are much better in these aspects

dimension reduction and visualization of clusters

fast PCA for big matrix

See https://t.co/yxCb85ctL1: "MDS best choice for preserving outliers, PCA for variance, & T-SNE for clusters" @mikelove @AndrewLBeam

— Rileen Sinha (@RileenSinha) August 25, 2016
<script async src="//platform.twitter.com/widgets.js" charset="utf-8"></script>

paper: Outlier Preservation by Dimensionality Reduction Techniques

"MDS best choice for preserving outliers, PCA for variance, & T-SNE for clusters"

t-SNE is a great piece of Machine Learning but one can find many reasons to use PCA instead of it. Of the top of my head, I will mention five. As most other computational methodologies in use, t-SNE is no silver bullet and there are quite a few reasons that make it a suboptimal choice in some cases. Let me mention some points in brief:

Stochasticity of final solution. PCA is deterministic; t-SNE is not. One gets a nice visualisation and then her colleague gets another visualisation and then they get artistic which looks better and if a difference of 0.03% in the KL(P||Q) divergence is meaningful... In PCA the correct answer to the question posed is guaranteed. t-SNE might have multiple minima that might lead to different solutions. This necessitates multiple runs as well as raises questions about the reproducibility of the results.

Interpretability of mapping. This relates to the above point but let's assume that a team has agreed in a particular random seed/run. Now the question becomes what this shows... t-SNE tries to map only local / neighbours correctly so our insights from that embedding should be very cautious; global trends are not accurately represented (and that can be potentially a great thing for visualisation). On the other hand, PCA is just a diagonal rotation of our initial covariance matrix and the eigenvectors represent a new axial system in the space spanned by our original data. We can directly explain what a particular PCA does.

Application to new/unseen data. t-SNE is not learning a function from the original space to the new (lower) dimensional one and that's a problem. On that matter, t-SNE is a non-parametric learning algorithm so approximating with parametric algorithm is an ill-posed problem. The embedding is learned by directly moving the data across the low dimensional space. That means one does not get an eigenvector or a similar construct to use in new data. In contrast, using PCA the eigenvectors offer a new axes system what can be directly used to project new data. [Apparently one could try training a deep-network to learn the t-SNE mapping (you can hear Dr. van der Maaten at ~46' of this video suggesting something along this lines) but clearly no easy solution exists.]

Incomplete data. Natively t-SNE does not deal with incomplete data. In fairness, PCA does not deal with them either but numerous extensions of PCA for incomplete data (eg. probabilistic PCA) are out there and are almost standard modelling routines. t-SNE currently cannot handle incomplete data (aside obviously training a probabilistic PCA first and passing the PC scores to t-SNE as inputs).

The k is not (too) small case. t-SNE solves a problem known as the crowding problem, effectively that somewhat similar points in higher dimension collapsing on top of each other in lower dimensions (more here). Now as you increase the dimensions used the crowding problem gets less severe ie. the problem you are trying to solve through the use of t-SNE gets attenuated. You can work around this issue but it is not trivial. Therefore if you need a k dimensional vector as the reduced set and k is not quite small the optimality of the produce solution is in question. PCA on the other hand offer always the k best linear combination in terms of variance explained. (Thanks to @amoeba for noticing I made a mess when first trying to outline this point.)

I do not mention issues about computational requirements (eg. speed or memory size) nor issues about selecting relevant hyperparameters (eg. perplexity). I think these are internal issues of the t-SNE methodology and are irrelevant when comparing it to another algorithm.

To summarise, t-SNE is great but as all algorithms has its limitations when it comes to its applicability. I use t-SNE almost on any new dataset I get my hands on as an explanatory data analysis tool. I think though it has certain limitations that do not make it nearly as applicable as PCA. Let me stress that PCA is not perfect either; for example, the PCA-based visualisations are often inferior to those of t-SNE.

You can’t add samples to an existing tSNE plot because there is no function outputed by the initial tSNE that maps from the higher dimensional space to the lower dimensions

UMAP is faster, the embeddings are often ++better, and you can use the result to project new data.

  • PCA loadings can be used to project new data

e.g. from this paper Multi-stage Differentiation Defines Melanoma Subtypes with Differential Vulnerability to Drug-Induced Iron-Dependent Oxidative Stress Fig 1D.

diffStagePCA = prcomp(t(diffStageDataCentered))

# Diff stage PCA (scores for top panel)
diffStagePCA_scores = diffStagePCA$x

# Cell line projected to diff stage PCA (scores for bottom panel)
diffStagePCA_rotation = diffStagePCA$rotation
cellLineProjected_scores <- as.matrix(t(cellLineDataCentered)) %*% as.matrix(diffStagePCA_rotation)

co-expression network

cell type prioritization/differential abundance test

  • why not compare groups of percentages?

some wisdom collected from https://twitter.com/tangming2005/status/1299927761826590720

by Nicola.

The problem with % is that they're bound to [0,1] and groups sum to 1. A multinomial logistic regression is probably your best friend here. You will see people using ANOVA, I would avoid that, as assumptions would not be met. You can use nnet::multinom to do it. see https://esajournals.onlinelibrary.wiley.com/doi/10.1890/10-0340.1

by Tom Kelly

I’m skeptical of doing this because it assumes that you don’t have bias in your sampling prep. How do you know the difference in cell proportions is biological not technical? Some cell types are harder to dissociate than others.

me:

A good point. We found exactly that the macrophage monocytes are hard to get single cell suspension in the tumor type we are studying.

Tom:

Yeah we collaborate with immunologists and heard similar issues with macrophages. With microfluidics, also important to consider cell size as large cells will be filtered out or block the instrument. Some cells more likely to clump together too. Perhaps single-nuclei is better?

Bottom line, please consider the techinical limitations and make sure the result is not confounded by techinical issues. For solid tumors, immunostaining to check if the cell population differences are true on slides.

large scale scRNAseq analysis

demultiplex

Marker gene pannel

  • COMET Single-Cell Marker Detection tool COMET’s goal is to make it easier to isolate a specified cluster of cells from a larger population. We attempt to find the best set of ‘marker’ surface proteins that occur in the specified cluster, but not in the rest of the population. Given this information, researchers can isolate the specified cluster using antibodies which bind to these ‘marker’ proteins.

Why does my output contain genes that are not relevant (e.g. are secreted rather than cell-surface)?? Our current marker list is inclusive rather than exclusive. If you find irrelevant non surface markers (e.g. secreted), you can manualy delete them from the list you used and upload the new list.

determine the clustering resolution or how do you know the clusters are real

cell-cell interactions

extension of 3UTR for better counting for scRNAseq data

A. Derr, et al., End Sequence Analysis Toolkit (ESAT) expands the extractable information from single-cell RNA-seq data. Genome Res. 26, 1397–1410 (2016)

  • https://academic.oup.com/endo/article/159/12/3910/5133692 "we show that incomplete transcriptome annotation can cause false negatives on some scRNAseq platforms that only generate 3′ transcript end sequences, and we use in vivo data to recover reads of the pituitary transcription factor Prop1."

regulatory network

Gene signature/sets analysis

Alternative polyadenylation (APA)

  • We have written a Python+R pipeline called "polyApipe" for identifying alternative polyadenylation (APA) sites in 10X Genomics scRNA-seq, based on the presence of polyadenylated reads. Once sites are identified, UMIs are counted for each site and the APA state of genes in cells can be determined. Given the sparse and noisy nature of this data, we have developed an R package "weitrix" to identify principal components of variation in APA based on measurements of varying accuracy and with many missing values. We then use varimax rotation to obtain independently interpretable components. In an embryonic mouse brain dataset, we identify 8 distinct components of APA variation, and assign biological meaning to each component in terms of the genes, cell type, and cell phase.

  • scraps: an end-to-end pipeline for measuring alternative polyadenylation at high resolution using single-cell RNA-seq

Alternative splicing

differential transcript usage

useful databases

interesting papers to read

interactive visulization

  • VISION A high-throughput and unbiased module for interpreting scRNA-seq data.
  • DISCO: Deep Integration of Single-Cell Omics. Want to visual millions of cell online and annotate cell type automatically? Try it!!! Make single cell easier and make life easier!
  • TISCH Tumor Immune Single-cell Hub (TISCH) is a scRNA-seq database focusing on tumor microenvironment (TME).
  • CancerSCEM To date, CancerSCE version 1.0 consists of 208 cancer samples across 28 studies and 20 human cancer types

single cell RNAseq copy-number variation

advance of scRNA-seq tech

single cell multi-omics

Allele specific scRNAseq

mutation from (sc)RNAseq and scATAC

single-cell specific

"Yes, I'm only really convinced by mutations called from "tumour-normal" comparisons for identifying somatic mutations. In our cardelino paper (out today!!) we put a lot of effort into modelling ref and alt allele counts observed from scRNA-seq reads, but used calls fr WES data" Davis McCarthy.

pseudotemporal modelling

large scale single cell analysis

zero inflation

We show that the primary causes of zero inflation are not technical but rather biological in nature. We also demonstrate that parameter estimates from the zero-inflated negative binomial distribution are an unreliable indicator of zero inflation. Despite the existence of zero inflation of scRNA-Seq counts, we recommend the generalized linear model with negative binomial count distribution (not zero-inflated) as a suitable reference model for scRNA-Seq analysis.

The field is advancing so fast!!

check this website for the tools being added:
https://www.scrna-tools.org/

paper published:
Exploring the single-cell RNA-seq analysis landscape with the scRNA-tools database

contamination of 10x data

https://twitter.com/constantamateur/status/994832241107849216?s=11

Did you know that droplet based single cell RNA-seq data (like 10X) is contaminated by ambient mRNAs? Good news though, we've written a paper (https://www.biorxiv.org/content/early/2018/04/20/303727 …) and created an R package called SoupX (https://github.com/constantAmateur/SoupX) to fix this problem.

Is this really a problem? It depends on your experiment. Contamination ranges from 2% - 50%. 10% seems common; it's 8% for 10X PBMC data. Solid tissues are typically worse, but there's no way to know in advance. Wouldn't you like to know how contaminated your data are?

These mRNAs come from the single cell suspension fed into the droplet creation system. They mostly get their from lysed cells and so resemble the cells being studied. This means the profile of the contamination is experiment specific and creates a batch effect.

cellranger is the toolkit developed by the 10x genomics company to deal with the data.

process raw fastq from GEO

https://twitter.com/sinabooeshaghi/status/1571902572369432576 https://github.com/pachterlab/ffq

pip install kb-python gget ffq
kb-ref -i index.idx -g t2g.txt -f transcriptome.fa $(gget ref --ftp -w dna,gtf homo_sapiens)
kb-count -i index.idx -g t2g.txt -x 10xv3 -o out $(ffq --ftp SRR10668798 | jq -r '.[] | .url' | tr '\n' ' ' )

some tools for 10x

DropletUtils Provides a number of utility functions for handling single-cell (RNA-seq) data from droplet technologies such as 10X Genomics. This includes data loading, identification of cells from empty droplets, removal of barcode-swapped pseudo-cells, and downsampling of the count matrix.

More Repositories

1

getting-started-with-genomics-tools-and-resources

Unix, R and python tools for genomics and data science
Shell
1,143
star
2

RNA-seq-analysis

RNAseq analysis notes from Ming Tang
Python
867
star
3

ChIP-seq-analysis

ChIP-seq analysis notes from Ming Tang
Python
670
star
4

bioinformatics-one-liners

Bioinformatics one liners from Ming Tang
452
star
5

awesome_spatial_omics

tools and notes for spatial omics
209
star
6

The-world-of-faculty

resources for faculty
207
star
7

TCR-BCR-seq-analysis

T/B cell receptor sequencing analysis notes
202
star
8

DNA-seq-analysis

DNA sequencing analysis notes from Ming Tang
Shell
139
star
9

scATACseq-analysis-notes

my notes for scATACseq analysis
113
star
10

pyflow-ChIPseq

a snakemake pipeline to process ChIP-seq files from GEO or in-house
Python
101
star
11

scclusteval

Single Cell Cluster Evaluation
R
85
star
12

pyflow-ATACseq

ATAC-seq snakemake pipeline
Python
82
star
13

DNA-methylation-analysis

DNA methylation analysis notes from Ming Tang
78
star
14

machine-learning-resource

70
star
15

papers_with_data_to_mine

published papers with a lot of data
61
star
16

oneliner_100day_challenge

Bioinformatics one-liner for 100 days
43
star
17

scATACutils

R/Bioconductor package for working with 10x scATACseq data
R
38
star
18

scRNA-seq-workshop-Fall-2019

Harvard FAS informatics scRNAseq workshop website
R
36
star
19

biotech_resource

some resources for startup companies
36
star
20

compbio_tutorials

My youtube programming scripts
HTML
33
star
21

compbio_resources_chatomics

24
star
22

pyflow-RNAseq

RNAseq pipeline based on snakemake
Python
22
star
23

Machine_learning_drug_discovery

21
star
24

awesome-long-reads

tools and notes for long reads analysis
19
star
25

pyflow-scATACseq

snakemake workflow for post-processing scATACseq data
Python
19
star
26

crazyhottommy

17
star
27

Coursera_Bioinformatics_for_Beginners

python scripts for the Coursera Bioinformatics for Beginners
Python
17
star
28

pyflow-cellranger

A Snakemake pipeline for cellranger to process 10x single-cell RNAseq data
Python
15
star
29

scripts-general-use

HTML
15
star
30

single-cell-DNAseq-notes

14
star
31

pyflow_seurat_parameter

cluster stability measurement by subsampling and reclustering with Seurat V3 and V4
R
11
star
32

immunotherapy_scRNAseq_papers

11
star
33

CV

my CV using pagedown
JavaScript
10
star
34

awesome-single-cell-proteomics

9
star
35

mixed_histology_lung_cancer

8
star
36

cloud_computing_resources

7
star
37

MIT6.00.1x-Introduction-to-Computer-Science-and-Programming-Using-Python

my notes for the homework
Python
5
star
38

immunology_tools

5
star
39

pyflow-single-cell

single-cell RNAseq ATACseq processing pipeline
Python
5
star
40

writing-tips

5
star
41

scATACtools

R, python, unix tools for 10x scATACseq data
R
5
star
42

wholebrain_docker

docker file for wholebrain http://www.wholebrainsoftware.org/cms/installing-wholebrain-on-ubuntudebian/
Dockerfile
5
star
43

Genrich_compare

snakemake workflow comparing Genrich and MACS2
Python
5
star
44

phantompeakqualtools

Automatically exported from code.google.com/p/phantompeakqualtools
R
4
star
45

computation_wiki

Tommy's computation wiki
HTML
4
star
46

flowcytometry_analysis_notes

4
star
47

mixing_histology_lung_cancer

HTML
3
star
48

pyflow-chromForest

snakemake workflow for random forest based feature selection on chromHMM data
Python
3
star
49

odyssey_dot_files

my dot files on Harvard Odyssey HPC
Shell
3
star
50

primer3_scATAC_peaks

batch design primers for scATACseq differential peaks
Shell
3
star
51

seurat_v3_dockerfile

docker file for seurat v3
Dockerfile
2
star
52

PRADA_pipeline_Verhaak_lab

Shell
2
star
53

STAT115_HW

Tommy's homework
2
star
54

rocker_tidyvese_jpeg_cairo

docker file to extend rocker tidyverse
Dockerfile
2
star
55

ucn3_neuron_microarray_analysis

2
star
56

epigenomics_concept_learning

2
star
57

CIDC_single_cell

snakemake single cell pipeline for CIDC
Python
2
star
58

ucn3_neuron_microarray

2
star
59

EvaluateSingleCellClustering

examples for using scclusteval
R
2
star
60

bulk-RNAseq-workshop

HTML
2
star
61

compbio_challenges

2
star
62

machine_learning_datasets

2
star
63

rosalind_problems_python_solutions

Python
1
star
64

Epigenome_RoadTrip

my RoadTrip project
Python
1
star
65

data-science-machine-learning-project

HTML
1
star
66

ChIP-seq-carpentry

Development of the ChIPseq workshop for Data Carpentry
Python
1
star
67

one-click-hugo-cms

SCSS
1
star
68

nextjs-blog-theme

JavaScript
1
star
69

hodgkin_lymphoma_publication_scRNAseq_analysis

Jupyter Notebook
1
star