• Stars
    star
    1,284
  • Rank 35,387 (Top 0.8 %)
  • Language
    Jupyter Notebook
  • Created over 5 years ago
  • Updated over 1 year ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Single cell current best practices tutorial case study for the paper:Luecken and Theis, "Current best practices in single-cell RNA-seq analysis: a tutorial"

Scripts for "Current best-practices in single-cell RNA-seq: a tutorial"

Note The "current" best practices that are detailed in this workflow were set up in 2019. Thus, they do not necessarily follow the latest best practices for scRNA-seq analysis anymore. For an up-to-date version of the latest best practices for single-cell RNA-seq analysis (and more modalities) please see our consistently updated online book: https://www.sc-best-practices.org.

For more information and contribution guidelines please visit the associated Github repository: https://github.com/theislab/single-cell-best-practices

image

This repository is complementary to the publication:

M.D. Luecken, F.J. Theis, "Current best practices in single-cell RNA-seq analysis: a tutorial", Molecular Systems Biology 15(6) (2019): e8746

The paper was recommended on F1000 prime as being of special significance in the field.

Access the recommendation on F1000Prime

The repository contains:

  • scripts to generate the paper figures
  • a case study which complements the manuscript
  • the code for the marker gene detection study from the supplementary material

The main part of this repository is a case study where the best-practices established in the manuscript are applied to a mouse intestinal epithelium regions dataset from Haber et al., Nature 551 (2018) available from the GEO under GSE92332. This case study can be found in different versions in the latest_notebook/ and old_releases/ directories.

The scripts in the plotting_scripts/ folder reproduce the figures that are shown in the manuscript and the supplementary materials. These scripts contain comments to explain each step. Each figure that does not have a corresponding script in the plotting_scripts/ folder was taken from the case study or the marker gene study.

In case of questions or issues, please get in touch by posting an issue in this repository.

If the materials in this repo are of use to you, please consider citing the above publication.

Environment set up

A docker container with a working sc-tutorial environment is now available here thanks to Leander Dony. If you would like to set up the environment via conda or manually outside of the docker container, please follow the instructions below.

To run the tutorial case study, several packages must be installed. As both R and python packages are required, we prefer using a conda environment. To facilitate the setup of a conda environment, we have provided the sc_tutorial_environment.yml file, which contains all conda and pip installable dependencies. R dependencies, which are not already available as conda packages, must be installed into the environment itself.

To set up a conda environment, the following instructions must be followed.

  1. Set up the conda environment from the sc_tutorial_environment.yml file.

    conda env create -f sc_tutorial_environment.yml
    
  2. Ensure that the environment can find the gsl libraries from R. This is done by setting the CFLAGS and LDFLAGS environment variables (see https://bit.ly/2CjJsgn). Here we set them so that they are correctly set every time the environment is activated.

    cd YOUR_CONDA_ENV_DIRECTORY
    mkdir -p ./etc/conda/activate.d
    mkdir -p ./etc/conda/deactivate.d
    touch ./etc/conda/activate.d/env_vars.sh
    touch ./etc/conda/deactivate.d/env_vars.sh
    

    Where YOUR_CONDA_ENV_DIRECTORY can be found by running conda info --envs and using the directory that corresponds to your conda environment name (default: sc-tutorail).

    WHILE NOT IN THE ENVIRONMENT(!!!!) open the env_vars.sh file at ./etc/conda/activate.d/env_vars.sh and enter the following into the file:

    #!/bin/sh
    
    CFLAGS_OLD=$CFLAGS
    export CFLAGS_OLD
    export CFLAGS="`gsl-config --cflags` ${CFLAGS_OLD}"
     
    LDFLAGS_OLD=$LDFLAGS
    export LDFLAGS_OLD
    export LDFLAGS="`gsl-config --libs` ${LDFLAGS_OLD}"
    

    Also change the ./etc/conda/deactivate.d/env_vars.sh file to:

    #!/bin/sh
     
    CFLAGS=$CFLAGS_OLD
    export CFLAGS
    unset CFLAGS_OLD
     
    LDFLAGS=$LDFLAGS_OLD
    export LDFLAGS
    unset LDFLAGS_OLD
    

    Note again that these files should be written WHILE NOT IN THE ENVIRONMENT. Otherwise you may overwrite the CFLAGS and LDFLAGS environment variables in the base environment!

  3. Enter the environment by conda activate sc-tutorial or conda activate ENV_NAME if you changed the environment name in the sc_tutorial_environment.yml file.

  4. Open R and install the dependencies via the commands:

    install.packages(c('devtools', 'gam', 'RColorBrewer', 'BiocManager'))
    update.packages(ask=F)
    BiocManager::install(c("scran","MAST","monocle","ComplexHeatmap","slingshot"), version = "3.8")
    

These steps should set up an environment to perform single cell analysis with the tutorial workflow on a Linux system. Please note that we have encountered issues with conda environments on Mac OS. When using Mac OS we recommend installing the packages without conda using separately installed python and R versions. Alternatively, you can try using the base conda environment and installing all packages as described in the conda_env_instructions_for_mac.txt file. In the base environment, R should be able to find the relevant gsl libraries, so LDFLAGS and CFLAGS should not need to be set.

Also note that conda and pip doesn't always play nice together. Conda developers have suggested first installing all conda packages and then installing pip packages on top of this where conda packages are not available. Thus, installing further conda packages into the environment may cause issues. Instead, start a new environment and reinstall all conda packages first.

If you prefer to set up an environment manually, a list of all package requirements are given at the end of this document.

Downloading the data

As mentioned above the data for the case study comes from GSE92332. To run the case study as shown, you must download this data and place it in the correct folder. Unpacking the data requires tar and gunzip, which should already be available on most systems. If you are cloning the github repository and have the case study script in a latest_notebook/ folder, then from the location where you store the case study ipynb file, this can be done via the following commands:

cd ../  #To get to the main github repo folder
mkdir -p data/Haber-et-al_mouse-intestinal-epithelium/
cd data/Haber-et-al_mouse-intestinal-epithelium/
wget ftp://ftp.ncbi.nlm.nih.gov/geo/series/GSE92nnn/GSE92332/suppl/GSE92332_RAW.tar
mkdir GSE92332_RAW
tar -C GSE92332_RAW -xvf GSE92332_RAW.tar
gunzip GSE92332_RAW/*_Regional_*

The annotated dataset with which we briefly compare the results at the end of the notebook, is available from the same GEO accession ID (GSE92332). It can be obtained using the following command:

cd data/Haber-et-al_mouse-intestinal-epithelium/
wget ftp://ftp.ncbi.nlm.nih.gov/geo/series/GSE92nnn/GSE92332/suppl/GSE92332_Regional_UMIcounts.txt.gz
gunzip GSE92332_Regional_UMIcounts.txt.gz

Case study notes

We have noticed that results such as visualization, dimensionality reduction, and clustering (and hence all downstream results as well) can give slightly different results on different systems. This has to do with the numerical libraries that are used in the backend. Thus, we cannot guarantee that a rerun of the notebook will generate exactly the same clusters.

While all results are qualitatively similar, the assignment of cells to clusters especialy for stem cells, TA cells, and enterocyte progenitors can differ between runs across systems. To show the diversity that can be expected, we have uploaded shortened case study notebooks to the alternative_clustering_results/ folder.

Note that running sc.pp.pca() with the parameter svd_solver='arpack' drastically reduces the variability between systems, however the output is not exactly the same.

Adapting the pipeline for other datasets:

The pipeline was designed to be easily adaptable to new datasets. However, there are several limitations to the general applicability of the current workflow. When adapting the pipeline for your own dataset please take into account the following:

  1. Sparse data formats are not supported by rpy2 and therefore do not work with any of the integrated R commands. Datasets can be turned into a dense format using the code: adata.X = adata.X.toarray()

  2. The case study assumes that the input data is count data obtained from a single-cell protocol with UMIs. If the input data is full-length read data, then one could consider replacing the normalization method with another method that includes gene length normalization (e.g., TPM).

Manual installation of package requirements

The following packages are required to run the first version of the case study notebook. For further versions see the README.md in the latest_notebook/ and old_releases/ folders.

General:

  • Jupyter notebook
  • IRKernel
  • rpy2
  • R >= 3.4.3
  • Python >= 3.5

Python:

  • scanpy
  • numpy
  • scipy
  • pandas
  • seaborn
  • louvain>=0.6
  • python-igraph
  • gprofiler-official (from Case study notebook 1906 version)
  • python-gprofiler from Valentine Svensson's github (vals/python-gprofiler)
    • only needed for notebooks before version 1906
  • ComBat python implementation from Maren Buettner's github (mbuttner/maren_codes/combat.py)
    • only needed for scanpy versions before 1.3.8 which don't include sc.pp.combat()

R:

  • scater
  • scran
  • MAST
  • gam
  • slingshot (change DESCRIPTION file for R version 3.4.3)
  • monocle 2
  • limma
  • ComplexHeatmap
  • RColorBrewer
  • clusterExperiment
  • ggplot2
  • IRkernel

Possible sources of error in the manual installation:

For R 3.4.3:

When using Slingshot in R 3.4.3, you must pull a local copy of slingshot via the github repository and change the DESCRIPTION file to say R>=3.4.3 instead of R>=3.5.0.

For R >= 3.5 and bioconductor >= 3.7:

The clusterExperiment version that comes for bioconductor 3.7 has slightly changed naming convention. clusterExperiment() is now called ClusterExperiment(). The latest version of the notebook includes this change, but when using the original notebook, please note that this may throw an error.

For rpy2 < 3.0.0:

Pandas 0.24.0 is not compatible with rpy2 < 3.0.0. When using old versions of rpy2, please downgrade pandas to 0.23.X. Please also note that Pandas 0.24.0 requires anndata version 0.6.18 and scanpy version > 1.37.0.

For enrichment analysis with g:profiler:

Ensure that the correct g:profiler package is used for the notebook. Notebooks until 1904 use python-gprofiler from valentine svensson's github, and Notebooks from 1906 use the gprofiler-official package from the g:profiler team.

If not R packages can be found:

Ensure that IRkernel has linked the correct version of R with your jupyter notebook. Check instructions at https://github.com/IRkernel/IRkernel.

More Repositories

1

single-cell-best-practices

https://www.sc-best-practices.org
Jupyter Notebook
666
star
2

scvelo

RNA Velocity generalized through dynamical modeling
Python
335
star
3

scarches

Reference mapping for single-cell genomics
Jupyter Notebook
310
star
4

cellrank

CellRank: dynamics from multi-view single-cell data
Python
297
star
5

scib

Benchmarking analysis of data integration tools
Python
276
star
6

scgen

Single cell perturbation prediction
Python
247
star
7

dca

Deep count autoencoder for denoising scRNA-seq data
Python
224
star
8

diffxpy

Differential expression analysis for single-cell RNA-seq data.
Python
174
star
9

paga

Mapping out the coarse-grained connectivity structures of complex manifolds.
Jupyter Notebook
159
star
10

kBET

An R package to test for batch effects in high-dimensional single-cell RNA sequencing data.
HTML
138
star
11

scCODA

A Bayesian model for compositional single-cell data analysis
Jupyter Notebook
136
star
12

sfaira

data and model repository for single-cell data
Python
133
star
13

sc-pert

Models and datasets for perturbational single-cell omics
Jupyter Notebook
122
star
14

anndata2ri

Convert between AnnData and SingleCellExperiment
Python
111
star
15

ehrapy

Electronic Health Record Analysis with Python.
Python
109
star
16

ncem

Learning cell communication from spatial graphs of cells
Python
96
star
17

moscot

Multi-omic single-cell optimal transport tools
Python
94
star
18

zellkonverter

Conversion between scRNA-seq objects
R
88
star
19

chemCPA

Code for "Predicting Cellular Responses to Novel Drug Perturbations at a Single-Cell Resolution", NeurIPS 2022.
Jupyter Notebook
87
star
20

pertpy

Perturbation Analysis in the scverse ecosystem.
Python
83
star
21

cpa

The Compositional Perturbation Autoencoder (CPA) is a deep generative framework to learn effects of perturbations at the single-cell level. CPA performs OOD predictions of unseen combinations of drugs, learns interpretable embeddings, estimates dose-response curves, and provides uncertainty estimates.
Python
75
star
22

destiny

R package for single cell and other data analysis using diffusion maps
R
62
star
23

scib-pipeline

Snakemake pipeline that works with the scIB package to benchmark data integration methods.
Python
59
star
24

trVAE

Conditional out-of-distribution prediction
Python
53
star
25

scib-reproducibility

Additional code and analysis from the single-cell integration benchmarking project
Jupyter Notebook
50
star
26

AutoGeneS

Jupyter Notebook
50
star
27

spatial_scog_workshop_2022

Tutorials for the SCOG Virtual Workshop โ€˜Spatial transcriptomics data analysis in Pythonโ€™ - May 23-24, 2022
Jupyter Notebook
44
star
28

pseudodynamics

Dynamic models for single-cell RNA-seq time series.
Jupyter Notebook
40
star
29

tcellmatch

Python
34
star
30

scArches-reproducibility

Reproducing result from the paper
Jupyter Notebook
31
star
31

scTab

Jupyter Notebook
28
star
32

deepflow

This code contains the neural network implementation from the nature communication manuscript NCOMMS-16-25447A.
Python
28
star
33

batchglm

Fit generalized linear models in python.
Python
26
star
34

graph_abstraction

Generate cellular maps of differentiation manifolds with complex topologies.
Jupyter Notebook
26
star
35

DeepRT

Jupyter Notebook
25
star
36

Covid_meta_analysis

Analysis notebooks for the Covid-19 meta analysis that accompanies the Nature Medicine publication "Single-cell meta-analysis of SARS-CoV-2 entry genes across tissues and demographics"
Jupyter Notebook
24
star
37

hadge

Comprehensive pipeline for donor demultiplexing in single cell
Nextflow
23
star
38

scvelo_notebooks

Jupyter Notebook
23
star
39

spapros

Python package for Probe set selection for targeted spatial transcriptomics.
Python
22
star
40

interactive_plotting

Jupyter Notebook
21
star
41

mubind

Learning motif contributions to cell transitions using sequence features and graphs.
Python
20
star
42

nicheformer

Repository for Nicheformer: a foundation model for single-cell and spatial omics
Jupyter Notebook
19
star
43

scgen-reproducibility

Jupyter Notebook
17
star
44

graphcompass

GraphCompass: Graph Comparison Tools for Differential Analyses in Spatial Systems
Jupyter Notebook
15
star
45

geome

Python
14
star
46

multicpa

Python
13
star
47

campa

Conditional Autoencoders for Multiplexed Pixel Analysis
Jupyter Notebook
13
star
48

cellrank_reproducibility

CellRank's reproducibility repository.
Jupyter Notebook
13
star
49

scanpy-in-R

A guide to using the Python scRNA-seq analysis package Scanpy from R
HTML
12
star
50

scanpydoc

Collection of Sphinx extensions similar to (but more flexible than) numpydoc
Python
12
star
51

scPoli_reproduce

Reproducibility notebooks for scPoli
Jupyter Notebook
11
star
52

DeepCollisionalCrossSection

Jupyter Notebook
11
star
53

MetaMap

The code and analyses accompanying the manuscript โ€œMetaMap: An atlas of metatranscriptomic reads in human disease-related RNA-seq dataโ€.
HTML
11
star
54

scAnalysisTutorial

Jupyter Notebook
10
star
55

multigrate

Multigrate: multiomic data integration for single-cell genomics
Python
10
star
56

GWAS-scRNAseq-Integration

A Shiny tool to define the cell-type of action by integrating single cell expression data with GWAS
R
10
star
57

superexacttestpy

Python implementation of the SuperExactTest package
Jupyter Notebook
9
star
58

ncem_tutorials

Jupyter Notebook
9
star
59

enrichment_analysis_celltype

Cell type enrichment analysis using gene signatures and cluster markers
R
9
star
60

moslin

Code, data and analysis for moslin.
Jupyter Notebook
9
star
61

diffxpy_tutorials

Tutorials for diffxpy.
Jupyter Notebook
9
star
62

cross_system_integration

Jupyter Notebook
9
star
63

expiMap_reproducibility

Jupyter Notebook
9
star
64

trvaep

Jupyter Notebook
9
star
65

greatpy

GREAT algorithm in Python
Jupyter Notebook
8
star
66

PathReg

Sparsity-enforcing regularizer
Jupyter Notebook
8
star
67

IMPA

Jupyter Notebook
8
star
68

ncem_benchmarks

Jupyter Notebook
8
star
69

squidpy_reproducibility

Jupyter Notebook
8
star
70

sc-best-practices-ce

The best-practices workflow for single-cell RNA-seq analysis as determined by the community.
8
star
71

tissue_tensorflow

Python
8
star
72

scachepy

Caching extension for Scanpy
Jupyter Notebook
7
star
73

cpa-reproducibility

Notebooks for CPA figures
Jupyter Notebook
7
star
74

scCODA_reproducibility

Jupyter Notebook
7
star
75

2020_Mayr

This repo contains the analysis code describing the findings of Mayr_et_al
Jupyter Notebook
6
star
76

gastrulation_analysis

Jupyter Notebook
6
star
77

2019_Strunz

Reproducibility repo accompanying Strunz et al. "Alveolar regeneration through a Krt8+ transitional stem cell state that persists in human lung fibrosis". Nat Commun. 2020.
Jupyter Notebook
6
star
78

trVAE_reproducibility

Jupyter Notebook
6
star
79

cellrank_notebooks

Tutorials and examples for CellRank.
Jupyter Notebook
6
star
80

moscot_notebooks

Analysis notebooks using the moscot package
Jupyter Notebook
6
star
81

intercode

Jupyter Notebook
6
star
82

spapros-pipeline

Nextflow
6
star
83

ehrapy-tutorials

Tutorials for ehrapy
Jupyter Notebook
5
star
84

sfaira_tutorials

Jupyter Notebook
5
star
85

flowVI

flowVI: Flow Cytometry Variational Inference
5
star
86

2018_Angelidis

Reproducibility repo accompanying Angelidis et al. "An atlas of the aging lung mapped by single cell transcriptomics and deep tissue proteomics"
R
5
star
87

InterpretableAutoencoders

Jupyter Notebook
5
star
88

theislab.github.io

theislab repository overview
JavaScript
5
star
89

disent

Out-of-distribution prediction with disentangled representations for single-cell RNA sequencing data
Jupyter Notebook
5
star
90

ehrapy-datasets

A collection of scripts to generate AnnData objects of EHR datasets for ehrapy
Jupyter Notebook
5
star
91

neural_organoid_atlas

Reproducibility repository for the Human Neural Organoid Atlas publication
Jupyter Notebook
5
star
92

scatac_poisson_reproducibility

Jupyter Notebook
5
star
93

scanpy-demo-czbiohub

single-cell scanpy teaching
HTML
5
star
94

kbranches

Finding branching events and tips in single cell differentiation trajectories
R
5
star
95

jump-cpg0016-segmentation

Snakemake pipeline used to segment the cpg0016 dataset of the JUMP-Cell Painting Consortium
Jupyter Notebook
5
star
96

cellrank_reproducibility_preprint

Code to reproduce results from the CellRank preprint
Jupyter Notebook
4
star
97

inVAE

Invariant Representation learning
Jupyter Notebook
4
star
98

extended-single-cell-best-practices-container

Hosting the container for the extended single-cell best-practices book
Dockerfile
4
star
99

perturbation-metrics

Jupyter Notebook
4
star
100

LODE

repository for all LODE projects
Jupyter Notebook
4
star