singleCellNet
Table of content
- Introduction
- Data
- Train SCN claissfier
- Assess SCN claissfier with heldout data
- Query
- Visualization
- Train cross-species SCN classifier
- Query for cross-species data
- Assess SCN claissfier with external dataset
- More detailed visualization examples
- Explore important celltype-specific top-pairs
- SCN score calibration
- Loom integration
- Seurat integration
- SCE integration
- Available training datasets
Introduction
SingleCellNet enables the classifcation of single cell RNA-Seq data across species and platforms. See our recent publication for more details. Additionally, we have a vignette to guide you through the steps as well.
Here, we illustrate ...
-
how to build and assess single cell classifiers
-
how to build and assess cross-species single cell classifiers
-
how to use these classifiers to quantify 'cell identity' from query scRNA-Seq data
If you want to use the bulk RNA-Seq version of CellNet, go to bulk CellNet.
Our singleCellNet is available on Python pySCN which is Scanpy and AnnData compatible.
Data
In this example, we use a subset of the Tabula Muris data to train singleCellNet. To learn more about the Tabula Muris project, see the manuscript. As query data, we use scRNA-Seq of kidney cells as reported in Park et al 2018. We also provide an example of classifying human, bead enriched PBMCs (from https://www.ncbi.nlm.nih.gov/pubmed/28091601). You can download this data here:
APPLICATION | METADATA | EXPRESSION |
---|---|---|
Query | metadata | expression data |
Training | metadata | expression data |
cross-species | human-mouse orthologs | |
cross-species | metadata | expression data |
*more training datasets (metadata and expression data) are provided at the bottom of the page.
Training
Setup
install.packages("devtools")
devtools::install_github("pcahan1/singleCellNet")
library(singleCellNet)
Optional set up if you are working with loom files
devtools::install_github(repo = "hhoeflin/hdf5r")
devtools::install_github(repo = "mojaveazure/loomR", ref = "develop")
library(loomR)
Fetch the data if you have not already done so
download.file("https://s3.amazonaws.com/cnobjects/singleCellNet/examples/sampTab_Park_MouseKidney_062118.rda", "sampTab_Park_MouseKidney_062118.rda")
download.file("https://s3.amazonaws.com/cnobjects/singleCellNet/examples/expMatrix_Park_MouseKidney_Oct_12_2018.rda", "expMatrix_Park_MouseKidney_Oct_12_2018.rda")
download.file("https://s3.amazonaws.com/cnobjects/singleCellNet/examples/expMatrix_TM_Raw_Oct_12_2018.rda", "expMatrix_TM_Raw_Oct_12_2018.rda")
download.file("https://s3.amazonaws.com/cnobjects/singleCellNet/examples/sampTab_TM_053018.rda", "sampTab_TM_053018.rda")
## For cross-species analyis:
download.file("https://s3.amazonaws.com/cnobjects/singleCellNet/examples/human_mouse_genes_Jul_24_2018.rda", "human_mouse_genes_Jul_24_2018.rda")
download.file("https://s3.amazonaws.com/cnobjects/singleCellNet/examples/6k_beadpurfied_raw.rda", "6k_beadpurfied_raw.rda")
download.file("https://s3.amazonaws.com/cnobjects/singleCellNet/examples/stDat_beads_mar22.rda", "stDat_beads_mar22.rda")
## To demonstrate how to integrate loom files to SCN
download.file("https://s3.amazonaws.com/cnobjects/singleCellNet/examples/pbmc_6k.loom", "pbmc_6k.loom")
Load query data
stPark = utils_loadObject("sampTab_Park_MouseKidney_062118.rda")
expPark = utils_loadObject("expMatrix_Park_MouseKidney_Oct_12_2018.rda")
dim(expPark)
[1] 16272 43745
genesPark = rownames(expPark)
rm(expPark)
gc()
Load the training data
expTMraw = utils_loadObject("expMatrix_TM_Raw_Oct_12_2018.rda")
dim(expTMraw)
[1] 23433 24936
stTM = utils_loadObject("sampTab_TM_053018.rda")
dim(stTM)
[1] 24936 17
stTM<-droplevels(stTM)
Find genes in common to the data sets and limit analysis to these
commonGenes = intersect(rownames(expTMraw), genesPark)
length(commonGenes)
[1] 13831
expTMraw = expTMraw[commonGenes,]
Split for training and assessment, and transform training data
set.seed(100) #can be any random seed number
stList = splitCommon(sampTab=stTM, ncells=100, dLevel="newAnn")
stTrain = stList[[1]]
expTrain = expTMraw[,rownames(stTrain)]
Train the classifier
- If you increase nTopGenes and nTopGenePairs, you may get a even better classifier performance on query data!
system.time(class_info<-scn_train(stTrain = stTrain, expTrain = expTrain, nTopGenes = 10, nRand = 70, nTrees = 1000, nTopGenePairs = 25, dLevel = "newAnn", colName_samp = "cell"))
user system elapsed
476.839 25.809 503.351
Assessing the classifier with heldout data
Apply to held out data
#validate data
stTestList = splitCommon(sampTab=stList[[2]], ncells=100, dLevel="newAnn") #normalize validation data so that the assessment is as fair as possible
stTest = stTestList[[1]]
expTest = expTMraw[commonGenes,rownames(stTest)]
#predict
classRes_val_all = scn_predict(cnProc=class_info[['cnProc']], expDat=expTest, nrand = 50)
Assess classifier
tm_heldoutassessment = assess_comm(ct_scores = classRes_val_all, stTrain = stTrain, stQuery = stTest, dLevelSID = "cell", classTrain = "newAnn", classQuery = "newAnn", nRand = 50)
plot_PRs(tm_heldoutassessment)
plot_metrics(tm_heldoutassessment)
Classification result heatmap
#Create a name vector label used later in classification heatmap where the values are cell types/ clusters and names are the sample names
nrand = 50
sla = as.vector(stTest$newAnn)
names(sla) = as.vector(stTest$cell)
slaRand = rep("rand", nrand)
names(slaRand) = paste("rand_", 1:nrand, sep='')
sla = append(sla, slaRand) #include in the random cells profile created
sc_hmClass(classMat = classRes_val_all,grps = sla, max=300, isBig=TRUE)
Attribution plot
plot_attr(classRes=classRes_val_all, sampTab=stTest, nrand=nrand, dLevel="newAnn", sid="cell")
Viusalize average top pairs genes expression for training data
gpTab = compareGenePairs(query_exp = expTest, training_exp = expTrain, training_st = stTrain, classCol = "newAnn", sampleCol = "cell", RF_classifier = class_info$cnProc$classifier, numPairs = 20, trainingOnly= TRUE)
train = findAvgLabel(gpTab = gpTab, stTrain = stTrain, dLevel = "newAnn")
hm_gpa_sel(gpTab, genes = class_info$cnProc$xpairs, grps = train, maxPerGrp = 50)
Query
Apply to Park et al query data
expPark = utils_loadObject("expMatrix_Park_MouseKidney_Oct_12_2018.rda")
nqRand = 50
system.time(crParkall<-scn_predict(class_info[['cnProc']], expPark, nrand=nqRand))
user system elapsed
89.633 5.010 95.041
Visualization
sgrp = as.vector(stPark$description1)
names(sgrp) = as.vector(stPark$sample_name)
grpRand =rep("rand", nqRand)
names(grpRand) = paste("rand_", 1:nqRand, sep='')
sgrp = append(sgrp, grpRand)
# heatmap classification result
sc_hmClass(crParkall, sgrp, max=5000, isBig=TRUE, cCol=F, font=8)
Classification annotation assignment
# This classifies a cell with the catgory with the highest classification score or higher than a classification score threshold of your choosing.
# The annotation result can be found in a column named category in the query sample table.
stPark <- get_cate(classRes = crParkall, sampTab = stPark, dLevel = "description1", sid = "sample_name", nrand = nqRand)
Classification result violin plot
sc_violinClass(sampTab = stPark, classRes = crParkall, sid = "sample_name", dLevel = "description1", addRand = nqRand)
Skyline plot of classification results
library(viridis)
stKid2 = addRandToSampTab(crParkall, stPark, "description1", "sample_name")
skylineClass(crParkall, "T cell", stKid2, "description1",.25, "sample_name")
Cross-species classification
Load the mouse training and human query data
stQuery = utils_loadObject("stDat_beads_mar22.rda")
expQuery = utils_loadObject("6k_beadpurfied_raw.rda") # use Matrix if RAM low
dim(expQuery)
[1] 32643 6000
stTM = utils_loadObject("sampTab_TM_053018.rda")
expTMraw = utils_loadObject("expMatrix_TM_Raw_Oct_12_2018.rda") # reload training
Load the ortholog table and convert human gene names to mouse ortholog names, and limit analysis to genes in common between the training and query data.
oTab = utils_loadObject("human_mouse_genes_Jul_24_2018.rda")
dim(oTab)
[1] 16688 3
aa = csRenameOrth(expQuery, expTMraw, oTab)
expQueryOrth = aa[['expQuery']]
expTrainOrth = aa[['expTrain']]
Limit anlaysis to a subset of the TM cell types
cts = c("B cell", "cardiac muscle cell", "endothelial cell", "erythroblast", "granulocyte", "hematopoietic precursor cell", "late pro-B cell", "limb_mesenchymal", "macrophage", "mammary_basal_cell", "monocyte", "natural killer cell", "T cell", "trachea_epithelial", "trachea_mesenchymal")
stTM2 = filter(stTM, newAnn %in% cts)
stTM2 = droplevels(stTM2)
rownames(stTM2) = as.vector(stTM2$cell) # filter strips rownames
expTMraw2 = expTrainOrth[,rownames(stTM2)]
dim(expTMraw2)
[1] 14550 15161
Train Classifier
stList = splitCommon(stTM2, ncells=100, dLevel="newAnn")
stTrain = stList[[1]]
expTrain = expTMraw2[,rownames(stTrain)]
system.time(class_info2<-scn_train(stTrain = stTrain, expTrain = expTrain, nTopGenes = 10, nRand = 70, nTrees = 1000, nTopGenePairs = 25, dLevel = "newAnn", colName_samp = "cell"))
user system elapsed
41.029 6.747 47.963
Apply to held out data
#validate data
stTestList = splitCommon(stList[[2]], ncells=100, dLevel="newAnn")
stTest = stTestList[[1]]
expTest = expTMraw2[,rownames(stTest)]
#predict
system.time(classRes_val_all2 <- scn_predict(class_info2[['cnProc']], expTest, nrand = 50))
user system elapsed
0.691 0.032 0.724
Assess classifier
tm_heldoutassessment = assess_comm(ct_scores = classRes_val_all2, stTrain = stTrain, stQuery = stTest, dLevelSID = "cell", classTrain = "newAnn", classQuery = "newAnn", nRand = 50)
plot_PRs(tm_heldoutassessment)
plot_metrics(tm_heldoutassessment)
Classification result heatmap
nrand=50
sla = as.vector(stTest$newAnn)
names(sla) = as.vector(stTest$cell)
slaRand = rep("rand", nrand)
names(slaRand) = paste("rand_", 1:nrand, sep='')
sla = append(sla, slaRand)
# heatmap classification result
sc_hmClass(classRes_val_all2, sla, max=300, font=7, isBig=TRUE)
Attribute plot
plot_attr(classRes_val_all2, stTest, nrand=nrand, dLevel="newAnn", sid="cell")
Apply to human query data
stQuery$description = as.character(stQuery$description)
stQuery[which(stQuery$description == "NK cell"), "description"] = "natural killer cell"
nqRand = 50
system.time(crHS <- scn_predict(class_info2[['cnProc']], expQueryOrth, nrand=nqRand))
user system elapsed
3.566 0.548 4.166
Assess classifier with external dataset
tm_pbmc_assessment = assess_comm(ct_scores = crHS, stTrain = stTrain, stQuery = stQuery, classTrain = "newAnn",classQuery="description",dLevelSID="sample_name")
plot_PRs(tm_pbmc_assessment)
plot_metrics(tm_pbmc_assessment)
More visualization
Classification result heatmap
sgrp = as.vector(stQuery$prefix)
names(sgrp) = as.vector(stQuery$sample_name)
grpRand = rep("rand", nqRand)
names(grpRand) = paste("rand_", 1:nqRand, sep='')
sgrp = append(sgrp, grpRand)
sc_hmClass(crHS, sgrp, max=5000, isBig=TRUE, cCol=F, font=8)
Note that the macrophage category seems to be promiscuous in the mouse held out data, too.
Classification violin plot
sc_violinClass(sampTab = stQuery, classRes = crHS, sid = "sample_name", dLevel = "description")
Classification violin plot with adjusted width
sc_violinClass(sampTab = stQuery,classRes = crHS, sid = "sample_name", dLevel = "description", ncol = 12)
Classification violin plot with selected cluster
sc_violinClass(stQuery, crHS, sid = "sample_name", dLevel = "description", ncol = 12, sub_cluster = "B cell")
Attribution plot
plot_attr(crHS, stQuery, nrand=nqRand, sid="sample_name", dLevel="description")
Attribution plot with subcluster focus
plot_attr(sampTab = stQuery, classRes = crHS, sid = "sample_name", dLevel = "description", nrand = 50, sub_cluster = c("B cell", "T cell"))
UMAP by category
system.time(umPrep_HS<-prep_umap_class(crHS, stQuery, nrand=nqRand, dLevel="description", sid="sample_name", topPC=5))
user system elapsed
25.703 0.740 26.450
plot_umap(umPrep_HS)
Heatmap top pairs genes for training sample average
system.time(gpTab2 <- compareGenePairs(query_exp = expQueryOrth, training_exp = expTrainOrth, training_st = stTrain, classCol = "newAnn", sampleCol = "cell", RF_classifier = class_info2$cnProc$classifier, numPairs = 20, trainingOnly = FALSE))
user system elapsed
84.130 0.677 84.826
sgrp = as.vector(stQuery$prefix)
names(sgrp) = rownames(stQuery)
train2 = findAvgLabel(gpTab2, stTrain = stTrain, dLevel = "newAnn")
sgrp = append(sgrp, train2)
hm_gpa_sel(gpTab2, genes = class_info2$cnProc$xpairs, grps = sgrp, maxPerGrp = 5)
How to calibrate/make sense of a given SCN score
#this function aims to give you a sense of how precise/sensitive SCN is with the assigned score of a given cell type for a cell
#tm_assess_matrix = tm_heldoutassessment$nonNA_PR
#tm_assess_matrix is a held_out assessment metric extracted from tm_heldoutassessment, which is already stored in SCN.
#e_assess_matrix is also provided for a gastrulation SCN classifier
score = 0.6
celltype = "B cell"
calibration = scn_calibration(score = score, celltype = celltype, matrix=tm_assess_matrix)
#[1] "SCN score of 0.6 for cell type B cell has precision of 0.979 ~ 0.979 and sensitivity of 0.93 ~ 0.93"
calibration
#$score
#[1] 0.6
#$celltype
#[1] "B cell"
#$precision
#[1] 0.979 0.979
#$recall
#[1] 0.93 0.93
How to integrate loom file to SCN
lfile = loadLoomExpCluster("pbmc_6k.loom", cellNameCol = "obs_names", xname = "description")
stQuery = lfile$sampTab
dim(stQuery)
[1] 6000 2
expQuery = lfile$expDat
dim(expQuery)
[1] 32643 6000
#With this you can rerun the cross-species analysis and follow the exact same steps
Integrate Seurat object to SCN analysis
#exp_type options can be: counts, normcounts, and logcounts, if they are available in your sce object
seuratfile = extractSeurat(seurat_object, exp_slot_name = "counts")
sampTab = seuratfile$sampTab
expDat = seuratfile$expDat
Integrate SCE object to SCN analysis
#exp_type options can be: counts, data, and scale.data if they are available in your sce object
scefile = extractSCE(sce_object, exp_slot_name = "counts")
sampTab = scefile$sampTab
expDat = scefile$expDat
More training data for your own analysis
study | species | organ/tissue | seq method | data |
---|---|---|---|---|
Baron | mouse | pancreas | inDrop | data |
Baron | human | pancreas | inDrop | data |
Murano* | human | pancreas | Cel-Seq2 | data |
Segerstolp | human | pancreas | Smart-Seq | data |
Park | human | kidney | 10x | data |
Haber | mouse | intestine | Smart-Seq2 | data |
TM10x | mouse | atlas subset | 10x | data |
TM10x | mouse | atlas | 10x | data |
TMfacs | mouse | atlas subset | Smart-Seq | data |
TMfacs | mouse | atlas | Smart-Seq | data |
MWS | mouse | atlas | microwell-seq | data |
Zeisel | mouse | barin altas | 10x | data |
Loo | mouse | cortex(e14.5) | Dropseq | data |
Darmanis | human | cortex | C1 | data |
Gokce* | human | striatum | C1 and Smart-Seq2 | data |
*the expresion data is log-transformed.