• Stars
    star
    148
  • Rank 249,983 (Top 5 %)
  • Language
    HTML
  • License
    Apache License 2.0
  • Created almost 8 years ago
  • Updated 10 months ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

An R package to test for batch effects in high-dimensional single-cell RNA sequencing data.
title author date
kBET short introduction
Maren Bรผttner
9/18/2017

kBET - k-nearest neighbour batch effect test

The R package provides a test for batch effects in high-dimensional single-cell RNA sequencing data. It evaluates the accordance of replicates based on Pearson's $\chi^2$ test. First, the algorithm creates k-nearest neighbour matrix and choses 10% of the samples to check the batch label distribution in its neighbourhood. If the local batch label distribution is sufficiently similar to the global batch label distribution, the $\chi^2$-test does not reject the null hypothesis (that is "all batches are well-mixed"). The neighbourhood size k is fixed for all tests. Next, the test returns a binary result for each of the tested samples. Finally, the result of kBET is the average test rejection rate. The lower the test result, the less bias is introduced by the batch effect. kBET is very sensitive to any kind of bias. If kBET returns an average rejection rate of 1 for your batch-corrected data, you may also consider to compute the average silhouette width and PCA-based batch-effect measures to explore the degree of the batch effect. Learn more about kBET and batch effect correction in our publication.

Installation

Installation should take less than 5 min.

Via Github and devtools

If you want to install the package directly from Github, I recommend to use the devtools package.

library(devtools)
install_github('theislab/kBET')

Manually

Please download the package as zip archive and install it via

install.packages('kBET.zip', repos = NULL, type = 'source')

Usage of the kBET function:

#data: a matrix (rows: samples, columns: features (genes))
#batch: vector or factor with batch label of each cell 
batch.estimate <- kBET(data, batch)

kBET creates (if plot = TRUE) a boxplot of the kBET rejection rates (for neighbourhoods and randomly chosen subsets of size k) and kBET returns a list with several parts:

  • summary: summarizes the test results (with 95% confidence interval)
  • results: the p-values of all tested samples
  • average.pval: an average over all p-values of the tested samples
  • stats: the results for each of n_repeat runs - they can be used to reproduce the boxplot that is returned by kBET
  • params: the parameters used in kBET
  • outsider: samples without mutual nearest neighbour, their batch labels and a p-value whether their batch label composition varies from the global batch label frequencies

For a single-cell RNAseq dataset with less than 1,000 samples, the estimated run time is less than 2 minutes.

Plot kBET's rejection rate

By default (plot = TRUE), kBET returns a boxplot of observed and expected rejection rates for a data set. You might want to turn off the display of these plots and create them elsewhere. kBET returns all information that is needed in the stats part of the results.

library(ggplot2)
batch.estimate <- kBET(data, batch, plot=FALSE)
plot.data <- data.frame(class=rep(c('observed', 'expected'), 
                                  each=length(batch.estimate$stats$kBET.observed)), 
                        data =  c(batch.estimate$stats$kBET.observed,
                                  batch.estimate$stats$kBET.expected))
g <- ggplot(plot.data, aes(class, data)) + geom_boxplot() + 
     labs(x='Test', y='Rejection rate',title='kBET test results') +
     theme_bw() +  
     scale_y_continuous(limits=c(0,1))

Variations:

The standard implementation of kBET performs a k-nearest neighbour search (if knn = NULL) with a pre-defined neighbourhood size k0, computes an optimal neighbourhood size (heuristics = TRUE) and finally 10% of the samples is randomly chosen to compute the test statistics itself (repeatedly by default to derive a confidence interval, n_repeat = 100). For repeated runs of kBET, we recommend to run the k-nearest neighbour search separately:

require('FNN')
# data: a matrix (rows: samples, columns: features (genes))
k0=floor(mean(table(batch))) #neighbourhood size: mean batch size 
knn <- get.knn(data, k=k0, algorithm = 'cover_tree')

#now run kBET with pre-defined nearest neighbours.
batch.estimate <- kBET(data, batch, k = k0, knn = knn)

It must be noted that the get.knn function from the FNN package initializes a variable with n * k entries, where n is the sample size and k is the neighbourhood size. If n * k > 2^31, the get.knn aborts the k-nearest neighbour search. The initial neighbourhood size in kBET (k0) is ~ 1/4* mean(batch size), which can be already too large for example for mass cytometry data. In such cases, we recommend to subsample the data.

Subsampling:

Currently (July 2019), kBET operates only on dense matrices, which results in memory issues for large datasets. Furthermore, k-nearest neighbour search with FNN is limited (see above). We recommend to subsample in these cases. We have thought of several options. One option is to subsample the data irrespective of the substructure:

#data: a matrix (rows: samples, columns: features (genes))
#batch: vector or factor with batch label of each cell 
subset_size <- 0.1 #subsample to 10% of the data
subset_id <- sample.int(n = length(batch), size = floor(subset_size * length(batch)), replace=FALSE)
batch.estimate <- kBET(data[subset_id,], batch[subset_id])

In case of differently sized batches, one should consider stratified sampling in order to keep more samples from smaller batches.

The second option of subsampling is to take into account the substructure of the data (i.e. clusters). We observed that the batch label frequencies may vary in clusters. For example, such changes are due to inter-individual variability, or due to targeted population enrichment in some batches (e.g. by FACS), in contrast to unbiased cell sampling. In these cases, we compute the rejection rates for each cluster separately and average the results afterwards.

#data: a matrix (rows: samples, columns: features (genes))
#batch: vector or factor with batch label of each cell 
#clusters: vector or factor with cluster label of each cell 

kBET_result_list <- list()
sum_kBET <- 0
for (cluster_level in unique(clusters)){
   batch_tmp <- batch[clusters == cluster_level]
   data_tmp <- data[clusters == cluster_level,]
   kBET_tmp <- kBET(df=data_tmp, batch=batch_tmp, plot=FALSE)
   kBET_result_list[[cluster_level]] <- kBET_tmp
   sum_kBET <- sum_kBET + kBET_tmp$summary$kBET.observed[1]
}

#averaging
mean_kBET = sum_kBET/length(unique(clusters))

Compute a silhouette width and PCA-based measure:

#data: a matrix (rows: samples, columns: features (genes))
#batch: vector or factor with batch label of each cell 
pca.data <- prcomp(data, center=TRUE) #compute PCA representation of the data
batch.silhouette <- batch_sil(pca.data, batch)
batch.pca <- pcRegression(pca.data, batch)

For a single-cell RNAseq dataset with less than 1,000 samples, the estimated run time is less than 2 minutes.

More Repositories

1

single-cell-tutorial

Single cell current best practices tutorial case study for the paper:Luecken and Theis, "Current best practices in single-cell RNA-seq analysis: a tutorial"
Jupyter Notebook
1,284
star
2

single-cell-best-practices

https://www.sc-best-practices.org
Jupyter Notebook
746
star
3

cellrank

CellRank: dynamics from multi-view single-cell data
Python
342
star
4

scvelo

RNA Velocity generalized through dynamical modeling
Python
335
star
5

scarches

Reference mapping for single-cell genomics
Jupyter Notebook
333
star
6

scib

Benchmarking analysis of data integration tools
Python
298
star
7

scgen

Single cell perturbation prediction
Python
259
star
8

dca

Deep count autoencoder for denoising scRNA-seq data
Python
224
star
9

ehrapy

Electronic Health Record Analysis with Python.
Python
201
star
10

diffxpy

Differential expression analysis for single-cell RNA-seq data.
Python
192
star
11

paga

Mapping out the coarse-grained connectivity structures of complex manifolds.
Jupyter Notebook
159
star
12

scCODA

A Bayesian model for compositional single-cell data analysis
Jupyter Notebook
145
star
13

sc-pert

Models and datasets for perturbational single-cell omics
Jupyter Notebook
141
star
14

sfaira

data and model repository for single-cell data
Python
134
star
15

anndata2ri

Convert between AnnData and SingleCellExperiment
Python
124
star
16

moscot

Multi-omic single-cell optimal transport tools
Python
112
star
17

ncem

Learning cell communication from spatial graphs of cells
Python
102
star
18

chemCPA

Code for "Predicting Cellular Responses to Novel Drug Perturbations at a Single-Cell Resolution", NeurIPS 2022.
Jupyter Notebook
96
star
19

zellkonverter

Conversion between scRNA-seq objects
R
88
star
20

cpa

The Compositional Perturbation Autoencoder (CPA) is a deep generative framework to learn effects of perturbations at the single-cell level. CPA performs OOD predictions of unseen combinations of drugs, learns interpretable embeddings, estimates dose-response curves, and provides uncertainty estimates.
Python
84
star
21

scib-pipeline

Snakemake pipeline that works with the scIB package to benchmark data integration methods.
Python
65
star
22

destiny

R package for single cell and other data analysis using diffusion maps
R
62
star
23

nicheformer

Repository for Nicheformer: a foundation model for single-cell and spatial omics
Jupyter Notebook
55
star
24

trVAE

Conditional out-of-distribution prediction
Python
54
star
25

scib-reproducibility

Additional code and analysis from the single-cell integration benchmarking project
Jupyter Notebook
53
star
26

AutoGeneS

Jupyter Notebook
50
star
27

spatial_scog_workshop_2022

Tutorials for the SCOG Virtual Workshop โ€˜Spatial transcriptomics data analysis in Pythonโ€™ - May 23-24, 2022
Jupyter Notebook
49
star
28

pseudodynamics

Dynamic models for single-cell RNA-seq time series.
Jupyter Notebook
40
star
29

scTab

Jupyter Notebook
38
star
30

tcellmatch

Python
34
star
31

scArches-reproducibility

Reproducing result from the paper
Jupyter Notebook
33
star
32

graphcompass

GraphCompass: Graph Comparison Tools for Differential Analyses in Spatial Systems
Jupyter Notebook
30
star
33

deepflow

This code contains the neural network implementation from the nature communication manuscript NCOMMS-16-25447A.
Python
28
star
34

mubind

Learning motif contributions to cell transitions using sequence features and graphs.
Python
27
star
35

batchglm

Fit generalized linear models in python.
Python
27
star
36

graph_abstraction

Generate cellular maps of differentiation manifolds with complex topologies.
Jupyter Notebook
26
star
37

DeepRT

Jupyter Notebook
25
star
38

hadge

Comprehensive pipeline for donor demultiplexing in single cell
Nextflow
24
star
39

Covid_meta_analysis

Analysis notebooks for the Covid-19 meta analysis that accompanies the Nature Medicine publication "Single-cell meta-analysis of SARS-CoV-2 entry genes across tissues and demographics"
Jupyter Notebook
24
star
40

spapros

Python package for Probe set selection for targeted spatial transcriptomics.
Python
23
star
41

scvelo_notebooks

Jupyter Notebook
23
star
42

interactive_plotting

Jupyter Notebook
21
star
43

scgen-reproducibility

Jupyter Notebook
18
star
44

multimil

Multimodal weakly supervised learning to identify disease-specific changes in single-cell atlases
Python
18
star
45

geome

Python
16
star
46

campa

Conditional Autoencoders for Multiplexed Pixel Analysis
Jupyter Notebook
14
star
47

multicpa

Python
13
star
48

scPoli_reproduce

Reproducibility notebooks for scPoli
Jupyter Notebook
13
star
49

cellrank_reproducibility

CellRank's reproducibility repository.
Jupyter Notebook
13
star
50

scanpy-in-R

A guide to using the Python scRNA-seq analysis package Scanpy from R
HTML
12
star
51

scanpydoc

Collection of Sphinx extensions similar to (but more flexible than) numpydoc
Python
12
star
52

MetaMap

The code and analyses accompanying the manuscript โ€œMetaMap: An atlas of metatranscriptomic reads in human disease-related RNA-seq dataโ€.
HTML
12
star
53

DeepCollisionalCrossSection

Jupyter Notebook
11
star
54

scAnalysisTutorial

Jupyter Notebook
10
star
55

multigrate

Multigrate: multiomic data integration for single-cell genomics
Python
10
star
56

cross_system_integration

Jupyter Notebook
10
star
57

GWAS-scRNAseq-Integration

A Shiny tool to define the cell-type of action by integrating single cell expression data with GWAS
R
10
star
58

superexacttestpy

Python implementation of the SuperExactTest package
Jupyter Notebook
9
star
59

enrichment_analysis_celltype

Cell type enrichment analysis using gene signatures and cluster markers
R
9
star
60

ncem_tutorials

Jupyter Notebook
9
star
61

IMPA

Jupyter Notebook
9
star
62

diffxpy_tutorials

Tutorials for diffxpy.
Jupyter Notebook
9
star
63

moslin

Code, data and analysis for moslin.
Jupyter Notebook
9
star
64

trvaep

Jupyter Notebook
9
star
65

expiMap_reproducibility

Jupyter Notebook
9
star
66

ncem_benchmarks

Jupyter Notebook
8
star
67

greatpy

GREAT algorithm in Python
Jupyter Notebook
8
star
68

PathReg

Sparsity-enforcing regularizer
Jupyter Notebook
8
star
69

squidpy_reproducibility

Jupyter Notebook
8
star
70

sc-best-practices-ce

The best-practices workflow for single-cell RNA-seq analysis as determined by the community.
8
star
71

tissue_tensorflow

Python
8
star
72

2020_Mayr

This repo contains the analysis code describing the findings of Mayr_et_al
Jupyter Notebook
7
star
73

ehrapy-tutorials

Tutorials for ehrapy
Jupyter Notebook
7
star
74

cpa-reproducibility

Notebooks for CPA figures
Jupyter Notebook
7
star
75

scachepy

Caching extension for Scanpy
Jupyter Notebook
7
star
76

2019_Strunz

Reproducibility repo accompanying Strunz et al. "Alveolar regeneration through a Krt8+ transitional stem cell state that persists in human lung fibrosis". Nat Commun. 2020.
Jupyter Notebook
7
star
77

scCODA_reproducibility

Jupyter Notebook
7
star
78

2018_Angelidis

Reproducibility repo accompanying Angelidis et al. "An atlas of the aging lung mapped by single cell transcriptomics and deep tissue proteomics"
R
6
star
79

gastrulation_analysis

Jupyter Notebook
6
star
80

trVAE_reproducibility

Jupyter Notebook
6
star
81

cellrank_notebooks

Tutorials and examples for CellRank.
Jupyter Notebook
6
star
82

intercode

Jupyter Notebook
6
star
83

spapros-pipeline

Nextflow
6
star
84

jump-cpg0016-segmentation

Snakemake pipeline used to segment the cpg0016 dataset of the JUMP-Cell Painting Consortium
Jupyter Notebook
6
star
85

flowVI

flowVI: Flow Cytometry Variational Inference
5
star
86

sfaira_tutorials

Jupyter Notebook
5
star
87

theislab.github.io

theislab repository overview
JavaScript
5
star
88

scatac_poisson_reproducibility

Jupyter Notebook
5
star
89

disent

Out-of-distribution prediction with disentangled representations for single-cell RNA sequencing data
Jupyter Notebook
5
star
90

ehrapy-datasets

A collection of scripts to generate AnnData objects of EHR datasets for ehrapy
Jupyter Notebook
5
star
91

neural_organoid_atlas

Reproducibility repository for the Human Neural Organoid Atlas publication
Jupyter Notebook
5
star
92

moscot_notebooks

Analysis notebooks using the moscot package
Jupyter Notebook
5
star
93

scanpy-demo-czbiohub

single-cell scanpy teaching
HTML
5
star
94

kbranches

Finding branching events and tips in single cell differentiation trajectories
R
5
star
95

InterpretableAutoencoders

Jupyter Notebook
5
star
96

archmap

JavaScript
4
star
97

inVAE

Invariant Representation learning
Jupyter Notebook
4
star
98

cellrank_reproducibility_preprint

Code to reproduce results from the CellRank preprint
Jupyter Notebook
4
star
99

extended-single-cell-best-practices-container

Hosting the container for the extended single-cell best-practices book
Dockerfile
4
star
100

LODE

repository for all LODE projects
Jupyter Notebook
4
star