• Stars
    star
    147
  • Rank 251,347 (Top 5 %)
  • Language
    Python
  • License
    MIT License
  • Created over 6 years ago
  • Updated about 1 year ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Batch balanced KNN

Batch balanced KNN

BBKNN is a fast and intuitive batch effect removal tool that can be directly used in the scanpy workflow. It serves as an alternative to scanpy.pp.neighbors(), with both functions creating a neighbour graph for subsequent use in clustering, pseudotime and UMAP visualisation. The standard approach begins by identifying the k nearest neighbours for each individual cell across the entire data structure, with the candidates being subsequently transformed to exponentially related connectivities before serving as the basis for further analyses. If technical artifacts (be they because of differing data acquisition technologies, protocol alterations or even particularly severe operator effects) are present in the data, they will make it challenging to link corresponding cell types across different batches.

KNN

As such, BBKNN actively combats this effect by taking each cell and identifying a (smaller) k nearest neighbours in each batch separately, rather than the dataset as a whole. These nearest neighbours for each batch are then merged into a final neighbour list for the cell. This helps create connections between analogous cells in different batches without altering the counts or PCA space.

BBKNN

Citation

If you use BBKNN in your work, please cite the paper:

@article{polanski2019bbknn,
  title={BBKNN: Fast Batch Alignment of Single Cell Transcriptomes},
  author={Pola{\'n}ski, Krzysztof and Young, Matthew D and Miao, Zhichao and Meyer, Kerstin B and Teichmann, Sarah A and Park, Jong-Eun},
  doi={10.1093/bioinformatics/btz625},
  journal={Bioinformatics},
  year={2019}
}

Installation

BBKNN depends on Cython, numpy, scipy, annoy, pynndescent, umap-learn and scikit-learn. The package is available on pip and conda, and can be easily installed as follows:

pip3 install bbknn

or

conda install -c bioconda bbknn

BBKNN can also make use of faiss. Consult the official installation instructions, the easiest way to get it is via conda.

Usage and Documentation

BBKNN has the option to immediately slot into the spot occupied by scanpy.neighbors() in the Seurat-inspired scanpy workflow. It computes a batch aligned variant of the neighbourhood graph, with its uses within scanpy including clustering, diffusion map pseudotime inference and UMAP visualisation. The basic syntax to run BBKNN on scanpy's AnnData object (with PCA computed via scanpy.tl.pca()) is as follows:

import bbknn
bbknn.bbknn(adata)

You can provide which adata.obs column to use for batch discrimination via the batch_key parameter. This defaults to 'batch', which is created by scanpy when you merge multiple AnnData objects (e.g. if you were to import multiple samples separately and then concatenate them).

Integration can be improved by using ridge regression on both a technical effect and a biological grouping prior to BBKNN, following a workflow from Park et al., 2020. In the event of not having a biological grouping at hand, a coarse clustering obtained from a BBKNN-corrected graph can be used in its place. This creates the following basic workflow syntax:

import bbknn
import scanpy
bbknn.bbknn(adata)
scanpy.tl.leiden(adata)
bbknn.ridge_regression(adata, batch_key=['batch'], confounder_key=['leiden'])
scanpy.tl.pca(adata)
bbknn.bbknn(adata)

Alternately, you can just provide a PCA matrix with cells as rows and a matching vector of batch assignments for each of the cells and call BBKNN as follows (with connectivities being the primary graph output of interest):

import bbknn.matrix
distances, connectivities, parameters = bbknn.matrix.bbknn(pca_matrix, batch_list)

An HTML render of the BBKNN function docstring, detailing all the parameters, can be accessed at ReadTheDocs. BBKNN use, along with using ridge regression to improve the integration, is shown in a demonstration notebook.

BBKNN in R

At this point, there is no plan to create a BBKNN R package. However, it can be ran quite easily via reticulate. Using the base functions is the same as in python. If you're in possession of a PCA matrix and a batch assignment vector and want to get UMAP coordinates out of it, you can use the following code snippet to do so. The weird PCA computation part and replacing it with your original values is unfortunately necessary due to how AnnData innards operate from a reticulate level. Provide your python path in use_python()

library(reticulate)
use_python("/usr/bin/python3")

anndata = import("anndata",convert=FALSE)
bbknn = import("bbknn", convert=FALSE)
sc = import("scanpy",convert=FALSE)

adata = anndata$AnnData(X=pca, obs=batch)
sc$tl$pca(adata)
adata$obsm$X_pca = pca
bbknn$bbknn(adata,batch_key=0)
sc$tl$umap(adata)
umap = py_to_r(adata$obsm[["X_umap"]])

If you wish to change any integer arguments (such as neighbors_within_batch), you'll have to as.integer() the value so python understands it as an integer.

When testing locally, faiss refused to work when BBKNN was reticulated. As such, provide use_faiss=FALSE to the BBKNN call if you run into this problem.

Example Notebooks

demo.ipynb is the main demonstration, applying BBKNN to some pancreas data with a batch effect. The notebook also uses ridge regression to improve the integration.

The BBKNN paper makes use of the following analyses:

  • simulation.ipynb applies BBKNN to simulated data with a known ground truth, and demonstrates the utility of graph trimming by introducing an unrelated cell population. This simulated data is then used to benchmark BBKNN against mnnCorrect, CCA, Scanorama and Harmony in benchmark.ipynb, and then finish off with a benchmarking of a BBKNN variant reluctant to work within R/reticulate and visualise the findings in benchmark2.ipynb. benchmark3-new-R-methods.ipynb adds some newer R approaches to the benchmark.
  • mouse.ipynb runs a collection of murine atlases through BBKNN. mouse-harmony.ipynb applies Harmony to the same data.

The BBKNN preprint performed some additional analyses that got left out of the final manuscript. Archival notebooks are stored in a separate repository.

More Repositories

1

scg_lib_structs

Collections of library structure and sequence of popular single cell genomic methods
HTML
413
star
2

cellphonedb

Python
342
star
3

celltypist

A tool for semi-automatic cell type classification
Python
225
star
4

SpatialDE

Test genes for Spatial Variation
Jupyter Notebook
143
star
5

tracer

TraCeR - reconstruction of T cell receptor sequences from single-cell RNAseq data
Python
118
star
6

cellhint

A tool for semi-automatic cell type harmonization and integration
Python
68
star
7

MultiMAP

MultiMAP for integration of single cell multi-omics
Python
51
star
8

embl-single-cell-course-2016

Material for lecture at the EMBL Single Cell Course 2016
Jupyter Notebook
44
star
9

drug2cell

Gene group activity utility functions for Scanpy
Jupyter Notebook
36
star
10

bracer

BraCeR - reconstruction of B cell receptor sequences from single-cell RNAseq data
Python
35
star
11

cellity

R
34
star
12

Genes2Genes

Aligning gene expression trajectories of single-cell reference and query systems
Jupyter Notebook
34
star
13

Pan_fetal_immune

Collection of scripts for analysis of pan fetal immune atlas
Jupyter Notebook
25
star
14

sctk

Python
21
star
15

celloline

Python
19
star
16

GPfates

Python
19
star
17

HCA_Heart_ver2

Jupyter Notebook
17
star
18

covid19_MS1

Analysis notebooks for "SARS-CoV-2 entry factors are highly expressed in nasal epithelial cells together with innate immune genes".
Jupyter Notebook
13
star
19

visium_stitcher

Stitch multiple Visium slides together
Jupyter Notebook
13
star
20

limbcellatlas

This repository contains codes used for the human fetal limb cell atlas.
Jupyter Notebook
12
star
21

innate_evo

R
12
star
22

thymusatlas

Jupyter Notebook
11
star
23

celltypist_wiki

Materials and scripts for building cell type encyclopedia table
Python
10
star
24

KIRid

Shell
9
star
25

TissueImmuneCellAtlas

Jupyter Notebook
9
star
26

TissueTag

Python package to interactively annotate histological images within a jupyter notebook
Jupyter Notebook
8
star
27

readquant

Convenience package for parsing RNA-seq quantification results
Python
8
star
28

RCA

Residual Component Analysis
Jupyter Notebook
8
star
29

spectrum-of-differentiation-supplements

Mirror of analysis files for "Single-Cell RNA-Sequencing Reveals a Continuous Spectrum of Differentiation in Hematopoietic Cells"
Jupyter Notebook
7
star
30

NaiveDE

The most trivial DE test based on likelihood ratio tests
Python
6
star
31

thymus_spatial_atlas

general repo that holds all analysis and figures for the thymus spatial atlas by Yayon, Kedlian, Boehme, Radtke and many more!
Jupyter Notebook
5
star
32

SpaceTimeGut

Analysis of scRNA-seq and V(D)J data of almost half a million cells from up to five anatomical regions in the developing and up to eleven distinct anatomical regions in healthy pediatric and adult human gut.
Jupyter Notebook
5
star
33

SKM_ageing_atlas

Jupyter Notebook
4
star
34

scrnatb

Single Cell RNA-Seq analysis toolbox for Python
Python
4
star
35

basespace_fq_downloader

A fastq downloader from basespace that actually works.
Python
4
star
36

basecloud

Base R/python/etc. internal OpenStack cloud setup for Teichlab
Shell
4
star
37

rbcde

Rank-biserial correlation coefficient for big data marker detection
Python
4
star
38

snp2cell

cell type specific, trait-associated gene regulation
Python
4
star
39

mapcloud

10X/SS2/spatial transcriptomics/genotyping internal cloud pipeline
Shell
3
star
40

cellphonedb-data

3
star
41

cell2tcr

Inference of TCR motifs
Jupyter Notebook
3
star
42

treg-gut-niches

Data pre-processing and analysis scripts for the article "Immune microniches shape intestinal Treg function"
Jupyter Notebook
3
star
43

COVID-19paed

R
2
star
44

power-analysis-material

Jupyter Notebook
2
star
45

iss_patcher

Approximate missing features from higher dimensionality data neighbours
Jupyter Notebook
2
star
46

covid19_oral

Jupyter Notebook
2
star
47

starters

Computational setup checklist for new starters
2
star
48

G2G_notebooks

Analysis notebooks for G2G MS
Jupyter Notebook
1
star
49

bbknn_preprint

Archival notebooks from the BBKNN preprint, removed from the paper
Jupyter Notebook
1
star
50

lung-immune-cell-atlas

This is a folder containing information of the human fetal lung leukocyte atlas
HTML
1
star
51

sctkr

A deposit of functions that perform common tasks for single cell analysis
R
1
star