• Stars
    star
    141
  • Rank 258,442 (Top 6 %)
  • Language
    Python
  • License
    BSD 2-Clause "Sim...
  • Created almost 5 years ago
  • Updated 2 months ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

A wrapper for the kallisto | bustools workflow for single-cell RNA-seq pre-processing

kb-python

github version pypi version python versions status codecov pypi downloads docs license

kb-python is a python package for processing single-cell RNA-sequencing. It wraps the kallisto | bustools single-cell RNA-seq command line tools in order to unify multiple processing workflows.

kb-python was developed by Kyung Hoi (Joseph) Min and A. Sina Booeshaghi while in Lior Pachter's lab at Caltech. If you use kb-python in a publication please cite*:

Melsted, P., Booeshaghi, A.S., et al. 
Modular, efficient and constant-memory single-cell RNA-seq preprocessing. 
Nat Biotechnol  39, 813–818 (2021). 
https://doi.org/10.1038/s41587-021-00870-2

Installation

The latest release can be installed with

pip install kb-python

The development version can be installed with

pip install git+https://github.com/pachterlab/kb_python

There are no prerequisite packages to install. The kallisto and bustools binaries are included with the package.

Usage

kb consists of four subcommands

$ kb
usage: kb [-h] [--list] <CMD> ...
positional arguments:
  <CMD>
    info      Display package and citation information
    compile   Compile `kallisto` and `bustools` binaries from source
    ref       Build a kallisto index and transcript-to-gene mapping
    count     Generate count matrices from a set of single-cell FASTQ files

kb ref: generate a pseudoalignment index

The kb ref command takes in a species annotation file (GTF) and associated genome (FASTA) and builds a species-specific index for pseudoalignment of reads. This must be run before kb count. Internally, kb ref extracts the coding regions from the GTF and builds a transcriptome FASTA that is then indexed with kallisto index.

kb ref -i index.idx -g t2g.txt -f1 transcriptome.fa <GENOME> <GENOME_ANNOTATION>
  • <GENOME> refers to a genome file (FASTA).
    • For example, the zebrafish genome is hosted by ensembl and can be downloaded here
  • <GENOME_ANNOTATION> refers to a genome annotation file (GTF)
    • For example, the zebrafish genome annotation file is hosted by ensembl and can be downloaded here
  • Note: The latest genome annotation and genome file for every species on ensembl can be found with the gget command-line tool.

Examples

# Index the zebrafish transcriptome genome.fa.gz annotation.gtf.gz
$ kb ref -i index.idx -g t2g.txt -f1 transcriptome.fa genome.fa.gz annotation.gtf.gz

kb count: pseudoalign and count reads

The kb count command takes in the pseudoalignment index (built with kb ref) and sequencing reads generated by a sequencing machine to generate a count matrix. Internally, kb count runs numerous kallisto and bustools commands comprising a single-cell workflow for the specified technology that generated the sequencing reads.

kb  count -i index.idx -g t2g.txt -o out/ -x <TECHNOLOGY> <FASTQ FILE[s]>
  • <TECHNOLOGY> refers to the assay that generated the sequencing reads.
    • For a list of supported assays run kb --list
  • <FASTQ FILE[s]> refers to the a list of FASTQ files generated
    • Different assays will have a different number of FASTQ files
    • Different assays will place the different features in different FASTQ files
      • For example, sequencing a 10xv3 library on a NextSeq Illumina sequencer usually results in two FASTQ files.
      • The R1.fastq.gz file (colloquially called "read 1") contains a 16 basepair cell barcode and a 12 basepair unique molecular identifier (UMI).
      • The R2.fastq.gz file (colloquially called "read 2") contains the cDNA associated with the cell barcode-UMI pair in read 1.

Examples

# Quantify 10xv3 reads read1.fastq.gz and read2.fastq.gz
$ kb count -i index.idx -g t2g.txt -o out/ -x 10xv3 read1.fastq.gz read2.fastq.gz

kb info: display package and citation information

The kb info command prints out package information including the version of kb-python, kallisto, and bustools along with their installation location.

$ kb info
kb_python 0.27.3 ...
kallisto: 0.48.0 ...
bustools: 0.41.0 ...
...

kb compile: compile kallisto and bustools binaries from source

The kb compile command grabs the latest kallisto and bustools source and compiles the binaries. Note: this is not required to run kb-python.

Use cases

kb-python facilitates fast and uniform pre-processing of single-cell sequencing data to answer relevant research questions.

$ pip install kb-python gget ffq

# Goal: quantify publicly available scRNAseq data
$ kb ref -i index.idx -g t2g.txt -f1 transcriptome.fa $(gget ref --ftp -w dna,gtf homo_sapiens)
$ kb count -i index.idx -g t2g.txt -x 10xv3 -o out $(ffq --ftp SRR10668798 | jq -r '.[] | .url' | tr '\n' ' ')
# -> count matrix in out/ folder

# Goal: quantify 10xv2 feature barcode data, feature_barcodes.txt is a tab-delimited file
# containing barcode_sequence<tab>barcode_name
$ kb ref -i index.idx -g f2g.txt -f1 features.fa --workflow kite feature_barcodes.txt
$ kb count -i index.idx -g f2b.txt -x 10xv2 -o out/ --workflow kite --h5ad R1.fastq.gz R2.fastq.gz
# -> count matrix in out/ folder

Submitted by @sbooeshaghi.

Do you have a cool use case for kb-python? Submit a PR (including the goal, code snippet, and your username) so that we can feature it here.

Tutorials

For a list of tutorials that use kb-python please see https://www.kallistobus.tools/.

Documentation

Developer documentation is hosted on Read the Docs.

Contributing

Thank you for wanting to improve kb-python! If you have believe you've found a bug, please submit an issue.

If you have a new feature you'd like to add to kb-python please create a pull request. Pull requests should contain a message detailing the exact changes made, the reasons for the change, and tests that check for the correctness of those changes.

Cite

If you use kb-python in a publication, please cite the following papers:

kb-python & bustools

@article{melsted2021modular,
  title={\href{https://doi.org/10.1038/s41587-021-00870-2}{Modular, efficient and constant-memory single-cell RNA-seq preprocessing}},
  author={Melsted, P{\'a}ll and Booeshaghi, A. Sina and Liu, Lauren and Gao, Fan and Lu, Lambda and Min, Kyung Hoi Joseph and da Veiga Beltrame, Eduardo and Hj{\"o}rleifsson, Kristj{\'a}n Eldj{\'a}rn and Gehring, Jase and Pachter, Lior},
  author+an={1=first;2=first,highlight},
  journal={Nature biotechnology},
  year={2021},
  month={4},
  day={1},
  doi={https://doi.org/10.1038/s41587-021-00870-2}
}

kallisto

@article{bray2016near,
  title={Near-optimal probabilistic RNA-seq quantification},
  author={Bray, Nicolas L and Pimentel, Harold and Melsted, P{\'a}ll and Pachter, Lior},
  journal={Nature biotechnology},
  volume={34},
  number={5},
  pages={525--527},
  year={2016},
  publisher={Nature Publishing Group}
}

BUS format

@article{melsted2019barcode,
  title={The barcode, UMI, set format and BUStools},
  author={Melsted, P{\'a}ll and Ntranos, Vasilis and Pachter, Lior},
  journal={Bioinformatics},
  volume={35},
  number={21},
  pages={4472--4473},
  year={2019},
  publisher={Oxford University Press}
}

kb-python was inspired by Sten Linnarsson’s loompy fromfq command (http://linnarssonlab.org/loompy/kallisto/index.html)

More Repositories

1

gget

🧬 gget enables efficient querying of genomic reference databases
Python
921
star
2

kallisto

Near-optimal RNA-Seq quantification
C
645
star
3

ffq

A tool to find sequencing data and metadata from public databases.
Python
538
star
4

BI-BE-CS-183-2023

Introduction to Computational Biology and Bioinformatics Course at Caltech, 2023
Jupyter Notebook
389
star
5

sleuth

Differential analysis of RNA-Seq
R
305
star
6

poseidon

poseidon system - open source syringe pumps and microscope for laboratories
Jupyter Notebook
168
star
7

kallistobustools

kallisto | bustools workflow for pre-processing single-cell RNA-seq data
114
star
8

seqspec

machine-readable file format for genomic library sequence and structure
Python
108
star
9

voyager

From geospatial to spatial -omics
R
70
star
10

picasso

Picasso: a methods for embedding points in 2D in a way that respects distances while fitting a user-specified shape.
Jupyter Notebook
69
star
11

scRNA-Seq-TCC-prep

Preprocessing of single-cell RNA-Seq (deprecated)
Jupyter Notebook
62
star
12

metakallisto

Using kallisto for metagenomic analysis
Python
50
star
13

LP_2021

TeX
45
star
14

kallisto-transcriptome-indices

Reference transcriptome indices build from kallisto for popular organisms
41
star
15

sircel

Identify cell barcodes from single-cell genomics sequencing experiments
Jupyter Notebook
41
star
16

SpatialFeatureExperiment

Extension of SpatialExperiment with sf
R
36
star
17

MCML

Python
33
star
18

kallisto_paper_analysis

Analysis from kallisto paper
HTML
32
star
19

NYMP_2018

Jupyter Notebook
29
star
20

kma

Keep Me Around: Intron Retention Detection
Python
28
star
21

monod

The Monod package fits CME models to sequencing data.
Python
27
star
22

gget_examples

Examples for gget (https://github.com/pachterlab/gget).
Jupyter Notebook
26
star
23

GFCP_2022

RNA velocity validation
Jupyter Notebook
23
star
24

MBGBLHGP_2019

Code for reproducing results from the paper "Modular and efficient pre-processing of single-cell RNA-seq data"
Jupyter Notebook
23
star
25

colosseum

colosseum system - open source fraction collector for laboratories
Jupyter Notebook
23
star
26

BBB

Bioinformatics for Benched Biologists
Jupyter Notebook
22
star
27

BHGP_2022

Jupyter Notebook
21
star
28

kite

kallisto index tag extractor
Python
20
star
29

aggregationDE

Scripts and software supplement for "Gene-level differential analysis at transcript-level resolution" by Yi, Pimentel, Bray and Pachter
R
20
star
30

qcbc

Jupyter Notebook
19
star
31

splitcode

Flexible and efficient parsing, interpreting and editing of sequencing reads
C
18
star
32

PCCA

Code for performing PCA followed by CCA
Python
18
star
33

sleuth_paper_analysis

Code to reproduce analyses from the sleuth paper
R
17
star
34

concordex

Identification of spatial homogeneous regions
Python
17
star
35

RMEJLBASBMP_2024

Repository for the paper "The impact of package selection and versioning on single-cell RNA-seq analysis"
Jupyter Notebook
17
star
36

CGCCP_2023

scVI extension for unspliced RNA
Jupyter Notebook
16
star
37

CBP_2021

Jupyter Notebook
16
star
38

BYVSTZP_2020

This repository contains the code for reproducing all the results and figures from the preprint "Isoform specificity in the mouse primary motor cortex".
Jupyter Notebook
15
star
39

SBP_2019

Code for producing the analysis in the "Quantifying the tradeoff between sequencing depth and cell number in single-cell RNA-seq" manuscript
Jupyter Notebook
15
star
40

bears_analyses

Examples of kallisto + sleuth
Python
11
star
41

voyagerpy

Python
11
star
42

GSP_2019

Code for reproducing results from the paper "RNA velocity and protein acceleration from single-cell multiomics experiments."
Jupyter Notebook
11
star
43

kallisto-sleuth-workshop-2016

materials and website for the 2016 kallisto sleuth workshop
CSS
11
star
44

lair

home of the bear's lair
CSS
10
star
45

sleuth_walkthroughs

Some sleuth walkthroughs to help you get started
HTML
10
star
46

scATAK

Jupyter Notebook
10
star
47

concordexR

Compute the neighborhood consolidation matrix and identify SHRs
R
10
star
48

MBLGLMBHGP_2021

Jupyter Notebook
8
star
49

Bi-BE-CS-183-2022

Website for the 2021-2022 Caltech class Bi/BE/CS 183: Introduction to Computational Biology and Bioinformatics
Jupyter Notebook
8
star
50

bcl2fastq

source code for bcl2fastq2, files from illumina
C++
8
star
51

CP_2023

Jupyter Notebook
7
star
52

bam2tcc

C++
7
star
53

voyager-testing

Python
7
star
54

GVP_2023

scRNA-seq, regulation, and sysbio
Jupyter Notebook
6
star
55

monod_examples

Tutorials for the Monod package, which fits CME models to sequencing data.
Jupyter Notebook
6
star
56

BGP_2023

Jupyter Notebook
6
star
57

biophysics

Repository for Pachter Lab Biophysics
Python
6
star
58

museumst

Museum of Spatial Transcriptomics
R
6
star
59

GCCP_2022

Roff
5
star
60

SGYP_2019

Jupyter Notebook
5
star
61

BLCSBGLKP_2020

Code for analysis of SARS-CoV-2 sequencing based diagnostic testing data
Jupyter Notebook
5
star
62

kallisto-D

C
5
star
63

zika

sleuth workflow for processing zika RNA-seq dataset
R
5
star
64

HSSOHMP_2024

Code for reproducing the results in the second version of the preprint "Accurate quantification of single-nucleus and single-cell RNA-seq transcripts"
C++
5
star
65

GBP_2024

Jupyter Notebook
5
star
66

pegasus

modular stepper motor control with Arduino, CNC motor sheild, and Pololu stepper driver. also the workhorse of poseidon and colosseum
Python
4
star
67

SP_2019

Jupyter Notebook
4
star
68

GVFP_2021

SDE comparison preprint
Jupyter Notebook
4
star
69

COVID19-County

COVID-19 data from LA County
Jupyter Notebook
4
star
70

LP_2024

Jupyter Notebook
4
star
71

BGP_2024

Jupyter Notebook
4
star
72

HSHMP_2022

Python
3
star
73

bibecs183

Bi/BE/CS 183 Winter 2019 - Introduction to Computational Biology and Bioinformatics
Jupyter Notebook
3
star
74

AAQuant

Annotation-Agnostic RNA-seq Quantification
C++
3
star
75

BSP_2023

Jupyter Notebook
3
star
76

FGP_2024

Jupyter Notebook
3
star
77

BP_2020_2

log(x+1) and log(1+x)
Jupyter Notebook
2
star
78

CP_2021

Code for reproducing the results in "The Split Senate" paper
Python
2
star
79

GRNP_2020

Repository for reproducing the results and figures in Gustafsson et al. 2020
Jupyter Notebook
2
star
80

BKMGP_2021

Jupyter Notebook
2
star
81

isolate_transcripts

Python
2
star
82

CWGFLHGCCHAP_2021

Jupyter Notebook
2
star
83

DBALLSMRDMCMGWSTPMBDKPFP_2023

Jupyter Notebook
2
star
84

GP_2020

Code to reproduce results in the paper "Special Function Methods for Bursty Models of Transcription"
MATLAB
2
star
85

SFEData

Example SpatialFeatureExperiment datasets
R
2
star
86

PROBer

PROBer: A general toolkit for analyzing sequencing-based ‘toeprinting’ assays
C++
2
star
87

BTRBP_2020

Jupyter Notebook
2
star
88

BP_2020

Decrease in ACE2 mRNA expression in aged mouse lung, bioRxiv, 2020.
Jupyter Notebook
2
star
89

GP_2021_4

HTML
2
star
90

CGP_2024_2

Jupyter Notebook
2
star
91

HPM_2022

Simulations of the robustness of AAQuant to noise
R
1
star
92

YLMP_2018

Scripts to reproduce analysis in YLMP, 2018
Shell
1
star
93

bcltools

(still in development only a few things work) tools for converting bcls to fastqs and fastqs to bcls
Python
1
star
94

eXpress

Streaming fragment assignment for real-time analysis of sequencing experiments
1
star
95

make

Open source bioinstrumentation projects
1
star
96

GPCTP_2019-2

1
star
97

GP_2020_2

Intrinsic/extrinsic noise mini-project
Jupyter Notebook
1
star
98

GP_2021_3

Jupyter Notebook
1
star
99

kallisto_tests

A set of (regression) tests for kallisto
1
star
100

KBP_2023

1
star