• Stars
    star
    115
  • Rank 299,225 (Top 7 %)
  • Language
    C++
  • License
    Apache License 2.0
  • Created about 7 years ago
  • Updated over 3 years ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Genetic multiplexing of barcoded single cell RNA-seq

demuxlet

Genetic multiplexing of barcoded single cell RNA-seq

Citation

Demuxlet has been published: https://www.nature.com/articles/nbt.4042

If you find it useful, please cite: Kang et al., Nature Biotechnology 2017.

Tips for running

  • Set --alpha 0 --alpha 0.5, which assumes the expected proportion of 50% genetic mixture from two individuals, to get better estimates of doublets.
  • Set --group-list to a list of barcodes (i.e. barcodes.tsv from 10X) to speed things up and only get demultiplexing for cells called by other methods
  • To reproduce the results presented in Figure 2 of the demuxlet paper, please go to: https://github.com/yelabucsf/demuxlet_paper_code/tree/master/fig2 to download the vcf and the outputs of demuxlet.

Introduction

demuxlet is a software tool to deconvolute sample identity and identify multiplets when multiple samples are pooled by barcoded single cell sequencing. demuxlet takes (1) a SAM/BAM/CRAM file produced by the standard 10x sequencing platform, or any other barcoded single cell RNA-seq (with proper --tag-UMI and --tag-group) options (2) a VCF/BCF file containing the genotype (GT), posterior probability (GP), or genotype likelihood (GL) to assign each barcode to a specific sample (or a pair of samples) in the VCF file.

Installing demuxlet

Before installing demuxlet, you need to install htslib in the same directory you want to install demuxlet (i.e. demuxlet and htslib should be siblings).
NOTE htslib 1.11 is not supported for now - use earlier releases (e.g. 1.10.x)

After installing htslib, you can clone the current snapshot of this repository to install as well

$ git clone https://github.com/statgen/demuxlet.git
$ cd demuxlet
$ autoreconf -vfi
$ ./configure  (with additional options such as --prefix)
$ make
$ make install (may require root privilege)

Using demuxlet

demuxlet uses a self-documentation utility. You can run each utility with -man or -help option to see the command line usages.

$ ./demuxlet          (for short usage)
$ ./demuxlet -help    (for detailed usage)

The detailed usage is also pasted below.

Options for input SAM/BAM/CRAM
  --sam           [STR: ]             : Input SAM/BAM/CRAM file. Must be sorted by coordinates and indexed
  --tag-group     [STR: CB]           : Tag representing readgroup or cell barcodes, in the case to partition the BAM file into multiple groups. For 10x genomics, use CB
  --tag-UMI       [STR: UB]           : Tag representing UMIs. For 10x genomiucs, use UB

Options for input VCF/BCF
  --vcf           [STR: ]             : Input VCF/BCF file, containing the individual genotypes (GT), posterior probability (GP), or genotype likelihood (PL)
  --field         [STR: GP]           : FORMAT field to extract the genotype, likelihood, or posterior from
  --geno-error    [FLT: 0.01]         : Genotype error rate (must be used with --field GT)
  --min-mac       [INT: 1]            : Minimum minor allele frequency
  --min-callrate  [FLT: 0.50]         : Minimum call rate
  --sm            [V_STR: ]           : List of sample IDs to compare to (default: use all)
  --sm-list       [STR: ]             : File containing the list of sample IDs to compare

Output Options
  --out           [STR: ]             : Output file prefix
  --alpha         [V_FLT: ]           : Grid of alpha to search for (default is 0.1, 0.2, 0.3, 0.4, 0.5)
  --write-pair    [FLG: OFF]          : Writing the (HUGE) pair file
  --doublet-prior [FLT: 0.50]         : Prior of doublet
  --sam-verbose   [INT: 1000000]      : Verbose message frequency for SAM/BAM/CRAM
  --vcf-verbose   [INT: 10000]        : Verbose message frequency for VCF/BCF

Read filtering Options
  --cap-BQ        [INT: 40]           : Maximum base quality (higher BQ will be capped)
  --min-BQ        [INT: 13]           : Minimum base quality to consider (lower BQ will be skipped)
  --min-MQ        [INT: 20]           : Minimum mapping quality to consider (lower MQ will be ignored)
  --min-TD        [INT: 0]            : Minimum distance to the tail (lower will be ignored)
  --excl-flag     [INT: 3844]         : SAM/BAM FLAGs to be excluded

Cell/droplet filtering options
  --group-list    [STR: ]             : List of tag readgroup/cell barcode to consider in this run. All other barcodes will be ignored. This is useful for parallelized run
  --min-total     [INT: 0]            : Minimum number of total reads for a droplet/cell to be considered
  --min-uniq      [INT: 0]            : Minimum number of unique reads (determined by UMI/SNP pair) for a droplet/cell to be considered
  --min-snp       [INT: 0]            : Minimum number of SNPs with coverage for a droplet/cell to be considered

Interpretation of output files

demuxlet generates multiple output file, such as [prefix].best, [prefix].sing, [prefix].sing2, and optionally [prefix].pair (with --write-pair argument). Each file contains the following information

  • The [prefix].best file contains the best guess of the sample identity, with detailed statistics to reach to the best guess
  • The [prefix].sing file contains the statistics for matching each cell with each possible sample.
  • The [prefix].sing2 file contains the statistics similar information to the previous one, but generated for sanity checking of the [prefix].pair results.
  • The [prefix].pair file contains the statistics for matching each cell with each possible configuration of doublet.

The [prefix].best file contains the following 22 columns.

  1. BARCODE - Cell barcode for the cell that is being assigned in this row
  2. RD.TOTL - The total number of reads overlapping with variant sites for each droplet.
  3. RD.PASS - The total number of reads that passed the quality threshold, such as mapping quality, base quality.
  4. RD.UNIQ - The total number of UMIs that passed the quality threshold. If a UMI is observed in a single variant multiple times, it won't be counted more. If a UMI is observed across multiple variants, it will be counted as different.
  5. N.SNP - The total number of variants overlapping with any read in the droplet.
  6. BEST - The best assignment for sample ID.
    • For singlets, SNG-
    • For doublets, DBL---
    • For ambiguous droplets, , AMB---<doublet ID1/ID2>)
  7. SNG.1ST - The best singlet assignment for sample ID
  8. SNG.LLK1 - The log(likelihood that the ID from SNG.1ST is the correct assignment)
  9. SNG.2ND - The next best singlet assignment for sample ID
  10. SNG.LLK2 - The log(likelihood that the ID from SNG.2ND is the correct assignment)
  11. SNG.LLK0 - The log-likelihood from allele frequencies only
  12. DBL.1ST - The sample ID that is most likely included if the assignment is a doublet
  13. DBL.2ND - The sample ID that is next most likely included ifthe assignment is a doublet
  14. ALPHA - % Mixture Proportion
  15. LLK12 - The log(likelihood that the ID is a doublet)
  16. LLK1 - The log(likelihood that the ID from DBL.1ST is the correct singlet assignment)
  17. LLK2 - The log(likelihood that the ID from DBL.2ND is the correct singlet assignment)
  18. LLK10 - The log(likelihood that the ID from DBL.1ST is one of the doublet, and the other doublet identity is calculated from allele frequencies only)
  19. LLK20 - The log(likelihood that the ID from DBL.2ND is one of the doublet, and the other doublet identity is calculated from allele frequencies only)
  20. LLK00 - The log(likelihood that the droplet is doublet, but both identities are calculated from allele frequencies only)
  21. PRB.DBL - Posterior probability of the doublet assignment
  22. PRB.SNG1 - Posterior probability of the singlet assignment when excluding all possible doublets

More Repositories

1

pheweb

A tool to build a website to browse hundreds or thousands of GWAS.
Python
151
star
2

locuszoom

A Javascript/d3 embeddable plugin for interactively visualizing statistical genetic data from customizable sources.
JavaScript
149
star
3

bamUtil

C++
83
star
4

SLURM-examples

Python
80
star
5

locuszoom-standalone

Create regional association plots from GWAS or meta-analysis
Python
56
star
6

Minimac4

C++
53
star
7

libStatGen

Useful set of classes for creating statistical genetic programs.
C++
49
star
8

popscle

A suite of population scale analysis tools for single-cell genomics data including implementation of Demuxlet / Freemuxlet methods and auxilary tools
C++
43
star
9

fastQValidator

Validate FastQ Files
C++
35
star
10

EPACTS

C++
33
star
11

emeraLD

tools to efficiently retrieve and calculate LD
C++
32
star
12

gotcloud

Genomes on the Cloud, Mapping & Variant Calling Pipelines
C++
31
star
13

verifyBamID

C++
27
star
14

savvy

Interface to various variant calling formats.
C++
25
star
15

METAL

Meta-analysis of genomewide association scans
C++
19
star
16

topmed_variant_calling

C++
17
star
17

LDServer

Fast API server for calculating linkage disequilibrium
C++
13
star
18

fivex

Interactive eQTL visualizations
Vue
13
star
19

swiss

Software to help identify overlap between association scan results and GWAS hit catalogs.
Python
13
star
20

encore

Encore Analysis Server
Python
13
star
21

pheweb-rg-pipeline

Genetic correlation calculation pipeline via summary statistics for PheWeb
Nextflow
12
star
22

topmed_freeze3_calling

TOPMed Freeze 3 variant calling pipeline
C
10
star
23

bravo

Deprecated. See new version: https://github.com/statgen/bravo_api - BRowse All Variants Online
HTML
8
star
24

raremetal

A flexible tool for meta-analysis
C++
7
star
25

localzoom

Make interactive LocusZoom plots from a local GWAS file
Vue
7
star
26

ruth

Robust Unified Hardy-Weinberg Equilibrium Test
C++
6
star
27

Rprs

R package for computing Polygenic Risk Scores (PRS)
R
6
star
28

Raremetal2

(beta) An updated version of meta-analysis software Raremetal.
C++
5
star
29

LASER

Locating Ancestry from SEquence Reads
C++
5
star
30

gwas-credible-sets

Credible set determination in GWAS results
JavaScript
4
star
31

qplot

qplot
C++
3
star
32

csg-utils

Perl
2
star
33

encore-client-r

An R package to interact with the Encore analysis server
R
2
star
34

HVCF

C++
2
star
35

statsTools

Set of tools for dealing with BAM Stats files.
C++
2
star
36

hds-util

C++
2
star
37

raremetal.js

JS implementation of rare variant aggregation tests
JavaScript
2
star
38

minimac

C++
2
star
39

statgen

This repository has been deprecated and replaced with libStatGen, https://github.com/statgen/libStatGen and the tools have been broken out to other repositories including: https://github.com/statgen/bamUtil and https://github.com/statgen/qplot
C++
2
star
40

statgenTools

Simple Tools using the statgen library.
C++
1
star
41

TagIt

Tag(ging) It(erative) of SNVs in multiple populations
C++
1
star
42

SampleTools

Example repository for having multiple tools in one repository.
Makefile
1
star
43

gvs-public

The public base of gvs, which does not contain any secret keys or emails
Python
1
star
44

csg-storage-slots

Perl
1
star
45

SampleProgram

Makefile
1
star
46

samtools-0.1.7a-hybrid

samtools v0.1.7a with updated bgzf logic and "calmd" logic
C
1
star
47

locuszoom-api

Flask server code for LocusZoom APIs
Python
1
star
48

table2dbSNP

GLP to create SNP tables for submission to dbSNP
C++
1
star
49

topmed-alignment

Perl
1
star
50

statgen-tools-image

A pre-made machine image with a set of commonly used tools
Shell
1
star
51

cloud-align

Script for aligning on dynamically provisioned VM's.
TypeScript
1
star
52

terraform-imputation-server-modules

Wrapper modules for Imputation Server infrastructure
HCL
1
star
53

bravo_api

Server side data processing and retrieval endpoints for BRAVO
Python
1
star
54

cram-access-tools

A set of tools for accessing and manipulating CRAM files
Dockerfile
1
star
55

locuszoom-hosted

A web service to upload and share GWAS results with LocusZoom.js
Python
1
star
56

sequencing_comparison

Python
1
star
57

23IV

Javascript modules for 2D and 3D Interactive Visualization (23IV) using WebGL.
JavaScript
1
star
58

dialogs

Python library for text-based GUI
Python
1
star
59

phegraph

HTML
1
star
60

vcfUtil

C++
1
star