• Stars
    star
    180
  • Rank 213,097 (Top 5 %)
  • Language
    Nim
  • License
    MIT License
  • Created over 6 years ago
  • Updated almost 2 years ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

fast sample-swap and relatedness checks on BAMs/CRAMs/VCFs/GVCFs... "like damn that is one smart wine guy"

somalier: extract informative sites, evaluate relatedness, and perform quality-control on BAM/CRAM/BCF/VCF/GVCF

Actions Status Cite

Quick Start

somalier makes checking any number of samples for identity easy directly from the alignments or from jointly-called VCFs:

The first step is to extract sites. For VCF just use:

somalier extract -d extracted/ --sites sites.vcf.gz -f /data/human/g1k_v37_decoy.fa $cohort.vcf.gz

with a sites file from releases

For BAM or CRAM, use: This is parallelizable by sample via cluster or cloud, but here, using a for loop:

for f in *.cram; do
    somalier extract -d extracted/ --sites sites.vcf.gz -f /data/human/g1k_v37_decoy.fa $f
done

--sites is a VCF of known polymorphic sites in VCF format. A good set is provided in the releases but any set of common variants will work.

⚠️ somalier can work on GVCF and individual VCFs, but it is recommended to extract from bam/cram when possible. It is also good to extract from a jointly-called VCF/BCF when only looking within that cohort. While extracting from a single-sample VCF is possible (with --unknown) and GVCF is also supported, these options are less accurate and more prone to problems.

The next step is to calculate relatedness on the extracted data:

somalier relate --ped $pedigree extracted/*.somalier

This will create text and interactive HTML output (similar to peddy) that makes it fast and easy to detect mismatched samples and sample-swaps.

Example output is here

Note that the somalier relate command runs extremely quickly (< 2 seconds for 600 samples and ~1 minute for 4,500 samples) so it's possible to add/remove samples or adjust a pedigree file and re-run iteratively.

For example to add the n + 1th samples, just run somalier extract on the new sample and then re-use the already extracted data from the n original samples.

For huge sample-sets, if you run into a bash error for argument list too long, you can pass the somalier files as quoted glob strings like: "/path/to/set-a/*.somalier" "/path/to/set-b/*.somalier".

Example Output

  • Interactive output from somalier relate is here
  • Interactive output from somalier ancestry is here

Infer

somalier can also infer first-degree relationships (parent-child) when both-parents are present and can often build entire pedigrees on high-qualty data. To do this, use somalier relate --infer ... and the samples.tsv output will be a pedigree file indicating the inferred relationships.

See wiki for more detail.

Usage

The usage is also described above. Briefly, after downloading the somalier binary and a sites vcf from the releases run:

somalier extract -d cohort/ --sites sites.hg38.vcf.gz -f $reference $sample.bam

for each sample to create a small binary file of the ref and alt counts for the variants listed in sites.hg38.vcf.gz.

for a vcf, run:

somalier extract -d cohort/ --sites sites.hg38.vcf.gz -f $reference $cohort.bcf

somalier can extract from a multi or single-sample VCF or a GVCF. This will be much faster, in cases where it's available, this would look like:

somalier extract -d extracted/ --sites sites.vcf.gz -f /data/human/g1k_v37_decoy.fa joint.vcf.gz

following this, there will be a $sample.somalier file for each sample in the joint.vcf.gz

Note that somalier uses the AD field to extract depth information. If that FORMAT field is not present in the header, then it will use the genotypes only and use a total depth of 20 (10,10 for heterozygote), etc.

Then run:

somalier relate --ped $pedigree_file cohort/*.somalier

This will create an html file for QC in a few seconds.

Note that if a new sample is added to the cohort, it's only necessary to perform the extract step on that sample and then run the (fast) relate step again with all of the extracted files.

Extended Usage

For each command of somalier, extended parameters are listed in --help of each subcommand.

$./somalier --help
Commands:
  extract      :   extract genotype-like information for a single sample from VCF/BAM/CRAM.
  relate       :   aggregate `extract`ed information and calculate relatedness among samples.
  ancestry     :   perform ancestry prediction on a set of samples, given a set of labeled samples
  find-sites   :   create a new sites.vcf.gz file from a population VCF (this is rarely needed).

somalier extract

extract genotype-like information for a single-sample at selected sites

Usage:
  somalier extract [options] sample_file

Arguments:
  sample_file      single-sample CRAM/BAM/GVCF file or multi/single-sample VCF from which to extract

Options:
  -s, --sites=SITES          sites vcf file of variants to extract
  -f, --fasta=FASTA          path to reference fasta file
  -d, --out-dir=OUT_DIR      path to output directory (default: .)
  --sample-prefix=SAMPLE_PREFIX
                             prefix for the sample name stored inside the digest

somalier relate

calculate relatedness among samples from extracted, genotype-like information

Usage:
  somalier relate [options] [extracted ...]

Arguments:
  [extracted ...]  $sample.somalier files for each sample. the first 10 are tested as a glob patterns

Options:
  -g, --groups=GROUPS        optional path  to expected groups of samples (e.g. tumor normal pairs).
specified as comma-separated groups per line e.g.:
    normal1,tumor1a,tumor1b
    normal2,tumor2a
  --sample-prefix=SAMPLE_PREFIX
                             optional sample prefixes that can be removed to find identical samples. e.g. batch1-sampleA batch2-sampleA
  -p, --ped=PED              optional path to a ped/fam file indicating the expected relationships among samples.
  -d, --min-depth=MIN_DEPTH  only genotype sites with at least this depth. (default: 7)
  --min-ab=MIN_AB            hets sites must be between min-ab and 1 - min_ab. set this to 0.2 for RNA-Seq data (default: 0.3)
  -u, --unknown              set unknown genotypes to hom-ref. it is often preferable to use this with VCF samples that were not jointly called
  -i, --infer                infer relationships (https://github.com/brentp/somalier/wiki/pedigree-inference)
  -o, --output-prefix=OUTPUT_PREFIX
                             output prefix for results. (default: somalier)

Note that for large cohorts, by default, somalier relate will subset to interesting sample-pairs so as not to balloon the size of the output. By default, pairs that are expected to be unrelated and have a relatedness <= 0.05. If you wish to force all samples to be reported, then set the environment variable SOMALIER_REPORT_ALL_PAIRS to any non-empty value, e.g. export SOMALIER_REPORT_ALL_PAIRS=1

find-sites

To create a set of new sites, use somalier find-sites on a population VCF. More info on this tool is available in the wiki

Install

get a static binary from here

Users can also get a docker image here which contains htslib and a somalier binary ready-for-use.

How it works

somalier takes a list of known polymorphic sites. Even a few hundred (or dozen) sites can be a very good indicator of relatedness. The best sites are those with a population allele frequency close to 0.5 as that maximizes the probability that any 2 samples will differ. A list of such sites is provided in the releases for GRCh37 and hg38.

In order to quickly calculate genotypes at these sites, somalier assays the exact base. The extraction step is done directly from the bam/cram files 1 sample at a time.

The relate step is run on the output of the extract commands. It runs extremely quickly so that new samples can be added and compared. It uses 3 bit-vectors per sample for hom-ref, het, hom-alt. Each bitvector is a sequence of 64 bit integers where each bit is set if the variant at that index in the sample is for example, heterozygous. With this setup, we can use fast bitwise operations and popcount hardware instructions to calculate relatedness extremely quickly.

For each sample-pair, it reports:

  1. IBS0 -- the number of sites where one sample is hom-ref and another is hom-alt
  2. IBS2 -- the number of sites where the samples have the same genotype
  3. shared-hets -- the number of sites where both samples are heterozygotes
  4. shared-hom-alts -- the number of sites where both samples are homozygous alternate

These are used to calculate relatedness and a measure of relatedness that is unaffected by loss-of-heterozygosity that is often seen in some cancers. The interactive output allows toggling between any of these measures.

It also reports depth information and the count of HET, HOM_REF, HOM_ALT, and unknown genotypes for each sample along with a number of metrics that are useful for general QC.

Example

example

Here, each point is a pair of samples. We can see that the expected identical sample-pairs (e.g. tumor-normal pairs) specified by the user and drawn in red mostly cluster together on the right. Unrelateds cluster on the lower left. The sample-swaps are the blue points that cluster with the red. In the somalier output, the user can hover to see which sample-pairs are involved each point

Ancestry Estimate

somalier can predict ancestry on a set of query samples given a set of labelled samples. You can read more about this in the wiki

Usage

Usage is intentionally very simple and running somalier extract or somalier relate will give sufficient help for nearly all cases.

By default somalier will only consider variants that have a "PASS" or "RefCall" FILTER. To extend this list, set the environment variable SOMALIER_ALLOWED_FILTERS to a comma-delimited list of additional filters to allow.

by default sites with an allele balance < 0.01 will be considered homozygous reference. To adjust this, use e.g. : SOMALIER_AB_HOM_CUTOFF=0.04 somalier relate ...

Other Work

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5499645/

https://academic.oup.com/bioinformatics/article/33/4/596/2624551

Acknowledgement

This work was motivated by interaction and discussions with Preeti Aahir and several early users who provided valuable feedback.

More Repositories

1

cyvcf2

cython + htslib == fast VCF and BCF processing
Cython
366
star
2

vcfanno

annotate a VCF with other VCFs/BEDs/tabixed files
Go
332
star
3

goleft

goleft is a collection of bioinformatics tools distributed under MIT license in a single static binary
Go
202
star
4

slivar

genetic variant expressions, annotation, and filtering for great good.
Nim
189
star
5

smoove

structural variant calling and genotyping with existing tools, but, smoothly.
Go
187
star
6

bio-playground

miscellaneous scripts for bioinformatics/genomics that dont merit their own repo.
C
181
star
7

hts-nim

nim wrapper for htslib for parsing genomics data files
Nim
149
star
8

cruzdb

python access to UCSC genomes database
Python
129
star
9

peddy

genotype :: ped correspondence check, ancestry check, sex check. directly, quickly on VCF
Jupyter Notebook
122
star
10

bwa-meth

fast and accurate alignment of BS-Seq reads using bwa-mem and a 3-letter genome
Python
116
star
11

jigv

igv.js standalone page generator and automatic configuration to view bam/cram/vcf/bed. "working in under 1 minute"
Nim
107
star
12

echtvar

using all the bits for echt rapid variant annotation and filtering
Rust
105
star
13

combat.py

python / numpy / pandas / patsy version of ComBat for removing batch effects.
Python
98
star
14

intintmap

fast int64-int64 map for go
Go
93
star
15

duphold

don't get DUP'ed or DEL'ed by your putative SVs.
Nim
90
star
16

slurmpy

submit jobs to slurm with quick-and-dirty python
Python
88
star
17

pyfasta

fast, memory-efficient, pythonic (and command-line) access to fasta sequence files
Python
86
star
18

vcfgo

a golang library to read, write and manipulate files in the variant call format.
Go
65
star
19

fishers_exact_test

Fishers Exact Test for Python (Cython)
Python
63
star
20

xopen

open files for buffered reading and writing in #golang
Go
60
star
21

interlap

fast, pure-python interval overlap testing
Python
50
star
22

seqcover

seqcover allows users to view coverage for hundreds of genes and dozens of samples
Nim
49
star
23

rare-disease-wf

(WIP) best-practices workflow for rare disease
Nextflow
49
star
24

hts-python

pythonic wrapper for libhts (moved to: https://github.com/quinlan-lab/hts-python)
Python
49
star
25

go-chartjs

golang library to make https://chartjs.org/ plots (this is vanilla #golang, not gopherjs)
Go
47
star
26

hts-nim-tools

useful command-line tools written to showcase hts-nim
Nim
46
star
27

irelate

Streaming relation (overlap, distance, KNN) of (any number of) sorted genomic interval sets. #golang
Go
45
star
28

bsub

python wrapper to submit jobs to bsub (and later qsub)
Python
43
star
29

combined-pvalues

combining p-values using modified stouffer-liptak for spatially correlated results (probes)
Python
40
star
30

align

sequence alignment. global, local, glocal.
Python
40
star
31

bigly

a pileup library that embraces the huge
Go
39
star
32

indelope

find large indels (in the blind spot between GATK/freebayes and SV callers)
Nim
39
star
33

tiwih

simple bioinformatics command-line (t)ools (i) (w)ished (i) (h)ad.
Nim
39
star
34

cigar

simple library for dealing with SAM cigar strings
Python
37
star
35

gsort

sort genomic data
Go
35
star
36

450k-analysis-guide

A Practical (And Opinionated) Guide To Analyzing 450K Data
TeX
34
star
37

geneimpacts

prioritize effects of variant annotations from VEP, SnpEff, et al.
Python
31
star
38

quicksect

a cythonized, extended version of the interval search tree in bx
Python
30
star
39

poverlap

significance testing over interval overlaps
Python
29
star
40

fastbit-python

pythonic access to fastbit
Python
26
star
41

nim-lapper

fast easy interval overlapping for nim-lang
Nim
24
star
42

lua-stringy

fast lua string operations
C
21
star
43

pybloomfaster

fast bloomfilter
C
21
star
44

vcfassoc

perform genotype-phenotype-association tests on a VCF with logistic regression.
Python
20
star
45

toolshed

python stuff I use
Python
19
star
46

bw-python

python wrapper to dpryan79's bigwig library using cffi
C
19
star
47

methylcode

Alignment and Tabulation of BiSulfite Treated Reads
C
16
star
48

tnsv

add true-negative SVs from a population callset to a truth-set.
Nim
15
star
49

hileup

horizontal pileup
C
15
star
50

nim-kmer

DNA kmer operations for nim
Nim
15
star
51

genoiser

use the noise
Nim
15
star
52

bigwig-nim

command-line querying+conversion of bigwigs and a nim wrapper for dpryan's libbigwig
Nim
15
star
53

vcf-bench

evaluating vcf parsing libraries
Zig
14
star
54

agoodle

numpy + GDAL == agoodle
Python
13
star
55

hts-zig

ziglang + htslib
Zig
12
star
56

aclust

streaming, flexible agglomerative clustering
Python
12
star
57

gobio

miscellaneous script-like stuff in go for bioinformatics
Go
12
star
58

excord

extract SV signal from a BAM
Go
11
star
59

fastahack-python

cython wrapper to fastahack
C++
11
star
60

spacepile

convert reads from repeated measures of same piece of DNA into spaced matricies for deep learners.
Rust
10
star
61

bwa-mips

Map sequence from Molecular Inversion Probes with BWA, strip arms, de-dup, ..., profit
Python
10
star
62

gobe

a fast, interactive, light-weight, customizable, web-based comparative genomics viewer with simple text input format.
Haxe
10
star
63

bix

tabix file access with golang using biogo machinery
Go
9
star
64

sveval

run multiple sv evalution tools
Python
8
star
65

clinical-components

Summarize the clinical (or lab) components and correlations of your dataset.
Python
8
star
66

bowfast

run bowtie then bfast on colorspace reads.
Shell
7
star
67

bcf

bcf parsing in golang
Go
7
star
68

inheritance

inheritance models for mendelian diseases
Python
7
star
69

skidmarks

find runs (non-randomness) in sequences
Python
7
star
70

falas

Fragment-Aware Local Assembly for Short-reads
Nim
7
star
71

bamject

DO NOT USE inject variants (snps/indels) from a vcf into a bam efficiently.
Nim
7
star
72

bpbio

basepair bio: a single binary with many useful genomics subtools.
Nim
6
star
73

kexpr-nim

nim wrapper for Heng Li's kexpr math expression evaluator library
C
5
star
74

clustermodel

fitting models to clustered correlated data
Python
5
star
75

go-blosc

go wrapper for blosc (blocked number compression with fast random access)
Go
5
star
76

celltypes450

adjust for cell-type composition in 450K data using houseman's and other methods.
R
5
star
77

crystal

find clusters and model correlated data from DNA methylation and other genomic sources.
Python
5
star
78

variantkey-nim

nim-wrapper for variantkey -- (chrom, pos, ref, alt) -> uint64
C
4
star
79

pedfile

pedigree file parsing and relatedness calculations for nim
Nim
4
star
80

shuffler

shuffle genome regions to determine probability of user-defined metric
Python
4
star
81

go-giggle

golang wrapper to giggle
Go
4
star
82

bedder-rs

an API for intersections of genomic data
Rust
4
star
83

nim-gzfile

simple reader and writer for gzipped (and regular) files
Nim
4
star
84

find_cns

find conserved non-coding sequences (CNS)
Python
4
star
85

nim-cgranges

nim wrapper for https://github.com/lh3/cgranges for faster interval tree
C
3
star
86

d4-nim

nim-lang wrapper for https://github.com/38/d4-format
Nim
3
star
87

flatfeature

python module for dealing with BED format for genomic data as a numpy array.
Python
3
star
88

faidx

faidx for golang
Go
3
star
89

tabix-py

interface to tabix using cffi
C
3
star
90

4bit

4bit fasta format.
C
3
star
91

ksw2-nim

nim wrapper for lhs/ksw2 for fast smith-waterman
C
3
star
92

htsuse

some C stuff that uses htslib
C
2
star
93

nim-minizip

nothing to see here.
C
2
star
94

pyfastx

unified access to fasta, fastx using kseq.h + ??
C
2
star
95

learnflash

all the stuff i want to remember how to do in haxe / flash
2
star
96

ififo

ififo provides a fast, sized, generic thread-safe FIFO in golang.
Go
2
star
97

dotfiles

my .bash, .vim, .* files
Shell
2
star
98

cysolar

copy of pysolar using cython
Python
2
star
99

cgotbx

yeah, another tabix wrapper for go.
Go
1
star
100

totable

simple python / cython wrapper to tokyo cabinet tables
C
1
star