• Stars
    star
    189
  • Rank 204,649 (Top 5 %)
  • Language
    Nim
  • License
    MIT License
  • Created about 6 years ago
  • Updated almost 2 years ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

genetic variant expressions, annotation, and filtering for great good.

slivar: filter/annotate variants in VCF/BCF format with simple expressions Build Status

downloads

If you use slivar, please cite the paper

slivar is a set of command-line tools that enables rapid querying and filtering of VCF files. It facilitates operations on trios and groups and allows arbitrary expressions using simple javascript.

use-cases for slivar

  • annotate variants with gnomad allele frequencies from combined exomes + whole genomes at > 30K variants/second using only a 1.5GB compressed annotation file.
  • call denovo variants with a simple expression that uses mom, dad, kid labels that is applied to each trio in a cohort (as inferred from a pedigree file). kid.het && mom.hom_ref && dad.hom_ref && kid.DP > 10 && mom.DP > 10 && dad.DP > 10
  • define and filter on arbitrary groups with labels. For example, 7 sets of samples each with 1 normal and 3 tumor time-points: normal.AD[0] = 0 && tumor1.AB < tumor2.AB && tumor2.AB < tumor3.AB
  • filter variants with simple expressions: variant.call_rate > 0.9 && variant.FILTER == "PASS" && INFO.AC < 22 && variant.num_hom_alt == 0
  • see using slivar for rare disease research

slivar logo

slivar has sub-commands:

  • expr: filter and/or annotate with INFO, trio, sample, group expressions
  • make-gnotate: make a compressed zip file of annotations for use by slivar
  • compound-hets: true compound hets using phase-by-inheritance within gene annotations

Table of Contents

Installation

get the latest binary from: https://github.com/brentp/slivar/releases/latest

slivar_static does not depend on any libraries and should work on any 64 bit linux system.

slivar_shared will require libhts.so (from htslib) to be in the usual places or in a directory indicated in LD_LIBRARY_PATH.

or use via docker from: brentp/slivar:latest

QuickStart

To get started quickly, grab a static binary for the latest release and then follow this example

So for hg38:

vcf=/path/to/your/vcf.vcf.gz
ped=/path/to/your/pedigree.ped
wget https://github.com/brentp/slivar/releases/download/v0.2.8/slivar
chmod +x ./slivar
wget https://raw.githubusercontent.com/brentp/slivar/master/js/slivar-functions.js
wget https://slivar.s3.amazonaws.com/gnomad.hg38.genomes.v3.fix.zip

# example command
./slivar expr --js slivar-functions.js -g gnomad.hg38.genomes.v3.fix.zip \
	--vcf $vcf --ped $ped \
	--info "INFO.gnomad_popmax_af < 0.01 && variant.FILTER == 'PASS'" \
	--trio "example_denovo:denovo(kid, dad, mom)" \
	--family-expr "denovo:fam.every(segregating_denovo)" \
	--trio "custom:kid.het && mom.het && dad.het && kid.GQ > 20 && mom.GQ > 20 && dad.GQ > 20" \
	--pass-only

The pedigree format is explained here

Commands

expr

expr allows filtering on (abstracted) trios and groups. For example, given a VCF (and ped/fam file) with 100 trios, slivar will apply an expression with kid, mom, dad identifiers to each trio that it automatically extracts.

expr can also be used, for example to annotate with population allele frequencies from a gnotate file without any sample filtering. See the wiki for more detail and the gnotate section for gnotation files that we distribute for slivar.

expr commands are quite fast, but can be parallelized using pslivar.

trio

when --trio is used, slivar finds all trios in a VCF, PED pair and let's the user specify an expression with indentifiers of kid, mom, dad that is applied to each possible trio. For example, a simple expression to call de novo variants:

variant.FILTER == 'PASS' && \                         # 
variant.call_rate > 0.95 && \                         # genotype must be known for most of cohort.
INFO.gnomad_af < 0.001 && \                           # rare in gnomad (must be in INFO [but see below])
kid.het && mom.hom_ref && dad.hom_ref && \            # also unknown
kid.DP > 7 && mom.DP > 7 && dad.DP > 7 && \           # sufficient depth in all
(mom.AD[1] + dad.AD[1]) == 0                          # no evidence for alternate in the parents

This requires passing variants that are rare in gnomad that have the expected genotypes and do not have any alternate evidence in the parents. If there are 200 trios in the ped::vcf given, then this expression will be tested on each of those 200 trios.

When trios are not sufficient, use Family Expressions which allow more heterogeneous family structures.

The expressions are javascript so the user can make these as complex as needed.

slivar expr \
   --pass-only \ # output only variants that pass one of the filters (default is to output all variants)
   --vcf $vcf \
   --ped $ped \
   # compressed zip that allows fast annotation so that `gnomad_af` is available in the expressions below.
   --gnotate $gnomad_af.zip \ 
   # any valid javascript is allowed in a file here. provide functions to be used below.
   --js js/slivar-functions.js \ 
   --out-vcf annotated.bcf \
   # this filter is applied before the trio filters and can speed evaluation if it is stringent.
   --info "variant.call_rate > 0.9" \ 
   --trio "denovo:kid.het && mom.hom_ref && dad.hom_ref \
                   && kid.AB > 0.25 && kid.AB < 0.75 \
                   && (mom.AD[1] + dad.AD[1]) == 0 \
                   && kid.GQ >= 20 && mom.GQ >= 20 && dad.GQ >= 20 \
                   && kid.DP >= 12 && mom.DP >= 12 && dad.DP >= 12" \
   --trio "informative:kid.GQ > 20 && dad.GQ > 20 && mom.GQ > 20 && kid.alts == 1 && \
           ((mom.alts == 1 && dad.alts == 0) || (mom.alts == 0 && dad.alts == 1))" \
   --trio "recessive:trio_autosomal_recessive(kid, mom, dad)"

Note that slivar does not give direct access to the genotypes, instead exposing hom_ref, het, hom_alt and unknown or via alts where 0 is homozygous reference, 1 is heterozygous, 2 is homozygous alternate and -1 when the genotype is unknown. It is recommended to decompose a VCF before sending to slivar

Here it is assumed that trio_autosomal_recessive is defined in slivar-functions.js; an example implementation of that and other useful functions is provided here. Note that it's often better to use --family-expr instead as it's more flexible than trio expressions.

Family Expressions

Trios are a nice abstraction for cohorts consisting of only trios, but for more general uses, there is --family-expr for example, given either a duo, or a quartet, we can find variants present only in affected samples with:

 --family-expr "aff_only:fam.every(function(s) { return s.het == s.affected && s.hom_ref == !s.affected && s.GQ > 5 })"

Note that this does not explicitly check for transmission or non-transmission between parents and off-spring so it is less transparent than the trio mode, but more flexible.

Groups

A trio is a special-case of a group that can be inferred from a pedigree. For more specialized use-cases, a group can be specified. For example we could, instead of using --trio, use a group file like:

#kid	mom	dad
sample1	sample2	sample3
sample4	sample5	sample6
sample7	sample8	sample9

Where, here we have specified 3 trios below a header with their "labels". This can be accomplished using --trio, but we can for example specify quartets like this:

#kid	mom	dad	sibling
sample1	sample2	sample3	sample10
sample4	sample5	sample6	sample11
sample7	sample8	sample9	sample12

where sample10 will be available as "sibling" in the first family and an expression like:

kid.alts == 1 && mom.alts == 0 && dad.alts == 0 and sibling.alts == 0

could be specified and it would automatically be applied to each of the 3 families.

Another example could be looking at somatic variants with 3 samples, each with a normal and 4 time-points of a tumor:

#normal	tumor1	tumor2	tumor3	tumor4
ss1	ss8	ss9	ss10	ss11
ss2	ss12	ss13	ss14	ss15	
ss3	ss16	ss17	ss18	ss19	

where, again each row is a sample and the ID's (starting with "ss") will be injected for each sample to allow a single expression like:

normal.hom_ref && normal.DP > 10 \
  && tumor1.AB > 0 \
  && tumor1.AB < tumor2.AB \
  && tumor2.AB < tumor3.AB \
  && tumor3.AB < tumor4.AB

to find a somatic variant that has increasing frequency (AB is allele balance) along the tumor time-points. More detail on groups is provided here

Sample Expressions

Users can specify a boolean expression that is tested against each sample using e.g.:

--sample-expr "hi_quality:sample.DP && sample.GQ > 10"

Each sample that passes this expression will be have its sample id appended to the INFO field of hi_quality which is added to the output VCF.

make-gnotate

Users can make their own gnotate files like:

slivar make-gnotate --prefix gnomad \
    --field AF_popmax:gnomad_popmax_af \
    --field nhomalt:gnomad_num_homalt \
    gnomad.exomes.r2.1.sites.vcf.gz gnomad.genomes.r2.1.sites.vcf.gz

this will pull AF_popmax and nhomalt from the INFO field and put them into gnomad.zip as gnomad_popmax_af and gnomad_num_homalt respectively. The resulting zip file will contain the union of values seen in the exome and genomes files with the maximum value for any intersection. Note that the names (gnomad_popmax_af and gnomad_num_homalt in this case) should be chosen carefully as those will be the names added to the INFO of any file to be annotated with the resulting gnomad.zip

More information on make-gnotate is in the wiki

compound-het

This command is used to find compound heterozygous variants (with phasing-by-inheritance) in trios. It is used after filtering to rare(-ish) heterozygotes.

See a full description of use here

NOTE that by default, this command limits to a subset of impacts; this is adjustable with the --skip flag. See more on the wiki

tsv

This command is used to convert a filtered and annotated VCF to a TSV (tab-separated value file) for final examination. An example use is:

slivar tsv -p $ped \
    -s denovo -s x_recessive \
    -c CSQ \
    -i gnomad_popmax_af -i gnomad_nhomalt \
    -g gene_desc.txt -g clinvar_gene_desc.txt \
    $vcf > final.tsv

where denovo and x_recessive indicate the INFO fields that contain lists of samples (as added by slivar) that should be extracted. and gnomad_popmax_af and gnomad_nhomalt are pulled from the INFO field. The -c arugment (CSQ) tells slivar that it can get gene, transcript and impact information from the CSQ field in the INFO. And the -g arguments are tab-delimited files of gene -> description where the description is added to the text output for quick inspection. Run slivar tsv without any arguments for examples on how to create these for pLI and clinvar.

Also see the wiki

duo-del

slivar duo-del finds structural deletions in parent-child duos using non-transmission of alleles. this can work to find deletions in exome data using genotypes, thereby avoiding the problems associated with depth-based CNV calling in exomes.

see: https://github.com/brentp/slivar/wiki/finding-deletions-in-parent-child-duos

Data Driven Cutoffs

slivar ddc is a tool to discover data-driven cutoffs from a VCF and pedigree information. It generates an interative VCF so a user can see how mendelian violation and transmissions are effected by varying cutoffs for values in the INFO and FORMAT fields.

See the wiki for more details.

Attributes

  • anything in the INFO is available as e.g. INFO.DP

  • INFO.impactful which, if CSQ (VEP), BCSQ (bcftools), or ANN (snpEff) is present indicates if the highest impact is "impactful". see wiki and INFO.genic which includes other gene impacts like synonymous. Also INFO.highest_impact_order explained in the wiki

  • variant consequences such as in INFO.CSQ can be parsed and used as object as described here

  • if FORMAT.AB is not present, it is added so one can filter with kid.AB > 0.25 && kid.AB < 0.75

  • variant attributes are: CHROM, POS, start, end, ID, REF, ALT, QUAL, FILTER, is_multiallelic

  • calculated variant attributes include: aaf, hwe_score, call_rate, num_hom_ref, num_het, num_hom_alt, num_unknown

  • numeric and flag sample attributes (via kid, mom, dad) included in the FORMAT. available as e.g. kid.AD[1], mom.DP, etc.

  • if the environment variable SLIVAR_FORMAT_STRINGS is not empty, then string sample fields will be available. these are not populated by default as they are used less often and impact performance.

  • sample attributes for hom_ref, het, hom_alt, unknown which are synonums for sample.alts of 0, 1, 2, -1 respectively.

  • sample attributes from the ped for affected, phenotype, sex, id are available as, e.g. kid.sex. phenotype is a string taken directly from the pedigree file while affected is a boolean.

  • sample relations are available as mom, dad, kids. mom and dad will be undefined if not available and kids will be an empty array.

  • a VCF object contains CSQ, BCSQ, ANN if those are present in the header (from VEP, BCFTOOLS, SnpEFF). The content is a list indicating the order of entries in the field e.g. ["CONSEQUENCE", "CODONS","AMINO_ACIDS", "GENE", ...]

How it works

slivar embeds the duktape javascript engine to allow the user to specify expressions. For each variant, each trio (and each sample), it fills the appropriate attributes. This can be intensive for VCFs with many samples, but this is done as efficiently as possible such that slivar can evaluate 10's of thousand of variants per second even with dozens of trios.

Summary Table

slivar outputs a summary table with rows of samples and columns of expression where each value indicates the number of variants that passed the expression in each sample. By default, this goes to STDOUT but if the environment variable SLIVAR_SUMMARY_FILE is set, slivar will write the summary to that file instead.

Gnotation Files

Users can create their own gnotation files with slivar make-gnotate, but we provide:

  • gnomad for hg37 with AF popmax, numhomalts (total and controls only) here

  • gnomad for hg38 (v3) genomes here

  • lifted gnomad exomes+genomes for hg38 with AF popmax, numhomalts (updated in release v0.1.2) here

The available fields can be seen with, for example:

$ unzip -l gnomad.hg38.v2.zip | grep -oP "gnotate-[^.]+" | sort -u
gnotate-gnomad_nhomalt
gnotate-gnomad_nhomalt_controls
gnotate-gnomad_popmax_af
gnotate-gnomad_popmax_af_controls
gnotate-variant

indicating that INFO.gnomad_nhomalt, INFO.gnomad_nhomalt_controls, INFO.gnomad_popmax_af and INFO.gnomad_popmax_af_controls will be the fields after they are added to the INFO.

More Repositories

1

cyvcf2

cython + htslib == fast VCF and BCF processing
Cython
366
star
2

vcfanno

annotate a VCF with other VCFs/BEDs/tabixed files
Go
332
star
3

goleft

goleft is a collection of bioinformatics tools distributed under MIT license in a single static binary
Go
202
star
4

smoove

structural variant calling and genotyping with existing tools, but, smoothly.
Go
187
star
5

bio-playground

miscellaneous scripts for bioinformatics/genomics that dont merit their own repo.
C
181
star
6

somalier

fast sample-swap and relatedness checks on BAMs/CRAMs/VCFs/GVCFs... "like damn that is one smart wine guy"
Nim
180
star
7

hts-nim

nim wrapper for htslib for parsing genomics data files
Nim
149
star
8

cruzdb

python access to UCSC genomes database
Python
129
star
9

peddy

genotype :: ped correspondence check, ancestry check, sex check. directly, quickly on VCF
Jupyter Notebook
122
star
10

bwa-meth

fast and accurate alignment of BS-Seq reads using bwa-mem and a 3-letter genome
Python
116
star
11

jigv

igv.js standalone page generator and automatic configuration to view bam/cram/vcf/bed. "working in under 1 minute"
Nim
107
star
12

echtvar

using all the bits for echt rapid variant annotation and filtering
Rust
105
star
13

combat.py

python / numpy / pandas / patsy version of ComBat for removing batch effects.
Python
98
star
14

intintmap

fast int64-int64 map for go
Go
93
star
15

duphold

don't get DUP'ed or DEL'ed by your putative SVs.
Nim
90
star
16

slurmpy

submit jobs to slurm with quick-and-dirty python
Python
88
star
17

pyfasta

fast, memory-efficient, pythonic (and command-line) access to fasta sequence files
Python
86
star
18

vcfgo

a golang library to read, write and manipulate files in the variant call format.
Go
65
star
19

fishers_exact_test

Fishers Exact Test for Python (Cython)
Python
63
star
20

xopen

open files for buffered reading and writing in #golang
Go
60
star
21

interlap

fast, pure-python interval overlap testing
Python
50
star
22

seqcover

seqcover allows users to view coverage for hundreds of genes and dozens of samples
Nim
49
star
23

rare-disease-wf

(WIP) best-practices workflow for rare disease
Nextflow
49
star
24

hts-python

pythonic wrapper for libhts (moved to: https://github.com/quinlan-lab/hts-python)
Python
49
star
25

go-chartjs

golang library to make https://chartjs.org/ plots (this is vanilla #golang, not gopherjs)
Go
47
star
26

hts-nim-tools

useful command-line tools written to showcase hts-nim
Nim
46
star
27

irelate

Streaming relation (overlap, distance, KNN) of (any number of) sorted genomic interval sets. #golang
Go
45
star
28

bsub

python wrapper to submit jobs to bsub (and later qsub)
Python
43
star
29

combined-pvalues

combining p-values using modified stouffer-liptak for spatially correlated results (probes)
Python
40
star
30

align

sequence alignment. global, local, glocal.
Python
40
star
31

bigly

a pileup library that embraces the huge
Go
39
star
32

indelope

find large indels (in the blind spot between GATK/freebayes and SV callers)
Nim
39
star
33

tiwih

simple bioinformatics command-line (t)ools (i) (w)ished (i) (h)ad.
Nim
39
star
34

cigar

simple library for dealing with SAM cigar strings
Python
37
star
35

gsort

sort genomic data
Go
35
star
36

450k-analysis-guide

A Practical (And Opinionated) Guide To Analyzing 450K Data
TeX
34
star
37

geneimpacts

prioritize effects of variant annotations from VEP, SnpEff, et al.
Python
31
star
38

quicksect

a cythonized, extended version of the interval search tree in bx
Python
30
star
39

poverlap

significance testing over interval overlaps
Python
29
star
40

fastbit-python

pythonic access to fastbit
Python
26
star
41

nim-lapper

fast easy interval overlapping for nim-lang
Nim
24
star
42

lua-stringy

fast lua string operations
C
21
star
43

pybloomfaster

fast bloomfilter
C
21
star
44

vcfassoc

perform genotype-phenotype-association tests on a VCF with logistic regression.
Python
20
star
45

toolshed

python stuff I use
Python
19
star
46

bw-python

python wrapper to dpryan79's bigwig library using cffi
C
19
star
47

methylcode

Alignment and Tabulation of BiSulfite Treated Reads
C
16
star
48

tnsv

add true-negative SVs from a population callset to a truth-set.
Nim
15
star
49

hileup

horizontal pileup
C
15
star
50

nim-kmer

DNA kmer operations for nim
Nim
15
star
51

genoiser

use the noise
Nim
15
star
52

bigwig-nim

command-line querying+conversion of bigwigs and a nim wrapper for dpryan's libbigwig
Nim
15
star
53

vcf-bench

evaluating vcf parsing libraries
Zig
14
star
54

agoodle

numpy + GDAL == agoodle
Python
13
star
55

hts-zig

ziglang + htslib
Zig
12
star
56

aclust

streaming, flexible agglomerative clustering
Python
12
star
57

gobio

miscellaneous script-like stuff in go for bioinformatics
Go
12
star
58

excord

extract SV signal from a BAM
Go
11
star
59

fastahack-python

cython wrapper to fastahack
C++
11
star
60

spacepile

convert reads from repeated measures of same piece of DNA into spaced matricies for deep learners.
Rust
10
star
61

bwa-mips

Map sequence from Molecular Inversion Probes with BWA, strip arms, de-dup, ..., profit
Python
10
star
62

gobe

a fast, interactive, light-weight, customizable, web-based comparative genomics viewer with simple text input format.
Haxe
10
star
63

bix

tabix file access with golang using biogo machinery
Go
9
star
64

sveval

run multiple sv evalution tools
Python
8
star
65

clinical-components

Summarize the clinical (or lab) components and correlations of your dataset.
Python
8
star
66

bowfast

run bowtie then bfast on colorspace reads.
Shell
7
star
67

bcf

bcf parsing in golang
Go
7
star
68

inheritance

inheritance models for mendelian diseases
Python
7
star
69

skidmarks

find runs (non-randomness) in sequences
Python
7
star
70

falas

Fragment-Aware Local Assembly for Short-reads
Nim
7
star
71

bamject

DO NOT USE inject variants (snps/indels) from a vcf into a bam efficiently.
Nim
7
star
72

bpbio

basepair bio: a single binary with many useful genomics subtools.
Nim
6
star
73

kexpr-nim

nim wrapper for Heng Li's kexpr math expression evaluator library
C
5
star
74

clustermodel

fitting models to clustered correlated data
Python
5
star
75

go-blosc

go wrapper for blosc (blocked number compression with fast random access)
Go
5
star
76

celltypes450

adjust for cell-type composition in 450K data using houseman's and other methods.
R
5
star
77

crystal

find clusters and model correlated data from DNA methylation and other genomic sources.
Python
5
star
78

variantkey-nim

nim-wrapper for variantkey -- (chrom, pos, ref, alt) -> uint64
C
4
star
79

pedfile

pedigree file parsing and relatedness calculations for nim
Nim
4
star
80

shuffler

shuffle genome regions to determine probability of user-defined metric
Python
4
star
81

go-giggle

golang wrapper to giggle
Go
4
star
82

bedder-rs

an API for intersections of genomic data
Rust
4
star
83

nim-gzfile

simple reader and writer for gzipped (and regular) files
Nim
4
star
84

find_cns

find conserved non-coding sequences (CNS)
Python
4
star
85

nim-cgranges

nim wrapper for https://github.com/lh3/cgranges for faster interval tree
C
3
star
86

d4-nim

nim-lang wrapper for https://github.com/38/d4-format
Nim
3
star
87

flatfeature

python module for dealing with BED format for genomic data as a numpy array.
Python
3
star
88

faidx

faidx for golang
Go
3
star
89

tabix-py

interface to tabix using cffi
C
3
star
90

4bit

4bit fasta format.
C
3
star
91

ksw2-nim

nim wrapper for lhs/ksw2 for fast smith-waterman
C
3
star
92

htsuse

some C stuff that uses htslib
C
2
star
93

nim-minizip

nothing to see here.
C
2
star
94

pyfastx

unified access to fasta, fastx using kseq.h + ??
C
2
star
95

learnflash

all the stuff i want to remember how to do in haxe / flash
2
star
96

ififo

ififo provides a fast, sized, generic thread-safe FIFO in golang.
Go
2
star
97

dotfiles

my .bash, .vim, .* files
Shell
2
star
98

cysolar

copy of pysolar using cython
Python
2
star
99

cgotbx

yeah, another tabix wrapper for go.
Go
1
star
100

totable

simple python / cython wrapper to tokyo cabinet tables
C
1
star