• Stars
    star
    332
  • Rank 126,957 (Top 3 %)
  • Language
    Go
  • License
    MIT License
  • Created over 9 years ago
  • Updated about 2 years ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

annotate a VCF with other VCFs/BEDs/tabixed files

vcfanno

Build Status Docs

If you use vcfanno, please cite the paper

Overview

vcfanno allows you to quickly annotate your VCF with any number of INFO fields from any number of VCFs or BED files. It uses a simple conf file to allow the user to specify the source annotation files and fields and how they will be added to the info of the query VCF.

  • For VCF, values are pulled by name from the INFO field with special-cases of ID and FILTER to pull from those VCF columns.
  • For BED, values are pulled from (1-based) column number.
  • For BAM, depth (count), "mapq" and "seq" are currently supported.

vcfanno is written in go and it supports custom user-scripts written in lua. It can annotate more than 8,000 variants per second with 34 annotations from 9 files on a modest laptop and over 30K variants per second using 12 processes on a server.

We are actively developing vcfanno and appreciate feedback and bug reports.

Usage

After downloading the binary for your system (see section below) usage looks like:

  ./vcfanno -lua example/custom.lua example/conf.toml example/query.vcf.gz

Where conf.toml looks like:

[[annotation]]
file="ExAC.vcf"
# ID and FILTER are special fields that pull the ID and FILTER columns from the VCF
fields = ["AC_AFR", "AC_AMR", "AC_EAS", "ID", "FILTER"]
ops=["self", "self", "min", "self", "self"]
names=["exac_ac_afr", "exac_ac_amr", "exac_ac_eas", "exac_id", "exac_filter"]

[[annotation]]
file="fitcons.bed"
columns = [4, 4]
names=["fitcons_mean", "lua_sum"]
# note the 2nd op here is lua that has access to `vals`
ops=["mean", "lua:function sum(t) local sum = 0; for i=1,#t do sum = sum + t[i] end return sum / #t end"]

[[annotation]]
file="example/ex.bam"
names=["ex_bam_depth"]
fields=["depth", "mapq", "seq"]
ops=["count", "mean", "concat"]

So from ExAC.vcf we will pull the fields from the info field and apply the corresponding operation from the ops array. Users can add as many [[annotation]] blocks to the conf file as desired. Files can be local as above, or available via http/https.

See the additional usage section at the bottom for more.

Example

The example directory contains the data and conf for a full example. To run, download the appropriate binary for your system.

Then, you can annotate with:

./vcfanno -p 4 -lua example/custom.lua example/conf.toml example/query.vcf.gz > annotated.vcf

An example INFO field row before annotation (pos 98683):

AB=0.282443;ABP=56.8661;AC=11;AF=0.34375;AN=32;AO=45;CIGAR=1X;TYPE=snp

and after:

AB=0.2824;ABP=56.8661;AC=11;AF=0.3438;AN=32;AO=45;CIGAR=1X;TYPE=snp;AC_AFR=0;AC_AMR=0;AC_EAS=0;fitcons_mean=0.061;lua_sum=0.061

Typecasting values

By default, using ops of mean,max,sum,div2 or min will result in type=Float, using self will get the type from the annotation VCF and other fields will have type=String. It's possible to add field type info to the field name. To change the field type add _int or _float to the field name. This suffix will be parsed and removed, and your field will be of the desired type.

Operations

In most cases, we will have a single annotation entry for each entry (variant) in the query VCF, in which case the self op is the best choice. However, it is possible that there will be multiple annotations from a single annotation file--in this case, the op determines how the many values are reduced. Valid operations are:

  • lua:$lua // see section below for more details
  • self // pull directly from the annotation and handle multi-allelics
  • concat // comma delimited list of output
  • count // count the number of overlaps
  • div2 // given two values a and b, return a / b
  • first // take only the first value
  • flag // presense/absence via VCF flag
  • max // numbers only
  • mean // numbers only
  • min // numbers only
  • sum // numbers only
  • uniq // comma-delimited list of uniq values
  • by_alt // comma-delimited by alt (Number=A), pipe-delimited (|) for multiple annos for the same alt.

There are some operations that are only for postannotation:

  • delete // remove fields from the query VCF's INFO
  • setid // set the ID file of the query VCF with values from its INFO

In nearly all cases, if you are annotating with a VCF, use self

Note that when the file is a BAM, the operation is determined by the field name ('seq', 'mapq', 'DP2', 'coverage' are supported).

PostAnnotation

One of the most powerful features of vcfanno is the embedded scripting language, lua, combined with postannotation. [[postannotation]] blocks occur after all the annotations have been applied. They are similar, but in the fields column, they request a number of columns from the query file (including the new columns added in annotation). For example if we have AC and AN columns indicating the alternate count and the number of chromosomes, respectively, we could create a new allele frequency column, AF, with this block:

[[postannotation]]
fields=["AC", "AN"]
op="lua:AC / AN"
name="AF"
type="Float"

where type is one of the types accepted in VCF format, name is the name of the field that is created, fields indicates the fields (from the INFO) that will be available to the op, and op indicates the action to perform. This can be quite powerful. For an extensive example that demonstrates the utility of this type of approach, see docs/examples/clinvar_exac.md.

A user can set the ID field of the VCF in a [[postannotation]] block by using name=ID. For example:

[[postannotation]]
name="ID"
fields=["other_field", "ID"]
op="lua:other_field .. ';' .. ID"
type="String"

will take the value in other_field, concatenate it with the existing ID, and set the ID to that value.

see the setid function in examples/custom.lua for a more robust method of doing this.

Additional Usage

-ends

For annotating large variants, such as CNVs or structural variants (SVs), it can be useful to annotate the ends of the variant in addition to the region itself. To do this, specify the -ends flag to vcfanno. e.g.:

vcfanno -ends example/conf.toml example/query.vcf.gz

In this case, the names field in the conf file contains "fitcons_mean". The output will contain fitcons_mean as before along with left_fitcons_mean and right_fitcons_mean for any variants that are longer than 1 base. The left end will be for the single-base at the lowest base of the variant and the right end will be for the single base at the higher numbered base of the variant.

-permissive-overlap

By default, when annotating with a variant, in addition to the overlap requirement, the variants must share the same position, the same reference allele and at least one alternate allele (this is only used for variants, not for BED/BAM annotations). If this flag is specified, only overlap testing is used and shared REF/ALT are not required.

-p

Set to the number of processes that vcfanno can use during annotation. vcfanno parallelizes well up to 15 or so cores.

-lua

Custom in ops (lua). For use when the built-in ops don't supply the needed reduction.

We embed the lua engine go-lua so that it's possible to create a custom op if it is not provided. For example if the user wants to

"lua:function sum(t) local sum = 0; for i=1,#t do sum = sum + t[i] end return sum end"

where the last value (in this case sum) is returned as the annotation value. It is encouraged to instead define lua functions in a separate .lua file and point to it when calling vcfanno using the -lua flag. So, in an external file, "some.lua", instead put:

function sum(t)
    local sum = 0
    for i=1,#t do
        sum = sum + t[i]
    end
    return sum
end

And then the above custom op would be: "lua:sum(vals)". (note that there's a sum op provided by vcfanno which will be faster).

The variables vals, chrom, start, stop, ref, alt from the currently variant will all be available in the lua code. alt will be a table with length equal to the number of alternate alleles. Example usage could be:

op="lua:ref .. '/' .. alt[1]"

See example/conf.toml and example/custom.lua for more examples.

Mailing List

Mailing ListMailing List

Installation

Please download a static binary (executable) from here and copy it into your '$PATH'. There are no dependencies.

If you use bioconda, you can install with: conda install -c bioconda vcfanno

Multi-Allelics

A multi-allelic variant is simply a site where there are multiple, non-reference alleles seen in the population. These will appear as e.g. REF="A", ALT="G,C". As of version 0.2, vcfanno will handle these fully with op="self" when the Number from the VCF header is A (Number=A)

For example this table lists Alt columns query and annotation (assuming the REFs and position match) along with the values from the annotation, and shows how the query INFO will be filled:

query ALTS anno ALTS anno vals from INFO result
C,G C,G 22,23 22,23
C,G C,T 22,23 22,.
C,G T,G 22,23 .,23
G,C C,G 22,23 23,22
C,G C YYY YYY,.
G,C,T C YYY .,YYY,.
C,T G YYY .,.
T,C C,T AA,BB BB,AA

Note the flipped values in the result column, and that values that are not present in the annotation are filled with '.' as a place-holder.

More Repositories

1

cyvcf2

cython + htslib == fast VCF and BCF processing
Cython
366
star
2

goleft

goleft is a collection of bioinformatics tools distributed under MIT license in a single static binary
Go
202
star
3

slivar

genetic variant expressions, annotation, and filtering for great good.
Nim
189
star
4

smoove

structural variant calling and genotyping with existing tools, but, smoothly.
Go
187
star
5

bio-playground

miscellaneous scripts for bioinformatics/genomics that dont merit their own repo.
C
181
star
6

somalier

fast sample-swap and relatedness checks on BAMs/CRAMs/VCFs/GVCFs... "like damn that is one smart wine guy"
Nim
180
star
7

hts-nim

nim wrapper for htslib for parsing genomics data files
Nim
149
star
8

cruzdb

python access to UCSC genomes database
Python
129
star
9

peddy

genotype :: ped correspondence check, ancestry check, sex check. directly, quickly on VCF
Jupyter Notebook
122
star
10

bwa-meth

fast and accurate alignment of BS-Seq reads using bwa-mem and a 3-letter genome
Python
116
star
11

jigv

igv.js standalone page generator and automatic configuration to view bam/cram/vcf/bed. "working in under 1 minute"
Nim
107
star
12

echtvar

using all the bits for echt rapid variant annotation and filtering
Rust
105
star
13

combat.py

python / numpy / pandas / patsy version of ComBat for removing batch effects.
Python
98
star
14

intintmap

fast int64-int64 map for go
Go
93
star
15

duphold

don't get DUP'ed or DEL'ed by your putative SVs.
Nim
90
star
16

slurmpy

submit jobs to slurm with quick-and-dirty python
Python
88
star
17

pyfasta

fast, memory-efficient, pythonic (and command-line) access to fasta sequence files
Python
86
star
18

vcfgo

a golang library to read, write and manipulate files in the variant call format.
Go
65
star
19

fishers_exact_test

Fishers Exact Test for Python (Cython)
Python
63
star
20

xopen

open files for buffered reading and writing in #golang
Go
60
star
21

interlap

fast, pure-python interval overlap testing
Python
50
star
22

seqcover

seqcover allows users to view coverage for hundreds of genes and dozens of samples
Nim
49
star
23

rare-disease-wf

(WIP) best-practices workflow for rare disease
Nextflow
49
star
24

hts-python

pythonic wrapper for libhts (moved to: https://github.com/quinlan-lab/hts-python)
Python
49
star
25

go-chartjs

golang library to make https://chartjs.org/ plots (this is vanilla #golang, not gopherjs)
Go
47
star
26

hts-nim-tools

useful command-line tools written to showcase hts-nim
Nim
46
star
27

irelate

Streaming relation (overlap, distance, KNN) of (any number of) sorted genomic interval sets. #golang
Go
45
star
28

bsub

python wrapper to submit jobs to bsub (and later qsub)
Python
43
star
29

combined-pvalues

combining p-values using modified stouffer-liptak for spatially correlated results (probes)
Python
40
star
30

align

sequence alignment. global, local, glocal.
Python
40
star
31

bigly

a pileup library that embraces the huge
Go
39
star
32

indelope

find large indels (in the blind spot between GATK/freebayes and SV callers)
Nim
39
star
33

tiwih

simple bioinformatics command-line (t)ools (i) (w)ished (i) (h)ad.
Nim
39
star
34

cigar

simple library for dealing with SAM cigar strings
Python
37
star
35

gsort

sort genomic data
Go
35
star
36

450k-analysis-guide

A Practical (And Opinionated) Guide To Analyzing 450K Data
TeX
34
star
37

geneimpacts

prioritize effects of variant annotations from VEP, SnpEff, et al.
Python
31
star
38

quicksect

a cythonized, extended version of the interval search tree in bx
Python
30
star
39

poverlap

significance testing over interval overlaps
Python
29
star
40

fastbit-python

pythonic access to fastbit
Python
26
star
41

nim-lapper

fast easy interval overlapping for nim-lang
Nim
24
star
42

lua-stringy

fast lua string operations
C
21
star
43

pybloomfaster

fast bloomfilter
C
21
star
44

vcfassoc

perform genotype-phenotype-association tests on a VCF with logistic regression.
Python
20
star
45

toolshed

python stuff I use
Python
19
star
46

bw-python

python wrapper to dpryan79's bigwig library using cffi
C
19
star
47

methylcode

Alignment and Tabulation of BiSulfite Treated Reads
C
16
star
48

tnsv

add true-negative SVs from a population callset to a truth-set.
Nim
15
star
49

hileup

horizontal pileup
C
15
star
50

nim-kmer

DNA kmer operations for nim
Nim
15
star
51

genoiser

use the noise
Nim
15
star
52

bigwig-nim

command-line querying+conversion of bigwigs and a nim wrapper for dpryan's libbigwig
Nim
15
star
53

vcf-bench

evaluating vcf parsing libraries
Zig
14
star
54

agoodle

numpy + GDAL == agoodle
Python
13
star
55

hts-zig

ziglang + htslib
Zig
12
star
56

aclust

streaming, flexible agglomerative clustering
Python
12
star
57

gobio

miscellaneous script-like stuff in go for bioinformatics
Go
12
star
58

excord

extract SV signal from a BAM
Go
11
star
59

fastahack-python

cython wrapper to fastahack
C++
11
star
60

spacepile

convert reads from repeated measures of same piece of DNA into spaced matricies for deep learners.
Rust
10
star
61

bwa-mips

Map sequence from Molecular Inversion Probes with BWA, strip arms, de-dup, ..., profit
Python
10
star
62

gobe

a fast, interactive, light-weight, customizable, web-based comparative genomics viewer with simple text input format.
Haxe
10
star
63

bix

tabix file access with golang using biogo machinery
Go
9
star
64

sveval

run multiple sv evalution tools
Python
8
star
65

clinical-components

Summarize the clinical (or lab) components and correlations of your dataset.
Python
8
star
66

bowfast

run bowtie then bfast on colorspace reads.
Shell
7
star
67

bcf

bcf parsing in golang
Go
7
star
68

inheritance

inheritance models for mendelian diseases
Python
7
star
69

skidmarks

find runs (non-randomness) in sequences
Python
7
star
70

falas

Fragment-Aware Local Assembly for Short-reads
Nim
7
star
71

bamject

DO NOT USE inject variants (snps/indels) from a vcf into a bam efficiently.
Nim
7
star
72

bpbio

basepair bio: a single binary with many useful genomics subtools.
Nim
6
star
73

kexpr-nim

nim wrapper for Heng Li's kexpr math expression evaluator library
C
5
star
74

clustermodel

fitting models to clustered correlated data
Python
5
star
75

go-blosc

go wrapper for blosc (blocked number compression with fast random access)
Go
5
star
76

celltypes450

adjust for cell-type composition in 450K data using houseman's and other methods.
R
5
star
77

crystal

find clusters and model correlated data from DNA methylation and other genomic sources.
Python
5
star
78

variantkey-nim

nim-wrapper for variantkey -- (chrom, pos, ref, alt) -> uint64
C
4
star
79

pedfile

pedigree file parsing and relatedness calculations for nim
Nim
4
star
80

shuffler

shuffle genome regions to determine probability of user-defined metric
Python
4
star
81

go-giggle

golang wrapper to giggle
Go
4
star
82

bedder-rs

an API for intersections of genomic data
Rust
4
star
83

nim-gzfile

simple reader and writer for gzipped (and regular) files
Nim
4
star
84

find_cns

find conserved non-coding sequences (CNS)
Python
4
star
85

nim-cgranges

nim wrapper for https://github.com/lh3/cgranges for faster interval tree
C
3
star
86

d4-nim

nim-lang wrapper for https://github.com/38/d4-format
Nim
3
star
87

flatfeature

python module for dealing with BED format for genomic data as a numpy array.
Python
3
star
88

faidx

faidx for golang
Go
3
star
89

tabix-py

interface to tabix using cffi
C
3
star
90

4bit

4bit fasta format.
C
3
star
91

ksw2-nim

nim wrapper for lhs/ksw2 for fast smith-waterman
C
3
star
92

htsuse

some C stuff that uses htslib
C
2
star
93

nim-minizip

nothing to see here.
C
2
star
94

pyfastx

unified access to fasta, fastx using kseq.h + ??
C
2
star
95

learnflash

all the stuff i want to remember how to do in haxe / flash
2
star
96

ififo

ififo provides a fast, sized, generic thread-safe FIFO in golang.
Go
2
star
97

dotfiles

my .bash, .vim, .* files
Shell
2
star
98

cysolar

copy of pysolar using cython
Python
2
star
99

cgotbx

yeah, another tabix wrapper for go.
Go
1
star
100

totable

simple python / cython wrapper to tokyo cabinet tables
C
1
star