• Stars
    star
    394
  • Rank 109,295 (Top 3 %)
  • Language
    Perl
  • License
    Apache License 2.0
  • Created over 8 years ago
  • Updated about 1 year ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

The Ensembl Variant Effect Predictor predicts the functional effects of genomic variants

ensembl-vep

GitHub Coverage Status Docker Build Status Docker Hub Pulls

  • VEP (Variant Effect Predictor) predicts the functional effects of genomic variants.
  • Haplosaurus uses phased genotype data to predict whole-transcript haplotype sequences.
  • Variant Recoder translates between different variant encodings.
Table of contents

Installation and requirements

The VEP package requires:

  • gcc, g++ and make
  • Perl (>=5.10 recommended, tested on 5.10, 5.14, 5.18, 5.22, 5.26)
  • Perl libraries Archive::Zip and DBI

The remaining dependencies can be installed using the included INSTALL.pl script. Basic instructions:

git clone https://github.com/Ensembl/ensembl-vep.git
cd ensembl-vep
perl INSTALL.pl

The installer may also be used to check for updates to this and co-dependent packages, simply re-run INSTALL.pl.

See documentation for full installation instructions.

Additional CPAN modules

The following modules are optional but most users will benefit from installing them. We recommend using cpanminus to install.

  • DBD::mysql - required for database access (--database or --cache without --offline)
  • Set::IntervalTree - required for Haplosaurus, also confers speed updates to VEP
  • JSON - required for writing JSON output
  • PerlIO::gzip - faster compressed file parsing
  • Bio::DB::BigFile - required for reading custom annotation data from BigWig files

Docker

A docker image for VEP is available from DockerHub.

See documentation for the Docker installation instructions.


VEP

Usage

./vep -i input.vcf -o out.txt -offline

See documentation for full command line instructions.

Please report any bugs or issues by contacting Ensembl or creating a GitHub issue


Haplosaurus

haplo is a local tool implementation of the same functionality that powers the Ensembl transcript haplotypes view. It takes phased genotypes from a VCF and constructs a pair of haplotype sequences for each overlapped transcript; these sequences are also translated into predicted protein haplotype sequences. Each variant haplotype sequence is aligned and compared to the reference, and an HGVS-like name is constructed representing its differences to the reference.

This approach offers an advantage over VEP's analysis, which treats each input variant independently. By considering the combined change contributed by all the variant alleles across a transcript, the compound effects the variants may have are correctly accounted for.

haplo shares much of the same command line functionality with vep, and can use VEP caches, Ensembl databases, GFF and GTF files as sources of transcript data; all vep command line flags relating to this functionality work the same with haplo.

Usage

Input data must be a VCF containing phased genotype data for at least one individual and file must be sorted by chromosome and genomic position; no other formats are currently supported.

When using a VEP cache as the source of transcript annotation, the first time you run haplo with a particular cache it will spend some time scanning transcript locations in the cache.

./haplo -i input.vcf -o out.txt -cache

Output

The default output format is a simple tab-delimited file reporting all observed non-reference haplotypes. It has the following fields:

  1. Transcript stable ID
  2. CDS haplotype name
  3. Comma-separated list of flags for CDS haplotype
  4. Protein haplotype name
  5. Comma-separated list of flags for protein haplotype
  6. Comma-separated list of frequency data for protein haplotype
  7. Comma-separated list of contributing variants
  8. Comma-separated list of sample:count that exhibit this haplotype

The altered haplotype sequences can be obtained by switching to JSON output using --json which will display them by default. Each transcript analysed is summarised as a JSON object written to one line of the output file.

The JSON output structure matches the format of the transcript haplotype REST endpoint.

You may exclude fields in the JSON from being exported with --dont_export field1,field2. This may be used, for example, to exclude the full haplotype sequence and aligned sequences from the output with --dont_export seq,aligned_sequences.

Note JSON output does not currently include side-loaded frequency data.

REST service

The transcript haplotype REST endpoint. returns arrays of protein_haplotypes and cds_haplotypes for a given transcript. The default haplotype record includes:

  • population_counts: the number of times the haplotype is seen in each population
  • population_frequencies: the frequency of the haplotype in each population
  • contributing_variants: variants contributing to the haplotype
  • diffs: differences between the reference and this haplotype
  • hex: the md5 hex of this haplotype sequence
  • other_hexes: the md5 hex of other related haplotype sequences ( CDSHaplotypes that translate to this ProteinHaplotype or ProteinHaplotype representing the translation of this CDSHaplotype)
  • has_indel: does the haplotype contain insertions or deletions
  • type: the type of haplotype - cds, protein
  • name: a human readable name for the haplotype (sequence id + REF or a change description)
  • flags: flags for the haplotype
  • frequency: haplotype frequency in full sample set
  • count: haplotype count in full sample set

The REST service does not return raw sequences, sample-haplotype assignments and the aligned sequences used to generate differences by default.

Flags

Haplotypes may be flagged with one or more of the following:

  • indel: haplotype contains an insertion or deletion (indel) relative to the reference.
  • frameshift: haplotype contains at least one indel that disrupts the reading frame of the transcript.
  • resolved_frameshift: haplotype contains two or more indels whose combined effect restores the reading frame of the transcript.
  • stop_changed: indicates either a STOP codon is gained (protein truncating variant, PTV) or the existing reference STOP codon is lost.
  • deleterious_sift_or_polyphen: haplotype contains at least one single amino acid substitution event flagged as deleterious (SIFT) or probably damaging (PolyPhen2).

bioperl-ext

haplo can make use of a fast compiled alignment algorithm from the bioperl-ext package; this can speed up analysis, particularly in longer transcripts where insertions and/or deletions are introduced. The bioperl-ext package is no longer maintained and requires some tweaking to install. The following instructions install the package in $HOME/perl5; edit PREFIX=[path] to change this. You may also need to edit the export command to point to the path created for the architecture on your machine.

git clone https://github.com/bioperl/bioperl-ext.git
cd bioperl-ext/Bio/Ext/Align/
perl -pi -e"s|(cd libs.+)CFLAGS=\\\'|\$1CFLAGS=\\\'-fPIC |" Makefile.PL
perl Makefile.PL PREFIX=~/perl5
make
make install
cd -
export PERL5LIB=${PERL5LIB}:${HOME}/perl5/lib/x86_64-linux-gnu/perl/5.22.1/

If successful the following should print OK:

perl -MBio::Tools::dpAlign -e"print qq{OK\n}"

Variant Recoder

variant_recoder is a tool for translating between different variant encodings. It accepts as input any format supported by VEP (VCF, variant ID, HGVS), with extensions to allow for parsing of potentially ambiguous HGVS notations. For each input variant, variant_recoder reports all possible encodings including variant IDs from all sources imported into the Ensembl database and HGVS (genomic, transcript and protein), reported on Ensembl, RefSeq and LRG sequences.

Usage

variant_recoder depends on database access for identifier lookup, and cannot be used in offline mode as per VEP. The output format is JSON and the JSON perl module is required.

./variant_recoder --id [input_data_string]
./variant_recoder -i [input_file] --species [species]

Output

Output is a JSON array of objects, one per input variant, with the following keys:

  • input: input string
  • id: variant identifiers
  • hgvsg: HGVS genomic nomenclature
  • hgvsc: HGVS transcript nomenclature
  • hgvsp: HGVS protein nomenclature
  • spdi: Genomic SPDI notation
  • vcf_string: VCF format (optional)
  • var_synonyms: Variation synonyms (optional)
  • mane_select: MANE Select transcripts (optional)
  • warnings: Warnings generated e.g. for invalid HGVS

Use --pretty to pre-format and indent JSON output.

Example output:

./variant_recoder --id "AGT:p.Met259Thr" --pretty
[
   {
     "warnings" : [
         "Possible invalid use of gene or protein identifier 'AGT' as HGVS reference; AGT:p.Met259Thr may resolve to multiple genomic locations"
      ],
     "C" : {
        "input" : "AGT:p.Met259Thr",
        "id" : [
           "rs699",
           "CM920010",
           "COSV64184214"
        ],
        "hgvsg" : [
           "NC_000001.11:g.230710048A>G"
        ],
        "hgvsc" : [
           "ENST00000366667.6:c.776T>C",
           "ENST00000679684.1:c.776T>C",
           "ENST00000679738.1:c.776T>C",
           "ENST00000679802.1:c.776T>C",
           "ENST00000679854.1:n.1287T>C",
           "ENST00000679957.1:c.776T>C",
           "ENST00000680041.1:c.776T>C",
           "ENST00000680783.1:c.776T>C",
           "ENST00000681269.1:c.776T>C",
           "ENST00000681347.1:n.1287T>C",
           "ENST00000681514.1:c.776T>C",
           "ENST00000681772.1:c.776T>C",
           "NM_001382817.3:c.776T>C",
           "NM_001384479.1:c.776T>C"
        ],
        "hgvsp" : [
           "ENSP00000355627.5:p.Met259Thr",
           "ENSP00000505981.1:p.Met259Thr",
           "ENSP00000505063.1:p.Met259Thr",
           "ENSP00000505184.1:p.Met259Thr",
           "ENSP00000506646.1:p.Met259Thr",
           "ENSP00000504866.1:p.Met259Thr",
           "ENSP00000506329.1:p.Met259Thr",
           "ENSP00000505985.1:p.Met259Thr",
           "ENSP00000505963.1:p.Met259Thr",
           "ENSP00000505829.1:p.Met259Thr",
           "NP_001369746.2:p.Met259Thr",
           "NP_001371408.1:p.Met259Thr"
        ],
        "spdi" : [
           "NC_000001.11:230710047:A:G"
        ]
     }
   }
]

Options

variant_recoder shares many of the same command line flags as VEP. Others are unique to variant_recoder.

  • -id|--input_data [input_string]: a single variant as a string.
  • -i|--input_file [input_file]: input file containing one or more variants, one per line. Mixed formats disallowed.
  • --species: species to use (default: homo_sapiens).
  • --grch37: use GRCh37 assembly instead of GRCh38.
  • --genomes: set database parameters for Ensembl Genomes species.
  • --pretty: write pre-formatted indented JSON.
  • --fields [field1,field2]: limit output fields. Comma-separated list, one or more of: id, hgvsg, hgvsc, hgvsp, spdi.
  • --vcf_string : report VCF
  • --var_synonyms : report variation synonyms
  • --mane_select : report MANE Select transcripts in HGVS format
  • --host [db_host]: change database host from default ensembldb.ensembl.org (UK); geographic mirrors are useastdb.ensembl.org (US East Coast) and asiadb.ensembl.org (Asia). --user, --port and --pass may also be set.
  • --pick, --per_gene, --pick_allele, --pick_allele_gene, --pick_order: set and customise transcript selection process, see VEP documentation

More Repositories

1

WiggleTools

Basic operations on the space of numerical functions defined on the genome using lazy evaluators for flexibility and efficiency
C
142
star
2

ensembl-rest

Language agnostic RESTful data access to Ensembl data over HTTP
Perl
128
star
3

VEP_plugins

Plugins for the Ensembl Variant Effect Predictor (VEP)
Perl
113
star
4

ensembl

The Ensembl Core Perl API and SQL schema
Perl
74
star
5

ensembl-pipeline

*DEPRECATED* Job management for the Ensembl Genebuild pipeline
Perl
59
star
6

ensembl-hive

EnsEMBL Hive - a system for creating and running pipelines on a distributed compute resource
Perl
49
star
7

ensembl-compara

The Ensembl Compara Perl API and SQL schema
Perl
45
star
8

postgap

Linking GWAS studies to genes through cis-regulatory datasets
Jupyter Notebook
38
star
9

plant-scripts

Scripting analyses of genomes in Ensembl Plants
Perl
36
star
10

ensembl-tools

Ensembl tools
Perl
32
star
11

ensembl-variation

The Ensembl Variation Perl API and SQL schema
Perl
26
star
12

Bio-DB-HTS

Git repo for Bio::DB::HTS module on CPAN, providing Perl links into HTSlib
Perl
24
star
13

ensembl-webcode

The code to run an Ensembl website
Perl
21
star
14

ensembl-funcgen

Ensembl Funcgen Perl API and SQL schema
Perl
14
star
15

ensembl-analysis

Modules to interface with tools used in Ensembl Gene Annotation Process and scripts to run pipelines
Perl
13
star
16

ensembl-client

Ensembl beta client
TypeScript
12
star
17

ensj-healthcheck

Ensembl's Automated QC Framework
Java
10
star
18

ensembl-production

Ensembl Production code
Perl
10
star
19

ensembl-anno

Python
9
star
20

ensembl-git-tools

A collection of tools which Ensembl uses to work with Git
Perl
9
star
21

public-plugins

Plugins for an Ensembl website
HTML
9
star
22

ensembl-io

File parsing and writing code for Ensembl
Perl
9
star
23

ensembl-genomio

Pipelines to turn basic genomic data into Ensembl cores and back
Python
9
star
24

treefam_tools

This folder contains scripts to access/use the TreeFam database
Perl
8
star
25

XML-To-Blockly

Takes RelaxNG schema as input and generates corresponding code for a Blockly block to represent the same
JavaScript
8
star
26

homebrew-ensembl

Core formula for Ensembl
Ruby
8
star
27

trackhub-registry

Specifications and implementation of the TrackHub registry
Perl
7
star
28

plant_tools

Tools and documentation for Plants
M
7
star
29

EpiRR

A registry of epigenomics reference data sets
Perl
6
star
30

ols-client

Python
6
star
31

ensembl-xs

A collection of XS modules to be used with the Ensembl APIs
Perl
6
star
32

ensembl-test

Test libraries and harnesses used for running the Ensembl test suite
Perl
6
star
33

ensembl-datacheck

Code for checking Ensembl databases during release production
Perl
5
star
34

ensembl-database-loader

A eHive Pipeline for loading Ensembl and Ensembl Genomes databases to a MySQL compatible server
Perl
5
star
35

rest-api-jupyter-course

Python and R notebooks to be used by Jupyter
Jupyter Notebook
5
star
36

ensembl-metadata

API for storing and querying metadata from Ensembl and EnsemblGenomes
Perl
5
star
37

guiHive

Graphical interface for the eHive workflow manager
JavaScript
5
star
38

ensembl-py

Python Ensembl code source repository
Python
5
star
39

ensembl-web-docker

ensembl-web docker
HTML
4
star
40

tark

HTML
4
star
41

homebrew-external

Formulas hosted from the homebrew project from retired or third party repos with minor edits
Ruby
4
star
42

ensembl-genes

Python Ensembl Gene Annotation code source repository
Python
4
star
43

ensc-core

C API for the Ensembl Database
C
3
star
44

GIFTS

Perl
3
star
45

ensembl-biomart

Code to build the ensembl and ensembl genomes marts
Perl
3
star
46

ensembl-rest-deploy

Tools for automatic server deployment
Ruby
3
star
47

homebrew-cask

Meta formulas for homebrew/linuxbrew
Ruby
3
star
48

ensembl-doc

Archive documentation of the Ensembl infrastructure. Information here is probably out of date.
TeX
3
star
49

ensembl-utils

Ensembl python utils
Python
3
star
50

ensembl-presentation

Presentations used in Ensembl workshops
Perl
3
star
51

gene_symbol_transformer

Transformer model for gene symbol assignment of protein coding sequences
Python
3
star
52

ensembl-otter

Perl
3
star
53

ensembl-hive-pbspro

'PBS Pro' implementation of Ensembl Hive Meadow interface
Perl
3
star
54

ensembl-annotation

The Ensembl gene annotation pipeline (a work in progress)
Perl
2
star
55

homebrew-moonshine

A collection of formulas where the source/archive is not readily available from a public URL
Ruby
2
star
56

ensembl-production-services

Ensembl production service portal
Python
2
star
57

gsoc-dl-protein-coding-genes

Python
2
star
58

ensembl-2020-server

Ensembl 2020 server backend
Python
2
star
59

Bio-DB-Big

Perl XS bindings to libBigWig for access the UCSC/kent big formats
Perl
2
star
60

ensembl-metadata-registry

REST API for ensembl-metadata
Python
2
star
61

repeat_nf

Code for NextFlow pipeline to find and annotate repeats (GSoC project)
Nextflow
2
star
62

ensembl-datacheck-py

Python
2
star
63

ensembl-glossary

Makefile
2
star
64

ensembl-hdf5

Wrapper for the storage of large numerical arrays
C
2
star
65

cwl-udocker-tests

A repository for testing udocker, cwl and toil
Shell
2
star
66

ensembl-2020-genome-search

Genome search backend microservice for ensembl 2020
Python
2
star
67

homebrew-icc

A collection of homebrew formula for working with ICC and ICCMPI programs
2
star
68

gene_pcp

Genebuild project - classifying protein-coding potential via a machine learning approach
Python
2
star
69

cpanfiles

A collection of additional cpanfiles that have no natural home with any project
Perl
2
star
70

python-requirements

Sets of requirements files required to install a tool or setup a virtualenv
2
star
71

gti-genesearch

Query manager and REST interface to Advanced Search
Java
2
star
72

webvm

Perl
2
star
73

ensembl-thoas

Son of Ariadne, a GraphQL server and supporting software
Python
2
star
74

pantherScore

Modified version of pantherScore that is used to classify proteins in the Ensembl Compara / TreeFam pipeline
2
star
75

treefam-webcode

Web-code for the TreeFam website
JavaScript
2
star
76

ensembl-production-metazoa

Metazoan production helper library and script to import ad-hoc annotations.
Shell
1
star
77

Finemap

Python
1
star
78

VersioningService

Service for downloading and consuming external data
Perl
1
star
79

ensembl-taxonomy

API for accessing NCBI taxonomy
Perl
1
star
80

ensembl-hive-htcondor

HTCondor Meadow for Ensembl Hive
Perl
1
star
81

ensembl-prodinf-djcore

Python
1
star
82

ensembl-prodinf-datachecks

JavaScript
1
star
83

ensembl-prodinf-tools

Python
1
star
84

ensembl-ontology-schema

Schema for the Ensembl ontology database
TSQL
1
star
85

ensembl-2020-static-assests

Static assets for new site
1
star
86

ensembl-git-test

Dummy repository to test git operations and scripts
1
star
87

ensembl-prodinf-legacy-core

Python
1
star
88

ensembl-dj-app

Django app template files
Python
1
star
89

fuse8

fuse8 for ensembl
C
1
star
90

gifts_rest

Python
1
star
91

ensembl-dj-project

Template to be reused when using django-admin startproject command
Python
1
star
92

thr

Second implementation of Trackhub Registry
Python
1
star
93

PerlModules

Perl
1
star
94

ensembl-genome-browser

Ensembl genome browser
JavaScript
1
star
95

ensembl-killlist

Perl
1
star
96

ensembl-prodinf-web

Web app for accessing Ensembl Production services
JavaScript
1
star
97

gene_finder

Jupyter Notebook
1
star
98

ensembl-external

*DEPRECATED* Code to load external data in the Ensembl website
Perl
1
star
99

ensembl-genes-nf

Ensembl Genebuild NextFlow pipelines
Nextflow
1
star
100

seqtools

C++
1
star