• Stars
    star
    388
  • Rank 110,068 (Top 3 %)
  • Language
    Python
  • License
    Other
  • Created over 5 years ago
  • Updated 6 months ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

A deep learning-based tool to identify splice variants

SpliceAI: A deep learning-based tool to identify splice variants

release license downloads

This package annotates genetic variants with their predicted effect on splicing, as described in Jaganathan et al, Cell 2019 in press. The annotations for all possible substitutions, 1 base insertions, and 1-4 base deletions within genes are available here for download. These annotations are free for academic and not-for-profit use; other use requires a commercial license from Illumina, Inc.

License

SpliceAI source code is provided under the GPLv3 license. SpliceAI includes several third party packages provided under other open source licenses, please see NOTICE for additional details. The trained models used by SpliceAI (located in this package at spliceai/models) are provided under the CC BY NC 4.0 license for academic and non-commercial use; other use requires a commercial license from Illumina, Inc.

Installation

The simplest way to install SpliceAI is through pip or conda:

pip install spliceai
# or
conda install -c bioconda spliceai

Alternately, SpliceAI can be installed from the github repository:

git clone https://github.com/Illumina/SpliceAI.git
cd SpliceAI
python setup.py install

SpliceAI requires tensorflow>=1.2.0, which is best installed separately via pip or conda (see the TensorFlow website for other installation options):

pip install tensorflow
# or
conda install tensorflow

Usage

SpliceAI can be run from the command line:

spliceai -I input.vcf -O output.vcf -R genome.fa -A grch37
# or you can pipe the input and output VCFs
cat input.vcf | spliceai -R genome.fa -A grch37 > output.vcf

Required parameters:

  • -I: Input VCF with variants of interest.
  • -O: Output VCF with SpliceAI predictions ALLELE|SYMBOL|DS_AG|DS_AL|DS_DG|DS_DL|DP_AG|DP_AL|DP_DG|DP_DL included in the INFO column (see table below for details). Only SNVs and simple INDELs (REF or ALT is a single base) within genes are annotated. Variants in multiple genes have separate predictions for each gene.
  • -R: Reference genome fasta file. Can be downloaded from GRCh37/hg19 or GRCh38/hg38.
  • -A: Gene annotation file. Can instead provide grch37 or grch38 to use GENCODE V24 canonical annotation files included with the package. To create custom gene annotation files, use spliceai/annotations/grch37.txt in repository as template.

Optional parameters:

  • -D: Maximum distance between the variant and gained/lost splice site (default: 50).
  • -M: Mask scores representing annotated acceptor/donor gain and unannotated acceptor/donor loss (default: 0).

Details of SpliceAI INFO field:

ID Description
ALLELE Alternate allele
SYMBOL Gene symbol
DS_AG Delta score (acceptor gain)
DS_AL Delta score (acceptor loss)
DS_DG Delta score (donor gain)
DS_DL Delta score (donor loss)
DP_AG Delta position (acceptor gain)
DP_AL Delta position (acceptor loss)
DP_DG Delta position (donor gain)
DP_DL Delta position (donor loss)

Delta score of a variant, defined as the maximum of (DS_AG, DS_AL, DS_DG, DS_DL), ranges from 0 to 1 and can be interpreted as the probability of the variant being splice-altering. In the paper, a detailed characterization is provided for 0.2 (high recall), 0.5 (recommended), and 0.8 (high precision) cutoffs. Delta position conveys information about the location where splicing changes relative to the variant position (positive values are downstream of the variant, negative values are upstream).

Examples

A sample input file and the corresponding output file can be found at examples/input.vcf and examples/output.vcf respectively. The output T|RYR1|0.00|0.00|0.91|0.08|-28|-46|-2|-31 for the variant 19:38958362 C>T can be interpreted as follows:

  • The probability that the position 19:38958360 (=38958362-2) is used as a splice donor increases by 0.91.
  • The probability that the position 19:38958331 (=38958362-31) is used as a splice donor decreases by 0.08.

Similarly, the output CA|TTN|0.07|1.00|0.00|0.00|-7|-1|35|-29 for the variant 2:179415988 C>CA has the following interpretation:

  • The probability that the position 2:179415981 (=179415988-7) is used as a splice acceptor increases by 0.07.
  • The probability that the position 2:179415987 (=179415988-1) is used as a splice acceptor decreases by 1.00.

Frequently asked questions

1. Why are some variants not scored by SpliceAI?

SpliceAI only annotates variants within genes defined by the gene annotation file. Additionally, SpliceAI does not annotate variants if they are close to chromosome ends (5kb on either side), deletions of length greater than twice the input parameter -D, or inconsistent with the reference fasta file.

2. What are the differences between raw (-M 0) and masked (-M 1) precomputed files?

The raw files also include splicing changes corresponding to strengthening annotated splice sites and weakening unannotated splice sites, which are typically much less pathogenic than weakening annotated splice sites and strengthening unannotated splice sites. The delta scores of such splicing changes are set to 0 in the masked files. We recommend using raw files for alternative splicing analysis and masked files for variant interpretation.

3. Can SpliceAI be used to score custom sequences?

Yes, install SpliceAI and use the following script:

from keras.models import load_model
from pkg_resources import resource_filename
from spliceai.utils import one_hot_encode
import numpy as np

input_sequence = 'CGATCTGACGTGGGTGTCATCGCATTATCGATATTGCAT'
# Replace this with your custom sequence

context = 10000
paths = ('models/spliceai{}.h5'.format(x) for x in range(1, 6))
models = [load_model(resource_filename('spliceai', x)) for x in paths]
x = one_hot_encode('N'*(context//2) + input_sequence + 'N'*(context//2))[None, :]
y = np.mean([models[m].predict(x) for m in range(5)], axis=0)

acceptor_prob = y[0, :, 1]
donor_prob = y[0, :, 2]

Contact

Kishore Jaganathan: [email protected]

More Repositories

1

hap.py

Haplotype VCF comparison tools
C++
401
star
2

manta

Structural variant and indel caller for mapped sequencing data
C++
391
star
3

strelka

Strelka2 germline and somatic small variant caller
C++
351
star
4

ExpansionHunter

A tool for estimating repeat sizes
C++
175
star
5

Nirvana

The nimble & robust variant annotator
C#
167
star
6

DRAGMAP

DRAGEN open-source mapper
C++
153
star
7

paragraph

Graph realignment tools for structural variants
C++
147
star
8

pyflow

A lightweight parallel task engine
Python
143
star
9

canvas

Canvas - Copy number variant (CNV) calling from DNA sequencing data
C#
121
star
10

PrimateAI

deep residual neural network for classifying the pathogenicity of missense mutations.
Python
110
star
11

Pisces

Somatic and germline variant caller for amplicon data. Recommended caller for tumor-only workflows.
C#
93
star
12

PlatinumGenomes

The Platinum Genomes Truthset
84
star
13

ExpansionHunterDenovo

A suite of tools for detecting expansions of short tandem repeats
C++
77
star
14

interop

C++ Library to parse Illumina InterOp files
C++
75
star
15

REViewer

A tool for visualizing alignments of reads in regions containing tandem repeats
C++
75
star
16

akt

Ancestry and Kinship Tools
C++
68
star
17

PrimateAI-3D

Python
55
star
18

Polaris

Data and information about the Polaris study
52
star
19

SMNCopyNumberCaller

A copy number caller for SMN1 and SMN2 to enable SMA diagnosis and carrier screening with WGS
Python
49
star
20

Cyrius

A tool to genotype CYP2D6 with WGS data
Python
46
star
21

BeadArrayFiles

Python library to parse file formats related to Illumina bead arrays
Python
45
star
22

GTCtoVCF

Script to convert GTC/BPM files to VCF
Python
41
star
23

GraphAlignmentViewer

Python
33
star
24

gvcfgenotyper

A utility for merging and genotyping Illumina-style GVCFs.
C++
31
star
25

witty.er

What is true, thank you, ernestly. A large variant benchmarking tool analogous to hap.py for small variants.
C#
27
star
26

isaac2

Aligner for sequencing data
C++
21
star
27

Gauchian

A variant caller for the GBA gene using WGS data
Python
20
star
28

BaseSpace_Clarity_LIMS

API libraries, application examples, and custom tools for BaseSpace Clarity LIMS
Python
18
star
29

Isaac3

Aligner for sequencing data
C++
18
star
30

RepeatCatalogs

17
star
31

Isaac4

Isaac aligner version 4
C++
16
star
32

happyR

R tools to interact with hap.py output
R
15
star
33

agg

gvcf aggregation tool
12
star
34

tHapMix

Haplotype-based somatic genome simulator
Python
10
star
35

happyCompare

Reporting toolbox for happy output
R
7
star
36

zippy

The ZIPPY pipeline prototyping system
Python
5
star
37

MarViN

C++
5
star
38

ica-sdk-python

Python
4
star
39

NirvanaDocumentation

MDX
4
star
40

novaseq-lims-api

Documentation and tools for users of the NovaSeq LIMS API
C#
3
star
41

NeoMutalyzer

Inspired by Mutalyzer and frustrated by RefSeq, we created this transcript annotation validator
C#
3
star
42

dragen-azure-quickstart

HTML
3
star
43

Pelops

Python
3
star
44

licenses

2
star
45

BlockCompression

Block compression library used by Nirvana
C++
2
star
46

dragen-aws-batch-quickstart

HTML
1
star