• Stars
    star
    147
  • Rank 249,856 (Top 5 %)
  • Language
    C++
  • License
    Other
  • Created almost 7 years ago
  • Updated almost 2 years ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Graph realignment tools for structural variants

Paragraph: a suite of graph-based genotyping tools

Introduction

Accurate genotyping of known variants is a critical for the analysis of whole-genome sequencing data. Paragraph aims to facilitate this by providing an accurate genotyper for Structural Variations with short-read data.

Please reference Paragraph using:

Genotyping data in this paper can be found at paper-data/download-instructions.txt

For details of population genotyping, please also refer to:

Installation

Please check doc/Installation.md for system requirements and installation instructions.

Run Paragraph from VCF

Test example

After installation, run multigrmpy.py script from the build/bin directory on an example dataset as follows:

python3 bin/multigrmpy.py -i share/test-data/round-trip-genotyping/candidates.vcf \
                          -m share/test-data/round-trip-genotyping/samples.txt \
                          -r share/test-data/round-trip-genotyping/dummy.fa \
                          -o test \

This runs a simple genotyping example for two test samples.

  • candidates.vcf: this specifies candidate SV events in a vcf format.
  • samples.txt: Manifest that specifies some test BAM files. Tab or comma delimited.
  • dummy.fa a short dummy reference which only contains chr1

The output folder test then contains gzipped json for final genotypes:

$ tree test
test
├── grmpy.log            #  main workflow log file
├── genotypes.vcf.gz     #  Output VCF with individual genotypes
├── genotypes.json.gz    #  More detailed output than genotypes.vcf.gz
├── variants.vcf.gz      #  The input VCF with unique ID from Paragraph
└── variants.json.gz     #  The converted graphs from input VCF (no genotypes)

If successful, the last 3 lines of genotypes.vcf.gz will the same as in expected file.

Input requirements

VCF format

paraGRAPH will independently genotype each entry of the input VCF. You can use either indel-style representation (full REF and ALT allele sequence in 4th and 5th columns) or symbolic alleles, as long as they meet the format requirement of VCF 4.0+.

Currently we support 4 symbolic alleles:

  • <DEL> for deletion
    • Must have END key in INFO field.
  • <INS> for insertion
    • Must have a key in INFO field for insertion sequence (without padding base). The default key is SEQ.
    • For blockwise swap, we strongly recommend using indel-style representation, other than symbolic alleles.
  • <DUP> for duplication
    • Must have END key in INFO field. paraGRAPH assumes the sequence between POS and END being duplicated for one more time in the alternative allele.
  • <INV> for inversion
    • Must have END key in INFO field. paraGRAPH assumes the sequence between POS and END being reverse-complemented in the alternative allele.

Sample Manifest

Must be tab-deliemited.

Required columns:

  • id: Each sample must have a unique ID. The output VCF will include genotypes for all samples in the manifest
  • path: Path to the BAM/CRAM file.
  • depth: Average depth across the genome. Can be calculated with bin/idxdepth (faster than samtools).
  • read length: Average read length (bp) across the genome.

Optional columns:

  • depth sd: Specify standard deviation for genome depth. Used for the normal test of breakpoint read depth. Default is sqrt(5*depth).
  • depth variance: Square of depth sd.
  • sex: Affects chrX and chrY genotyping. Allow "male" or "M", "female" or "F", and "unknown" (quotes shouldn't be included in the manifest). If not specified, the sample will be treated as unknown.

Run time

  • On a 30x HiSeqX sample, Paragraph typically takes 1-2 seconds to genotype a simple SV in confident regions.

  • If the SV is in a low-complexity region with abnormal read pileups, the running time could vary.

  • For efficiency, it is recommended to manually set the "-M" option (maximum allowed read count for a variant) to skip these high-depth regions. We recommend "-M" as 20 times of your mean sample depth.

Population-scale genotyping

To efficiently genotype SVs across a population, we recommend doing single-sample mode as follows:

  • Create a manifest for each single sample
  • Run multigrmpy.py for each manifest. Be sure to set "-M" option for each sample according to its depth.
  • Multithreading (option "-t") is highly recommended for population-scale genotyping
  • Merge all genotypes.vcf.gz to create a big VCF of all samples. You can use either bcftools merge or your custom script.

Run Paragraph on complex variants

For more complicated events (e.g. genotype a deletion together with its nearby SNP), you can provide a custimized JSON to paraGRAPH:

Please follow the pattern in example JSON and make sure all required keys are provided. Here is a visualization of this sample graph.

To obtain graph alignments for this graph (including all reads), run:

bin/paragraph -b <input BAM> \
              -r <reference fasta> \
              -g <input graph JSON> \
              -o <output JSON path> \
              -E 1

To obtain the algnment summary, genotypes of each breakpoint, and the whole graph, run:

bin/grmpy -m <input manifest> \
          -r <reference fasta> \
          -i <input graph JSON> \
          -o <output JSON path> \
          -E 1

If you have multiple events listed in the input JSON, multigrmpy.py can help you to run multiple grmpy jobs together.

Further Information

Please check github wiki for common usage questions and errors.

Documentation

External links

  • The Illumina/Polaris repository gives the short-read sequencing data we used to test our method in population.

License

The LICENSE file contains information about libraries and other tools we use, and license information for these.

More Repositories

1

hap.py

Haplotype VCF comparison tools
C++
401
star
2

manta

Structural variant and indel caller for mapped sequencing data
C++
391
star
3

SpliceAI

A deep learning-based tool to identify splice variants
Python
388
star
4

strelka

Strelka2 germline and somatic small variant caller
C++
351
star
5

ExpansionHunter

A tool for estimating repeat sizes
C++
175
star
6

Nirvana

The nimble & robust variant annotator
C#
167
star
7

DRAGMAP

DRAGEN open-source mapper
C++
153
star
8

pyflow

A lightweight parallel task engine
Python
143
star
9

canvas

Canvas - Copy number variant (CNV) calling from DNA sequencing data
C#
121
star
10

PrimateAI

deep residual neural network for classifying the pathogenicity of missense mutations.
Python
110
star
11

Pisces

Somatic and germline variant caller for amplicon data. Recommended caller for tumor-only workflows.
C#
93
star
12

PlatinumGenomes

The Platinum Genomes Truthset
84
star
13

ExpansionHunterDenovo

A suite of tools for detecting expansions of short tandem repeats
C++
77
star
14

interop

C++ Library to parse Illumina InterOp files
C++
75
star
15

REViewer

A tool for visualizing alignments of reads in regions containing tandem repeats
C++
75
star
16

akt

Ancestry and Kinship Tools
C++
68
star
17

PrimateAI-3D

Python
55
star
18

Polaris

Data and information about the Polaris study
52
star
19

SMNCopyNumberCaller

A copy number caller for SMN1 and SMN2 to enable SMA diagnosis and carrier screening with WGS
Python
49
star
20

Cyrius

A tool to genotype CYP2D6 with WGS data
Python
46
star
21

BeadArrayFiles

Python library to parse file formats related to Illumina bead arrays
Python
45
star
22

GTCtoVCF

Script to convert GTC/BPM files to VCF
Python
41
star
23

GraphAlignmentViewer

Python
33
star
24

gvcfgenotyper

A utility for merging and genotyping Illumina-style GVCFs.
C++
31
star
25

witty.er

What is true, thank you, ernestly. A large variant benchmarking tool analogous to hap.py for small variants.
C#
27
star
26

isaac2

Aligner for sequencing data
C++
21
star
27

Gauchian

A variant caller for the GBA gene using WGS data
Python
20
star
28

BaseSpace_Clarity_LIMS

API libraries, application examples, and custom tools for BaseSpace Clarity LIMS
Python
18
star
29

Isaac3

Aligner for sequencing data
C++
18
star
30

RepeatCatalogs

17
star
31

Isaac4

Isaac aligner version 4
C++
16
star
32

happyR

R tools to interact with hap.py output
R
15
star
33

agg

gvcf aggregation tool
12
star
34

tHapMix

Haplotype-based somatic genome simulator
Python
10
star
35

happyCompare

Reporting toolbox for happy output
R
7
star
36

zippy

The ZIPPY pipeline prototyping system
Python
5
star
37

MarViN

C++
5
star
38

ica-sdk-python

Python
4
star
39

NirvanaDocumentation

MDX
4
star
40

novaseq-lims-api

Documentation and tools for users of the NovaSeq LIMS API
C#
3
star
41

NeoMutalyzer

Inspired by Mutalyzer and frustrated by RefSeq, we created this transcript annotation validator
C#
3
star
42

dragen-azure-quickstart

HTML
3
star
43

Pelops

Python
3
star
44

licenses

2
star
45

BlockCompression

Block compression library used by Nirvana
C++
2
star
46

dragen-aws-batch-quickstart

HTML
1
star