• Stars
    star
    104
  • Rank 330,604 (Top 7 %)
  • Language
    Python
  • License
    Apache License 2.0
  • Created over 7 years ago
  • Updated almost 6 years ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

NGS-pipe: next-generation sequencing pipelines for precision oncology

NGS-Pipe

Description

NGS-Pipe provides analyses for large scale DNA and RNA sequencing experiments. The scope of pre-implemented functions spans the detection of germline variants, somatic single nucleotide variants (SNV) and insertion and deletion (InDel) identification, copy number event detection, and differential expression analyses. Further, it provides pre-configured workflows, such that the final mutational information as well as quality reports and all intermediate results can be generated quickly, also by inexperienced users. In addition, the pipeline can be used on a single computer or in a cluster environment where independent steps are executed in parallel. If one of the steps of the pipeline fails and produces incomplete or no results, the computation of all depending steps is halted and an error message is shown. However, after the issue is resolved the pipeline independently resumes the analyses at the appropriate point, eliminating the need to rerun the complete analysis or manual deletion of erroneous files.

See also the wiki pages of this repository for more information about NGS-pipe.

Workflows for WES, WGS, and RNA-seq data

We have implemented and tested predefined workflows for the automated analysis of WES, WGS, and RNA-seq data (Fig. 1).

Workflows

The primary data analysis steps include Trimmomatic (Bolger, 2014) to process raw files, BWA (Li, 2009) or STAR (Dobin, 2013) to align reads, and Picard tools (http://broadinstitute.github.io/picard), SAMtools (Li, 2009Samtools) and GATK (McKenna, 2010) to process the aligned reads.

Detecting genomic variants is highly dependent on properties of the input data, such as variant frequency, coverage, or contamination (Cai, 2016; Hofmann, 2017). For this reason, we included several variant callers in NGS-pipe, viz. Mutect (Cibulskis, 2013), JointSNVMix2 (Roth, 2012), VarScan2 (Koboldt, 2012), VarDict (Lai, 2016), SomaticSniper (Larson, 2011), Strelka (Saunder, 2012), and deepSNV (Gerstung, 2012). Further, we included SomaticSeq (Fang, 2015), which combines the results of multiple variant callers and ranked high in the ICGC-TCGA DREAM Somatic Mutation Calling Challenge (Ewing, 2015), and the rank aggregation scheme introduced in (Hofmann, 2017).

Copy number events are detected by FACETS (Shen, 2016), or BIC-seq2 (Xi, 2016), which has been designed specifically for whole genome data.

The results of the experiments can be annotated and manipulated using SnpEff (Cingolani 2012), SnpSift (Cingolani, 2012) and ANNOVAR (Wang, 2010).

RNA-seq data is analyzed to quantify gene expression levels. We include quality control, alignment, and gene counting using the SubRead (Liao, 2014) package. Output files are reformatted to serve as direct input to tools that perform differential gene expression analysis.

Example

The directory examples/wes/ contains a ready to go example for the analysis of three leukemia patients (Cifola, 2015). This example downloads tumor-control matched exome data sets from the Sequence Read Archive, installs the required programs, downloads the necessary reference files and builds the essentials indices. Afterwards, an analysis starting with the mapping of the reads via BWA (Li 2009) all the way to the somatic variant calling with VarScan2 (Koboldt 2012). After the installation of all tools via conda you can proceed like:

#1. Go to examples folder:
cd examples/dna
#2. Download test data: We provide an additional snakemake pipeline to 
#   download test sequences, databases and adapter files:
./run_prepare_data_locally.sh
# This will download 6 test data sets, the adapters, regions file,
# the human reference and build the BWA database index
#3. Execute the DNA Pipeline:
./run_analysis_locally.sh
# This will execute: RAW --> QC(Trimmomatic) --> Mapping(BWA) --> Sort(Picard)
# --> Merge(Picard) --> Remove Secondary Alignments(Samtools) --> MarkDuplicates(Picard)
# --> RemoveDuplicates(Samtools) --> SNV Calling (VarScan2)

An example for RNA-seq data analysis can be found in examples/rna/ and here.

References

Bolger, A. M., Lohse, M., & Usadel, B. (2014). Trimmomatic: a flexible trimmer for Illumina sequence data. Bioinformatics, 30(4), 2114-2120.

Cai, L., Yuan, W., Zhou Zhang, L. H., & Chou, K. C. (2016). In-depth comparison of somatic point mutation callers based on different tumor next-generation sequencing depth data. Scientific reports, 6.

Cibulskis, K., Lawrence, M. S., Carter, S. L., Sivachenko, A., Jaffe, D., Sougnez, C., ... & Getz, G. (2013). Sensitive detection of somatic point mutations in impure and heterogeneous cancer samples. Nature biotechnology, 31(3), 213-219. ISO 690

Cifola, I., Lionetti, M., Pinatel, E., Todoerti, K., Mangano, E., Pietrelli. A. et al. (2015). Whole-exome sequencing of primary plasma cell leukemia discloses heterogeneous mutational patterns. Oncotarget, 6(19), 17543-17558.

Cingolani, P., Patel, V. M., Coon, M., Nguyen, T., Land, S. J., Ruden, D. M., & Lu, X. (2012). Using Drosophila melanogaster as a model for genotoxic chemical mutational studies with a new program, SnpSift. Toxicogenomics in non-mammalian species, 3, 35.

Cingolani, P., Platts, A., Wang, L. L., Coon, M., Nguyen, T., Wang, L., ... & Ruden, D. M. (2012). A program for annotating and predicting the effects of single nucleotide polymorphisms, SnpEff: SNPs in the genome of Drosophila melanogaster strain w1118; iso-2; iso-3. Fly, 6(2), 80-92.

Dobin, A., Davis, C. A., Schlesinger, F., Drenkow, J., Zaleski, C., Jha, S., ... & Gingeras, T. R. (2013). STAR: ultrafast universal RNA-seq aligner. Bioinformatics, 29(1), 15-21.

Ewing, A. D., Houlahan, K. E., Hu, Y., Ellrott, K., Caloian, C., Yamaguchi, T. N., ... & Calling, I. T. D. S. M. (2015). Combining tumor genome simulation with crowdsourcing to benchmark somatic single-nucleotide-variant detection. Nature methods, 12(7), 623-630.

Fang, L. T., Afshar, P. T., Chhibber, A., Mohiyuddin, M., Fan, Y., Mu, J. C., ... & Koboldt, D. C. (2015). An ensemble approach to accurately detect somatic mutations using SomaticSeq. Genome biology, 16(1), 197.

Gerstung, M., Beisel, C., Rechsteiner, M., Wild, P., Schraml, P., Moch, H., & Beerenwinkel, N. (2012). Reliable detection of subclonal single-nucleotide variants in tumour cell populations. Nature communications, 3, 811.

Hofmann, A. L., Behr, J., Singer, J., Kuipers, J., Beisel, C., Schraml, P., ... & Beerenwinkel, N. (2017). Detailed simulation of cancer exome sequencing data reveals differences and common limitations of variant callers. BMC Bioinformatics, 18(1), 8.

Koboldt, D., Zhang, Q., Larson, D., Shen, D., McLellan, M., Lin, L., Miller, C., Mardis, E., Ding, L., & Wilson, R. (2012). VarScan 2: Somatic mutation and copy number alteration discovery in cancer by exome sequencing. Genome Research, 22(3), 568--576.

Koster, J. and Rahmann, S. (2012). Snakemake–a scalable bioinformatics workflow engine. Bioinformatics, 28(19), 2520–2522.

Lai, Z., Markovets, A., Ahdesmaki, M., Chapman, B., Hofmann, O., McEwen, R., ... & Dry, J. R. (2016). VarDict: a novel and versatile variant caller for next-generation sequencing in cancer research. Nucleic acids research, 44(11), e108-e108.

Larson, D. E., Harris, C. C., Chen, K., Koboldt, D. C., Abbott, T. E., Dooling, D. J., ... & Ding, L. (2011). SomaticSniper: identification of somatic point mutations in whole genome sequencing data. Bioinformatics, 28(3), 311-317.

Li H. and Durbin R. (2009). Fast and accurate short read alignment with Burrows-Wheeler Transform. Bioinformatics, 25(14), 1754-1760.

Li, H., Handsaker, B., Wysoker, A., Fennell, T., Ruan, J., Homer, N., ... & Durbin, R. (2009). The sequence alignment/map format and SAMtools. Bioinformatics, 25(16), 2078-2079.

Liao, Y., Smyth, G. K., & Shi, W. (2014). featureCounts: an efficient general purpose program for assigning sequence reads to genomic features. Bioinformatics, 30(7), 923-930.

McKenna, A., Hanna, M., Banks, E., Sivachenko, A., Cibulskis, K., Kernytsky, A., ... & DePristo, M. A. (2010). The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. Genome research, 20(9), 1297-1303.

Roth, A., Ding, J., Morin, R., Crisan, A., Ha, G., Giuliany, R., ... & Marra, M. A. (2012). JointSNVMix: a probabilistic model for accurate detection of somatic mutations in normal/tumour paired next-generation sequencing data. Bioinformatics, 28(7), 907-913.

Saunders, C. T., Wong, W. S., Swamy, S., Becq, J., Murray, L. J., & Cheetham, R. K. (2012). Strelka: accurate somatic small-variant calling from sequenced tumor–normal sample pairs. Bioinformatics, 28(14), 1811-1817.

Shen, R., & Seshan, V. E. (2016). FACETS: allele-specific copy number and clonal heterogeneity analysis tool for high-throughput DNA sequencing. Nucleic acids research, 44(16), e131-e131.

Wang, K., Li, M., & Hakonarson, H. (2010). ANNOVAR: functional annotation of genetic variants from high-throughput sequencing data. Nucleic acids research, 38(16), e164-e164.

Xi, R., Lee, S., Xia, Y., Kim, T. M., & Park, P. J. (2016). Copy number analysis of whole-genome data using BIC-seq2 and its application to detection of cancer susceptibility variants. Nucleic acids research, gkw491.

More Repositories

1

V-pipe

V-pipe is a pipeline designed for analysing NGS data of short viral genomes
Jupyter Notebook
130
star
2

shorah

Repo for the software suite ShoRAH (Short Reads Assembly into Haplotypes)
C++
39
star
3

bmi

Mutual information estimators and benchmark
Python
31
star
4

haploclique

Viral quasispecies assembly via maximal clique finding. A method to reconstruct viral haplotypes and detect large insertions and deletions from NGS data.
C++
25
star
5

SCIPhI

C++
22
star
6

5-virus-mix

Benchmarking data sets for haplotype reconstruction methods, sequenced with Illumina MiSeq, 454/Roche GSJunior, and Pacific Biosciences
21
star
7

SCICoNE

Single-cell copy number calling and event history reconstruction.
C++
20
star
8

SCITE

C++
19
star
9

BnpC

Bayesian non-parametric clustering (BnpC) of binary data with missing values and uneven error rates
Python
18
star
10

ConsensusFixer

Computes a consensus sequence with wobbles, ambiguous bases, and in-frame insertions, from a NGS read alignment.
Java
18
star
11

scDEF

Deep exponential families for single-cell data.
Python
18
star
12

QuasiRecomb

Probabilistic inference of viral quasispecies subject to recombination (viral haplotype reconstruction).
Java
16
star
13

pancancer-clustering

Bayesian network modelling of mutational interactions identifies novel cancer subgroups
R
15
star
14

netics

NetICS: network-based integration of multi-omics data for prioritizing cancer genes
MATLAB
14
star
15

cojac

Jupyter Notebook
14
star
16

VILOCA

VILOCA: VIral LOcal haplotype reconstruction and mutation CAlling for short and long read data
Python
14
star
17

SCATrEx

Map single-cell transcriptomes to copy number evolutionary trees.
Python
13
star
18

dce

Finding the causality in biological pathways
R
13
star
19

COMPASS

C++
12
star
20

smallgenomeutilities

smallgenomeutilities is a collection of Python scripts to convert alignments between different reference genomes.
Python
10
star
21

SGS

Inference in Bayesian Networks with R
R
9
star
22

pybda

💻💻💻 A commandline tool for analysis of big biological data sets for distributed HPC clusters.
Python
9
star
23

ngshmmalign

ngshmmalign is a profile HMM aligner for NGS reads designed particularly for small genomes (such as those of RNA viruses like HIV-1 and HCV) that experience substantial biological insertions and deletions
C++
9
star
24

cowwid

Procedure used fro the surveillance of SARS-CoV-2 genomic variants in wastewater.
Jupyter Notebook
7
star
25

scIsoPrep

Single-cell Iso Prep
Python
7
star
26

predictability_of_cancer_evolution

R
7
star
27

WES_Cancer_Sim

C
7
star
28

pangolin

BSSE COVID-19 sequencing of test swab samples
Shell
7
star
29

pareg

Pathway enrichment computations using a regularized regression approach to incorporate inter-pathway relations in the statistical model.
R
7
star
30

infSCITE

C++
6
star
31

TreeMHN

Joint inference of exclusivity patterns and recurrent trajectories from tumor mutation trees
R
6
star
32

scSomMerClock

Test for a molecular clock based on the phylogenetic tree inferred from single-cell DNA sequenzing data
Python
6
star
33

InDelFixer

Iterative and very sensitive Next-Generation Sequencing (NGS) sequence alignment software. Accounting for large deletions and removes indels, causing frame shifts. In addition, only specific regions can be considered.
Java
6
star
34

PrimerID

Here we analyse the PrimerID protocol in all its gory details.
C++
5
star
35

SCIPhIN

C++
5
star
36

metMHN

This is a adaptation to MHN so that it works on metastasis
Python
4
star
37

clustNet

Network-based clustering
R
4
star
38

graphClust_NeurIPS

Network-Based Clustering of Pan-Cancer Data Accounting for Clinical Covariates
R
4
star
39

slidr

An R package for identification of synthetic lethal partners for mutations from large perturbation screens.
R
4
star
40

PredictHaplo

This software aims at reconstructing haplotypes from next-generation sequencing data.
C++
4
star
41

PYggdrasil

Inference and analysis of mutation trees in Python
Python
4
star
42

MC-CBN

MC-CBN performs large-scale inference on conjunctive Bayesian networks
R
4
star
43

LolliPop

Deconvolution for Wastewater Genomics
Jupyter Notebook
4
star
44

mt-SCITE

Tree inference from mitochondrial mutations
Jupyter Notebook
4
star
45

TiMEx

Bioconductor package for finding mutually exclusive groups of alterations in large cancer datasets
R
4
star
46

mnem

Mixture Nested Effects Models - https://doi.org/10.1093/bioinformatics/bty602 - https://bioconductor.org/packages/mnem
R
4
star
47

SynNet

Toolbox for design and optimization of miRNA-based synthetic classifier pathways
MATLAB
3
star
48

gespeR

Gene-Specific Phenotype EstimatoR
R
3
star
49

timeseriesNEM

timeseriesNEM
R
3
star
50

HDL-X

R
3
star
51

pMHN

Personalised mutual hazard networks
Python
3
star
52

netprioR

R
3
star
53

PhIRL

Inverse reinforcement learning for mutation trees
Python
3
star
54

SARS-CoV-2_Analysis

A Snakemake workflow for large-scale SARS-CoV-2 analyses.
Python
3
star
55

oncotree2vec

JavaScript
3
star
56

cancer-type-prediction-from-tumour-DNA

R
2
star
57

shm

Deep hierarchical models combined with Markov random fields.
Python
2
star
58

GeneAccord

GeneAccord: An R package to detect patterns of mutual exclusivity and co-occurrence on the clone level in a cohort of cancer patients
R
2
star
59

openproblems2021

Python
2
star
60

perturbatr

Analysis of high-throughput genetic perturbation screens in R.
R
2
star
61

Jnotype

Exploratory analysis of binary data in JAX
Python
2
star
62

bnclustOmics

R
2
star
63

SARS-CoV-2-wastewater-sample-processing-VILOCA

wastewater-sample-processing-VILOCA
Jupyter Notebook
2
star
64

QuasiFit

QuasiFit is a Bayesian MCMC sampler for inferring fitness landscapes in the quasispecies model subject to mutation-selection equilibrium.
C++
2
star
65

ObservationMHN

R
2
star
66

epistasis-formulas

Computes higher-order interactions such as 2-way, 3-way,…, n-way interaction coordinates and some circuits in the n-locus case taking as input 2^n experimental measurements.
Python
2
star
67

nempi

Nested Effects Models based Perturbation Inference - https://doi.org/10.1093/bioinformatics/btab113 - https://bioconductor.org/packages/nempi
R
2
star
68

scdna-pipe

Python data analysis pipeline for single cell copy number event history reconstruction
Python
2
star
69

myeloid-clustering

This repository contains supplementary information, data and code for the manuscript: Bayer et al. 2023, "Network-based clustering unveils interconnected landscapes of genomic and clinical features across myeloid malignancies"
R
1
star
70

fairClust

R
1
star
71

DBNclass

R
1
star
72

pcNEM

R
1
star
73

pathTiMEx

pathTiMEx is a model for the joint inference of mutually exclusive pathways and the dependencies among them in carcinogenesis
R
1
star
74

demoTape

Computational demultiplexing of targeted single-cell sequencing (tapestri) data
Python
1
star
75

TMixClust

R
1
star
76

tree-embeddings

Python
1
star
77

epiNEM

Epistatic Nested Effects Models - https://doi.org/10.1371/journal.pcbi.1005496 - https://bioconductor.org/packages/epiNEM/
R
1
star