• Stars
    star
    130
  • Rank 277,575 (Top 6 %)
  • Language
    Jupyter Notebook
  • License
    Apache License 2.0
  • Created over 7 years ago
  • Updated 5 months ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

V-pipe is a pipeline designed for analysing NGS data of short viral genomes

Logo

bio.tools Snakemake Deploy Docker image Tests Mega-Linter License: Apache-2.0

V-pipe is a workflow designed for the analysis of next generation sequencing (NGS) data from viral pathogens. It produces a number of results in a curated format (e.g., consensus sequences, SNV calls, local/global haplotypes). V-pipe is written using the Snakemake workflow management system.

Usage

Different ways of initializing V-pipe are presented below. We strongly encourage you to deploy it using the quick install script, as this is our preferred method.

To configure V-pipe refer to the documentation present in config/README.md.

V-pipe expects the input samples to be organized in a two-level directory hierarchy, and the sequencing reads must be provided in a sub-folder named raw_data. Further details can be found on the website. Check the utils subdirectory for mass-importers tools that can assist you in generating this hierarchy.

We provide virus-specific base configuration files which contain handy defaults for, e.g., HIV and SARS-CoV-2. Set the virus in the general section of the configuration file:

general:
  virus_base_config: hiv

Also see snakemake's documentation to learn more about the command-line options available when executing the workflow.

Tutorials introducing usage of V-pipe are available in the docs/ subdirectory.

Using quick install script

To deploy V-pipe, use the installation script with the following parameters:

curl -O 'https://raw.githubusercontent.com/cbg-ethz/V-pipe/master/utils/quick_install.sh'
./quick_install.sh -w work

This script will download and install miniconda, checkout the V-pipe git repository (use -b to specify which branch/tag) and setup a work directory (specified with -w) with an executable script that will execute the workflow:

cd work
# edit config.yaml and provide samples/ directory
./vpipe --jobs 4 --printshellcmds --dry-run

Using Docker

Note: the docker image is only setup with components to run the workflow for HIV and SARS-CoV-2 virus base configurations. Using V-pipe with other viruses or configurations might require internet connectivity for additional software components.

Create config.yaml or vpipe.config and then populate the samples/ directory. For example, the following config file could be used:

general:
  virus_base_config: hiv

output:
  snv: true
  local: true
  global: false
  visualization: true
  QA: true

Then execute:

docker run --rm -it -v $PWD:/work ghcr.io/cbg-ethz/v-pipe:master --jobs 4 --printshellcmds --dry-run

Using Snakedeploy

First install mamba, then create and activate an environment with Snakemake and Snakedeploy:

mamba create -c bioconda -c conda-forge --name snakemake snakemake snakedeploy
conda activate snakemake

Snakemake's official workflow installer Snakedeploy can now be used:

snakedeploy deploy-workflow https://github.com/cbg-ethz/V-pipe --tag master .
# edit config/config.yaml and provide samples/ directory
snakemake --use-conda --jobs 4 --printshellcmds --dry-run

Dependencies

  • Conda

    Conda is a cross-platform package management system and an environment manager application. Snakemake uses mamba as a package manager.

  • Snakemake

    Snakemake is the central workflow and dependency manager of V-pipe. It determines the order in which individual tools are invoked and checks that programs do not exit unexpectedly.

  • VICUNA

    VICUNA is a de novo assembly software designed for populations with high mutation rates. It is used to build an initial reference for mapping reads with ngshmmalign aligner when a references/cohort_consensus.fasta file is not provided. Further details can be found in the wiki pages.

Computational tools

Other dependencies are managed by using isolated conda environments per rule, and below we list some of the computational tools integrated in V-pipe:

  • FastQC

    FastQC gives an overview of the raw sequencing data. Flowcells that have been overloaded or otherwise fail during sequencing can easily be determined with FastQC.

  • PRINSEQ

    Trimming and clipping of reads is performed by PRINSEQ. It is currently the most versatile raw read processor with many customization options.

  • ngshmmalign

    We perform the alignment of the curated NGS data using our custom ngshmmalign that takes structural variants into account. It produces multiple consensus sequences that include either majority bases or ambiguous bases.

  • bwa

    In order to detect specific cross-contaminations with other probes, the Burrows-Wheeler aligner is used. It quickly yields estimates for foreign genomic material in an experiment. Additionally, It can be used as an alternative aligner to ngshmmalign.

  • MAFFT

    To standardise multiple samples to the same reference genome (say HXB2 for HIV-1), the multiple sequence aligner MAFFT is employed. The multiple sequence alignment helps in determining regions of low conservation and thus makes standardisation of alignments more robust.

  • Samtools and bcftools

    The Swiss Army knife of alignment postprocessing and diagnostics. bcftools is also used to generate consensus sequence with indels.

  • SmallGenomeUtilities

    We perform genomic liftovers to standardised reference genomes using our in-house developed python library of utilities for rewriting alignments.

  • ShoRAH

    ShoRAh performs SNV calling and local haplotype reconstruction by using bayesian clustering.

  • LoFreq

    LoFreq (version 2) is SNVs and indels caller from next-generation sequencing data, and can be used as an alternative engine for SNV calling.

  • SAVAGE and Haploclique

    We use HaploClique or SAVAGE to perform global haplotype reconstruction for heterogeneous viral populations by using an overlap graph.

Citation

If you use this software in your research, please cite:

Posada-Céspedes S., Seifert D., Topolsky I., Jablonski K.P., Metzner K.J., and Beerenwinkel N. 2021. "V-pipe: a computational pipeline for assessing viral genetic diversity from high-throughput sequencing data." Bioinformatics, January. doi:10.1093/bioinformatics/btab015.

Contributions

* software maintainer ; ** group leader

Contact

We encourage users to use the issue tracker. For further enquiries, you can also contact the V-pipe Dev Team [email protected].

More Repositories

1

NGS-pipe

NGS-pipe: next-generation sequencing pipelines for precision oncology
Python
104
star
2

shorah

Repo for the software suite ShoRAH (Short Reads Assembly into Haplotypes)
C++
39
star
3

bmi

Mutual information estimators and benchmark
Python
31
star
4

haploclique

Viral quasispecies assembly via maximal clique finding. A method to reconstruct viral haplotypes and detect large insertions and deletions from NGS data.
C++
25
star
5

SCIPhI

C++
22
star
6

5-virus-mix

Benchmarking data sets for haplotype reconstruction methods, sequenced with Illumina MiSeq, 454/Roche GSJunior, and Pacific Biosciences
21
star
7

SCICoNE

Single-cell copy number calling and event history reconstruction.
C++
20
star
8

SCITE

C++
19
star
9

BnpC

Bayesian non-parametric clustering (BnpC) of binary data with missing values and uneven error rates
Python
18
star
10

ConsensusFixer

Computes a consensus sequence with wobbles, ambiguous bases, and in-frame insertions, from a NGS read alignment.
Java
18
star
11

scDEF

Deep exponential families for single-cell data.
Python
18
star
12

QuasiRecomb

Probabilistic inference of viral quasispecies subject to recombination (viral haplotype reconstruction).
Java
16
star
13

pancancer-clustering

Bayesian network modelling of mutational interactions identifies novel cancer subgroups
R
15
star
14

netics

NetICS: network-based integration of multi-omics data for prioritizing cancer genes
MATLAB
14
star
15

cojac

Jupyter Notebook
14
star
16

VILOCA

VILOCA: VIral LOcal haplotype reconstruction and mutation CAlling for short and long read data
Python
14
star
17

SCATrEx

Map single-cell transcriptomes to copy number evolutionary trees.
Python
13
star
18

dce

Finding the causality in biological pathways
R
13
star
19

COMPASS

C++
12
star
20

smallgenomeutilities

smallgenomeutilities is a collection of Python scripts to convert alignments between different reference genomes.
Python
10
star
21

SGS

Inference in Bayesian Networks with R
R
9
star
22

pybda

💻💻💻 A commandline tool for analysis of big biological data sets for distributed HPC clusters.
Python
9
star
23

ngshmmalign

ngshmmalign is a profile HMM aligner for NGS reads designed particularly for small genomes (such as those of RNA viruses like HIV-1 and HCV) that experience substantial biological insertions and deletions
C++
9
star
24

cowwid

Procedure used fro the surveillance of SARS-CoV-2 genomic variants in wastewater.
Jupyter Notebook
7
star
25

scIsoPrep

Single-cell Iso Prep
Python
7
star
26

predictability_of_cancer_evolution

R
7
star
27

WES_Cancer_Sim

C
7
star
28

pangolin

BSSE COVID-19 sequencing of test swab samples
Shell
7
star
29

pareg

Pathway enrichment computations using a regularized regression approach to incorporate inter-pathway relations in the statistical model.
R
7
star
30

infSCITE

C++
6
star
31

TreeMHN

Joint inference of exclusivity patterns and recurrent trajectories from tumor mutation trees
R
6
star
32

scSomMerClock

Test for a molecular clock based on the phylogenetic tree inferred from single-cell DNA sequenzing data
Python
6
star
33

InDelFixer

Iterative and very sensitive Next-Generation Sequencing (NGS) sequence alignment software. Accounting for large deletions and removes indels, causing frame shifts. In addition, only specific regions can be considered.
Java
6
star
34

PrimerID

Here we analyse the PrimerID protocol in all its gory details.
C++
5
star
35

SCIPhIN

C++
5
star
36

metMHN

This is a adaptation to MHN so that it works on metastasis
Python
4
star
37

clustNet

Network-based clustering
R
4
star
38

graphClust_NeurIPS

Network-Based Clustering of Pan-Cancer Data Accounting for Clinical Covariates
R
4
star
39

slidr

An R package for identification of synthetic lethal partners for mutations from large perturbation screens.
R
4
star
40

PredictHaplo

This software aims at reconstructing haplotypes from next-generation sequencing data.
C++
4
star
41

PYggdrasil

Inference and analysis of mutation trees in Python
Python
4
star
42

MC-CBN

MC-CBN performs large-scale inference on conjunctive Bayesian networks
R
4
star
43

LolliPop

Deconvolution for Wastewater Genomics
Jupyter Notebook
4
star
44

mt-SCITE

Tree inference from mitochondrial mutations
Jupyter Notebook
4
star
45

TiMEx

Bioconductor package for finding mutually exclusive groups of alterations in large cancer datasets
R
4
star
46

mnem

Mixture Nested Effects Models - https://doi.org/10.1093/bioinformatics/bty602 - https://bioconductor.org/packages/mnem
R
4
star
47

SynNet

Toolbox for design and optimization of miRNA-based synthetic classifier pathways
MATLAB
3
star
48

gespeR

Gene-Specific Phenotype EstimatoR
R
3
star
49

timeseriesNEM

timeseriesNEM
R
3
star
50

HDL-X

R
3
star
51

pMHN

Personalised mutual hazard networks
Python
3
star
52

netprioR

R
3
star
53

PhIRL

Inverse reinforcement learning for mutation trees
Python
3
star
54

SARS-CoV-2_Analysis

A Snakemake workflow for large-scale SARS-CoV-2 analyses.
Python
3
star
55

oncotree2vec

JavaScript
3
star
56

cancer-type-prediction-from-tumour-DNA

R
2
star
57

shm

Deep hierarchical models combined with Markov random fields.
Python
2
star
58

GeneAccord

GeneAccord: An R package to detect patterns of mutual exclusivity and co-occurrence on the clone level in a cohort of cancer patients
R
2
star
59

openproblems2021

Python
2
star
60

perturbatr

Analysis of high-throughput genetic perturbation screens in R.
R
2
star
61

Jnotype

Exploratory analysis of binary data in JAX
Python
2
star
62

bnclustOmics

R
2
star
63

SARS-CoV-2-wastewater-sample-processing-VILOCA

wastewater-sample-processing-VILOCA
Jupyter Notebook
2
star
64

QuasiFit

QuasiFit is a Bayesian MCMC sampler for inferring fitness landscapes in the quasispecies model subject to mutation-selection equilibrium.
C++
2
star
65

ObservationMHN

R
2
star
66

epistasis-formulas

Computes higher-order interactions such as 2-way, 3-way,…, n-way interaction coordinates and some circuits in the n-locus case taking as input 2^n experimental measurements.
Python
2
star
67

nempi

Nested Effects Models based Perturbation Inference - https://doi.org/10.1093/bioinformatics/btab113 - https://bioconductor.org/packages/nempi
R
2
star
68

scdna-pipe

Python data analysis pipeline for single cell copy number event history reconstruction
Python
2
star
69

myeloid-clustering

This repository contains supplementary information, data and code for the manuscript: Bayer et al. 2023, "Network-based clustering unveils interconnected landscapes of genomic and clinical features across myeloid malignancies"
R
1
star
70

fairClust

R
1
star
71

DBNclass

R
1
star
72

pcNEM

R
1
star
73

pathTiMEx

pathTiMEx is a model for the joint inference of mutually exclusive pathways and the dependencies among them in carcinogenesis
R
1
star
74

demoTape

Computational demultiplexing of targeted single-cell sequencing (tapestri) data
Python
1
star
75

TMixClust

R
1
star
76

tree-embeddings

Python
1
star
77

epiNEM

Epistatic Nested Effects Models - https://doi.org/10.1371/journal.pcbi.1005496 - https://bioconductor.org/packages/epiNEM/
R
1
star