• Stars
    star
    116
  • Rank 302,098 (Top 6 %)
  • Language
    Python
  • License
    Apache License 2.0
  • Created over 4 years ago
  • Updated 2 months ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

VIRify: detection of phages and eukaryotic viruses from metagenomic and metatranscriptomic assemblies

  1. The VIRify pipeline
  2. Nextflow execution
  3. CWL execution (discontinued)
  4. Pipeline overview
  5. Detour: Metatranscriptomics
  6. Resources
  7. Citations

VIRify

Sankey plot

General

VIRify is a pipeline for the detection, annotation, and taxonomic classification of viral contigs in metagenomic and metatranscriptomic assemblies. The pipeline is part of the repertoire of analysis services offered by MGnify. VIRify's taxonomic classification relies on the detection of taxon-specific profile hidden Markov models (HMMs), built upon a set of 22,013 orthologous protein domains and referred to as ViPhOGs.

The pipeline is implemented in Nextflow and additionally only Docker or Singularity are needed to run VIRify. Details about installation and usage are given below.

Please note, that until v1.0 the pipeline was also implemented in CWL as an alternative to Nextflow. However, later updates were only included in the Nextflow version of the pipeline.

Nextflow

A Nextflow implementation of the VIRify pipeline. In the backend, the same scripts are used as in the CWL implementation.

What do I need?

This implementation of the pipeline runs with the workflow manager Nextflow and needs as second dependency either Docker or Singularity. Conda will be implemented soonish, hopefully (currently blocked bc/ we use PPR-Meta). However, we highly recommend in any way the usage of the stable containers. All other programs and databases are automatically downloaded by Nextflow.

Attention, the workflow will download the containers and databases with a size of roughly 19 GB (49 GB with --hmmextend and --blastextend) the first time it is executed!

Install Nextflow

curl -s https://get.nextflow.io | bash

Install Docker

If you dont have experience with bioinformatic tools and their installation just copy the commands into your terminal to set everything up (local machine with full permissions!):

sudo apt-get install -y docker-ce docker-ce-cli containerd.io
sudo usermod -a -G docker $USER

Install Singularity

While singularity can be installed via Conda, we recommend setting up a true Singularity installation. For HPCs, ask the system administrator you trust. Here is also a good manual to get you started. Please note: you only need Docker or Singularity. However, due to security concerns it might not be possible to use Docker on your shared machine or HPC.

Basic Nextflow execution

Install

While it is possible to clone this repository and directly execute the virify.nf, we recommend to let Nextflow handle the installation. Get the pipeline code via:

nextflow pull EBI-Metagenomics/emg-viral-pipeline

Test installation and get help:

nextflow run EBI-Metagenomics/emg-viral-pipeline --help

Run specific pipeline version

We highly recommend to always run stable releases, also for reproducibility:

nextflow run EBI-Metagenomics/emg-viral-pipeline -r v0.4.0 --help

Check the release page to figure out the newest version of the pipelne. Or run:

nextflow info EBI-Metagenomics/emg-viral-pipeline

Example execution

Run annotation for a small assembly file (10 contigs, 0.78 Mbp) on your local Linux machine using Docker containers (per default --cores 4; takes approximately 10 min on a 8 core i7 laptop + time for database download; ~19 GB):

nextflow run EBI-Metagenomics/emg-viral-pipeline -r v0.4.0 --fasta "/home/$USER/.nextflow/assets/EBI-Metagenomics/emg-viral-pipeline/nextflow/test/assembly.fasta" --cores 4 -profile local,docker

Please note that in particular the following parameters are important to handle where Nextflow writes files.

  • --workdir or -w (here your work directories with intermediate data will be saved)
  • --databases (here your databases will be saved and the workflow checks if they are already available under this path)
  • --singularity_cachedir (here Singularity containers will be cached, not needed for Docker, default path: ./singularity)

Please clean up your work directory from time to time to save disk space!

Profiles

Nextflow uses a merged profile handling system so you have to define an executor (e.g., local, lsf, slurm) and an engine (docker, singularity) to run the pipeline according to your needs and infrastructure.

Per default, the workflow runs locally (e.g., on your laptop) with Docker. When you execute the workflow on a HPC you can for example switch to a specific job scheduler and Singularity instead of Docker:

  • SLURM (-profile slurm,singularity)
  • LSF (-profile lsf,singularity)

Don't forget, especially on an HPC, to define further important parameters such as -w, --databases, and --singularity_cachedir as mentioned above.

The engine conda is not working at the moment until there is a conda recipe for PPR-Meta or we switch the tool. Sorry. Use Docker. Or Singularity. Please. Or install PPR-Meta by yourself and then use the conda profile (not recommended).

Monitoring

Monitoring with Nextflow Tower

To monitor your Nextflow computations, VIRify can be connected to Nextflow Tower. You need a user access token to connect your Tower account with the pipeline. Simply generate a login using your email and then click the link sent to this address.

Once logged in, click on your avatar in the top right corner and select "Your tokens." Generate a token or copy the default one and set the following environment variable:

export TOWER_ACCESS_TOKEN=<YOUR_COPIED_TOKEN>

You can save this variable in your .bashrc or .profile to not need to enter it again. Refresh your terminal.

Now run:

nextflow run EBI-Metagenomics/emg-viral-pipeline -r v0.4.0 --fasta "/home/$USER/.nextflow/assets/EBI-Metagenomics/emg-viral-pipeline/nextflow/test/assembly.fasta" --cores 4 -profile local,docker -with-tower

Alternatively, you can also pull the code from this repository and activate the Tower connection within the nextflow.config file located in the root GitHub directory:

tower {
    accessToken = ''
    enabled = true
} 

You can also directly enter your access token here instead of generating the above-mentioned environment variable.

GFF output files

The outputs generated from viral prediction tools, ViPhOG annotation, taxonomy assign, and CheckV quality are integrated and summarized in a validated gff file. You can find such output in the 08-final/gff/ folder.

The labels used in the Type column of the gff file correspond to the following nomenclature according to the Sequence Ontology resource:

Type in gff file Sequence ontology ID
viral_sequence SO:0001041
prophage SO:0001006
CDS SO:0000316

Note that CDS are reported only when a ViPhOG match has been found.

Common Workflow Language (discontinued)

Until VIRify v1.0, VIRify was implemented in Common Workflow Language (CWL) next to the Nextflow implementation. Both Workflow Management Systems were previously supported.

What do I need?

The implementation until v1.0 of VIRify uses CWL version 1.2. It was tested using Toil version 5.3.0 as the workflow engine and conda to manage the software dependencies.

How?

For instructions go to the CWL README.

Pipeline overview

VIRify Overview For further details please check: doi.org/10.1101/2022.08.22.504484

A note about metatranscriptomes

Although VIRify has been benchmarked and validated with metagenomic data in mind, it is also possible to use this tool to detect RNA viruses in metatranscriptome assemblies (e.g. SARS-CoV-2). However, some additional considerations for this purpose are outlined below:

1. Quality control: As for metagenomic data, a thorough quality control of the FASTQ sequence reads to remove low-quality bases, adapters and host contamination (if appropriate) is required prior to assembly. This is especially important for metatranscriptomes as small errors can further decrease the quality and contiguity of the assembly obtained. We have used TrimGalore for this purpose.

2. Assembly: There are many assemblers available that are appropriate for either metagenomic or single-species transcriptomic data. However, to our knowledge, there is no assembler currently available specifically for metatranscriptomic data. From our preliminary investigations, we have found that transcriptome-specific assemblers (e.g. rnaSPAdes) generate more contiguous and complete metatranscriptome assemblies compared to metagenomic alternatives (e.g. MEGAHIT and metaSPAdes).

3. Post-processing: Metatranscriptomes generate highly fragmented assemblies. Therefore, filtering contigs based on a set minimum length has a substantial impact in the number of contigs processed in VIRify. It has also been observed that the number of false-positive detections of VirFinder (one of the tools included in VIRify) is lower among larger contigs. The choice of a length threshold will depend on the complexity of the sample and the sequencing technology used, but in our experience any contigs <2 kb should be analysed with caution.

4. Classification: The classification module of VIRify depends on the presence of a minimum number and proportion of phylogenetically-informative genes within each contig in order to confidently assign a taxonomic lineage. Therefore, short contigs typically obtained from metatranscriptome assemblies remain generally unclassified. For targeted classification of RNA viruses (for instance, to search for Coronavirus-related sequences), alternative DNA- or protein-based classification methods can be used. Two of the possible options are: (i) using MashMap to screen the VIRify contigs against a database of RNA viruses (e.g. Coronaviridae) or (ii) using hmmsearch to screen the proteins obtained in the VIRify contigs against marker genes of the taxon of interest.

Resources

Additional material (assemblies used for benchmarking in the paper, ...) as well as the ViPhOG HMMs with model-specific bit score thresholds used in VIRify are available at osf.io/fbrxy.

Here, we also list databases used and automatically downloaded by the pipeline (in v2.0.0) when it is first run. We deposited database files on a separate FTP to ensure their accessibility. The files can be also downloaded manually and then used as an input for the pipeline to prevent the auto-download (see --help in the Nextflow pipeline).

Virus-specific protein profile HMMs

  • ViPhOGs (mandatory, used for taxonomy assignment)
    • wget -nH ftp://ftp.ebi.ac.uk/pub/databases/metagenomics/viral-pipeline/hmmer_databases/vpHMM_database_v3.tar.gz
    • Additional metadata file for filtering the ViPhOGs (according to taxonomy updates by the ICTV)
      • wget ftp://ftp.ebi.ac.uk/pub/databases/metagenomics/viral-pipeline/additional_data_vpHMMs_v4.tsv
    • Publication
  • pVOGs (optional)
    • wget -nH ftp://ftp.ebi.ac.uk/pub/databases/metagenomics/viral-pipeline/hmmer_databases/pvogs.tar.gz
    • Publication
  • RVDB (optional)
    • wget -nH ftp://ftp.ebi.ac.uk/pub/databases/metagenomics/viral-pipeline/hmmer_databases/rvdb.tar.gz
    • Publication
  • VOGDB (optional)
    • wget -nH ftp://ftp.ebi.ac.uk/pub/databases/metagenomics/viral-pipeline/hmmer_databases/vogdb.tar.gz
    • Publication
  • VPF (optional)
    • wget -nH ftp://ftp.ebi.ac.uk/pub/databases/metagenomics/viral-pipeline/hmmer_databases/vpf.tar.gz
    • Publication

Initial virus prediction on contig level

  • VirSorter HMMs
    • wget ftp://ftp.ebi.ac.uk/pub/databases/metagenomics/viral-pipeline/virsorter-data-v2.tar.gz
    • Publication
  • Virfinder model
    • wget ftp://ftp.ebi.ac.uk/pub/databases/metagenomics/viral-pipeline/virfinder/VF.modEPV_k8.rda
    • Publication

Virus prediction QC

  • CheckV
    • wget https://portal.nersc.gov/CheckV/checkv-db-v1.0.tar.gz
    • Publication

Taxonomy annotation

  • NCBI taxonomy
    • wget -nH ftp://ftp.ebi.ac.uk/pub/databases/metagenomics/viral-pipeline/2022-11-01_ete3_ncbi_tax.sqlite.gz

Additional blast-based assignment (optional, super slow)

  • IMG/VR
    • wget -nH ftp://ftp.ebi.ac.uk/pub/databases/metagenomics/viral-pipeline/IMG_VR_2018-07-01_4.tar.gz
    • Publication

Cite

If you use the pipeline or ViPhOG HMMs in your work, please cite accordingly:

ViPhOGs:

Moreno-Gallego, Jaime Leonardo, and Alejandro Reyes. "Informative regions in viral genomes." Viruses 13.6 (2021): 1164.

VIRify:

Rangel-Pineros, Guillermo, et al. "VIRify: an integrated detection, annotation and taxonomic classification pipeline using virus-specific protein profile hidden Markov models." bioRxiv (2022)

More Repositories

1

genomes-pipeline

MGnify genome analysis pipeline
Python
89
star
2

EukCC

Tool to estimate genome quality of microbial eukaryotes
Python
29
star
3

emg-toolkit

MGnify API toolkit
Python
21
star
4

ebi-metagenomics-cwl

This repository contains the CWL description of the EBI Metagenomics pipeline
Common Workflow Language
21
star
5

pipeline-v5

This repository contains all CWL descriptions of the MGnify pipeline version 5.0.
HTML
21
star
6

genome_uploader

Python script to upload bins and MAGs to ENA (European Nucleotide Archive)
Python
20
star
7

MGnifyR

R package for searching, downloading and analysis of EBI MGnify metagenomics data
R
19
star
8

mobilome-annotation-pipeline

Python
16
star
9

examples

Microbiome Informatics - API and service usage examples
Jupyter Notebook
12
star
10

ebi-metagenomics

Automatically exported from code.google.com/p/ebi-metagenomics
Java
11
star
11

emgapi

MGnify RESTful API
Python
10
star
12

notebooks

MGnify documentation and Jupyter Lab notebooks to support downstream analysis of MGnify data (EMBL-EBI's metagenomics platform)
Jupyter Notebook
10
star
13

kegg-pathways-completeness-tool

Python
6
star
14

workflow-is-cwl

This repository contains CWL descriptions of the various tools which will allow you to build workflows for the annotation of transcripts
HTML
6
star
15

mgnify-web

Parent repo for MGnify's (EMBL-EBI's metagenomics platform) API and web client. Designed for local development.
Python
5
star
16

orchestra

A job orchestration system to rule them all
Python
4
star
17

ebi-metagenomics-client

This repos contains the MGnify web client
TypeScript
4
star
18

amplicon-pipeline

Nextflow
4
star
19

motus_pipeline

Raw reads mOTUs and taxonomic classification pipeline
HTML
3
star
20

CWL-assembly

Common Workflow Language
3
star
21

MGnifam

Python
3
star
22

fasta-reader-py

FASTA file reader.
Python
3
star
23

EMG-docs

This repository contains the documentation for the EBI Metagenomics resource
3
star
24

nf-modules

Microbiome Informatics NF modules and sufworkflows
Nextflow
3
star
25

mgnify-biata-2020

2
star
26

CRISPRCasFinder-CWL

Implementation of CRIPSPRCasFinder tool in CWL
Common Workflow Language
2
star
27

fetch_tool

Tool which allows you to fetch RAW read files from the European Nucleotide Archive (ENA)
Python
2
star
28

mgnify-sourmash-component

A web component that let you select FastA or FastQ sequence files and creates sketches (KmerMinHash signatures) using Sourmash.
TypeScript
2
star
29

ebi-metagenomics-webkit

NPM package for a Backbone-JS API to consume the European Bioinformatic Institute's MGnify API for metagenomic data, in addition to a widget library to visualise the data
HTML
2
star
30

holofood-course

ReadTheDocs/Sphinx documentation source for the "Organisation and utilisation of hologenomic datasets" Holofood course, 2022
HTML
2
star
31

hmmer-py-old

HMMER Python interface.
Python
2
star
32

mgnify-ebi-2020

1
star
33

hmmer-reader-py

HMMER file reader.
Python
1
star
34

fraggenescan

FragGene Scan fork
C
1
star
35

genome-search

Microservice API for searching fragments against indexed genomes, using COBS. Provides the "search by gene" feature on MGnify's MAG catalogues.
Python
1
star
36

mgnify-ebi-2021

Jupyter Notebook
1
star
37

green-socket

Abstraction for non-blocking sockets in C.
C
1
star
38

hmmer3

HMMER (version 3) related projects
C
1
star
39

fasta-reader

FASTA reader for C.
C
1
star
40

ebi-metagenomics-stats

Python
1
star
41

HoloFoodR

R interface for HoloFood resource
R
1
star
42

mgnifams-site

JavaScript
1
star
43

holofood-database

HoloFood is a project investigating sustainable food production through hologenomics. This Django app is the data portal where samples and other datasets from the project are made publicly available.
Python
1
star
44

ena-api-handler

Python
1
star
45

biome_prediction

Python module to predict GOLD biomes from free text field, can also be used as a command line tool.
Python
1
star
46

mgbinrefinder

Bin refinment tool
Python
1
star
47

mgnifams

Python
1
star