• Stars
    star
    274
  • Rank 150,274 (Top 3 %)
  • Language
    Python
  • Created over 3 years ago
  • Updated 3 months ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Telomere-to-telomere assembly of accurate long reads (PacBio HiFi, Oxford Nanopore Duplex, HERRO corrected Oxford Nanopore Simplex) and Oxford Nanopore ultra-long reads.

Verkko

Verkko is a hybrid genome assembly pipeline developed for telomere-to-telomere assembly of PacBio HiFi and Oxford Nanopore reads. Verkko is Finnish for net, mesh and graph.

Verkko uses Canu to correct remaining errors in the HiFi reads, builds a multiplex de Bruijn graph using MBG, aligns the Oxford Nanopore reads to the graph using GraphAligner, progressively resolves loops and tangles first with the HiFi reads then with the aligned Oxford Nanopore reads, and finally creates contig consensus sequences using Canu's consensus module.

Install:

Installing with a 'package manager' is encouraged:

  • conda install -c conda-forge -c bioconda -c defaults verkko

or

  • conda create -n verkko -c conda-forge -c bioconda -c defaults verkko

if you prefer to install verkko in a separate environment. Alternatively, you can download the source for a recent release.

To install Verkko from github (for developers only) run:

git clone https://github.com/marbl/verkko.git
cd verkko/src
git submodule init && git submodule update
make -j32

This will create the folder verkko/bin and verkko/lib/verkko. You can move the contents of these folders to a central installation location or you can add verkko/bin to your path. If any of the dependencies (e.g. GraphAligner, MBG, winnowmap, mashmap, etc) are not available in your path you may also symlink them under verkko/lib/verkko/bin/. Make sure you are using the latest tip of MBG/GraphAligner not a conda install for development.

Run:

Verkko is implemented as a Snakemake workflow, launched by a wrapper script to parse options and create a config.yml file.

verkko -d <work-directory> --hifi <hifi-read-files> [--nano <ont-read-files>]

By default, verkko will run the snakemake workflow and all compute on the local machine. Support for SGE, Slurm and LSF (untested) can be enabled with options --sge, --slurm and --lsf, respectively. This will run the snakemake workflow on the local machine but submit all compute to the grid. To launch the both the snakemake workflow and compute on the grid, wrap the verkko command in a shell script and submit using your scheduler. You may need to set the environment variable VERKKO to the installation directory of Verkko if there are errors that component scripts are not found.

Verkko supports extended phasing using using rukki using either trio or Hi-C information.

To run in trio mode, you must first generate merqury hapmer databases and pass them to verkko. Please use git clone to pull the latest versions merqury (see the merqury documentation for details) and make sure that /path/to/verkko/lib/verkko/bin is in your path. Then, if you have a SLURM cluster you can run:

$MERQURY/_submit_build.sh -c 30 maternal.fofn maternal_compress
$MERQURY/_submit_build.sh -c 30 paternal.fofn paternal_compress
$MERQURY/_submit_build.sh -c 30 child.fofn    child_compress

if not, you can run

meryl count compress k=30 threads=XX memory=YY maternal.*fastq.gz output maternal_compress.k30.meryl
meryl count compress k=30 threads=XX memory=YY paternal.*fastq.gz output paternal_compress.k30.meryl
meryl count compress k=30 threads=XX memory=YY    child.*fastq.gz output    child_compress.k30.meryl

replacing XX and YY with the threads and memory you want meryl to use. Once you have the databases, run:

$MERQURY/trio/hapmers.sh \
  maternal_compress.k30.meryl \
  paternal_compress.k30.meryl \
     child_compress.k30.meryl

verkko -d asm \
  --hifi hifi/*.fastq.gz \
  --nano  ont/*.fastq.gz \
  --hap-kmers maternal_compress.k30.hapmer.meryl \
              paternal_compress.k30.hapmer.meryl \
              trio

Make sure to count k-mers in compressed space. Child data is optional, in this case use maternal_compress.k30.only.meryl and paternal_compress.k30.only.meryl in the verkko command above.

To run in Hi-C mode, reads should be provided using the --hic1 and --hic2 options. For example:

verkko -d asm \
  --hifi hifi/*.fastq.gz \
  --nano ont/*.fastq.gz \
  --hic1 hic/*R1*fastq.gz  \
  --hic2 hic/*R2*fastq.gz

Hi-C integration is a beta release and tested mostly on human and primate genomes. Please see the --rdna-tangle, --uneven-depth and --haplo-divergence options if you want to assemble something distant from human and/or have uneven coverage. If you encounter issues or have questions about appropriate parameters, please open an issue.

You can pass through snakemake options to restrict CPU/memory/cluster resources by adding the --snakeopts option to verkko. For example, --snakeopts "--dry-run" will print what jobs will run while --snakeopts "--cores 1000" would restrict grid runs to at most 1000 cores across all submited jobs.

To test your installation we have an E. coli K12 dataset available.

curl -L https://obj.umiacs.umd.edu/sergek/shared/ecoli_hifi_subset24x.fastq.gz -o hifi.fastq.gz
curl -L https://obj.umiacs.umd.edu/sergek/shared/ecoli_ont_subset50x.fastq.gz -o ont.fastq.gz
verkko -d asm --hifi ./hifi.fastq.gz --nano ./ont.fastq.gz

The final assembly result is under asm/assembly.fasta. The final graph (in homopolymer-compressed space) is under asm/assembly.homopolymer-compressed.gfa along with coverage files in asm/assembly*csv. If you provided phasing information, you will also have asm/assembly.haplotype[12].fasta. You can find intermediate graphs and coverage files under asm/*/unitig-*gfa and asm/*/unitig-*csv.

Citations:

More Repositories

1

CHM13

The complete sequence of a human genome
914
star
2

canu

A single molecule sequence assembler for genomes large and small.
C++
600
star
3

Krona

Interactively explore metagenomes and more from a web browser.
JavaScript
419
star
4

Mash

Fast genome and metagenome distance estimation using MinHash
C++
355
star
5

MashMap

A fast approximate aligner for long DNA sequences
C++
210
star
6

merqury

k-mer based assembly evaluation
Shell
201
star
7

Winnowmap

Long read / genome alignment software
C
187
star
8

SALSA

SALSA: A tool to scaffold long read assemblies with Hi-C data
Python
178
star
9

parsnp

Parsnp was designed to align the core genome of hundreds to thousands of bacterial genomes within a few minutes to few hours. Input can be both draft assemblies and finished genomes, and output includes variant (SNP) calls, core genome phylogeny and multi-alignments. Parsnp leverages contextual information provided by multi-alignments surrounding SNP sites for filtration/cleaning, in addition to existing tools for recombination detection/filtration and phylogenetic reconstruction.
C++
124
star
10

ModDotPlot

Python
102
star
11

MHAP

MinHash Alignment Process (MHAP, pronounced MAP): locality-sensitive hashing to detect long-read overlaps and utilities
Java
95
star
12

HG002

A complete diploid human genome
94
star
13

metAMOS

A metagenomic and isolate assembly and analysis pipeline built with AMOS
Roff
93
star
14

meryl

A genomic k-mer counter (and sequence utility) with nice features.
C
78
star
15

harvest

50
star
16

MetaCompass

MetaCompass: Reference-guided Assembly of Metagenomes
Python
38
star
17

Primates

Complete assemblies of non-human primate genomes
38
star
18

MetagenomeScope

Visualization tool for (meta)genome assembly graphs
JavaScript
25
star
19

seqrequester

A tool for summarizing, extracting, generating and modifying DNA sequences.
C
23
star
20

rukki

Extracting paths from assembly graphs
Rust
22
star
21

CHM13-issues

CHM13 human reference genome issue tracking
HTML
18
star
22

T2T-Browser

Genome browser hub for the T2T genomes and resources
HTML
15
star
23

VALET

A pipeline for detecting mis-assemblies in metagenomic assemblies.
TeX
14
star
24

gingr

C++
13
star
25

MetaCarvel

MetaCarvel: A scaffolder for metagenomes
C++
13
star
26

MUMmer3

MUMmer3
C++
11
star
27

binnacle

Binnacle: Using Scaffolds to Improve the Contiguity and Quality of Metagenomic Bins
Python
10
star
28

HG002-issues

HG002 human reference genome issue tracking and polishing
10
star
29

harvest-tools

C++
8
star
30

ATLAS

outlier detection in BLAST hits
Python
3
star