• Stars
    star
    201
  • Rank 194,491 (Top 4 %)
  • Language
    Shell
  • License
    Other
  • Created almost 5 years ago
  • Updated over 1 year ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

k-mer based assembly evaluation

Merqury

Evaluate genome assemblies with k-mers and more

Often, genome assembly projects have illumina whole genome sequencing reads available for the assembled individual.
The k-mer spectrum of this read set can be used for independently evaluating assembly quality without the need of a high quality reference.
Merqury provides a set of tools for this purpose.

Dependency

  • gcc 10.2.0 or higher (for installing Meryl)
  • Meryl v1.4
  • Java run time environment (JRE)
  • R with argparse, ggplot2, and scales (recommend R 4.0.3+)
  • bedtools
  • samtools

Note that igvtools is no longer used. The .tdf files are replaced with .wig files, compatable to IGV and UCSC genome browser.

Installation

Version supporting homopolymer compressed hapmers for Verkko (Recommended)

Only the latest Merqury version supports the compressed option used in Verkko.

tar -xJf meryl-1.4.*.tar.xz
cd meryl-1.4/bin
export PATH=$pwd:$PATH

If the binary doesn't work, download the source and compile:

git clone https://github.com/marbl/meryl.git
cd meryl/src
make -j 24
export PATH=/path/to/meryl/โ€ฆ/bin:$PATH

See if we get help message with meryl.

  • Merqury
git clone https://github.com/marbl/merqury.git
cd merqury
export MERQURY=$PWD

Other dependency:

  • Java run time environment (JRE)
  • R with argparse, ggplot2, and scales (recommend R 4.0.3+)
  • bedtools
  • samtools

Previous Releases

The v1.3 Merqury is compatable with Meryl v1.3, however does not support homopolymer-compressed kmers. In addition, multiple issues were fixed (e.g. use of large k-mers, better memory utilization, minor bugs in the logs etc.) since then. Therefore, we recommend to use the latest Meryl and Merqury. The Conda version below is currently deploying v1.3.

Direct installation

  1. Get a working Meryl in your PATH

Download Meryl release: https://github.com/marbl/meryl/releases/tag/v1.3

tar -xJf meryl-1.3.*.tar.xz
cd meryl-1.3/bin
export PATH=$pwd:$PATH

If the binary doesn't work, download the source and compile:

cd meryl/src
make -j 24
export PATH=/path/to/meryl/โ€ฆ/bin:$PATH

See if we get help message with meryl.

  1. Download the release version and set env variable $MERQURY
wget https://github.com/marbl/merqury/archive/v1.3.tar.gz
tar -zxvf v1.3.tar.gz
cd merqury-1.3
export MERQURY=$PWD

Add the โ€œexportโ€ part to your environment for meryl and MERQURY (~/.bash_profile or ~/.profile).
Add installation dir paths for bedtools and samtools to your environment.
source it.

Through Conda

Thanks to @EdHarry, a conda recipe is now available: https://anaconda.org/bioconda/merqury
On a new conda environment, run:

conda install -c conda-forge -c bioconda merqury

Or, if you have a different version of jdk installed or want to have a separate conda environnment for merqury:

conda create -n merqury -c conda-forge -c bioconda merqury openjdk=11

You will then need to activate the merqury environment before using it with:

conda activate merqury

Test running

Rscript $MERQURY/plot/plot_spectra_cn.R --help

In case R complains for version mismatches of the R packages, try

conda update --all

It seems like R in conda isn't maintained anymore. Try to modify channel priority in .condarc.

Run

  • !! Merqury assumes all meryl dbs (dirs) are named with .meryl. !!

On a single machine:

ln -s $MERQURY/merqury.sh		# Link merqury
./merqury.sh <read-db.meryl> [<mat.meryl> <pat.meryl>] <asm1.fasta> [asm2.fasta] <out>

Usage: merqury.sh <read-db.meryl> [<mat.meryl> <pat.meryl>] <asm1.fasta> [asm2.fasta] <out>
	<read-db.meryl>	: k-mer counts of the read set
	<mat.meryl>		: k-mer counts of the maternal haplotype (ex. mat.only.meryl or mat.hapmer.meryl)
	<pat.meryl>		: k-mer counts of the paternal haplotype (ex. pat.only.meryl or pat.hapmer.meryl)
	<asm1.fasta>	: Assembly fasta file (ex. pri.fasta, hap1.fasta or maternal.fasta)
	[asm2.fasta]	: Additional fasta file (ex. alt.fasta, hap2.fasta or paternal.fasta)
	*asm1.meryl and asm2.meryl will be generated. Avoid using the same names as the hap-mer dbs
	<out>		: Output prefix

< > : required
[ ] : optional

Example

Below is showing examples how to run Merqury using the prebuilt meryl dbs on a. thaliana F1 hybrid. The fasta files are the trio-binned assemblies from Koren et al.

### Download assemblies ###
wget https://gembox.cbcb.umd.edu/triobinning/athal_COL.fasta
wget https://gembox.cbcb.umd.edu/triobinning/athal_CVI.fasta

### Download prebuilt meryl dbs ###
# read.meryl of the F1 hybrid between COL-0 and CVI-0
wget https://obj.umiacs.umd.edu/marbl_publications/merqury/athal/a_thal.k18.meryl.tar.gz
# hap-mers for COL-0 haplotype
wget https://obj.umiacs.umd.edu/marbl_publications/merqury/athal/a_thal.col0.hapmer.meryl.tar.gz
# hap-mers for CVI-0 haplotype
wget https://obj.umiacs.umd.edu/marbl_publications/merqury/athal/a_thal.cvi0.hapmer.meryl.tar.gz

# Untar
for gz in *.tar.gz
do
    tar -zxf $gz
done

# Run merqury
$MERQURY/merqury.sh F1.k18.meryl col0.hapmer.meryl cvi0.hapmer.meryl athal_COL.fasta athal_CVI.fasta test

1. I have one assembly (pseudo-haplotype or mixed-haplotype)

# I don't have the hap-mers
$MERQURY/merqury.sh read-db.meryl asm1.fasta out_prefix
# Using the example above
$MERQURY/merqury.sh F1.k18.meryl athal_COL.fasta test-1

# I have the hap-mers
$MERQURY/merqury.sh read-db.meryl mat.meryl pat.meryl asm1.fasta out_prefix
# Using the example above
$MERQURY/merqury.sh F1.k18.meryl col0.hapmer.meryl cvi0.hapmer.meryl athal_COL.fasta test-1

2. I have two assemblies (diploid)

# I don't have the hap-mers
$MERQURY/merqury.sh read-db.meryl asm1.fasta asm2.fasta out_prefix
# Using the example above
$MERQURY/merqury.sh F1.k18.meryl athal_COL.fasta athal_CVI.fasta test-2

# I have the hap-mers
$MERQURY/merqury.sh read-db.meryl mat.meryl pat.meryl asm1.fasta asm2.fasta out_prefix
# Using the example above
$MERQURY/merqury.sh F1.k18.meryl col0.hapmer.meryl cvi0.hapmer.meryl athal_COL.fasta athal_CVI.fasta test-2
  • Note there is no need to run merqury per-assemblies. Give two fasta files, Merqury generates stats for each and combined.

How to parallelize

Merqury starts with eval/spectra_cn.sh. When hap-mers are provided, merqury runs modules under trio/ in addition to eval/spectra_cn.sh.

The following can run at the same time. Modules with dependency are followed by arrows (->).

  • eval/spectra_cn.sh -> trio/spectra_hap.sh
  • trio/hap_blob.sh
  • trio/phase_block.sh per assembly -> trio/block_n_stats.sh

Meryl, the k-mer counter inside, uses the maximum cpus available. Set OMP_NUM_THREADS=24 for example to use 24 threads.

On slurm environment, simply run:

ln -s $MERQURY/_submit_merqury.sh	# Link merqury
./_submit_merqury.sh <read-db.meryl> [<mat.meryl> <pat.meryl>] <asm1.fasta> [asm2.fasta] <out>

Change the sbatch to match your environment. (ex. partition)

Outputs from each modules

  • eval/spectra_cn.sh: k-mer completeness, qv, spectra-cn and spectra-asm plots, asm-only .bed and .tdf for tracking errors
  • eval/qv.sh: just get the qv stats and quit.
  • trio/spectra_hap.sh: hap-mer level spectra-cn plots, hap-mer completeness
  • trio/hap_blob.sh: blob plots of the hap-mers in each contg/scaffold
  • trio/phase_block.sh: phase block statistics, phase block N* plots, hap-mer tracks (.bed and .tdf files)
  • trio/block_n_stats.sh: continuity plots (phase block N* or NG* plots, phase block vs. contig/scaffold plots)
  • trio/switch_error.sh: this is run part of phase_blck.sh, however can be re-run with desired short-range switch parameters. Run trio/block_n_stats.sh along with it to get the associated plots.

Tips for helps

Run each script without any parameters if not sure what to do. For example, ./trio/switch_error.sh will give a help message and quit.

Following wiki pages have more detailed examples.

1. Prepare meryl dbs (details)

  1. Get the right k size
  2. Build k-mer dbs with meryl
  3. Build hap-mers for trios

2. Overall assembly evaluation (details)

  1. Reference free QV estimate
  2. k-mer completeness (recovery rate)
  3. Spectra copy number analysis
  4. Track error bases in the assembly

3. Phasing assessment with hap-mers (details)

  1. Inherited hap-mer plots
  2. Hap-mer blob plots
  3. Hap-mer completeness (recovery rate)
  4. Spectra copy number analysis per hap-mers
  5. Phased block statistics and switch error rates
  6. Track each haplotype block in the assembly

Available pre-built meryl dbs

Meryl dbs from Illumina WGS and hapmers are available here for

  • A. thaliana COL-0 x CVI-0 F1
  • NA12878 (HG001)
  • HG002

Citing merqury

Please use this paper to cite Merqury:

Rhie, A., Walenz, B.P., Koren, S. et al. Merqury: reference-free quality, completeness, and phasing assessment for genome assemblies. Genome Biol 21, 245 (2020). https://doi.org/10.1186/s13059-020-02134-9

More Repositories

1

CHM13

The complete sequence of a human genome
914
star
2

canu

A single molecule sequence assembler for genomes large and small.
C++
600
star
3

Krona

Interactively explore metagenomes and more from a web browser.
JavaScript
419
star
4

Mash

Fast genome and metagenome distance estimation using MinHash
C++
355
star
5

verkko

Telomere-to-telomere assembly of accurate long reads (PacBio HiFi, Oxford Nanopore Duplex, HERRO corrected Oxford Nanopore Simplex) and Oxford Nanopore ultra-long reads.
Python
274
star
6

MashMap

A fast approximate aligner for long DNA sequences
C++
210
star
7

Winnowmap

Long read / genome alignment software
C
187
star
8

SALSA

SALSA: A tool to scaffold long read assemblies with Hi-C data
Python
178
star
9

parsnp

Parsnp was designed to align the core genome of hundreds to thousands of bacterial genomes within a few minutes to few hours. Input can be both draft assemblies and finished genomes, and output includes variant (SNP) calls, core genome phylogeny and multi-alignments. Parsnp leverages contextual information provided by multi-alignments surrounding SNP sites for filtration/cleaning, in addition to existing tools for recombination detection/filtration and phylogenetic reconstruction.
C++
124
star
10

ModDotPlot

Python
102
star
11

MHAP

MinHash Alignment Process (MHAP, pronounced MAP): locality-sensitive hashing to detect long-read overlaps and utilities
Java
95
star
12

HG002

A complete diploid human genome
94
star
13

metAMOS

A metagenomic and isolate assembly and analysis pipeline built with AMOS
Roff
93
star
14

meryl

A genomic k-mer counter (and sequence utility) with nice features.
C
78
star
15

harvest

50
star
16

MetaCompass

MetaCompass: Reference-guided Assembly of Metagenomes
Python
38
star
17

Primates

Complete assemblies of non-human primate genomes
38
star
18

MetagenomeScope

Visualization tool for (meta)genome assembly graphs
JavaScript
25
star
19

seqrequester

A tool for summarizing, extracting, generating and modifying DNA sequences.
C
23
star
20

rukki

Extracting paths from assembly graphs
Rust
22
star
21

CHM13-issues

CHM13 human reference genome issue tracking
HTML
18
star
22

T2T-Browser

Genome browser hub for the T2T genomes and resources
HTML
15
star
23

VALET

A pipeline for detecting mis-assemblies in metagenomic assemblies.
TeX
14
star
24

gingr

C++
13
star
25

MetaCarvel

MetaCarvel: A scaffolder for metagenomes
C++
13
star
26

MUMmer3

MUMmer3
C++
11
star
27

binnacle

Binnacle: Using Scaffolds to Improve the Contiguity and Quality of Metagenomic Bins
Python
10
star
28

HG002-issues

HG002 human reference genome issue tracking and polishing
10
star
29

harvest-tools

C++
8
star
30

ATLAS

outlier detection in BLAST hits
Python
3
star