• Stars
    star
    187
  • Rank 206,464 (Top 5 %)
  • Language
    C
  • License
    Other
  • Created almost 5 years ago
  • Updated over 1 year ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Long read / genome alignment software

Winnowmap

Winnowmap is a long-read mapping algorithm optimized for mapping ONT and PacBio reads to repetitive reference sequences. Winnowmap development began on top of minimap2 codebase, and since then we have incorporated the following two ideas to improve mapping accuracy within repeats.

  • Winnowmap implements a novel weighted minimizer sampling algorithm (>=v1.0). This optimization was motivated by the need to avoid masking of frequently occurring k-mers during the seeding stage in an efficient manner, and achieve better mapping accuracy in complex repeats (e.g., long tandem repeats) of the human genome. Using weighted minimizers, Winnowmap down-weights frequently occurring k-mers, thus reducing their chance of getting selected as minimizers. Users can refer to this paper for more details. This idea is helpful to preserve the theoretical guarantee of minimizer sampling technique, i.e., if two sequences share a substring of a specified length, then they must be guaranteed to have a matching minimizer.

  • We noticed that the highest scoring alignment doesn't necessarily correspond to correct placement of reads in repetitive regions of T2T human chromosomes. In the presence of a non-reference allele within a repeat, a read sampled from that region could be mapped to an incorrect repeat copy because the standard pairwise sequence alignment scoring system penalizes true variants. This is also sometimes referred to as allelic bias. To address this bias, we introduced and implemented an idea of using minimal confidently alignable substrings (>=v2.0). These are minimal-length substrings in a read that align end-to-end to a reference with mapping quality score above a user-specified threshold. This approach treats each read mapping as a collection of confident sub-alignments, which is more tolerant of structural variation and more sensitive to paralog-specific variants (PSVs). Our most recent paper desribes this concept and benchmarking results.

Compile

Clone source code from master branch or download the latest release.

  git clone https://github.com/marbl/Winnowmap.git

Winnowmap compilation requires C++ compiler with c++11 and openmp, which are available by default in GCC >= 4.8.

  cd Winnowmap
  make -j8

Expect winnowmap and meryl executables in bin folder.

Usage

For either mapping long reads or computing whole-genome alignments, Winnowmap requires pre-computing high frequency k-mers (e.g., top 0.02% most frequent) in a reference. Winnowmap uses meryl k-mer counting tool for this purpose.

  • Mapping ONT or PacBio-hifi WGS reads
  meryl count k=15 output merylDB ref.fa
  meryl print greater-than distinct=0.9998 merylDB > repetitive_k15.txt

  winnowmap -W repetitive_k15.txt -ax map-ont ref.fa ont.fq.gz > output.sam  [OR]
  winnowmap -W repetitive_k15.txt -ax map-pb ref.fa hifi.fq.gz > output.sam
  • Mapping genome assemblies
  meryl count k=19 output merylDB asm1.fa
  meryl print greater-than distinct=0.9998 merylDB > repetitive_k19.txt

  winnowmap -W repetitive_k19.txt -ax asm20 asm1.fa asm2.fa > output.sam

For the genome-to-genome use case, it may be useful to visualize the dot plot. This perl script can be used to generate a dot plot from paf-formatted output. In both usage cases, pre-computing repetitive k-mers using meryl is quite fast, e.g., it typically takes 2-3 minutes for the human genome reference.

Benchmarking

When comparing Winnowmap (v1.0) to minimap2 (v2.17-r954), we observed a reduction in the mapping error-rate from 0.14% to 0.06% in the recently finished human X chromosome, and from 3.6% to 0% within the highly repetitive X centromere (3.1 Mbp). Winnowmap improves mapping accuracy within repeats and achieves these results with sparser sampling, leading to better index compression and competitive runtimes. By avoiding masking, we show that Winnowmap maintains uniform minimizer density.


Minimizer sampling density using a human X chromosome as the reference, with the centromere positioned between 58 Mbp and 61 Mbp. โ€˜Standardโ€™ method refers to the classic minimizer sampling algorithm from Roberts et al., without any masking or modification.

Publications

More Repositories

1

CHM13

The complete sequence of a human genome
914
star
2

canu

A single molecule sequence assembler for genomes large and small.
C++
600
star
3

Krona

Interactively explore metagenomes and more from a web browser.
JavaScript
419
star
4

Mash

Fast genome and metagenome distance estimation using MinHash
C++
355
star
5

verkko

Telomere-to-telomere assembly of accurate long reads (PacBio HiFi, Oxford Nanopore Duplex, HERRO corrected Oxford Nanopore Simplex) and Oxford Nanopore ultra-long reads.
Python
274
star
6

MashMap

A fast approximate aligner for long DNA sequences
C++
210
star
7

merqury

k-mer based assembly evaluation
Shell
201
star
8

SALSA

SALSA: A tool to scaffold long read assemblies with Hi-C data
Python
178
star
9

parsnp

Parsnp was designed to align the core genome of hundreds to thousands of bacterial genomes within a few minutes to few hours. Input can be both draft assemblies and finished genomes, and output includes variant (SNP) calls, core genome phylogeny and multi-alignments. Parsnp leverages contextual information provided by multi-alignments surrounding SNP sites for filtration/cleaning, in addition to existing tools for recombination detection/filtration and phylogenetic reconstruction.
C++
124
star
10

ModDotPlot

Python
102
star
11

MHAP

MinHash Alignment Process (MHAP, pronounced MAP): locality-sensitive hashing to detect long-read overlaps and utilities
Java
95
star
12

HG002

A complete diploid human genome
94
star
13

metAMOS

A metagenomic and isolate assembly and analysis pipeline built with AMOS
Roff
93
star
14

meryl

A genomic k-mer counter (and sequence utility) with nice features.
C
78
star
15

harvest

50
star
16

MetaCompass

MetaCompass: Reference-guided Assembly of Metagenomes
Python
38
star
17

Primates

Complete assemblies of non-human primate genomes
38
star
18

MetagenomeScope

Visualization tool for (meta)genome assembly graphs
JavaScript
25
star
19

seqrequester

A tool for summarizing, extracting, generating and modifying DNA sequences.
C
23
star
20

rukki

Extracting paths from assembly graphs
Rust
22
star
21

CHM13-issues

CHM13 human reference genome issue tracking
HTML
18
star
22

T2T-Browser

Genome browser hub for the T2T genomes and resources
HTML
15
star
23

VALET

A pipeline for detecting mis-assemblies in metagenomic assemblies.
TeX
14
star
24

gingr

C++
13
star
25

MetaCarvel

MetaCarvel: A scaffolder for metagenomes
C++
13
star
26

MUMmer3

MUMmer3
C++
11
star
27

binnacle

Binnacle: Using Scaffolds to Improve the Contiguity and Quality of Metagenomic Bins
Python
10
star
28

HG002-issues

HG002 human reference genome issue tracking and polishing
10
star
29

harvest-tools

C++
8
star
30

ATLAS

outlier detection in BLAST hits
Python
3
star