• Stars
    star
    210
  • Rank 187,585 (Top 4 %)
  • Language
    C++
  • License
    Other
  • Created almost 8 years ago
  • Updated over 1 year ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

A fast approximate aligner for long DNA sequences

MashMap

BioConda Install GitHub Downloads

MashMap implements a fast and approximate algorithm for computing local alignment boundaries between long DNA sequences. It can be useful for mapping genome assembly or long reads (PacBio/ONT) to reference genome(s). Given a minimum alignment length and an identity threshold for the desired local alignments, Mashmap computes alignment boundaries and identity estimates using k-mers. It does not compute the alignments explicitly, but rather estimates an unbiased k-mer based Jaccard similarity using a combination of minmers (a novel winnowing scheme) and MinHash. This is then converted to an estimate of sequence identity using the Mash distance. An appropriate k-mer sampling rate is automatically determined using the given minimum local alignment length and identity thresholds.

As an example, Mashmap can map a human genome assembly to the human reference genome in about one minute total execution time and < 4 GB memory using just 8 CPU threads, achieving more than an order of magnitude improvement in both runtime and memory over alternative methods. We describe the algorithms associated with Mashmap, and report on speed, scalability, and accuracy of the software in the publications listed below. Unlike traditional mappers, MashMap does not compute exact sequence alignments. In future, we plan to add an optional alignment support to generate base-to-base alignments.

MashMap3 important changes

  • The automatic sampling rate has increased slightly relative to MashMap2, resulting in more accurate mappings and ANI prediction at the cost of more RAM. For a more dense sampling rate, users can provide the --dense flag.
  • The default output file format is now PAF, but the MashMap2 output format can be obtained with the --legacy flag. The id tag in the PAF output corresponds to the predicted ANI.
  • More details of the MashMap3 changes can be found in the v3.0.1 changelog.

Installation

Follow INSTALL.txt to compile and install MashMap. We also provide dependency-free linux and OSX binaries for user convenience through the latest release.

Usage

  • Map set of query sequences against a reference genome:

    mashmap -r reference.fna -q query.fa

    The output is space-delimited with each line consisting of query name, length, 0-based start, end, strand, target name, length, start, end and mapping nucleotide identity.

  • Map set of query seqences against a list of reference genomes:

    mashmap --rl referenceList.txt -q query.fa

    File 'referenceList.txt' containing the list of reference genomes should contain path to the reference genomes, one per line.

Parameters

For most of the use cases, default values should be appropriate. However, different parameters and their purpose can be checked using the help page mashmap -h. Important ones are mentioned below:

  • Identity threshold (--perc_identity, --pi) : By default, it is set to 85, implying mappings with 85% or more identity should be reported. For example, it can be set to 80% to account for more noisy long-read datasets or 95% for mapping human genome assembly to human reference.

  • Minimum segment length (-s, --segLength) : Default is 5,000 bp. Sequences below this length are ignored. Mashmap provides guarantees on reporting local alignments of length twice this value.

  • Filtering options (-f, --filter_mode) : Mashmap implements a plane-sweep based algorithm to perform the alignment filtering. Similar to delta-filter in nucmer, different filtering options are provided that are suitable for long read or assembly mapping. Option -f map is suitable for reporting the best mappings for long reads, whereas -f one-to-one is suitable for reporting orthologous mappings among all computed assembly to genome mappings.

  • Sketch size (-J, --sketchSize) : This parameter sets the seed density of the winnowing scheme, gauranteeing that the minhash will be calculated from a sample of sketchSize k-mers for each segment. It is set automatically based on --pi but can be manually set as well.

  • Dense sketching (--dense) : This flag will increase the seed density substantially, resulting in a density of roughly 0.02 * (1 + (1 - pi) / .05) where pi is the perc_identity threshold. This leads to longer runtimes and higher RAM usage, but significantly more accurate estimates of ANI.

Visualize

We provide a perl script for generating dot-plots to visualize mappings. It takes Mashmap's mapping output as its input. This script requires availability of gnuplot. Below is an example demonstrating mapping of canu NA12878 human genome assembly (y-axis) to hg38 reference (x-axis).

Release

Use the latest release for a stable version.

Publications

More Repositories

1

CHM13

The complete sequence of a human genome
914
star
2

canu

A single molecule sequence assembler for genomes large and small.
C++
600
star
3

Krona

Interactively explore metagenomes and more from a web browser.
JavaScript
419
star
4

Mash

Fast genome and metagenome distance estimation using MinHash
C++
355
star
5

verkko

Telomere-to-telomere assembly of accurate long reads (PacBio HiFi, Oxford Nanopore Duplex, HERRO corrected Oxford Nanopore Simplex) and Oxford Nanopore ultra-long reads.
Python
274
star
6

merqury

k-mer based assembly evaluation
Shell
201
star
7

Winnowmap

Long read / genome alignment software
C
187
star
8

SALSA

SALSA: A tool to scaffold long read assemblies with Hi-C data
Python
178
star
9

parsnp

Parsnp was designed to align the core genome of hundreds to thousands of bacterial genomes within a few minutes to few hours. Input can be both draft assemblies and finished genomes, and output includes variant (SNP) calls, core genome phylogeny and multi-alignments. Parsnp leverages contextual information provided by multi-alignments surrounding SNP sites for filtration/cleaning, in addition to existing tools for recombination detection/filtration and phylogenetic reconstruction.
C++
124
star
10

ModDotPlot

Python
102
star
11

MHAP

MinHash Alignment Process (MHAP, pronounced MAP): locality-sensitive hashing to detect long-read overlaps and utilities
Java
95
star
12

HG002

A complete diploid human genome
94
star
13

metAMOS

A metagenomic and isolate assembly and analysis pipeline built with AMOS
Roff
93
star
14

meryl

A genomic k-mer counter (and sequence utility) with nice features.
C
78
star
15

harvest

50
star
16

MetaCompass

MetaCompass: Reference-guided Assembly of Metagenomes
Python
38
star
17

Primates

Complete assemblies of non-human primate genomes
38
star
18

MetagenomeScope

Visualization tool for (meta)genome assembly graphs
JavaScript
25
star
19

seqrequester

A tool for summarizing, extracting, generating and modifying DNA sequences.
C
23
star
20

rukki

Extracting paths from assembly graphs
Rust
22
star
21

CHM13-issues

CHM13 human reference genome issue tracking
HTML
18
star
22

T2T-Browser

Genome browser hub for the T2T genomes and resources
HTML
15
star
23

VALET

A pipeline for detecting mis-assemblies in metagenomic assemblies.
TeX
14
star
24

gingr

C++
13
star
25

MetaCarvel

MetaCarvel: A scaffolder for metagenomes
C++
13
star
26

MUMmer3

MUMmer3
C++
11
star
27

binnacle

Binnacle: Using Scaffolds to Improve the Contiguity and Quality of Metagenomic Bins
Python
10
star
28

HG002-issues

HG002 human reference genome issue tracking and polishing
10
star
29

harvest-tools

C++
8
star
30

ATLAS

outlier detection in BLAST hits
Python
3
star