• Stars
    star
    1,391
  • Rank 33,781 (Top 0.7 %)
  • Language
    C
  • License
    GNU General Publi...
  • Created over 8 years ago
  • Updated about 1 month ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

MMseqs2: ultra fast and sensitive search and clustering suite

MMseqs2: ultra fast and sensitive sequence search and clustering suite

MMseqs2 (Many-against-Many sequence searching) is a software suite to search and cluster huge protein and nucleotide sequence sets. MMseqs2 is open source GPL-licensed software implemented in C++ for Linux, MacOS, and (as beta version, via cygwin) Windows. The software is designed to run on multiple cores and servers and exhibits very good scalability. MMseqs2 can run 10000 times faster than BLAST. At 100 times its speed it achieves almost the same sensitivity. It can perform profile searches with the same sensitivity as PSI-BLAST at over 400 times its speed.

Publications

Steinegger M and Soeding J. MMseqs2 enables sensitive protein sequence searching for the analysis of massive data sets. Nature Biotechnology, doi: 10.1038/nbt.3988 (2017).

Steinegger M and Soeding J. Clustering huge protein sequence sets in linear time. Nature Communications, doi: 10.1038/s41467-018-04964-5 (2018).

Mirdita M, Steinegger M and Soeding J. MMseqs2 desktop and local web server app for fast, interactive sequence searches. Bioinformatics, doi: 10.1093/bioinformatics/bty1057 (2019).

Mirdita M, Steinegger M, Breitwieser F, Soding J, Levy Karin E: Fast and sensitive taxonomic assignment to metagenomic contigs. Bioinformatics, doi: 10.1093/bioinformatics/btab184 (2021).

BioConda Install Github All Releases Biocontainer Pulls Build Status

Documentation

The MMseqs2 user guide is available in our GitHub Wiki or as a PDF file (Thanks to pandoc!). The wiki also contains tutorials to learn how to use MMseqs2 with real data. For questions please open an issue on GitHub or ask in our chat. Keep posted about MMseqs2/Linclust updates by following Martin on Twitter.

Installation

MMseqs2 can be used by compiling from source, downloading a statically compiled binary, using Homebrew, conda or Docker.

# install by brew
brew install mmseqs2
# install via conda
conda install -c conda-forge -c bioconda mmseqs2
# install docker
docker pull ghcr.io/soedinglab/mmseqs2
# static build with AVX2 (fastest)
wget https://mmseqs.com/latest/mmseqs-linux-avx2.tar.gz; tar xvfz mmseqs-linux-avx2.tar.gz; export PATH=$(pwd)/mmseqs/bin/:$PATH
# static build with SSE4.1
wget https://mmseqs.com/latest/mmseqs-linux-sse41.tar.gz; tar xvfz mmseqs-linux-sse41.tar.gz; export PATH=$(pwd)/mmseqs/bin/:$PATH
# static build with SSE2 (slowest, for very old systems)
wget https://mmseqs.com/latest/mmseqs-linux-sse2.tar.gz; tar xvfz mmseqs-linux-sse2.tar.gz; export PATH=$(pwd)/mmseqs/bin/:$PATH

MMseqs2 requires an AMD or Intel 64-bit system (check with uname -a | grep x86_64). We recommend using a system with at least the SSE4.1 instruction set (check by executing cat /proc/cpuinfo | grep sse4_1 on Linux or sysctl -a | grep machdep.cpu.features | grep SSE4.1 on MacOS). The AVX2 version is faster than SSE4.1, check if AVX2 is supported by executing cat /proc/cpuinfo | grep avx2 on Linux and sysctl -a | grep machdep.cpu.leaf7_features | grep AVX2 on MacOS). A SSE2 version is also available for very old systems.

MMseqs2 also works on ARM64 systems and on PPC64LE systems with POWER8 ISA or newer.

We provide static binaries for all supported platforms at mmseqs.com/latest.

MMseqs2 comes with a bash command and parameter auto completion, which can be activated by adding the following lines to your $HOME/.bash_profile:

if [ -f /Path to MMseqs2/util/bash-completion.sh ]; then
    source /Path to MMseqs2/util/bash-completion.sh
fi

Getting started

We provide easy workflows to cluster, search and assign taxonomy. These easy workflows are a shorthand to deal directly with FASTA/FASTQ files as input and output. MMseqs2 provides many modules to transform, filter, execute external programs and search. However, these modules use the MMseqs2 database formats, instead of the FASTA/FASTQ format. For maximum flexibility, we recommend using MMseqs2 workflows and modules directly. Please read more about this in the documentation.

Cluster

For clustering, MMseqs2 easy-cluster and easy-linclust are available.

easy-cluster by default clusters the entries of a FASTA/FASTQ file using a cascaded clustering algorithm.

mmseqs easy-cluster examples/DB.fasta clusterRes tmp --min-seq-id 0.5 -c 0.8 --cov-mode 1

easy-linclust clusters the entries of a FASTA/FASTQ file. The runtime scales linearly with input size. This mode is recommended for huge datasets.

mmseqs easy-linclust examples/DB.fasta clusterRes tmp

Read more about the clustering format in our user guide.

Please adjust the clustering criteria and check if temporary directory provides enough free space. For disk space requirements, see the user guide.

Search

The easy-search workflow searches directly with a FASTA/FASTQ files against either another FASTA/FASTQ file or an already existing MMseqs2 database.

mmseqs easy-search examples/QUERY.fasta examples/DB.fasta alnRes.m8 tmp

It is also possible to pre-compute the index for the target database. This reduces overhead when searching repeatedly against the same database.

mmseqs createdb examples/DB.fasta targetDB
mmseqs createindex targetDB tmp
mmseqs easy-search examples/QUERY.fasta targetDB alnRes.m8 tmp

The databases workflow provides download and setup procedures for many public reference databases, such as the Uniref, NR, NT, PFAM and many more (see Downloading databases). For example, to download and search against a database containing the Swiss-Prot reference proteins run:

mmseqs databases UniProtKB/Swiss-Prot swissprot tmp
mmseqs easy-search examples/QUERY.fasta swissprot alnRes.m8 tmp

The speed and sensitivity of the search can be adjusted with -s parameter and should be adapted based on your use case (see setting sensitivity -s parameter). A very fast search would use a sensitivity of -s 1.0, while a very sensitive search would use a sensitivity of up to -s 7.0. A detailed guide how to speed up searches is here.

The output can be customized with the --format-output option e.g. --format-output "query,target,qaln,taln" returns the query and target accession and the pairwise alignments in tab separated format. You can choose many different output columns.

❗ easy-search in default computes the sequence identity by dividing the number of identical residues by the alignment length (numIdentical/alnLen). However, search estimates the identity in default. To output real sequence identity use --alignment-mode 3 or -a.

Taxonomy

The easy-taxonomy workflow can be used to assign sequences taxonomical labels. It performs a search against a sequence database with taxonomy information (seqTaxDb), chooses the most representative sets of aligned target sequences according to different strategies (according to --lca-mode) and computes the lowest common ancestor among those.

mmseqs createdb examples/DB.fasta targetDB
mmseqs createtaxdb targetDB tmp
mmseqs createindex targetDB tmp
mmseqs easy-taxonomy examples/QUERY.fasta targetDB alnRes tmp

By default, createtaxdb assigns a Uniprot accession to a taxonomical identifier to every sequence and downloads the NCBI taxonomy. We also support BLAST, SILVA or custom taxonomical databases. Many common taxonomic reference databases can be easily downloaded and set up by the databases workflow.

Read more about the taxonomy format and the classification in our user guide.

Supported search modes

MMseqs2 provides many additional search modes:

Many modes can also be combined. You can, for example, do a translated nucleotide against protein profile search.

Memory requirements

MMseqs2 minimum memory requirements for cluster or linclust is 1 byte per sequence residue, search needs 1 byte per target residue. Sequence databases can be compressed using the --compress flag, DNA sequences can be reduced by a factor of ~3.5 and proteins by ~1.7.

MMseqs2 checks the available system memory and automatically divides the target database in parts that fit into memory. Splitting the database will increase the runtime slightly. It is possible to control the memory usage using --split-memory-limit.

How to run MMseqs2 on multiple servers using MPI

MMseqs2 can run on multiple cores and servers using OpenMP and Message Passing Interface (MPI). MPI assigns database splits to each compute node, which are then computed with multiple cores (OpenMP).

Make sure that MMseqs2 was compiled with MPI by using the -DHAVE_MPI=1 flag (cmake -DHAVE_MPI=1 -DCMAKE_BUILD_TYPE=Release -DCMAKE_INSTALL_PREFIX=. ..). Our precompiled static version of MMseqs2 cannot use MPI. The version string of MMseqs2 will have a -MPI suffix, if it was built successfully with MPI support.

To search with multiple servers, call the search or cluster workflow with the MPI command exported in the RUNNER environment variable. The databases and temporary folder have to be shared between all nodes (e.g. through NFS):

RUNNER="mpirun -pernode -np 42" mmseqs search queryDB targetDB resultDB tmp

Contributors

MMseqs2 exists thanks to all the people who contribute.

More Repositories

1

hh-suite

Remote protein homology detection suite.
C
535
star
2

metaeuk

MetaEuk - sensitive, high-throughput gene discovery and annotation for large-scale eukaryotic metagenomics
C
175
star
3

plass

sensitive and precise assembly of short sequencing reads
C
145
star
4

CCMpred

Protein Residue-Residue Contacts from Correlated Mutations predicted quickly and accurately.
C
93
star
5

MMseqs2-App

MMseqs2 app to run on your workstation or servers
Vue
58
star
6

WIsH

Predict prokaryotic host for phage metagenomic sequences
C++
52
star
7

spacedust

Discovery of conserved gene clusters in multiple genomes
C
42
star
8

uniclust-pipeline

Shell
35
star
9

spacepharer

SpacePHARER CRISPR Spacer Phage-Host pAiRs findER
C
34
star
10

prosstt

PRObabilistic Simulations of ScRNA-seq Tree-like Topologies
Python
25
star
11

CCMgen

HTML
20
star
12

pdbx

pdbx is a parser module in python for structures of the protein data bank in the mmcif format
Python
20
star
13

BaMMmotif

Bayesian Markov Model motif discovery - An expectation maximization algorithm for the de novo discovery of enriched motifs as modelled by higher-order Markov models.
C++
19
star
14

merlot

Reconstruct the lineage topology of a scRNA-seq differentiation dataset.
HTML
18
star
15

kClust

kClust is a fast and sensitive clustering method for the clustering of protein sequences. It is able to cluster large protein databases down to 20-30% sequence identity. kClust generates a clustering where each cluster is represented by its longest sequence (representative sequence).
C++
17
star
16

b-lore

Bayesian multiple logistic regression for GWAS meta-analysis
Python
16
star
17

MMseqs

C++
14
star
18

BaMMmotif2

Bayesian Markov Model motif discovery tool version 2 - An expectation maximization algorithm for the de novo discovery of enriched motifs as modelled by higher-order Markov models.
C++
12
star
19

ffindex_soedinglab

C
11
star
20

tejaas

Tejaas - a tool for discovering trans-eQTLs
C
10
star
21

bbcontacts

Prediction of beta-strand pairing from direct coupling patterns
Papyrus
8
star
22

hhdatabase_cif70

Scripts to generate the pdb70 database for hh-suite on the basis of pdb's mmcif format
Shell
7
star
23

PEnG-motif

PEnG-motif is an open-source software package for searching statistically overrepresented motifs (position specific weight matrices, PWMs) in a set of DNA sequences.
C++
7
star
24

transannot

TransAnnot - a fast transcriptome annotation pipeline
C
5
star
25

BaMM_webserver

Webserver for motif discovery with higher-order Bayesian Markov Models (BaMMs)
HTML
4
star
26

metaG-ECCB18-partII

MMseqs2 tutorial for metagenomics sequence data
TeX
3
star
27

bamm-suite

De-novo motif discovery and optimization
Python
3
star
28

CCMgen-scripts

Contains plotting scripts, examples, and other small scripts relevant to CCMgen and the corresponding publication.
Python
2
star
29

mockinbird

PAR-CLIP data processing pipeline
Python
2
star
30

bipartite_motif_finder

BMF: Bipartite Motif Finder
Python
1
star
31

CoCo

Consensus Correction
C++
1
star
32

MMseqs2-Regression

MMseqs2 Regression Testing
Shell
1
star
33

xxmotif

XXmotif: eXhaustive, weight matriX-based motif discovery in nucleotide sequences
Perl
1
star
34

prosstt-r

An R package with evaluation and visualization functions for the python PROSSTT package
HTML
1
star