• Stars
    star
    145
  • Rank 254,144 (Top 6 %)
  • Language
    C
  • License
    GNU General Publi...
  • Created almost 7 years ago
  • Updated about 1 month ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

sensitive and precise assembly of short sequencing reads

PLASS - Protein-Level ASSembler

BioConda Install BioContainer Pulls Build Status DOI

Plass (Protein-Level ASSembler) is a software to assemble short read sequencing data on a protein level. The main purpose of Plass is the assembly of complex metagenomic datasets. It assembles 10 times more protein residues in soil metagenomes than Megahit. Plass is GPL-licensed open source software that is implemented in C++ and available for Linux and macOS. The software is designed to run on multiple cores. Plass was used to create a Soil Reference Catalog (SRC) and a Marine Eukaryotic Reference Catalog (MERC).

Steinegger M, Mirdita M and Soeding J. Protein-level assembly increases protein sequence recovery from metagenomic samples manyfold. Nature Methods, doi: doi.org/10.1038/s41592-019-0437-4 (2019).

Soil Reference Catalog (SRC) and Marine Eukaryotic Reference Catalog (MERC)

SRC was created by assembling 640 soil metagenome samples. MERC was assembled from the the metatranscriptomics datasets created by the TARA ocean expedition. Both catalogues were redundancy reduced to 90% sequence identity at 90% coverage. Each catalog is a single FASTA file containing the sequences, the header identifiers contain the Sequence Read Archive (SRA) identifiers. The catalogues can be downloaded here. We provide a HH-suite3 database called "BFD" containing sequences from the Metaclust, SRC, MERC and Uniport at here.

Install Plass

Plass can be install via conda or as statically compiled Linux version. Plass requires a 64-bit Linux/MacOS system (check with uname -a | grep x86_64) with at least the SSE2 instruction set.

 # install from bioconda
 conda install -c conda-forge -c bioconda plass 
 # static build with AVX2 (fastest)
 wget https://mmseqs.com/plass/plass-linux-avx2.tar.gz; tar xvfz plass-linux-avx2.tar.gz; export PATH=$(pwd)/plass/bin/:$PATH
 # static build with SSE4.1
 wget https://mmseqs.com/plass/plass-linux-sse41.tar.gz; tar xvfz plass-linux-sse41.tar.gz; export PATH=$(pwd)/plass/bin/:$PATH
 # static build with SSE2 (slowest, for very old systems)
 wget https://mmseqs.com/plass/plass-linux-sse2.tar.gz; tar xvfz plass-linux-sse2.tar.gz; export PATH=$(pwd)/plass/bin/:$PATH

How to assemble

Plass can assemble both paired-end reads (FASTQ) and single reads (FASTA or FASTQ):

  # assemble paired-end reads 
  plass assemble examples/reads_1.fastq.gz examples/reads_2.fastq.gz assembly.fas tmp

  # assemble single-end reads 
  plass assemble examples/reads_1.fastq.gz assembly.fas tmp

  # assemble single-end reads using stdin
  cat examples/reads_1.fastq.gz | plass assemble stdin assembly.fas tmp

Important parameters:

 --min-seq-id         Adjusts the overlap sequence identity threshold
 --min-length         minimum codon length for ORF prediction (default: 40)
 -e                   E-value threshold for overlaps 
 --num-iterations     Number of iterations of assembly
 --filter-proteins    Switches the neural network protein filter off/on

Modules:

  plass assemble      Assembles proteins (i:Nucleotides -> o:Proteins)
  plass nuclassemble  Assembles nucleotides *experimental* (i:Nucleotides -> o:Nucleotides)

Assemble using MPI

Plass can be distrubted over several homogeneous computers. However the TMP folder has to be shared between all nodes (e.g. NFS). The following command assembles several nodes:

RUNNER="mpirun -np 42" plass assemble examples/reads_1.fastq.gz examples/reads_2.fastq.gz assembly.fas tmp

Compile from source

Compiling PLASS from source has the advantage that it will be optimized to the specific system, which should improve its performance. To compile PLASS git, g++ (4.6 or higher) and cmake (3.0 or higher) are required. Afterwards, the PLASS binary will be located in the build/bin directory.

  git clone https://github.com/soedinglab/plass.git
  cd plass
  git submodule update --init
  mkdir build && cd build
  cmake -DCMAKE_BUILD_TYPE=RELEASE -DCMAKE_INSTALL_PREFIX=. ..
  make -j 4 && make install
  export PATH="$(pwd)/bin/:$PATH"

❗ If you want to compile PLASS on macOS, please install and use gcc from Homebrew. The default macOS clang compiler does not support OpenMP and PLASS will not be able to run multithreaded. Use the following cmake call:

  CXX="$(brew --prefix)/bin/g++-8" cmake -DCMAKE_BUILD_TYPE=RELEASE -DCMAKE_INSTALL_PREFIX=. ..

Dependencies

When compiling from source, PLASS requires zlib and bzip.

Use the docker image

We also provide a Docker image of Plass. You can mount the current directory containing the reads to be assembled and run plass with the following command:

  docker pull soedinglab/plass
  docker run -ti --rm -v "$(pwd):/app" -w /app plass assemble reads_1.fastq reads_2.fastq assembly.fas tmp

Hardware requirements

Plass needs roughly 1 byte of memory per residue to work efficiently. Plass will scale its memory consumption based on the available main memory of the machine. Plass needs a CPU with at least the SSE4.1 instruction set to run.

Known problems

  • The assembly of Plass includes all ORFs having a start and end codon that includes even very short ORFs < 60 amino acids. Many of these short ORFs are spurious since our neural network cannot distingue them well. We would recommend to use other method to verify the coding potential of these. Assemblies above 100 amino acids are mostly genuine protein sequences.
  • Plass in default searches for ORFs of 40 amino acids or longer. This limits the read length to > 120. To assemble this protein, you need to lower the --min-length threshold. Be aware using short reads (< 100 length) might result in lower sensitivity.

More Repositories

1

MMseqs2

MMseqs2: ultra fast and sensitive search and clustering suite
C
1,391
star
2

hh-suite

Remote protein homology detection suite.
C
535
star
3

metaeuk

MetaEuk - sensitive, high-throughput gene discovery and annotation for large-scale eukaryotic metagenomics
C
175
star
4

CCMpred

Protein Residue-Residue Contacts from Correlated Mutations predicted quickly and accurately.
C
93
star
5

MMseqs2-App

MMseqs2 app to run on your workstation or servers
Vue
58
star
6

WIsH

Predict prokaryotic host for phage metagenomic sequences
C++
52
star
7

spacedust

Discovery of conserved gene clusters in multiple genomes
C
42
star
8

uniclust-pipeline

Shell
35
star
9

spacepharer

SpacePHARER CRISPR Spacer Phage-Host pAiRs findER
C
34
star
10

prosstt

PRObabilistic Simulations of ScRNA-seq Tree-like Topologies
Python
25
star
11

CCMgen

HTML
20
star
12

pdbx

pdbx is a parser module in python for structures of the protein data bank in the mmcif format
Python
20
star
13

BaMMmotif

Bayesian Markov Model motif discovery - An expectation maximization algorithm for the de novo discovery of enriched motifs as modelled by higher-order Markov models.
C++
19
star
14

merlot

Reconstruct the lineage topology of a scRNA-seq differentiation dataset.
HTML
18
star
15

kClust

kClust is a fast and sensitive clustering method for the clustering of protein sequences. It is able to cluster large protein databases down to 20-30% sequence identity. kClust generates a clustering where each cluster is represented by its longest sequence (representative sequence).
C++
17
star
16

b-lore

Bayesian multiple logistic regression for GWAS meta-analysis
Python
16
star
17

MMseqs

C++
14
star
18

BaMMmotif2

Bayesian Markov Model motif discovery tool version 2 - An expectation maximization algorithm for the de novo discovery of enriched motifs as modelled by higher-order Markov models.
C++
12
star
19

ffindex_soedinglab

C
11
star
20

tejaas

Tejaas - a tool for discovering trans-eQTLs
C
10
star
21

bbcontacts

Prediction of beta-strand pairing from direct coupling patterns
Papyrus
8
star
22

hhdatabase_cif70

Scripts to generate the pdb70 database for hh-suite on the basis of pdb's mmcif format
Shell
7
star
23

PEnG-motif

PEnG-motif is an open-source software package for searching statistically overrepresented motifs (position specific weight matrices, PWMs) in a set of DNA sequences.
C++
7
star
24

transannot

TransAnnot - a fast transcriptome annotation pipeline
C
5
star
25

BaMM_webserver

Webserver for motif discovery with higher-order Bayesian Markov Models (BaMMs)
HTML
4
star
26

metaG-ECCB18-partII

MMseqs2 tutorial for metagenomics sequence data
TeX
3
star
27

bamm-suite

De-novo motif discovery and optimization
Python
3
star
28

CCMgen-scripts

Contains plotting scripts, examples, and other small scripts relevant to CCMgen and the corresponding publication.
Python
2
star
29

mockinbird

PAR-CLIP data processing pipeline
Python
2
star
30

bipartite_motif_finder

BMF: Bipartite Motif Finder
Python
1
star
31

CoCo

Consensus Correction
C++
1
star
32

MMseqs2-Regression

MMseqs2 Regression Testing
Shell
1
star
33

xxmotif

XXmotif: eXhaustive, weight matriX-based motif discovery in nucleotide sequences
Perl
1
star
34

prosstt-r

An R package with evaluation and visualization functions for the python PROSSTT package
HTML
1
star