• Stars
    star
    122
  • Rank 292,031 (Top 6 %)
  • Language Cython
  • License
    MIT License
  • Created about 4 years ago
  • Updated 4 months ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Cython bindings and Python interface to HMMER3.

๐Ÿ๐ŸŸกโ™ฆ๏ธ๐ŸŸฆ PyHMMER Stars

Cython bindings and Python interface to HMMER3.

Actions Coverage PyPI Bioconda AUR Wheel Python Versions Python Implementations License Source Mirror GitHub issues Docs Changelog Downloads Paper Citations

๐Ÿ—บ๏ธ Overview

HMMER is a biological sequence analysis tool that uses profile hidden Markov models to search for sequence homologs. HMMER3 is developed and maintained by the Eddy/Rivas Laboratory at Harvard University.

pyhmmer is a Python package, implemented using the Cython language, that provides bindings to HMMER3. It directly interacts with the HMMER internals, which has the following advantages over CLI wrappers (like hmmer-py):

  • single dependency: If your software or your analysis pipeline is distributed as a Python package, you can add pyhmmer as a dependency to your project, and stop worrying about the HMMER binaries being properly setup on the end-user machine.
  • no intermediate files: Everything happens in memory, in Python objects you have control on, making it easier to pass your inputs to HMMER without needing to write them to a temporary file. Output retrieval is also done in memory, via instances of the pyhmmer.plan7.TopHits class.
  • no input formatting: The Easel object model is exposed in the pyhmmer.easel module, and you have the possibility to build a DigitalSequence object yourself to pass to the HMMER pipeline. This is useful if your sequences are already loaded in memory, for instance because you obtained them from another Python library (such as Pyrodigal or Biopython).
  • no output formatting: HMMER3 is notorious for its numerous output files and its fixed-width tabular output, which is hard to parse (even Bio.SearchIO.HmmerIO is struggling on some sequences).
  • efficient: Using pyhmmer to launch hmmsearch on sequences and HMMs in disk storage is typically as fast as directly using the hmmsearch binary (see the Benchmarks section). pyhmmer.hmmer.hmmsearch uses a different parallelisation strategy compared to the hmmsearch binary from HMMER, which can help getting the most of multiple CPUs when annotating smaller sequence databases.

This library is still a work-in-progress, and in an experimental stage, but it should already pack enough features to run biological analyses or workflows involving hmmsearch, hmmscan, nhmmer, phmmer, hmmbuild and hmmalign.

๐Ÿ”ง Installing

pyhmmer can be installed from PyPI, which hosts some pre-built CPython wheels for Linux and MacOS on x86-64 and Arm64, as well as the code required to compile from source with Cython:

$ pip install pyhmmer

Compilation for UNIX PowerPC is not tested in CI, but should work out of the box. Note than non-UNIX operating systems (such as Windows) are not supported by HMMER.

A Bioconda package is also available:

$ conda install -c bioconda pyhmmer

๐Ÿ”– Citation

PyHMMER is scientific software, with a published paper in the Bioinformatics. Please cite both PyHMMER and HMMER if you are using it in an academic work, for instance as:

PyHMMER (Larralde et al., 2023), a Python library binding to HMMER (Eddy, 2011).

Detailed references are available on the Publications page of the online documentation.

๐Ÿ“– Documentation

A complete API reference can be found in the online documentation, or directly from the command line using pydoc:

$ pydoc pyhmmer.easel
$ pydoc pyhmmer.plan7

๐Ÿ’ก Example

Use pyhmmer to run hmmsearch to search for Type 2 PKS domains (t2pks.hmm) inside proteins extracted from the genome of Anaerococcus provencensis (938293.PRJEB85.HG003687.faa). This will produce an iterable over TopHits that can be used for further sorting/querying in Python. Processing happens in parallel using Python threads, and a TopHits object is yielded for every HMM passed in the input iterable.

import pyhmmer

with pyhmmer.easel.SequenceFile("pyhmmer/tests/data/seqs/938293.PRJEB85.HG003687.faa", digital=True) as seq_file:
    sequences = list(seq_file)

with pyhmmer.plan7.HMMFile("pyhmmer/tests/data/hmms/txt/t2pks.hmm") as hmm_file:
    for hits in pyhmmer.hmmsearch(hmm_file, sequences, cpus=4):
      print(f"HMM {hits.query_name.decode()} found {len(hits)} hits in the target sequences")

Have a look at more in-depth examples such as building a HMM from an alignment, analysing the active site of a hit, or fetching marker genes from a genome in the Examples page of the online documentation.

๐Ÿ’ญ Feedback

โš ๏ธ Issue Tracker

Found a bug ? Have an enhancement request ? Head over to the GitHub issue tracker if you need to report or ask something. If you are filing in on a bug, please include as much information as you can about the issue, and try to recreate the same bug in a simple, easily reproducible situation.

๐Ÿ—๏ธ Contributing

Contributions are more than welcome! See CONTRIBUTING.md for more details.

โฑ๏ธ Benchmarks

Benchmarks were run on a i7-10710U CPU running @1.10GHz with 6 physical / 12 logical cores, using a FASTA file containing 4,489 protein sequences extracted from the genome of Escherichia coli (562.PRJEB4685) and the version 33.1 of the Pfam HMM library containing 18,259 domains. Commands were run 3 times on a warm SSD. Plain lines show the times for pressed HMMs, and dashed-lines the times for HMMs in text format.

Benchmarks

Raw numbers can be found in the benches folder. They suggest that phmmer should be run with the number of logical cores, while hmmsearch should be run with the number of physical cores (or less). A possible explanation for this observation would be that HMMER platform-specific code requires too many SIMD registers per thread to benefit from simultaneous multi-threading.

To read more about how PyHMMER achieves better parallelism than HMMER for many-to-many searches, have a look at the Performance page of the documentation.

๐Ÿ” See Also

Building a HMM from scratch? Then you may be interested in the pyfamsa package, providing bindings to FAMSA, a very fast multiple sequence aligner. In addition, you may want to trim alignments: in that case, consider pytrimal, which wraps trimAl 2.0.

If despite of all the advantages listed earlier, you would rather use HMMER through its CLI, this package will not be of great help. You can instead check the hmmer-py package developed by Danilo Horta at the EMBL-EBI.

โš–๏ธ License

This library is provided under the MIT License. The HMMER3 and Easel code is available under the BSD 3-clause license. See vendor/hmmer/LICENSE and vendor/easel/LICENSE for more information.

This project is in no way affiliated, sponsored, or otherwise endorsed by the original HMMER authors. It was developed by Martin Larralde during his PhD project at the European Molecular Biology Laboratory in the Zeller team.

More Repositories

1

InstaLooter

Another API-less Instagram pictures and videos downloader.
Python
2,003
star
2

ffpb

A progress bar for ffmpeg. Yay !
Python
300
star
3

pronto

A Python frontend to (Open Biomedical) Ontologies.
Python
229
star
4

pyrodigal

Cython bindings and Python interface to Prodigal, an ORF finder for genomes and metagenomes. Now with SIMD!
Cython
138
star
5

fs.sshfs

Pyfilesystem2 over SSH using paramiko
Python
88
star
6

rich-msa

A Rich renderable for viewing Multiple Sequence Alignments in the terminal.
Python
77
star
7

peptides.py

Physicochemical properties, indices and descriptors for amino-acid sequences.
Python
69
star
8

lightmotif

A lightweight platform-accelerated library for biological motif scanning using position weight matrices.
Rust
39
star
9

fs.smbfs

Pyfilesystem2 over SMB using pysmb
Python
29
star
10

pyfamsa

Cython bindings and Python interface to FAMSA, an algorithm for ultra-scale multiple sequence alignments.
Python
28
star
11

mini3di

A NumPy port of the foldseek code for encoding protein structures to 3di.
Python
22
star
12

pyskani

PyO3 bindings and Python interface to skani, a method for fast genomic identity calculation using sparse chaining.
Rust
20
star
13

blanket

A simple Rust macro to derive blanket implementations for your traits.
Rust
20
star
14

pytrimal

Cython bindings and Python interface to trimAl, a tool for automated alignment trimming. Now with SIMD!
Cython
20
star
15

pyfastani

Cython bindings and Python interface to FastANI, a method for fast whole-genome similarity estimation.
Cython
19
star
16

pymuscle5

Cython bindings and Python interface to MUSCLE v5, a highly efficient and accurate multiple sequence alignment software.
Cython
18
star
17

orthoani

A Python implementation of the OrthoANI algorithm for nucleotide identity measurement.
Python
17
star
18

fs.archive

Pyfilesystem2 for various archive filesystems
Python
17
star
19

pyrodigal-gv

A Pyrodigal extension to predict genes in giant viruses and viruses with alternative genetic code.
Python
13
star
20

moclo

Modular cloning simulation with the MoClo framework in Python
Python
12
star
21

iocursor

A zero-copy file-like wrapper for Python byte buffers, inspired by Rust's std::io::Cursor.
C
12
star
22

gb-io.py

A Python interface to gb-io, a fast GenBank parser written in Rust.
Python
12
star
23

cksfv.rs

A 10x faster drop-in reimplementation of cksfv using Rust and the crc32fast crate.
Rust
12
star
24

textwrap-macros

Simple procedural macros to use textwrap utilities at compile time.
Rust
12
star
25

pymemesuite

Cython bindings and Python interface to the MEME suite, a collection of tools for the analysis of sequence motifs.
Cython
10
star
26

pysylph

PyO3 bindings and Python interface to sylph, an ultrafast method for containment ANI querying and taxonomic profiling.
Rust
10
star
27

uniprot.rs

Rust data structures and parser for the Uniprot database(s).
Rust
9
star
28

thunar-torrent-property

A small thunar extension displaying the metadata in a torrent file.
C
9
star
29

jinja2-fsloader

A Jinja2 template loader using PyFilesystem2.
Python
9
star
30

nanoset.py

A memory-optimized wrapper for Python sets likely to be empty.
Python
8
star
31

packageurl.rs

Rust implementation of the Package URL specification.
Rust
8
star
32

pyopal

Cython bindings and Python interface to Opal, a SIMD-accelerated database search aligner.
Python
8
star
33

pubchem.rs

Rust data structures and client for the PubChem REST API
Rust
8
star
34

fs.expose

Python
7
star
35

scihub-pubmed-userscript

A GreaseMonkey userscript to add a Full Text Link button redirecting to Sci-Hub on PubMed article pages.
JavaScript
7
star
36

scoring-matrices

Dependency free, Cython-compatible scoring matrices to use with biological sequences.
Python
7
star
37

pyjess

Cython bindings and Python interface to Jess, a 3D template matching software for protein structures.
Cython
5
star
38

torch-treecrf

A PyTorch implementation of Tree-structured Conditional Random Fields.
Python
5
star
39

pruefung

Redundancy checks in pure Rust
Rust
5
star
40

nafcodec

Rust coder/decoder for Nucleotide Archival Format (NAF) files.
Rust
5
star
41

lapucelle-textures

A PPSSPP texture pack for La Pucelle Ragnarok (english patched)
Makefile
4
star
42

proteinogenic

Chemical structure generation for protein sequences as SMILES string.
Rust
4
star
43

flips.rs

Rust bindings to Flips, the Floating IPS patcher.
Rust
4
star
44

opticaldisc

Read optical media filesystems with Rust
Rust
4
star
45

pyswrd

Cython bindings and Python interface to SWORD (Smith Waterman On Reduced Database), a heuristic method for fast database search.
Cython
4
star
46

kmachine

A toy compiler that produces Kappa code from Counter Machine instructions.
Rust
3
star
47

embedded-picofont

The PICO-8 font to use with embedded-graphics.
Rust
2
star
48

annotate.Snakefile

A Snakemake pipeline to copy annotations between GenBank files
Python
2
star
49

diced

A Rust reimplementation of the MinCED method for identifying CRISPRs in full or assembled genomes.
Rust
2
star
50

smatrix

Not the slurm job dispatcher you need, but the one you deserve.
Python
1
star
51

pytantan

Cython bindings and Python interface to Tantan, a fast method for identifying repeats in DNA and protein sequences.
Python
1
star
52

rlinalg

Linear Algebra routines for Python as implemented in the R language.
Python
1
star