• Stars
    star
    138
  • Rank 264,508 (Top 6 %)
  • Language Cython
  • License
    GNU General Publi...
  • Created over 4 years ago
  • Updated 2 months ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Cython bindings and Python interface to Prodigal, an ORF finder for genomes and metagenomes. Now with SIMD!

πŸ”₯ Pyrodigal Stars

Cython bindings and Python interface to Prodigal, an ORF finder for genomes and metagenomes. Now with SIMD!

Actions Coverage License PyPI Bioconda AUR Wheel Python Versions Python Implementations Source GitHub issues Docs Changelog Downloads Paper

πŸ—ΊοΈ Overview

Pyrodigal is a Python module that provides bindings to Prodigal using Cython. It directly interacts with the Prodigal internals, which has the following advantages:

  • single dependency: Pyrodigal is distributed as a Python package, so you can add it as a dependency to your project, and stop worrying about the Prodigal binary being present on the end-user machine.
  • no intermediate files: Everything happens in memory, in a Python object you fully control, so you don't have to invoke the Prodigal CLI using a sub-process and temporary files. Sequences can be passed directly as strings or bytes, which avoids the overhead of formatting your input to FASTA for Prodigal.
  • better memory usage: Pyrodigal uses more compact data structures compared to the original Prodigal implementation, allowing to save memory to store the same information. A heuristic is used to estimate the number of nodes to allocate based on the sequence GC% in order to minimize reallocations.
  • better performance: Pyrodigal uses SIMD instructions to compute which dynamic programming nodes can be ignored when scoring connections. This can save from a third to half the runtime depending on the sequence. The Benchmarks page of the documentation contains comprehensive comparisons. See the JOSS paper for details about how this is achieved.
  • same results: Pyrodigal is tested to make sure it produces exactly the same results as Prodigal v2.6.3+31b300a. This was verified extensively by Julian Hahnfeld and can be checked with his comparison repository.

πŸ“‹ Features

The library now features everything from the original Prodigal CLI:

  • run mode selection: Choose between single mode, using a training sequence to count nucleotide hexamers, or metagenomic mode, using pre-trained data from different organisms (prodigal -p).
  • region masking: Prevent genes from being predicted across regions containing unknown nucleotides (prodigal -m).
  • closed ends: Genes will be identified as running over edges if they are larger than a certain size, but this can be disabled (prodigal -c).
  • training configuration: During the training process, a custom translation table can be given (prodigal -g), and the Shine-Dalgarno motif search can be forcefully bypassed (prodigal -n)
  • output files: Output files can be written in a format mostly compatible with the Prodigal binary, including the protein translations in FASTA format (prodigal -a), the gene sequences in FASTA format (prodigal -d), or the potential gene scores in tabular format (prodigal -s).
  • training data persistence: Getting training data from a sequence and using it for other sequences is supported; in addition, a training data file can be saved and loaded transparently (prodigal -t).

In addition, the new features are available:

  • custom gene size threshold: While Prodigal uses a minimum gene size of 90 nucleotides (60 if on edge), Pyrodigal allows to customize this threshold, allowing for smaller ORFs to be identified if needed.
  • custom metagenomic models: Since v3.0.0, you can use your own metagenomic models to run Pyrodigal in meta-mode. Check for instance pyrodigal-gv, which provides additional models for giant viruses and gut phages.

🐏 Memory

Pyrodigal makes several changes compared to the original Prodigal binary regarding memory management:

  • Sequences are stored as raw bytes instead of compressed bitmaps. This means that the sequence itself takes 3/8th more space, but since the memory used for storing the sequence is often negligible compared to the memory used to store dynamic programming nodes, this is an acceptable trade-off for better performance when extracting said nodes.
  • Node fields use smaller data types to fit into 128 bytes, compared to the 176 bytes of the original Prodigal data structure.
  • Node arrays are pre-allocated based on the sequence GC% to extrapolate the probability to find a start or stop codon.
  • Genes are stored in a more compact data structure than in Prodigal (which reserves a buffer to store string data), saving around 1KiB per gene.

🧢 Thread-safety

pyrodigal.GeneFinder instances are thread-safe. In addition, the find_genes method is re-entrant. This means you can train an GeneFinder instance once, and then use a pool to process sequences in parallel:

import multiprocessing.pool
import pyrodigal

gene_finder = pyrodigal.GeneFinder()
gene_finder.train(training_sequence)

with multiprocessing.pool.ThreadPool() as pool:
    predictions = pool.map(orf_finder.find_genes, sequences)

πŸ”§ Installing

Pyrodigal can be installed directly from PyPI, which hosts some pre-built wheels for the x86-64 architecture (Linux/MacOS/Windows) and the Aarch64 architecture (Linux/MacOS), as well as the code required to compile from source with Cython:

$ pip install pyrodigal

Otherwise, Pyrodigal is also available as a Bioconda package:

$ conda install -c bioconda pyrodigal

Check the install page of the documentation for other ways to install Pyrodigal on your machine.

πŸ’‘ Example

Let's load a sequence from a GenBank file, use an GeneFinder to find all the genes it contains, and print the proteins in two-line FASTA format.

πŸ”¬ Biopython

To use the GeneFinder in single mode (corresponding to prodigal -p single, the default operation mode of Prodigal), you must explicitly call the train method with the sequence you want to use for training before trying to find genes, or you will get a RuntimeError:

import Bio.SeqIO
import pyrodigal

record = Bio.SeqIO.read("sequence.gbk", "genbank")

orf_finder = pyrodigal.GeneFinder()
orf_finder.train(bytes(record.seq))
genes = orf_finder.find_genes(bytes(record.seq))

However, in meta mode (corresponding to prodigal -p meta), you can find genes directly:

import Bio.SeqIO
import pyrodigal

record = Bio.SeqIO.read("sequence.gbk", "genbank")

orf_finder = pyrodigal.GeneFinder(meta=True)
for i, pred in enumerate(orf_finder.find_genes(bytes(record.seq))):
    print(f">{record.id}_{i+1}")
    print(pred.translate())

On older versions of Biopython (before 1.79) you will need to use record.seq.encode() instead of bytes(record.seq).

πŸ§ͺ Scikit-bio

import skbio.io
import pyrodigal

seq = next(skbio.io.read("sequence.gbk", "genbank"))

orf_finder = pyrodigal.GeneFinder(meta=True)
for i, pred in enumerate(orf_finder.find_genes(seq.values.view('B'))):
    print(f">{record.id}_{i+1}")
    print(pred.translate())

We need to use the view method to get the sequence viewable by Cython as an array of unsigned char.

πŸ”– Citation

Pyrodigal is scientific software, with a published paper in the Journal of Open-Source Software. Please cite both Pyrodigal and Prodigal if you are using it in an academic work, for instance as:

Pyrodigal (Larralde, 2022), a Python library binding to Prodigal (Hyatt et al., 2010).

Detailed references are available on the Publications page of the online documentation.

πŸ’­ Feedback

⚠️ Issue Tracker

Found a bug ? Have an enhancement request ? Head over to the GitHub issue tracker if you need to report or ask something. If you are filing in on a bug, please include as much information as you can about the issue, and try to recreate the same bug in a simple, easily reproducible situation.

πŸ—οΈ Contributing

Contributions are more than welcome! See CONTRIBUTING.md for more details.

πŸ“‹ Changelog

This project adheres to Semantic Versioning and provides a changelog in the Keep a Changelog format.

βš–οΈ License

This library is provided under the GNU General Public License v3.0. The Prodigal code was written by Doug Hyatt and is distributed under the terms of the GPLv3 as well. See vendor/Prodigal/LICENSE for more information.

This project is in no way not affiliated, sponsored, or otherwise endorsed by the original Prodigal authors. It was developed by Martin Larralde during his PhD project at the European Molecular Biology Laboratory in the Zeller team.

More Repositories

1

InstaLooter

Another API-less Instagram pictures and videos downloader.
Python
2,003
star
2

ffpb

A progress bar for ffmpeg. Yay !
Python
300
star
3

pronto

A Python frontend to (Open Biomedical) Ontologies.
Python
229
star
4

pyhmmer

Cython bindings and Python interface to HMMER3.
Cython
122
star
5

fs.sshfs

Pyfilesystem2 over SSH using paramiko
Python
88
star
6

rich-msa

A Rich renderable for viewing Multiple Sequence Alignments in the terminal.
Python
77
star
7

peptides.py

Physicochemical properties, indices and descriptors for amino-acid sequences.
Python
69
star
8

lightmotif

A lightweight platform-accelerated library for biological motif scanning using position weight matrices.
Rust
39
star
9

fs.smbfs

Pyfilesystem2 over SMB using pysmb
Python
29
star
10

pyfamsa

Cython bindings and Python interface to FAMSA, an algorithm for ultra-scale multiple sequence alignments.
Python
28
star
11

mini3di

A NumPy port of the foldseek code for encoding protein structures to 3di.
Python
22
star
12

pyskani

PyO3 bindings and Python interface to skani, a method for fast genomic identity calculation using sparse chaining.
Rust
20
star
13

blanket

A simple Rust macro to derive blanket implementations for your traits.
Rust
20
star
14

pytrimal

Cython bindings and Python interface to trimAl, a tool for automated alignment trimming. Now with SIMD!
Cython
20
star
15

pyfastani

Cython bindings and Python interface to FastANI, a method for fast whole-genome similarity estimation.
Cython
19
star
16

pymuscle5

Cython bindings and Python interface to MUSCLE v5, a highly efficient and accurate multiple sequence alignment software.
Cython
18
star
17

orthoani

A Python implementation of the OrthoANI algorithm for nucleotide identity measurement.
Python
17
star
18

fs.archive

Pyfilesystem2 for various archive filesystems
Python
17
star
19

pyrodigal-gv

A Pyrodigal extension to predict genes in giant viruses and viruses with alternative genetic code.
Python
13
star
20

moclo

Modular cloning simulation with the MoClo framework in Python
Python
12
star
21

iocursor

A zero-copy file-like wrapper for Python byte buffers, inspired by Rust's std::io::Cursor.
C
12
star
22

gb-io.py

A Python interface to gb-io, a fast GenBank parser written in Rust.
Python
12
star
23

cksfv.rs

A 10x faster drop-in reimplementation of cksfv using Rust and the crc32fast crate.
Rust
12
star
24

textwrap-macros

Simple procedural macros to use textwrap utilities at compile time.
Rust
12
star
25

pymemesuite

Cython bindings and Python interface to the MEME suite, a collection of tools for the analysis of sequence motifs.
Cython
10
star
26

pysylph

PyO3 bindings and Python interface to sylph, an ultrafast method for containment ANI querying and taxonomic profiling.
Rust
10
star
27

uniprot.rs

Rust data structures and parser for the Uniprot database(s).
Rust
9
star
28

thunar-torrent-property

A small thunar extension displaying the metadata in a torrent file.
C
9
star
29

jinja2-fsloader

A Jinja2 template loader using PyFilesystem2.
Python
9
star
30

nanoset.py

A memory-optimized wrapper for Python sets likely to be empty.
Python
8
star
31

packageurl.rs

Rust implementation of the Package URL specification.
Rust
8
star
32

pyopal

Cython bindings and Python interface to Opal, a SIMD-accelerated database search aligner.
Python
8
star
33

pubchem.rs

Rust data structures and client for the PubChem REST API
Rust
8
star
34

fs.expose

Python
7
star
35

scihub-pubmed-userscript

A GreaseMonkey userscript to add a Full Text Link button redirecting to Sci-Hub on PubMed article pages.
JavaScript
7
star
36

scoring-matrices

Dependency free, Cython-compatible scoring matrices to use with biological sequences.
Python
7
star
37

pyjess

Cython bindings and Python interface to Jess, a 3D template matching software for protein structures.
Cython
5
star
38

torch-treecrf

A PyTorch implementation of Tree-structured Conditional Random Fields.
Python
5
star
39

pruefung

Redundancy checks in pure Rust
Rust
5
star
40

nafcodec

Rust coder/decoder for Nucleotide Archival Format (NAF) files.
Rust
5
star
41

lapucelle-textures

A PPSSPP texture pack for La Pucelle Ragnarok (english patched)
Makefile
4
star
42

proteinogenic

Chemical structure generation for protein sequences as SMILES string.
Rust
4
star
43

flips.rs

Rust bindings to Flips, the Floating IPS patcher.
Rust
4
star
44

opticaldisc

Read optical media filesystems with Rust
Rust
4
star
45

pyswrd

Cython bindings and Python interface to SWORD (Smith Waterman On Reduced Database), a heuristic method for fast database search.
Cython
4
star
46

kmachine

A toy compiler that produces Kappa code from Counter Machine instructions.
Rust
3
star
47

embedded-picofont

The PICO-8 font to use with embedded-graphics.
Rust
2
star
48

annotate.Snakefile

A Snakemake pipeline to copy annotations between GenBank files
Python
2
star
49

diced

A Rust reimplementation of the MinCED method for identifying CRISPRs in full or assembled genomes.
Rust
2
star
50

smatrix

Not the slurm job dispatcher you need, but the one you deserve.
Python
1
star
51

pytantan

Cython bindings and Python interface to Tantan, a fast method for identifying repeats in DNA and protein sequences.
Python
1
star
52

rlinalg

Linear Algebra routines for Python as implemented in the R language.
Python
1
star