• Stars
    star
    300
  • Rank 138,870 (Top 3 %)
  • Language Common Workflow Language
  • License
    Other
  • Created about 6 years ago
  • Updated 3 months ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

NCBI Prokaryotic Genome Annotation Pipeline

PGAP

NCBI Prokaryotic Genome Annotation Pipeline

The NCBI Prokaryotic Genome Annotation Pipeline is designed to annotate bacterial and archaeal genomes (chromosomes and plasmids).

Genome annotation is a multi-level process that includes prediction of protein-coding genes, as well as other functional genome units such as structural RNAs, tRNAs, small RNAs and pseudogenes.

NCBI has developed an automatic prokaryotic genome annotation pipeline that combines ab initio gene prediction algorithms with homology based methods. The first version of NCBI Prokaryotic Genome Pipeline was developed in 2001 and is regularly upgraded to improve structural and functional annotation quality (Li W, O'Neill KR et al 2021). Recent improvements include utilization of curated protein profile hidden Markov models (HMMs), and curated complex domain architectures for functional annotation of proteins and annotation of Enzyme Commission numbers and Gene Ontology terms. Post-annotation, the completeness of the annotated gene set is estimated with CheckM.

The workflow provided here also offers the option to confirm or correct the organism associated with the genome assembly prior to starting the annotation, using the Average Nucleotide Identity tool.

Get started by watching this webinar!

Need to assemble the genome too? Use RAPT for producing an annotated genome starting from short reads

Instructions

To run the PGAP pipeline you will need Linux, or some compatible container technology, CWL (Common Workflow Language), and about 30GB of supplemental data. We provide instructions here for running under the CWL reference implementation, cwltool. Full instructions for installing, running, and interpreting the results may be found in our wiki.

References

NCBI

Expanding the Prokaryotic Genome Annotation Pipeline reach with protein family model curation.
Li W, O'Neill KR, Haft DH, DiCuccio M, Chetvernin V, Badretdin A, Coulouris G, Chitsaz F, Derbyshire MK, Durkin AS, Gonzales NR, Gwadz M, Lanczycki CJ, Song JS, Thanki N, Wang J, Yamashita RA, Yang M, Zheng C, Marchler-Bauer A, Thibaud-Nissen F. RefSeq: Nucleic Acids Res. 2021 Jan 8;49(D1):D1020-D1028.

RefSeq: an update on prokaryotic genome annotation and curation.
Haft DH, DiCuccio M, Badretdin A, Brover V, Chetvernin V, O'Neill K, Li W, Chitsaz F, Derbyshire MK, Gonzales NR, Gwadz M, Lu F, Marchler GH, Song JS, Thanki N, Yamashita RA, Zheng C, Thibaud-Nissen F, Geer LY, Marchler-Bauer A, Pruitt KD.
Nucleic Acids Res. 2018 Jan 4;46(D1):D851-D860.

NCBI prokaryotic genome annotation pipeline.
Tatusova T, DiCuccio M, Badretdin A, Chetvernin V, Nawrocki EP, Zaslavsky L, Lomsadze A, Pruitt KD, Borodovsky M, Ostell J.
Nucleic Acids Res. 2016 Aug 19;44(14):6614-24. Epub 2016 Jun 24.

Using average nucleotide identity to improve taxonomic assignments in prokaryotic genomes at the NCBI.
Ciufo S, Kannan S, Sharma S, Badretdin A, Clark K, Turner S, Brover S, Schoch CL, Kimchi A, DiCuccio M.
Int J Syst Evol Microbiol. 2018 Jul;68(7):2386-2392.

GeneMarkS-2+

Modeling leaderless transcription and atypical genes results in more accurate gene prediction in prokaryotes
Lomsadze A, Gemayel K, Tang S, Borodovsky M.
Genome Research. 2018; 28(7):1079-1089.

CheckM

CheckM: assessing the quality of microbial genomes recovered from isolates, single cells, and metagenomes
Parks DH, Imelfort M, Skennerton CT, Hugenholtz P, Tyson GW.
Genome Research. 2015; 25(7):1043-1055.

TIGRFAMs

TIGRFAMs: a protein family resource for the functional identification of proteins.
Haft DH, Loftus BJ, Richardson DL, Yang F, Eisen JA, Paulsen IT, White O.
Nucleic Acids Res. 2001 Jan 1;29(1):41-3.

The TIGRFAMs database of protein families.
Haft DH, Selengut JD, White O.
Nucleic Acids Res. 2003 Jan 1;31(1):371-3.

TIGRFAMs and Genome Properties: tools for the assignment of molecular function and biological process in prokaryotic genomes.
Selengut JD, Haft DH, Davidsen T, Ganapathy A, Gwinn-Giglio M, Nelson WC, Richter AR, White O.
Nucleic Acids Res. 2007 Jan;35(Database issue):D260-4. Epub 2006 Dec 6.

TIGRFAMs and Genome Properties in 2013.
Haft DH, Selengut JD, Richter RA, Harkins D, Basu MK, Beck E.
Nucleic Acids Res. 2013 Jan;41(Database issue):D387-95. doi: 10.1093/nar/gks1234. Epub 2012 Nov 28.

LICENSING TERMS

NCBI PGAP CWL

The NCBI PGAP CWL and other code authored by NCBI is a "United States Government Work" under the terms of the United States Copyright Act. It was written as part of the authors' official duties as United States Government employees and thus cannot be copyrighted. This software is freely available to the public for use. The National Library of Medicine and the U.S. Government have not placed any restriction on its use or reproduction.

Although all reasonable efforts have been taken to ensure the accuracy and reliability of the software and data, the NLM and the U.S. Government do not and cannot warrant the performance or results that may be obtained by using this software or data. The NLM and the U.S. Government disclaim all warranties, express or implied, including warranties of performance, merchantability or fitness for any particular purpose.

Please cite NCBI in any work or product based on this material.

Third-party tools

The Docker image contains third-party tools distributed under the licensing terms of the respective license holders.

GeneMarkS-2+

GeneMarkS-2+ is distributed as part of PGAP with limited rights of use and redistribution from the Georgia Tech Research Corporation. See the full text of the license.

CheckM

GNU General Public License v3.0

Permissions of this strong copyleft license are conditioned on making available complete source code of licensed works and modifications, which include larger works using a licensed work, under the same license. Copyright and license notices must be preserved. Contributors provide an express grant of patent rights. See the full text of the license.

TIGRFAMs

The original TIGRFAMs database was a research project of the J. Craig Venter Institute (JCVI) . TIGRFAMs, short for The Institute for Genomic Research's database of protein families, is a collection of manually curated protein families focusing primarily on prokaryotic sequences. It consists of hidden Markov models (HMMs), multiple sequence alignments, Gene Ontology (GO) terminology, Enzyme Commission (EC) numbers, gene symbols, protein family names, descriptive text, cross-references to related models in TIGRFAMs and other databases, and pointers to the literature. The work has been described in the articles listed in the References section above and use of the TIGRFAMs database must grant proper attribution by citing those four articles.

As of April 2018, rights were transferred to the National Center for Biotechnology Information (NCBI), National Library of Medicine, NIH, for the data to be made available for distribution under a Creative Commons Attribution-ShareAlike 4.0 license. Please see (https://creativecommons.org/licenses/by-sa/4.0/) for a brief summary of the license and (https://creativecommons.org/licenses/by-sa/4.0/legalcode) to see the full text.

More Repositories

1

sra-tools

SRA Tools
C
1,093
star
2

GeneGPT

Code and data for GeneGPT.
Python
370
star
3

datasets

NCBI Datasets is a new resource that lets you easily gather data from across NCBI databases.
Jupyter Notebook
347
star
4

amr

AMRFinderPlus - Identify AMR genes and point mutations, and virulence and stress resistance genes in assembled bacterial nucleotide and protein sequence.
C++
262
star
5

PubReader

A new way to view journal articles
JavaScript
193
star
6

icn3d

web-based protein structure viewer and analysis tool interactively or in batch mode
JavaScript
151
star
7

TPMCalculator

TPMCalculator quantifies mRNA abundance directly from the alignments by parsing BAM files
C++
124
star
8

MedCPT

Code for MedCPT, a model for zero-shot biomedical information retrieval.
Python
123
star
9

dbsnp

dbSNP
Jupyter Notebook
120
star
10

ngs

NGS Language Bindings
C++
118
star
11

SKESA

SKESA assembler
C++
112
star
12

blast_plus_docs

106
star
13

ngs-tools

C++
102
star
14

vadr

Viral Annotation DefineR: classification and annotation of viral sequences based on RefSeq annotation
Perl
98
star
15

fcs

Foreign Contamination Screening caller scripts and documentation
95
star
16

ncbi-vdb

ncbi-vdb
C
89
star
17

gprobe

client app for the gRPC health-checking protocol
Go
84
star
18

robotframework-pageobjects

Implementation of the Page Object pattern with Robot Framework and selenium. Also facilitates page object pattern independent of Robot Framework
Python
84
star
19

SSDraw

Jupyter Notebook
74
star
20

BAMscale

BAMscale is a one-step tool for either 1) quantifying and normalizing the coverage of peaks or 2) generated scaled BigWig files for easy visualization of commonly used DNA-seq capture based methods.
C
66
star
21

clinvar

ClinVar aggregates information about genomic variation and its relationship to human health. Contact us at '[email protected]' with any questions or comments.
HTML
66
star
22

rapt

Read Assembly and Annotation Pipeline Tool
57
star
23

ncbi-cxx-toolkit-public

NCBI C++ Toolkit package sources
C++
49
star
24

JATSPreviewStylesheets

JATS Preview Stylesheets
XSLT
48
star
25

docker

Dockerfile
46
star
26

sra-human-scrubber

An SRA tool that takes as input local fastq file from a clinical infection sample, identifies and removes any significant human read, and outputs the edited (cleaned) fastq file that can safely be used for SRA submission.
Shell
45
star
27

elastic-blast

ElasticBLAST is a cloud-based tool to perform your BLAST searches faster and make you more effective
Python
43
star
28

BioConceptVec

Jupyter Notebook
40
star
29

dbvar

dbVar
39
star
30

AIONER

AIONER
Python
38
star
31

magicblast

Python
34
star
32

JUDI

This repository contains the source code of JUDI, a workflow management system for developing complex bioinformatics software with many parameter settings. Bioinformatics pipeline: Just Do It!
Python
33
star
33

sratoolkit

SRAToolkit has been REPLACED - see README
32
star
34

bert_gt

Python
30
star
35

egapx

Eukaryotic Genome Annotation Pipeline-External caller scripts and documentation
Nextflow
27
star
36

osiris

OSIRIS is a public domain quality assurance software package that facilitates the assessment of multiplex short tandem repeat (STR) DNA profiles based on laboratory-specific protocols. OSIRIS evaluates the raw electrophoresis data contained in .fsa or .hid files using an independently derived mathematically-based sizing algorithm. OSIRIS currently supports ABI capillary analytical platforms and numerous commercially available marker kits including all CODIS-compliant kits as well as those favored by biomedical laboratories.
C++
26
star
37

pm4ngs

Project Manager for NGS data analysis
Python
25
star
38

BioREx

Python
25
star
39

cwl-ngs-workflows-cbb

A set of CWL tools and workflows used by NCBI Computational Biology Branch for NGS data analysis
Common Workflow Language
23
star
40

consul-announcer

Service announcer for Consul (https://www.consul.io/).
Python
22
star
41

scPopCorn

A python tool to do comparative analysis of mulitple single cell datasets.
Jupyter Notebook
21
star
42

workshop-ncbi-data-with-python

Python
20
star
43

BioRED

19
star
44

cxx-toolkit

HTML
18
star
45

EvoGeneX

This repository contains the source code of the R package for EvoGeneX, a software to infer the mode of evolution from the gene expression data.
R
17
star
46

tree-tool

Incremental building of phylogenetic distance trees
C++
16
star
47

GNorm2

Java
15
star
48

pipelines

Common Workflow Language
14
star
49

tmVar3

Java
14
star
50

graf

Genetic Relationship And Fingerprinting
Perl
13
star
51

ribovore

Perl
13
star
52

biomedical-citation-selector

Python
12
star
53

PMCXMLConverters

PMC XML Converters
XSLT
12
star
54

gaptools

dbGaP data validation tool repo
Shell
11
star
55

AF2_benchmark

Jupyter Notebook
11
star
56

sars2variantcalling

The NCBI SARS-CoV-2 Variant Calling (SC2VC) Pipeline allows calling high-confidence variants from SARS-CoV-2 NGS data in a standardized format
Perl
11
star
57

ICITY

Python
10
star
58

fcs-gx

Foreign Contamination Screening - GX source code
C++
10
star
59

blast-cloud

Documentation for NCBI BLAST AMI
CSS
10
star
60

RepairSig

Python
10
star
61

finagle-consul

Service discovery for Finagle cluster with Consul.
9
star
62

NetREX

Python
9
star
63

python-libpq-dev

Shell
8
star
64

packit

Python packaging in declarative way (wrapping pbr to make it flexible)
Python
8
star
65

workshop-asm-ngs-2022

Pre-conference workshop for ASM NGS 2022
Perl
8
star
66

elastic-blast-demos

ElasticBLAST demos
Jupyter Notebook
7
star
67

PSSS-Bytes2Biology

Petabyte Scale Sequence Search Initiative
Python
7
star
68

HYDROID

Python package for analyzing hydroxyl-radical footprinting experiments of DNA-protein complexes
Python
7
star
69

ncbi-drs

GA4GH DRS Service
Python
6
star
70

ncbi-logging

Log monitoring and gathering infrastructure to feed analytics
C++
6
star
71

cwl-demos

CWL demonstration pipelines
Common Workflow Language
6
star
72

SpeciesAssignment

SpeciesAssignment
Python
6
star
73

niso-jats

6
star
74

dual_fold_coevolution

Python
6
star
75

mti

NLM Medical Text Indexer (MTI)
C
6
star
76

DbGaP-FHIR-API-Docs

The documentation repository for the dbGaP FHIR API.
Jupyter Notebook
6
star
77

mtix

ML based NLM Medical Text Indexer
Python
5
star
78

ITSx

Not the official ITSx repository, please visit https://microbiology.se/software/itsx/
Perl
5
star
79

deeplensnet

Python
5
star
80

gtax

Python
5
star
81

biomedical-citation-selector-trainer

Biomedical Citation Selector Trainer
Python
5
star
82

SuPER

Python
5
star
83

Co-SELECT

This repository contains the source code of Co-SELECT, a computational tool to analyze the results of in vitro HT-SELEX experiments for TF-DNA binding to show the role of DNA shape in TF-DNA binding by using a novel method of deconvoluting the contributions of DNA sequence and DNA shape on the binding.
Assembly
5
star
84

AceView

Acedb object oriented database engine and AceView/MAGIC RNA_seq pipeline, NCBI/NLM/NIH
C
5
star
85

stxtyper

StxTyper uses a standardized algorithm to accurately type both known and unknown Shiga toxin operons from assembled genomic sequence.
C++
4
star
86

ncbi-xmlwrapp

NCBI’s fork of β€œxmlwrapp” -- a C++ wrapper for libxml2/libxslt libraries
C++
4
star
87

ncbi-cxx-toolkit-conan

NCBI C++ Toolkit package recipe
Python
4
star
88

biocreative_litcovid

Evaluation scripts of the Biocreative LitCovid track
Python
4
star
89

NETPHIX

A computational tool to identify mutated subnetworks that are associated with a continuous cancer phenotype
Python
4
star
90

SRPRISM

C++
4
star
91

deflake

deflake.py Helps debug a non determinate test (or any flaky program) by running it until it exits with a non-zero exit code.
Python
4
star
92

ncbi_doc_template

CSS
3
star
93

Solr-Plugins

Assorted plugins for Solr
Java
3
star
94

ncbi_css_standards

NCBI CSS
HTML
3
star
95

datadicer

JavaScript
3
star
96

GeneSigNet

HTML
3
star
97

nlm-dtd

HTML
3
star
98

CoV-Dist

HTML
3
star
99

cloud-transcriptome-annotation

Time and cost comparison on GCP and AWS for transcriptome annotation
Jupyter Notebook
3
star
100

elastic-blast-docs

ElasticBLAST documentation
3
star