• Stars
    star
    946
  • Rank 48,319 (Top 1.0 %)
  • Language
    Python
  • License
    Apache License 2.0
  • Created over 8 years ago
  • Updated 3 months ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Scripts to download genomes from the NCBI FTP servers

NCBI Genome Downloading Scripts

PyPI release DOI

Some script to download bacterial and fungal genomes from NCBI after they restructured their FTP a while ago.

Idea shamelessly stolen from Mick Watson's Kraken downloader scripts that can also be found in Mick's GitHub repo. However, Mick's scripts are written in Perl specific to actually building a Kraken database (as advertised).

So this is a set of scripts that focuses on the actual genome downloading.

Installation

pip install ncbi-genome-download

Alternatively, clone this repository from GitHub, then run (in a python virtual environment)

pip install .

If this fails on older versions of Python, try updating your pip tool first:

pip install --upgrade pip

and then rerun the ncbi-genome-download install.

Alternatively, ncbi-genome-download is packaged in conda. Refer the the Anaconda/miniconda site to install a distribution (highly recommended). With that installed one can do:

conda install -c bioconda ncbi-genome-download

ncbi-genome-download is only developed and tested on Python releases still under active support by the Python project. At the moment, this means versions 3.7, 3.8, 3.9, 3.10 and 3.11. Specifically, no attempt at testing under Python versions older than 3.7 is being made.

If your system is stuck on an older version of Python, consider using a tool like Homebrew to obtain a more up-to-date version.

ncbi-genome-download 0.2.12 was the last version to support Python 2.

Usage

To download all bacterial RefSeq genomes in GenBank format from NCBI, run the following:

ncbi-genome-download bacteria

Downloading multiple groups is also possible:

ncbi-genome-download bacteria,viral

Note: To see all available groups, see ncbi-genome-download --help, or simply use all to check all groups. Naming a more specific group will reduce the download size and the time needed to find the sequences to download.

If you're on a reasonably fast connection, you might want to try running multiple downloads in parallel:

ncbi-genome-download bacteria --parallel 4

To download all fungal GenBank genomes from NCBI in GenBank format, run:

ncbi-genome-download --section genbank fungi

To download all viral RefSeq genomes in FASTA format, run:

ncbi-genome-download --formats fasta viral

It is possible to download multiple formats by supplying a list of formats or simply downloading all formats:

ncbi-genome-download --formats fasta,assembly-report viral
ncbi-genome-download --formats all viral

To download only completed bacterial RefSeq genomes in GenBank format, run:

ncbi-genome-download --assembly-levels complete bacteria

It is possible to download multiple assembly levels at once by supplying a list:

ncbi-genome-download --assembly-levels complete,chromosome bacteria

To download only bacterial reference genomes from RefSeq in GenBank format, run:

ncbi-genome-download --refseq-categories reference bacteria

To download bacterial RefSeq genomes of the genus Streptomyces, run:

ncbi-genome-download --genera Streptomyces bacteria

Note: This is a simple string match on the organism name provided by NCBI only.

You can also use this with a slight trick to download genomes of a certain species as well:

ncbi-genome-download --genera "Streptomyces coelicolor" bacteria

Note: The quotes are important. Again, this is a simple string match on the organism name provided by the NCBI.

Multiple genera is also possible:

ncbi-genome-download --genera "Streptomyces coelicolor,Escherichia coli" bacteria

You can also put genus names into a file, one organism per line, e.g.:

Streptomyces
Amycolatopsis

Then, pass the path to that file (e.g. my_genera.txt) to the --genera option, like so:

ncbi-genome-download --genera my_genera.txt bacteria

Note: The above command will download all Streptomyces and Amycolatopsis genomes from RefSeq.

You can make the string match fuzzy using the --fuzzy-genus option. This can be handy if you need to match a value in the middle of the NCBI organism name, like so:

ncbi-genome-download --genera coelicolor --fuzzy-genus bacteria

Note: The above command will download all bacterial genomes containing "coelicolor" anywhere in their organism name from RefSeq.

To download bacterial RefSeq genomes based on their NCBI species taxonomy ID, run:

ncbi-genome-download --species-taxids 562 bacteria

Note: The above command will download all RefSeq genomes belonging to Escherichia coli.

To download a specific bacterial RefSeq genomes based on its NCBI taxonomy ID, run:

ncbi-genome-download --taxids 511145 bacteria

Note: The above command will download the RefSeq genome belonging to Escherichia coli str. K-12 substr. MG1655.

It is also possible to download multiple species taxids or taxids by supplying the numbers in a comma-separated list:

ncbi-genome-download --taxids 9606,9685 --assembly-level chromosome vertebrate_mammalian

Note: The above command will download the reference genomes for cat and human.

In addition, you can put multiple species taxids or taxids into a file, one per line and pass that filename to the --species-taxids or --taxids parameters, respectively.

Assuming you had a file my_taxids.txt with the following contents:

9606
9685

You could download the reference genomes for cat and human like this:

ncbi-genome-download --taxids my_taxids.txt --assembly-levels chromosome vertebrate_mammalian

It is possible to also create a human-readable directory structure in parallel to mirroring the layout used by NCBI:

ncbi-genome-download --human-readable bacteria

This will use links to point to the appropriate files in the NCBI directory structure, so it saves file space. Note that links are not supported on some Windows file systems and some older versions of Windows.

It is also possible to re-run a previous download with the --human-readable option. In this case, ncbi-genome-download will not download any new genome files, and just create human-readable directory structure. Note that if any files have been changed on the NCBI side, a file download will be triggered.

There is a "dry-run" option to show which accessions would be downloaded, given your filters:

ncbi-genome-download --dry-run bacteria

If you want to filter for the "relation to type material" column of the assembly summary file, you can use the --type-materials option. Possible values are "any", "all", "type", "reference", "synonym", "proxytype", and/or "neotype". "any" will include assemblies with no relation to type material value defined, "all" will download only assemblies with a defined value. Multiple values can be given, separated by comma:

ncbi-genome-download --type-materials type,reference

By default, ncbi-genome-download caches the assembly summary files for the respective taxonomic groups for one day. You can skip using the cache file by using the --no-cache option. The output of --help also shows the cache directory, should you want to remove any of the cached files.

To get an overview of all options, run

ncbi-genome-download --help

As a method

You can also use it as a method call. Pass the pythonised keyword arguments (_ instead of -) as described above or in the --help:

import ncbi_genome_download as ngd
ngd.download()

Note: To specify a taxonomic group, like bacteria, use the group keyword.

Contributed Scripts: gimme_taxa.py

This script lets you find out what TaxIDs to pass to ngd, and will write a simple one-item-per-line file to pass in to it. It utilises the ete3 toolkit, so refer to their site to install the dependency if it's not already satisfied.

You can query the database using a particular TaxID, or a scientific name. The primary function of the script is to return all the child taxa of the specified parent taxa. The script has various options for what information is written in the output.

A basic invocation may look like:

# Fetch all descendent taxa for Escherichia (taxid 561):
python gimme_taxa.py -o ~/mytaxafile.txt 561

# Alternatively, just provide the taxon name
python gimme_taxa.py -o all_descendent_taxids.txt Escherichia

# You can provide multiple taxids and/or names
python gimme_taxa.py -o all_descendent_taxids.txt 561,Methanobrevibacter

On first use, a small sqlite database will be created in your home directory by default (change the location with the --database flag). You can update this database by using the --update flag. Note that if the database is not in your home directory, you must specify it with --database or a new database will be created in your home directory.

To see all help:

python gimme_taxa.py
python gimme_taxa.py -h
python gimme_taxa.py --help

Citing ncbi-genome-download

You can cite ncbi-genome-download via the Zenodo deposit under DOI: 10.5281/zenodo.8192432 or the specific DOI for the version you used.

License

All code is available under the Apache License version 2, see the LICENSE file for details.

More Repositories

1

ncbi-acc-download

Download files from NCBI Entrez by accession
Python
107
star
2

vim-fountain

A VIM syntax highlighting plugin for the Fountain screenplay format
Vim Script
23
star
3

mockaioredis

Mock library to replace aioredis during unit tests (RETIRED)
Python
15
star
4

covid-spike-classification

Detect interesting SARS-CoV-2 spike protein variants from Sanger sequencing data.
Python
11
star
5

supybot-gsoc

A collection of patches used to pimp gsocbot, the supybot of #gsoc
Python
7
star
6

bioperl-hmmer3

BioPerl modules for HMMER3
Perl
6
star
7

fancy-prompt

An /etc/profile.d settings file creating a fancy prompt.
Shell
6
star
8

merge-gbk-records

Merge multiple GenBank records using a defined spacer sequence
Python
6
star
9

svgene

SVGene, an SVG gene cluster visualization library in JavaScript
JavaScript
6
star
10

glimmerhmm

GlimmerHMM git repository
C
5
star
11

flask-downloader

Allow a Flask web app to download files on behalf of the user
Python
5
star
12

nrpys

Python bindings for nrps-rs
Rust
4
star
13

nrps-rs

A Rust reimplementation of NRPSPredictor2
Rust
4
star
14

gecco2as

Small script to convert GECCO result tables into antiSMASH sideload JSON files
Python
4
star
15

around-the-world

Code to run the beaglebone-driven world map for Around the World in 80 Days
JavaScript
3
star
16

py3-kkdcp

Python 3 asyncio Kerberos Key Distribution Center Proxy server
Python
3
star
17

patscanui

A comfortable web interface for PatScan
CSS
2
star
18

go-kkdcp

Go implementation of a Kerberos Key Distribution Center Proxy
Go
2
star
19

rpi-temp-monitor

1-wire temperature monitoring for the Raspberry Pi
JavaScript
1
star
20

mentor-summit-2019

A central place to keep notes, the schedule etc. of the 2019 GSoC mentor summit
1
star
21

bioinf-helperlibs

A library of bioinformatics-related helper functions
Python
1
star
22

rust-kkdcp

A Rust implementation of MS-KKDCP
1
star
23

ncbi-entrez-error-messages

A collection of error messages returned by NCBI Entrez
1
star
24

asproxy

Small reverse proxy to sidestep some networking issues.
Go
1
star
25

docker-debian

Go
1
star
26

asdb-api

This is a port of the antiSMASH DB backend into Rust.
Rust
1
star
27

contorted

Automatically exported from code.google.com/p/contorted
C
1
star
28

match-ids

Match IDs (or more precisely locus_tags) between two GenBank files.
Python
1
star
29

spire

Search for Prokaryote Iron Response Elements
Python
1
star
30

genomedb-py

A tool to manage and use some metadata around NCBI GenBank genome files
Python
1
star
31

jsbio

A collection of JavaScript functions for biology applications.
JavaScript
1
star
32

wombatdb

Wombat Database Backend rewrite
Python
1
star
33

ipinfo

A small IP address echo server for use with a dynamic DNS change script.
Go
1
star
34

rna_extract

Extract tRNAs and rRNAs from sequences identified by NCBI GenBank ID
Python
1
star
35

statusbot

A command-line to twitter tool I use to let my servers tweet their status.
Python
1
star
36

swc-shell-novice

Python
1
star