Official NCBI BLAST+ Docker Image Documentation

This repository contains documentation for the NCBI BLAST+ command line applications in a Docker image. We will demonstrate how to use the Docker image to run BLAST analysis on the Google Cloud Platform (GCP) and Amazon Web Services (AWS) using a small basic example and a more advanced production-level example. Some basic knowledge of Unix/Linux commands and BLAST+ is useful in completing this tutorial.

What Is NCBI BLAST?
What Is Cloud Computing?
What Is Docker?
Google Cloud Platform Setup
Amazon Web Services Setup
BLAST Databases
BLAST Database Metadata
Additional Resources
Maintainer
License
Appendix

What Is NCBI BLAST?

The National Center for Biotechnology Information (NCBI) Basic Local Alignment Search Tool(BLAST) finds regions of local similarity between sequences. The program compares nucleotide or protein sequences to sequence databases and calculates the statistical significance of matches. BLAST can be used to infer functional and evolutionary relationships between sequences as well as help identify members of gene families.

Introduced in 2009, BLAST+ is an improved version of BLAST command line applications. For a full description of the features and capabilities of BLAST+, please refer to the BLAST Command Line Applications User Manual.

What Is Cloud Computing?

Cloud computing offers potential cost savings by using on-demand, scalable, and elastic computational resources. While a detailed description of various cloud technologies and benefits is out of the scope for this repository, the following sections contain information needed to get started running the BLAST+ Docker image on the Google Cloud Platform (GCP).

What Is Docker?

Docker is a tool to perform operating-system level virtualization using software containers. In containerization technology^*, an image is a snapshot of an analytical environment encapsulating application(s) and dependencies. An image, which is essentially a file built from a list of instructions, can be saved and easily shared for others to recreate the exact analytical environment across platforms and operating systems. A container is a runtime instance of an image. By using containerization, users can bypass the often-complicated steps in compiling, configuring, and installing a Unix-based tool like BLAST+. In addition to portability, containerization is a lightweight approach to make analysis more findable, accessible, interoperable, reusable (F.A.I.R.) and, ultimately, reproducible.

*There are many containerization tools and standards, such as Docker and Singularity. We will focus solely on Docker, which is considered the de facto standard by many in the field.

Google Cloud Platform Setup

The following sections include instructions to create a Google virtual machine, install Docker, and run BLAST+ commands using the Docker image.

Section 1 - Getting Started Using the BLAST+ Docker Image with a Small Example

This section provides a quick run-through of a BLAST analysis in the Docker environment on a Google instance. This is intended as an overview for those who just want an understanding of the principles of the solution. If you work with Amazon instances, please go the the Amazon Web Services Setup section of this documentation. The Google Cloud Shell, an interactive shell environment, will be used for this example, which makes it possible to run the following small example without having to perform additional setup, such as creating a billing account or compute instance. More detailed descriptions of analysis steps, alternative commands, and more advanced topics are covered in the later sections of this documentation.

Requirements: A Google account

Flow of the Task:

Input data:

Query – 1 sequence, 44 nucleotides, file size 0.2 KB
Databases
- 7 sequences, 922 nucleotides, file size 1.7 KB
- PDB protein database (pdbaa) 0.2831 GB

First, in a separate browser window or tab, sign in at https://console.cloud.google.com/

Click the Activate Cloud Shell button at the top right corner of the Google Cloud Platform Console.

You now will see your Cloud Shell session window:

The next step is to copy-and-paste the commands below in your Cloud Shell session.

Please note: In GitHub you can use your mouse to copy; however, in the command shell you must use your keyboard. In Windows or Unix/Linux, use the shortcut Control+C to copy and Control+V to paste. On macOS, use Command+C to copy and Command+V to paste.

To scroll in the Cloud Shell, enable the scrollbar in Terminal settings with the wrench icon.

# Time needed to complete this section: <10 minutes

# Step 1. Retrieve sequences
## Create directories for analysis
cd ; mkdir blastdb queries fasta results blastdb_custom

## Retrieve query sequence
docker run --rm ncbi/blast efetch -db protein -format fasta \
    -id P01349 > queries/P01349.fsa
    
## Retrieve database sequences
docker run --rm ncbi/blast efetch -db protein -format fasta \
    -id Q90523,P80049,P83981,P83982,P83983,P83977,P83984,P83985,P27950 \
    > fasta/nurse-shark-proteins.fsa
    
## Step 2. Make BLAST database 
docker run --rm \
    -v $HOME/blastdb_custom:/blast/blastdb_custom:rw \
    -v $HOME/fasta:/blast/fasta:ro \
    -w /blast/blastdb_custom \
    ncbi/blast \
    makeblastdb -in /blast/fasta/nurse-shark-proteins.fsa -dbtype prot \
    -parse_seqids -out nurse-shark-proteins -title "Nurse shark proteins" \
    -taxid 7801 -blastdb_version 5
    
## Step 3. Run BLAST+ 
docker run --rm \
    -v $HOME/blastdb:/blast/blastdb:ro \
    -v $HOME/blastdb_custom:/blast/blastdb_custom:ro \
    -v $HOME/queries:/blast/queries:ro \
    -v $HOME/results:/blast/results:rw \
    ncbi/blast \
    blastp -query /blast/queries/P01349.fsa -db nurse-shark-proteins
    
## Output on screen
## Scroll up to see the entire output
## Type "exit" to leave the Cloud Shell or continue to the next section

At this point, you should see the output on the screen. With your query, BLAST identified the protein sequence P80049.1 as a match with a score of 14.2 and an E-value of 0.96.

For larger analysis, it is recommended to use the -out flag to save the output to a file. For example, append -out /blast/results/blastp.out to the last command in Step 3 above and view the content of this output file using more $HOME/results/blastp.out.

You can also query P01349.fsa against the PDB as shown in the following code block.

## Extend the example to query against the Protein Data Bank
## Time needed to complete this section: <10 minutes

## Confirm query
ls queries/P01349.fsa

## Download Protein Data Bank amino acid database (pdbaa)
docker run --rm \
     -v $HOME/blastdb:/blast/blastdb:rw \
     -w /blast/blastdb \
     ncbi/blast \
     update_blastdb.pl --source gcp pdbaa

## Run BLAST+ 
docker run --rm \
     -v $HOME/blastdb:/blast/blastdb:ro \
     -v $HOME/blastdb_custom:/blast/blastdb_custom:ro \
     -v $HOME/queries:/blast/queries:ro \
     -v $HOME/results:/blast/results:rw \
     ncbi/blast \
     blastp -query /blast/queries/P01349.fsa -db pdbaa

## Output on screen
## Scroll up to see the entire output
## Leave the Cloud Shell

exit

You have now completed a simple task and seen how BLAST+ with Docker works. To learn about Docker and BLAST+ at production scale, please proceed to the next section.

In Section 2 - A Step-by-Step Guide Using the BLAST+ Docker Image, we will use the same small example from the previous section and discuss alternative approaches, additional useful Docker and BLAST+ commands, and Docker command options and structures. In Section 3, we will demonstrate how to run the BLAST+ Docker image at production scale.

First, you need to set up a Google Cloud Platform (GCP) virtual machine (VM) for analysis.

Requirements

A GCP account linked to a billing account
A GCP VM running Ubuntu 18.04LTS

Set up your GCP account and create a VM for analysis

1. Creating your GCP account and registering for the free $300 credit program. (If you already have a GCP billing account, you can skip to step 2.)

First, in a separate browser window or tab, sign in at https://console.cloud.google.com/
- If you need to create one, go to https://cloud.google.com/ and click “Get started for free” to sign up for a trial account.
- If you have multiple Google accounts, sign in using an Incognito Window (Chrome) or Private Window (Safari) or any other private browser window.

GCP is currently offering a $300 credit, which expires 12 months from activation, to incentivize new cloud users. The following steps will show you how to activate this credit. You will be asked for billing information, but GCP will not auto-charge you once the trial ends; you must elect to manually upgrade to a paid account.

After signing in, click Activate to activate the $300 credit.
Enter your country, for example, United States, and check the box indicating that you have read and accept the terms of service.
Under “Account type,” select “Individual.” (This may be pre-selected in your Google account)
Enter your name and address.
Under “How you pay," select “Automatic payments.” (This may be pre-selected in your Google account) This indicates that you will pay costs after you have used the service, either when you have reached your billing threshold or every 30 days, whichever comes first.
Under “Payment method,” select “add a credit or debit card” and enter your credit card information. You will not be automatically charged once the trial ends. You must elect to upgrade to a paid account before your payment method will be charged.
Click “Start my free trial” to finish registration. When this process is completed, you should see a GCP welcome screen.

2. Create a Virtual Machine (VM)

On the GCP welcome screen from the last step, click "Compute Engine" or navigate to the "Compute Engine" section by clicking on the navigation menu with the "hamburger icon" (three horizontal lines) on the top left corner.

Click on the blue “CREATE INSTANCE” button on the top bar.
Create an image with the following parameters: (if parameter is not list below, keep the default setting)
- Name: keep the default or enter a name
- Region: us-east4 (Northern Virginia)
- For Section 2, change these settings -
  - Machine Type: micro (1 shared vCPU), 0.6 GB memory, f1-micro
  - Boot Disk: Click "Change," select Ubuntu 18.04 LTS, and click "Select" (Boot disc size is default 10 GB).
- For Section 3, change these settings -
  - Machine Type: 16 vCPU, 104 GB memory, n1-highmem-16
  - Boot Disk: Click "Change" and select Ubuntu 18.04 LTS, change the "Boot disk size" to 200 GB Standard persistent disk, and click "Select."

At this point, you should see a cost estimate for this instance on the right side of your window.

Click the blue “Create” button. This will create and start the VM.

Please note: Creating a VM in the same region as storage can provide better performance. We recommend creating a VM in the us-east4 region. If you have a job that will take several hours, but less than 24 hours, you can potentially take advantage of preemptible VMs.

Detailed instructions for creating a GCP account and launching a VM can be found here.

3. Access a GCP VM from a local machine

Once you have your VM created, you must access it from your local computer. There are many methods to access your VM, depending on the ways in which you would like to use it. On the GCP, the most straightforward way is to SSH from the browser.

Connect to your new VM instance by clicking the "SSH" button

You now have a command shell running and you are ready to proceed.

Remember to stop or delete the VM to prevent incurring additional cost.

Section 2 - A Step-by-Step Guide Using the BLAST+ Docker Image

In this section, we will cover Docker installation, discuss various docker run command options, and examine the structure of a Docker command. We will use the same small example from Section 1 and explore alternative approaches in running the BLAST+ Docker image. However, we are using a real VM instance, which provides greater performance and functionality than the Google Cloud Shell.

Input data

Query – 1 sequence, 44 nucleotides, file size 0.2 KB
Database – 7 sequences, 922 nucleotides, file size 1.7 KB

Step 1. Install Docker

In a production system, Docker has to be installed as an application.

## Run these commands to install Docker and add non-root users to run Docker
sudo snap install docker
sudo apt update
sudo apt install -y docker.io
sudo usermod -aG docker $USER
exit
# exit and SSH back in for changes to take effect

To confirm the correct installation of Docker, run the command docker run hello-world. If correctly installed, you should see "Hello from Docker!..."(https://docs.docker.com/samples/library/hello-world/)

Docker run command options

This section is optional.

Below is a list of docker run command line options used in this tutorial.

Name, short-hand(if available)	Description
`--rm`	Automatically remove the container when it exits
`--volume` , `-v`	Bind mount a volume
`--workdir` , `-w`	Working directory inside the container

Docker run command structure

This section is optional.

For this tutorial, it would be useful to understand the structure of a Docker command. The following command consists of three parts.

docker run --rm ncbi/blast \

    -v $HOME/blastdb_custom:/blast/blastdb_custom:rw \
    -v $HOME/fasta:/blast/fasta:ro \
    -w /blast/blastdb_custom \

    makeblastdb -in /blast/fasta/nurse-shark-proteins.fsa -dbtype prot \
    -parse_seqids -out nurse-shark-proteins -title "Nurse shark proteins" \
    -taxid 7801 -blastdb_version 5

The first part of the command docker run --rm ncbi/blast is an instruction to run the docker image ncbi/blast and remove the container when the run is completed.

The second part of the command makes the query sequence data accessible in the container. Docker bind mounts uses -v to mount the local directories to directories inside the container and provide access permission rw (read and write) or ro (read only). For instance, assuming your subject sequences are stored in the $HOME/fasta directory on the local host, you can use the following parameter to make that directory accessible inside the container in /blast/fasta as a read-only directory -v $HOME/fasta:/blast/fasta:ro. The -w /blast/blastdb_custom flag sets the working directory inside the container.

The third part of the command is the BLAST+ command. In this case, it is executing makeblastdb to create BLAST database files.

You can start an interactive bash session for this image by using docker run -it ncbi/blast /bin/bash. For the BLAST+ Docker image, the executables are in the folder /blast/bin and /root/edirect and added to the variable $PATH.

For additional documentation on the docker run command, please refer to documentation.

Useful Docker commands

This section is optional.

Docker Command	Description
`docker ps -a`	Displays a list of containers
`docker rm $(docker ps -q -f status=exited)`	Removes all exited containers, if you have at least 1 exited container
`docker rm <CONTAINER_ID>`	Removes a container
`docker images`	Displays a list of images
`docker rmi <REPOSITORY (IMAGE_NAME)>`	Removes an image

Using BLAST+ with Docker

This section is optional.

With this Docker image you can run BLAST+ in an isolated container, facilitating reproducibility of BLAST results. As a user of this Docker image, you are expected to provide BLAST databases and query sequence(s) to run BLAST as well as a location outside the container to save the results. The following is a list of directories used by BLAST+. You will create them in Step 2.

Directory	Purpose	Notes
`$HOME/blastdb`	Stores NCBI-provided BLAST databases	If set to a single, absolute path, the `$BLASTDB` environment variable could be used instead (see Configuring BLAST via environment variables.)
`$HOME/queries`	Stores user-provided query sequence(s)
`$HOME/fasta`	Stores user-provided FASTA sequences to create BLAST database(s)
`$HOME/results`	Stores BLAST results	Mount with `rw` permissions
`$HOME/blastdb_custom`	Stores user-provided BLAST databases

Versions of BLAST Docker image

This section is optional.

The following command displays the latest BLAST version.
docker run --rm ncbi/blast blastn -version

Appending a tag to the image name (ncbi/blast) allows you to use a different version of BLAST+ (see “Supported Tags and Respective Release Notes” section for supported versions).

Different versions of BLAST+ exist in different Docker images. The following command will initiate download of the BLAST+ version 2.9.0 Docker image.

docker run --rm ncbi/blast:2.9.0 blastn -version
## Display a list of images
docker images

For example, to use the BLAST+ version 2.9.0 Docker image instead of the latest version, replace the first part of the command

docker run --rm ncbi/blast with docker run --rm ncbi/blast:2.9.0

Supported tags

This section is optional.

Step 2. Import sequences and create a BLAST database

In this example, we will start by fetching query and database sequences and then create a custom BLAST database.

# Start in a directory where you want to perform the analysis
## Create directories for analysis
cd ; mkdir blastdb queries fasta results blastdb_custom

## Retrieve query sequences
docker run --rm ncbi/blast efetch -db protein -format fasta \
    -id P01349 > queries/P01349.fsa
    
## Retrieve database sequences
docker run --rm ncbi/blast efetch -db protein -format fasta \
    -id Q90523,P80049,P83981,P83982,P83983,P83977,P83984,P83985,P27950 \
    > fasta/nurse-shark-proteins.fsa
    
## Make BLAST database 
docker run --rm \
    -v $HOME/blastdb_custom:/blast/blastdb_custom:rw \
    -v $HOME/fasta:/blast/fasta:ro \
    -w /blast/blastdb_custom \
    ncbi/blast \
    makeblastdb -in /blast/fasta/nurse-shark-proteins.fsa -dbtype prot \
    -parse_seqids -out nurse-shark-proteins -title "Nurse shark proteins" \
    -taxid 7801 -blastdb_version 5

To verify the newly created BLAST database above, you can run the following command to display the accessions, sequence length, and common name of the sequences in the database.

docker run --rm \
    -v $HOME/blastdb:/blast/blastdb:ro \
    -v $HOME/blastdb_custom:/blast/blastdb_custom:ro \
    ncbi/blast \
    blastdbcmd -entry all -db nurse-shark-proteins -outfmt "%a %l %T"

As an alternative, you can also download preformatted BLAST databases from NCBI or the NCBI Google storage bucket.

Show BLAST databases available for download from the Google Cloud bucket

docker run --rm ncbi/blast update_blastdb.pl --showall pretty --source gcp

For a detailed description of update_blastdb.pl, please refer to the documentation. By default update_blastdb.pl will download from the cloud provided you are connected to, or from NCBI if you are not using a supported cloud provider.

Show BLAST databases available for download from NCBI

This section is optional.

docker run --rm ncbi/blast update_blastdb.pl --showall --source ncbi

Show available BLAST databases on local host

This section is optional.

The command below mounts the $HOME/blastdb path on the local machine as /blast/blastdb on the container, and blastdbcmd shows the available BLAST databases at this location.

## Download Protein Data Bank amino acid database (pdbaa)
docker run --rm \
     -v $HOME/blastdb:/blast/blastdb:rw \
     -w /blast/blastdb \
     ncbi/blast \
     update_blastdb.pl pdbaa

## Display database(s) in $HOME/blastdb
docker run --rm \
    -v $HOME/blastdb:/blast/blastdb:ro \
    ncbi/blast \
    blastdbcmd -list /blast/blastdb -remove_redundant_dbs

You should see an output /blast/blastdb/pdbaa Protein.

## For the custom BLAST database used in this example -
docker run --rm \
    -v $HOME/blastdb_custom:/blast/blastdb_custom:ro \
    ncbi/blast \
    blastdbcmd -list /blast/blastdb_custom -remove_redundant_dbs

You should see an output /blast/blastdb_custom/nurse-shark-proteins Protein.

Step 3. Run BLAST

When running BLAST in a Docker container, note the mounts specified to the docker run command to make the input and outputs accessible. In the examples below, the first two mounts provide access to the BLAST databases, the third mount provides access to the query sequence(s), and the fourth mount provides a directory to save the results. (Note the :ro and :rw options, which mount the directories as read-only and read-write respectively.)

docker run --rm \
    -v $HOME/blastdb:/blast/blastdb:ro \
    -v $HOME/blastdb_custom:/blast/blastdb_custom:ro \
    -v $HOME/queries:/blast/queries:ro \
    -v $HOME/results:/blast/results:rw \
    ncbi/blast \
    blastp -query /blast/queries/P01349.fsa -db nurse-shark-proteins \
    -out /blast/results/blastp.out

At this point, you should see the output file $HOME/results/blastp.out. With your query, BLAST identified the protein sequence P80049.1 as a match with a score of 14.2 and an E-value of 0.96. To view the content of this output file, use the command more $HOME/results/blastp.out.

Stop the GCP instance

Remember to stop or delete the VM to prevent incurring additional cost. You can do this at the GCP Console as shown below.

Section 3 - Using the BLAST+ Docker Image at Production Scale

Background

One of the promises of cloud computing is scalability. In this section, we will demonstrate how to use the BLAST+ Docker image at production scale on the Google Cloud Platform. We will perform a BLAST analysis similar to the approach described in this publication to compare de novo aligned contigs from bacterial 16S-23S sequencing against the nucleotide collection (nt) database.

To test scalability, we will use inputs of different sizes to estimate the amount of time to download the nucleotide collection database and run BLAST search using the latest version of the BLAST+ Docker image. Expected results are summarized in the following tables.

Input files: 28 samples (multi-FASTA files) containing de novo aligned contigs from the publication.
(Instructions to download and create the input files are described in the code block below.)

Database: Pre-formatted BLAST nucleotide collection database, version 5 (nt): 68.7217 GB (from May 2019)

	Input file name	File content	File size	Number of sequences	Number of nucleotides	Expected output size
Analysis 1	query1.fa	only sample 1	59 KB	121	51,119	3.1 GB
Analysis 2	query5.fa	only samples 1-5	422 KB	717	375,154	10.4 GB
Analysis 3	query.fa	all 28 samples	2.322 MB	3798	2,069,892	47.8 GB

BLAST+ Docker image benchmarks

VM Type/Zone	CPU	Memory (GB)	Hourly Cost*	Download nt (min)	Analysis 1 (min)	Analysis 2 (min)	Analysis 3 (min)	Total Cost**
n1-standard-8 us-east4c	8	30	$0.312	9	22	-	-	-
n1-standard-16 us-east4c	16	60	$0.611	9	14	53	205	$2.86
n1-highmem-16 us-east4c	16	104	$0.767	9	9	30	143	$2.44
n1-highmem-16 us-west2a	16	104	$0.809	11	9	30	147	$2.60
n1-highmem-16 us-west1b	16	104	$0.674	11	9	30	147	$2.17
BLAST website (blastn)	-	-	-	-	Searches exceed current restrictions on usage	Searches exceed current restrictions on usage	Searches exceed current restrictions on usage	-

All GCP instances are configured with 200 GB of persistent standard disk.

*Hourly costs were provided by Google Cloud Platform (May 2019) when VMs were created and are subject to change.
**Total costs were estimated using the hourly cost and total time to download nt and run Analysis 1, Analysis 2, and Analysis 3. Estimates are used for comparison only; your costs may vary and are your responsibility to monitor and manage.

Please refer to GCP for more information on machine types, regions and zones, and compute cost.

Please note that running the blastn binary without specifying its -task parameter invokes the MegaBLAST algorithm.

Commands to run

## Install Docker if not already done
## This section assumes using recommended hardware requirements below
## 16 CPUs, 104 GB memory and 200 GB persistent hard disk

## Modify the number of CPUs (-num_threads) in Step 3 if another type of VM is used.

## Step 1. Prepare for analysis
## Create directories
cd ; mkdir -p blastdb queries fasta results blastdb_custom

## Import and process input sequences
sudo apt install unzip
wget https://ndownloader.figshare.com/articles/6865397?private_link=729b346eda670e9daba4 -O fa.zip
unzip fa.zip -d fa

### Create three input query files
### All 28 samples
cat fa/*.fa > query.fa

### Sample 1
cat fa/'Sample_1 (paired) trimmed (paired) assembly.fa' > query1.fa

### Sample 1 to Sample 5
cat fa/'Sample_1 (paired) trimmed (paired) assembly.fa' \
    fa/'Sample_2 (paired) trimmed (paired) assembly.fa' \
    fa/'Sample_3 (paired) trimmed (paired) assembly.fa' \
    fa/'Sample_4 (paired) trimmed (paired) assembly.fa' \
    fa/'Sample_5 (paired) trimmed (paired) assembly.fa' > query5.fa
    
### Copy query sequences to $HOME/queries folder
cp query* $HOME/queries/.

## Step 2. Display BLAST databases on the GCP
docker run --rm ncbi/blast update_blastdb.pl --showall pretty --source gcp

## Download nt (nucleotide collection version 5) database
## This step takes approximately 10 min.  The following command runs in the background.
docker run --rm \
  -v $HOME/blastdb:/blast/blastdb:rw \
  -w /blast/blastdb \
  ncbi/blast \
  update_blastdb.pl --source gcp nt &

## At this point, confirm query/database have been properly provisioned before proceeding

## Check the size of the directory containing the BLAST database
## nt should be around 68 GB    (this was in May 2019)
du -sk $HOME/blastdb

## Check for queries, there should be three files - query.fa, query1.fa and query5.fa
ls -al $HOME/queries

## From this point forward, it may be easier if you run these steps in a script. 
## Simply copy and paste all the commands below into a file named script.sh
## Then run the script in the background `nohup bash script.sh > script.out &`

## Step 3. Run BLAST
## Run BLAST using query1.fa (Sample 1) 
## This command will take approximately 9 minutes to complete.
## Expected output size: 3.1 GB  
docker run --rm \
  -v $HOME/blastdb:/blast/blastdb:ro -v $HOME/blastdb_custom:/blast/blastdb_custom:ro \
  -v $HOME/queries:/blast/queries:ro \
  -v $HOME/results:/blast/results:rw \
  ncbi/blast \
  blastn -query /blast/queries/query1.fa -db nt -num_threads 16 \
  -out /blast/results/blastn.query1.denovo16s.out

## Run BLAST using query5.fa (Samples 1-5) 
## This command will take approximately 30 minutes to complete.
## Expected output size: 10.4 GB  
docker run --rm \
  -v $HOME/blastdb:/blast/blastdb:ro -v $HOME/blastdb_custom:/blast/blastdb_custom:ro \
  -v $HOME/queries:/blast/queries:ro \
  -v $HOME/results:/blast/results:rw \
  ncbi/blast \
  blastn -query /blast/queries/query5.fa -db nt -num_threads 16 \
  -out /blast/results/blastn.query5.denovo16s.out

## Run BLAST using query.fa (All 28 samples) 
## This command will take approximately 147 minutes to complete.
## Expected output size: 47.8 GB  
docker run --rm \
  -v $HOME/blastdb:/blast/blastdb:ro -v $HOME/blastdb_custom:/blast/blastdb_custom:ro \
  -v $HOME/queries:/blast/queries:ro \
  -v $HOME/results:/blast/results:rw \
  ncbi/blast \
  blastn -query /blast/queries/query.fa -db nt -num_threads 16 \
  -out /blast/results/blastn.query.denovo16s.out

## Stdout and stderr will be in script.out
## BLAST output will be in $HOME/results

You have completed the entire tutorial. At this point, if you do not need the downloaded data for further analysis, please delete the VM to prevent incurring additional cost.

To delete an instance, follow instructions in the section Stop the GCP instance.

For additional information, please refer to Google Cloud Platform's documentation on instance life cycle.

Amazon Web Services Setup

Overview

To run these examples you'll need an Amazon Web Services (AWS) account. If you don't have one already, you can create an account that provides the ability to explore and try out AWS services free of charge up to specified limits for each service. To get started visit the Free Tier site, this will require a valid credit card however it will not be charged if you compute within the Free Tier. When choosing a Free Tier product, be sure it's in the Product Category Compute.

Requirements

An AWS account
An EC2 VM running Linux, on an instance type of t2.micro
An SSH client, such as the native Terminal application on OS X or on Windows 8 or greater with the CMD prompt or Putty on Windows

Example 1: Run BLAST+ Docker on an Amazon EC2 Virtual Machine

Step 1: Create an EC2 Virtual Machine (VM)

These instructions create an EC2 VM based on an Amazon Machine Image (AMI) that includes Docker and its dependencies.

Log into the AWS console and select the EC2 service.
Start the instance creation process by selecting Launch Instance (a virtual machine)
In Step 1: Choose an Amazon Machine Image (AMI) select the AWS Marketplace tab
In the search box enter the value ECS-Optimized Amazon Linux AMI
Select one of the Free tier eligible AMIs; Amazon ECS-Optimized Amazon Linux AMI; select Continue
In Step 2: Choose an Instance Type choose the t2.micro Type; select Next: Review and Launch
Select Launch
To allow SSH connection to this VM you'll need a key pair. When prompted, select an existing, or create a new, key pair. Be sure to record the location (directory) in which you place the associated .pem file, then select Launch Instances.
Select View Instances

Step 2: Establish an SSH session with the EC2 VM

With the VM created, you access it from your local computer using SSH. Your key pair / .pem file serves as your credential.

There are several ways to establish an SSH connection. From the EC2 Instance list in the AWS Console, select Connect, then follow the instructions for the Connection Method A standalone SSH client.

The detailed instructions for connecting to a Linux VM can be found here.

Specify ec2-user as the username, instead of root in your ssh command line or when prompted to login, specify ec2-user as the username.

Step 3: Import sequences and create a BLAST database

In this example, we will start by fetching query and database sequences and then create a custom BLAST database.

## Retrieve sequences
## Create directories for analysis
cd $HOME; sudo mkdir bin blastdb queries fasta results blastdb_custom; sudo chown ec2-user:ec2-user *

## Retrieve query sequence
docker run --rm ncbi/blast efetch -db protein -format fasta \
    -id P01349 > queries/P01349.fsa

## Retrieve database sequences
docker run --rm ncbi/blast efetch -db protein -format fasta \
    -id Q90523,P80049,P83981,P83982,P83983,P83977,P83984,P83985,P27950 \
    > fasta/nurse-shark-proteins.fsa

## Make BLAST database 
docker run --rm \
    -v $HOME/blastdb_custom:/blast/blastdb_custom:rw \
    -v $HOME/fasta:/blast/fasta:ro \
    -w /blast/blastdb_custom \
    ncbi/blast \
    makeblastdb -in /blast/fasta/nurse-shark-proteins.fsa -dbtype prot \
    -parse_seqids -out nurse-shark-proteins -title "Nurse shark proteins" \
    -taxid 7801 -blastdb_version 5

To verify the newly created BLAST database above, you can run the following command to display the accessions, sequence length, and common name of the sequences in the database.

## Verify BLAST DB
docker run --rm \
    -v $HOME/blastdb:/blast/blastdb:ro \
    -v $HOME/blastdb_custom:/blast/blastdb_custom:ro \
    ncbi/blast \
    blastdbcmd -entry all -db nurse-shark-proteins -outfmt "%a %l %T"

Step 4: Run BLAST

When running BLAST in a Docker container, note the mounts (-v option) specified to the docker run command to make the input and outputs accessible. In the examples below, the first two mounts provide access to the BLAST databases, the third mount provides access to the query sequence(s), and the fourth mount provides a directory to save the results. (Note the :ro and :rw options, which mount the directories as read-only and read-write respectively.)

## Run BLAST+ 
docker run --rm \
    -v $HOME/blastdb:/blast/blastdb:ro \
    -v $HOME/blastdb_custom:/blast/blastdb_custom:ro \
    -v $HOME/queries:/blast/queries:ro \
    -v $HOME/results:/blast/results:rw \
    ncbi/blast \
    blastp -query /blast/queries/P01349.fsa -db nurse-shark-proteins \
	-out /blast/results/blastp.out

Step 5: Optional - Show BLAST databases available for download from the NCBI AWS bucket

docker run --rm ncbi/blast update_blastdb.pl --showall pretty --source aws

The expected output is a list of BLAST DBs, including their name, description, size, and last updated date.

Step 6: Optional - Show BLAST databases available for download from NCBI

docker run --rm ncbi/blast update_blastdb.pl --showall --source ncbi

The expected output is a list of the names of BLAST DBs.

Step 7: Stop the EC2 VM

Remember to stop or terminate the VM to prevent incurring additional cost. You can do this from the EC2 Instance list in the AWS Console as shown below.

Example 2: Run BLAST+ Docker on an Amazon EC2 Virtual Machine - Protein Data Bank Amino Acid DB

This example requires a multi-core host. As such, EC2 compute charges will be realized by executing this example. The current rate for the Instance Type used - t2.large - is $0.093/hr.

Step 1: Create an EC2 Virtual Machine (VM)

These instructions create an EC2 VM based on an Amazon Machine Image (AMI) that includes Docker and its dependencies.

Log into the AWS console and select the EC2 service.
Start the instance creation process by selecting Launch Instance (a virtual machine)
In Step 1: Choose an Amazon Machine Image (AMI) select the AWS Marketplace tab
In the search box enter the value ECS-Optimized Amazon Linux AMI
Select one of the Free tier eligible AMIs; Amazon ECS-Optimized Amazon Linux AMI; select Continue
In Step 2: Choose an Instance Type choose the t2.large Type; select Next: Review and Launch
Select Launch
To allow SSH connection to this VM you'll need a key pair. When prompted, select an existing, or create a new, key pair. Be sure to record the location (directory) in which you place the associated .pem file, then select Launch Instances. You can use the same key pair as used in Example 1.
Select View Instances

Step 2: Establish an SSH session with the EC2 VM

With the VM created, you access it from your local computer using SSH. Your key pair / .pem file serves as your credential.

The detailed instructions for connecting to a Linux VM can be found here.

Specify ec2-user as the username, instead of root in your ssh command line or when prompted to login, specify ec2-user as the username.

Step 2. Retrieve sequences

## Create directories for analysis
cd $HOME; sudo mkdir bin blastdb queries fasta results blastdb_custom; sudo chown ec2-user:ec2-user *

## Retrieve query sequence
docker run --rm ncbi/blast efetch -db protein -format fasta \
    -id P01349 > queries/P01349.fsa

Step 3: Download Protein Data Bank Amino Acid Database (pdbaa)

The command below mounts (using the -v option) the $HOME/blastdb path on the local machine as /blast/blastdb on the container, and blastdbcmd shows the available BLAST databases at this location.

## Download Protein Data Bank amino acid database (pdbaa)
docker run --rm \
     -v $HOME/blastdb:/blast/blastdb:rw \
     -w /blast/blastdb \
     ncbi/blast \
     update_blastdb.pl pdbaa

## Display database(s) in $HOME/blastdb
docker run --rm \
    -v $HOME/blastdb:/blast/blastdb:ro \
    ncbi/blast \
    blastdbcmd -list /blast/blastdb -remove_redundant_dbs

You should see an output /blast/blastdb/pdbaa Protein.

Step 4: Run BLAST+

## Run BLAST+ 
docker run --rm \
     -v $HOME/blastdb:/blast/blastdb:ro \
     -v $HOME/blastdb_custom:/blast/blastdb_custom:ro \
     -v $HOME/queries:/blast/queries:ro \
     -v $HOME/results:/blast/results:rw \
     ncbi/blast \
     blastp -query /blast/queries/P01349.fsa -db pdbaa \
	 -out /blast/results/blastp_pdbaa.out

At this point, you should see the output file $HOME/results/blastp_pdbaa.out. To view the content of this output file, use the command more $HOME/results/blastp_pdbaa.out.

Appendix

Appendix A: Transfer Files to/from an AWS VM

One way to transfer files between your local computer and a Linux instance is to use the secure copy protocol (SCP).

The section Transferring files to Linux instances from Linux using SCP of the Amazon EC2 User Guide for Linux Instances provides detailed instructions for this process.

BLAST Databases

The NCBI hosts the same databases on AWS, GCP, and the NCBI FTP site. The table below has the list of databases current as of November, 2022.

It is also possible to obtain the current list with the command:

docker run --rm ncbi/blast update_blastdb.pl --showall pretty

update_blastdb.pl --showall pretty # after downloading the BLAST+ package.

As shown above, update_blastdb.pl can also be used to download these databases. It will automatically select the appropriate resource (e.g., GCP if you are within that provider).

These databases can also be searched with ElasticBLAST on GCP and AWS.

Accessing the databases on AWS or GCP outside of the cloud provider will likely result in egress charges to your account. If you are not on the cloud provider, you should use the databases at the NCBI FTP site.

Name	Type	Title
16S_ribosomal_RNA	DNA	16S ribosomal RNA (Bacteria and Archaea type strains)
18S_fungal_sequences	DNA	18S ribosomal RNA sequences (SSU) from Fungi type and reference material
28S_fungal_sequences	DNA	28S ribosomal RNA sequences (LSU) from Fungi type and reference material
Betacoronavirus	DNA	Betacoronavirus
GCF_000001405.38_top_level	DNA	Homo sapiens GRCh38.p12 [GCF_000001405.38] chromosomes plus unplaced and unlocalized scaffolds
GCF_000001635.26_top_level	DNA	Mus musculus GRCm38.p6 [GCF_000001635.26] chromosomes plus unplaced and unlocalized scaffolds
ITS_RefSeq_Fungi	DNA	Internal transcribed spacer region (ITS) from Fungi type and reference material
ITS_eukaryote_sequences	DNA	ITS eukaryote BLAST
LSU_eukaryote_rRNA	DNA	Large subunit ribosomal nucleic acid for Eukaryotes
LSU_prokaryote_rRNA	DNA	Large subunit ribosomal nucleic acid for Prokaryotes
SSU_eukaryote_rRNA	DNA	Small subunit ribosomal nucleic acid for Eukaryotes
env_nt	DNA	environmental samples
nt	DNA	Nucleotide collection (nt)
patnt	DNA	Nucleotide sequences derived from the Patent division of GenBank
pdbnt	DNA	PDB nucleotide database
ref_euk_rep_genomes	DNA	RefSeq Eukaryotic Representative Genome Database
ref_prok_rep_genomes	DNA	Refseq prokaryote representative genomes (contains refseq assembly)
ref_viroids_rep_genomes	DNA	Refseq viroids representative genomes
ref_viruses_rep_genomes	DNA	Refseq viruses representative genomes
refseq_rna	DNA	NCBI Transcript Reference Sequences
refseq_select_rna	DNA	RefSeq Select RNA sequences
tsa_nt	DNA	Transcriptome Shotgun Assembly (TSA) sequences
env_nr	Protein	Proteins from WGS metagenomic projects
landmark	Protein	Landmark database for SmartBLAST
nr	Protein	All non-redundant GenBank CDS translations+PDB+SwissProt+PIR+PRF excluding environmental samples from WGS projects
pdbaa	Protein	PDB protein database
pataa	Protein	Protein sequences derived from the Patent division of GenBank
refseq_protein	Protein	NCBI Protein Reference Sequences
refseq_select_prot	Protein	RefSeq Select proteins
swissprot	Protein	Non-redundant UniProtKB/SwissProt sequences
tsa_nr	Protein	Transcriptome Shotgun Assembly (TSA) sequences
cdd	Protein	Conserved Domain Database (CDD) is a collection of well-annotated multiple sequence alignment models reprepresented as position-specific score matrices

Database Metadata

The NCBI provides metadata for the available BLAST databases at AWS, GCP and the NCBI FTP site.

On AWS and GCP, the file is in a date dependent subdirectory with the databases. To find the latest valid subdirectory, first read s3://ncbi-blast-databases/latest-dir (on AWS) or gs://blast-db/latest-dir (on GCP). latest-dir is a text file with a date stamp (e.g., 2020-09-29-01-05-01) specifying the most recent directory. The proper directory will be the AWS or GCP base URI for the BLAST databases (e.g., s3://ncbi-blast-databases/ for AWS) plus the text in the latest-dir file. An example URI, in AWS, would be s3://ncbi-blast-databases/2020-09-29-01-05-01. The GCP URI would be similar.

An excerpt from a metadata file is shown below. Most fields have obvious meanings. The files comprise the BLAST database. The bytes-total field represents the total BLAST database size in bytes and is intended to specify how much disk space is required.

The example below is from AWS, but the metadata files on GCP have the same format. Databases on the FTP site are in gzipped tarfiles, one per volume of the BLAST database, so those are listed rather than the individual files.

"16S_ribosomal_RNA": {
    "version": "1.2",
    "dbname": "16S_ribosomal_RNA",
    "dbtype": "Nucleotide",
    "db-version": 5,
    "description": "16S ribosomal RNA (Bacteria and Archaea type strains)",
    "number-of-letters": 32435109,
    "number-of-sequences": 22311,
    "last-updated": "2022-03-07T11:23:00",
    "number-of-volumes": 1,
    "bytes-total": 14917073,
    "bytes-to-cache": 8495841,
    "files": [
      "s3://ncbi-blast-databases/2020-09-26-01-05-01/16S_ribosomal_RNA.ndb",
      "s3://ncbi-blast-databases/2020-09-26-01-05-01/16S_ribosomal_RNA.nog",
      "s3://ncbi-blast-databases/2020-09-26-01-05-01/16S_ribosomal_RNA.nni",
      "s3://ncbi-blast-databases/2020-09-26-01-05-01/16S_ribosomal_RNA.nnd",
      "s3://ncbi-blast-databases/2020-09-26-01-05-01/16S_ribosomal_RNA.nsq",
      "s3://ncbi-blast-databases/2020-09-26-01-05-01/16S_ribosomal_RNA.nin",
      "s3://ncbi-blast-databases/2020-09-26-01-05-01/16S_ribosomal_RNA.ntf",
      "s3://ncbi-blast-databases/2020-09-26-01-05-01/16S_ribosomal_RNA.not",
      "s3://ncbi-blast-databases/2020-09-26-01-05-01/16S_ribosomal_RNA.nhr",
      "s3://ncbi-blast-databases/2020-09-26-01-05-01/16S_ribosomal_RNA.nos",
      "s3://ncbi-blast-databases/2020-09-26-01-05-01/16S_ribosomal_RNA.nto",
      "s3://ncbi-blast-databases/2020-09-26-01-05-01/taxdb.btd",
      "s3://ncbi-blast-databases/2020-09-26-01-05-01/taxdb.bti"
    ]
  }

Additional Resources

BLAST:
- BLAST Command Line Applications User Manual
- BLAST Knowledge Base
Docker:
Other:
- Common Workflow Language (CWL) is a specification to describe tools and workflows. This GitHub Repository contains sample CWL workflows using containerized BLAST+.
- Google Cloud Platform
- NIH/STRIDES
- GitHub

or email us.

Maintainer

National Center for Biotechnology Information (NCBI)
National Library of Medicine (NLM)
National Institutes of Health (NIH)

License

View refer to the license and copyright information for the software contained in this image.

As with all Docker images, these likely also contain other software which may be under other licenses (such as bash, etc., from the base distribution, along with any direct or indirect dependencies of the primary software being contained).

As with any pre-built image usage, it is the image user's responsibility to ensure that any use of this image complies with any relevant licenses for all software contained within.

Appendix

Appendix A. Cloud and Docker Concepts

Figure 1. Docker and Cloud Computing Concept. Users can access compute resources provided by cloud service providers (CSPs), such as the Google Cloud Platform, using SSH tunneling (1). When you create a VM (2), a hard disk (also called a boot/persistent disk) (3) is attached to that VM. With the right permissions, VMs can also access other storage buckets (4) or other data repositories in the public domain. Once inside a VM with Docker installed, you can run a Docker image (5), such as NCBI's BLAST image. An image can be used to create multiple running instances or containers (6). Each container is in an isolated environment. In order to make data accessible inside the container, you need to use Docker bind mounts (7) described in this tutorial.

A Docker image can be used to create a Singularity image. Please refer to Singularity's documentation for more detail.

Appendix B. Alternative Ways to Run Docker

As an alternative to what is described above, you can also run BLAST interactively inside a container.

Run BLAST+ Docker image interactively

When to use: This is useful for running a few (e.g., fewer than 5-10) BLAST searches on small BLAST databases where you expect the search to complete in seconds/minutes.

docker run --rm -it \
    -v $HOME/blastdb:/blast/blastdb:ro -v $HOME/blastdb_custom:/blast/blastdb_custom:ro \
    -v $HOME/queries:/blast/queries:ro \
    -v $HOME/results:/blast/results:rw \
    ncbi/blast \
    /bin/bash

# Once you are inside the container (note the root prompt), run the following BLAST commands.
blastp -query /blast/queries/P01349.fsa -db nurse-shark-proteins \
    -out /blast/results/blastp.out

# To view output, run the following command
more /blast/results/blastp.out

# Leave container
exit

In addition, you can run BLAST in detached mode by running a container in the background.

Run BLAST+ Docker image in detached mode

When to use: This is a more practical approach if you have many (e.g., 10 or more) BLAST searches to run or you expect the search to take a long time to execute. In this case it may be better to start the BLAST container in detached mode and execute commands on it.

NOTE: Be sure to mount all required directories, as these need to be specified when the container is started.

# Start a container named 'blast' in detached mode
docker run --rm -dit --name blast \
    -v $HOME/blastdb:/blast/blastdb:ro -v $HOME/blastdb_custom:/blast/blastdb_custom:ro \
    -v $HOME/queries:/blast/queries:ro \
    -v $HOME/results:/blast/results:rw \
    ncbi/blast \
    sleep infinity

# Check the container is running in the background
docker ps -a
docker ps --filter "status=running"

Once the container is confirmed to be running in detached mode, run the following BLAST command.

docker exec blast blastp -query /blast/queries/P01349.fsa \
    -db nurse-shark-proteins -out /blast/results/blastp.out

# View output
more $HOME/results/blastp.out

# stop the container
docker stop blast

If you run into issues with docker stop blast command, reset the VM from the GCP Console or restart the SSH session.

Appendix C. Transfer Files to/from a GCP VM

To copy the file $HOME/script.out in the home directory on a local machine to the home directory on a GCP VM named instance-1 in project My First Project using GCP Cloud SDK.

GCP documentation

First install GCP Cloud SDK command line tools for your operating system.

# First, set up gcloud tools
# From local machine's terminal

gcloud init

# Enter a configuration name
# Select the sign-in email account
# Select a project, for example “my-first-project”
# Select a compute engine zone, for example, “us-east4-c”

# To copy the file $HOME/script.out to the home directory of GCP instance-1 
# Instance name can be found in your Google Cloud Console -> Compute Engine -> VM instances

gcloud compute scp $HOME/script.out instance-1:~

# Optional - to transfer the file from the GCP instance to a local machine's home directory

gcloud compute scp instance-1:~/script.out $HOME/.

ncbi/blast_plus_docs

ncbi

Reviews

Repository Details

Official NCBI BLAST+ Docker Image Documentation

Table of Contents

What Is NCBI BLAST?

What Is Cloud Computing?

What Is Docker?

Google Cloud Platform Setup

Section 1 - Getting Started Using the BLAST+ Docker Image with a Small Example

Requirements

Set up your GCP account and create a VM for analysis

1. Creating your GCP account and registering for the free $300 credit program. (If you already have a GCP billing account, you can skip to step 2.)

2. Create a Virtual Machine (VM)

3. Access a GCP VM from a local machine

Section 2 - A Step-by-Step Guide Using the BLAST+ Docker Image

Step 1. Install Docker

Docker run command options

Docker run command structure

Useful Docker commands

Using BLAST+ with Docker

Versions of BLAST Docker image

Supported tags

Step 2. Import sequences and create a BLAST database

Show BLAST databases available for download from the Google Cloud bucket

Show BLAST databases available for download from NCBI

Show available BLAST databases on local host

Step 3. Run BLAST

Stop the GCP instance

Section 3 - Using the BLAST+ Docker Image at Production Scale

Background

BLAST+ Docker image benchmarks

Commands to run

Amazon Web Services Setup

Overview

Requirements

Example 1: Run BLAST+ Docker on an Amazon EC2 Virtual Machine

Step 1: Create an EC2 Virtual Machine (VM)

Step 2: Establish an SSH session with the EC2 VM

Step 3: Import sequences and create a BLAST database

Step 4: Run BLAST

Step 5: Optional - Show BLAST databases available for download from the NCBI AWS bucket

Step 6: Optional - Show BLAST databases available for download from NCBI

Step 7: Stop the EC2 VM

Example 2: Run BLAST+ Docker on an Amazon EC2 Virtual Machine - Protein Data Bank Amino Acid DB

Step 1: Create an EC2 Virtual Machine (VM)

Step 2: Establish an SSH session with the EC2 VM

Step 2. Retrieve sequences

Step 3: Download Protein Data Bank Amino Acid Database (pdbaa)

Step 4: Run BLAST+

Appendix

Appendix A: Transfer Files to/from an AWS VM

BLAST Databases

Database Metadata

Additional Resources

Maintainer

License

Appendix

Appendix A. Cloud and Docker Concepts

Appendix B. Alternative Ways to Run Docker

Run BLAST+ Docker image interactively

Run BLAST+ Docker image in detached mode

Appendix C. Transfer Files to/from a GCP VM

More Repositories