• Stars
    star
    271
  • Rank 151,717 (Top 3 %)
  • Language
    Python
  • License
    MIT License
  • Created about 5 years ago
  • Updated about 1 month ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Download FASTQ files from SRA or ENA repositories.

GitHub release (latest by date) Anaconda-Server Badge Gitpod ready-to-code

fastq-dl

Download FASTQ files from the European Nucleotide Archive or the Sequence Read Archive repositories.

Introduction

fastq-dl takes an ENA/SRA accession (Study, Sample, Experiment, or Run) and queries ENA (via Data Warehouse API) to determine the associated metadata. It then downloads FASTQ files for each Run. For Samples or Experiments with multiple Runs, users can optionally merge the runs.

Installation

Bioconda

fastq-dl is available from Bioconda and I highly recommend you go this route to for installation.

conda create -n fastq-dl -c conda-forge -c bioconda fastq-dl
conda activate fastq-dl 

Usage

fastq-dl --help
                                                                                          
 Usage: fastq-dl [OPTIONS]                                                                
                                                                                          
 Download FASTQ files from ENA or SRA.                                                    
                                                                                          
โ•ญโ”€ Required Options โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฎ
โ”‚ *  --accession  -a  TEXT  ENA/SRA accession to query. (Study, Sample, Experiment, Run  โ”‚
โ”‚                           accession) [required]                                        โ”‚
โ•ฐโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฏ
โ•ญโ”€ Additional Options โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฎ
โ”‚ --provider                 [ena|sra]  Specify which provider (ENA or SRA) to use.      โ”‚
โ”‚                                       [default: ena]                                   โ”‚
โ”‚ --group-by-experiment                 Group Runs by experiment accession.              โ”‚
โ”‚ --group-by-sample                     Group Runs by sample accession.                  โ”‚
โ”‚ --outdir               -o  TEXT       Directory to output downloads to. [default: ./]  โ”‚
โ”‚ --prefix                   TEXT       Prefix to use for naming log files.              โ”‚
โ”‚                                       [default: fastq]                                 โ”‚
โ”‚ --cpus                     INTEGER    Total cpus used for downloading from SRA.        โ”‚
โ”‚                                       [default: 1]                                     โ”‚
โ”‚ --max-attempts         -m  INTEGER    Maximum number of download attempts.             โ”‚
โ”‚                                       [default: 10]                                    โ”‚
โ”‚ --only-provider        -F             Only attempt download from specified provider.   โ”‚
โ”‚ --silent                              Only critical errors will be printed.            โ”‚
โ”‚ --version              -V             Show the version and exit.                       โ”‚
โ”‚ --verbose              -v             Print debug related text.                        โ”‚
โ”‚ --help                 -h             Show this message and exit.                      โ”‚
โ•ฐโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฏ

fastq-dl requires a single ENA/SRA Study, Sample, Experiment, or Run accession and FASTQs for all Runs that fall under the given accession will be downloaded. For example, if a Study accession is given all Runs under that studies umbrella will be downloaded. By default, fastq-dl will try to download from ENA first, then SRA.

--accession

The accession you would like to download associated FASTQS for. Currently the following types of accessions are accepted.

Accession Type Prefixes Example
BioProject PRJEB, PRJNA, PRJDB PRJEB42779, PRJNA480016, PRJDB14838
Study ERP, DRP, SRP ERP126685, DRP009283, SRP158268
BioSample SAMD, SAME, SAMN SAMD00258402, SAMEA7997453, SAMN06479985
Sample ERS, DRS, SRS ERS5684710, DRS259711, SRS2024210
Experiment ERX, DRX, SRX ERX5050800, DRX406443, SRX4563689
Run ERR, DRR, SRR ERR5260405, DRR421224, SRR7706354

The accessions are using regular expressions from the ENA Training Modules - Accession Numbers section.

--provider

fastq-dl gives you the option to download from ENA or SRA. the --provider option will specify which provider you would like to attempt downloads from first. If a download fails from the first provider, additional attempts will be made using the other provider.

ENA was selected as the default provider because the FASTQs are available directly without the need for conversion.

--only-provider

By default, fastq-dl will fallback on a secondary provider to attempt downloads. There may be cases where you would prefer to disable this feature, and that is exactly the purpose of --only-provider. When provided, if a FASTQ cannot be downloaded from the original provider, no additional attempts will be made.

--group-by-experiment & --group-by-sample

There maybe times you might want to group Run accessions based on a Experiment or Sample accessions. This will merge FASTQs associated with a Run accession based its associated Experiment accession (--group-by-experiment) or Sample accession (--group-by-sample).

Output Files

Extension Description
-run-info.tsv Tab-delimited file containing metadata for each Run downloaded
-run-mergers.tsv Tab-delimited file merge information from --group-by-experiment or --group-by-sample
.fastq.gz FASTQ files downloaded from ENA or SRA

Example Usage

Download FASTQs associated with a Study

Sometimes you might be reading a paper and they very kindly provided a Bioproject of all the samples they sequenced. So, you decide you want to download FASTQs for all the samples asscociated with the Bioproject. fastq-dl can help you with that!

fastq-dl --accession PRJNA248678 --provider SRA
fastq-dl --accession PRJNA248678

The above commands will download the 3 Runs that fall under Study accession PRJNA248678 from either SRA (--provider SRA) or ENA (without --provider).

Download FASTQs associated with an Experiment

Let's say instead of the whole Bioproject you just want a single Experiment. You can do that as well.

fastq-dl --accession SRX477044

The above command would download the Run accessions from ENA that fall under Experiment SRX477044.

The relationship of Experiment to Run is a 1-to-many relationship, or there can be many Run accessions associated with a single Experiment Accession (e.g. re-sequencing the same sample). Although in most cases, it is a 1-to-1 relationship, you can use --group-by-experiment to merge multiple runs associated with an Experiment accession into a single FASTQ file.

Download FASTQs associated with an Sample

Ok, this time you just want a single Sample, or Biosample.

fastq-dl --accession SRS1904245 --provider SRA

The above command would download the Run accessions from SRA that fall under Sample SRS1904245.

Similar to Experiment accessions, the relationship of Sample to Run is a 1-to-many relationship, or there can be many Run accessions associated with a single Sample Accession. Although in most cases, it is a 1-to-1 relationship, you can use --group-by-sample to merge multiple runs associated with an Sample accession into a single FASTQ file.

_Warning! For some type strains (e.g. S. aureus USA300) a Biosample accession might be associated with 100s or 1000s of Run accessions. These Runs are likely associated with many different conditions and really should not fall under a single BioSample accession. Please consider this when using --group-by-sample.

Download FASTQs associated with a Run

Let's keep it super simple and just download a Run.

fastq-dl --accession SRR1178105 --provider SRA

The above command would download the Run SRR1178105 from SRA. Run accessions are the end of the line (1-to-1 relationship), so you will always get the expected Run.

Alternatives

fastq-dl, is a spin-off of ena-dl, that has been developed for usage with Bactopia. With this in mind, EBI/NCBI and provide their own tools (enaBrowserTools and SRA Toolkit) that offer more extensive access to their databases.

More Repositories

1

dragonflye

๐Ÿ‰ ๐Ÿชฐ Assemble bacterial isolate genomes from Nanopore reads
Perl
114
star
2

fastq-scan

Output FASTQ summary statistics in JSON format
C++
27
star
3

vcf-annotator

Add biological annotations to variants in a given VCF file.
Python
22
star
4

pasty

A tool easily taken advantage of for in silico serogrouping of Pseudomonas aeruginosa isolates
Shell
11
star
5

pbptyper

In silico Penicillin Binding Protein (PBP) typer for Streptococcus pneumoniae assemblies
Python
9
star
6

assembly-scan

Generate basic stats for an assembly.
Python
8
star
7

camlhmp

๐ŸชClassification through yAML Heuristic Mapping Protocol ๐Ÿช
Python
7
star
8

illumina-cleanup

A simple pipeline for pre-processing Illumina FASTQ files.
Nextflow
3
star
9

goblin

GOBLIN - Generate trusted prOteins to supplement BacteriaL annotatIoN
Python
2
star
10

sccmec

A tool for typing SCCmec cassettes in assemblies
Python
2
star
11

any2tsv

Convert various bioinformatic outputs to TSV
Python
2
star
12

call_variants

A ruffus pipeline to call variants following GATK best practices.
Python
2
star
13

steamboat-py

A collection of tools/scripts for microbial bioinformatics
Python
1
star
14

sizemeup

A simple tool to determine the genome size of an organism
Python
1
star
15

workshops

A repository of trainings and workshops
1
star
16

StarCluster-Plugins

A few plugins created for Staphopia's StarCluster instances
Python
1
star
17

django-kmer-test

Tests to determine the most efficient way to store DNA k-mers.
Python
1
star
18

fitbit-steps

Pull yesterday's total steps from FitBit
Python
1
star
19

binf-scripts

Personal bioinformatic related scripts that need a home.
Python
1
star
20

admin

basic scripts for server administration
Shell
1
star