• Stars
    star
    538
  • Rank 82,538 (Top 2 %)
  • Language
    Python
  • License
    MIT License
  • Created over 4 years ago
  • Updated 3 months ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

A tool to find sequencing data and metadata from public databases.

ffq

github version pypi version python versions status Code Coverage Downloads license

! NCBI is deprecating .SRA file links. This may result in an empty list with `--ncbi`.
+ Have a cool use case for ffq? Submit a PR to the `Use cases` section and we'll feature it!

Fetch metadata information from the following databases:

  • GEO: Gene Expression Omnibus,
  • SRA: Sequence Read Archive,
  • EMBL-EBI: European Molecular BIology Laboratory’s European BIoinformatics Institute,
  • DDBJ: DNA Data Bank of Japan,
  • NIH Biosample: Biological source materials used in experimental assays,
  • ENCODE: The Encyclopedia of DNA Elements.

ffq receives an accession and returns the metadata for that accession as well as the metadata for all downstream accessions following the connections between GEO, SRA, EMBL-EBI, DDBJ, and Biosample. If you use ffq in a publication, please the cite*:

Gálvez-Merchán, Á., et al. (2022). Metadata retrieval from sequence databases with ffq. bioRxiv 2022.05.18.492548.

The manuscript is available here: https://doi.org/10.1101/2022.05.18.492548.

By default, ffq returns all downstream metadata down to the level of the SRR record. However, the desired level of resolution can be specified.

ffq can also skip returning the metadata, and instead return the raw data download links from any available host (FTP, AWS, GCP or NCBI) for GEO and SRA ids.

Installation

The latest release can be installed with

pip install ffq

The development version can be installed with

pip install git+https://github.com/pachterlab/ffq

Usage

Fetch information of an accession and display it in the terminal

ffq [accession]

where [accession] is either:

  • an SRA/EBI/DDJ accession

    • (SRR, SRX, SRS or SRP)
    • (ERR, ERX, ERS or ERP)
    • (DRR, DRS, DRX or DRP)
  • a GEO accession (GSE or GSM)

  • an ENCODE accession (ENCSR, ENCSB or ENCSD)

  • a Bioproject accession (CXR)

  • a Biosample accession (SAMN)

  • a DOI

Examples:
$ ffq SRR9990627
#=> Returns metadata for the SRR9990627 run.

$ ffq SRX7347523
#=> Returns metadata for the experiment SRX7347523 and for its associated SRR run.

$ ffq GSE129845
#=> Returns metadata for GSE129845 and for its 5 associated GSM, SRS, SRX and SRR ids.

$ ffq DRP004583
#=> Returns metadata for the study DRP004583 and its 104 associated DRS, DRX and SRR ids.

$ ffq ENCSR998WNE
#=> Returns metadata for the ENCODE experiment ENCSR998WNE.

Fetch information of multiple accessions and display it in the terminal

ffq [accession 1] [accession 2] ...

where [accession 1] and [accession 2] are accessions belonging to any of the above usage example categories.

Examples:
$ ffq SRR11181954 SRR11181954 SRR11181956
#=> Returns metadata for the three SRR runs.

$ ffq GSM4339769 GSM4339770 GSM4339771
#=> Returns metadata for the three GSM accessions, as well as for their corresponding downstream SRS, SRX and SRR accessions.

Fetch information of an accession only down to specified level

ffq -l [level] [accession]

where [level] is the number of downstream accessions you want to fetch

Examples:
$ ffq -l 1 GSM4339769
#=> Returns metadata only for GSM4339769, and not from any downstream accession.

$ ffq -l 3 GSE115469
#=> Returns metadata for GSE115469 and its downstream GSM and SRS accessions.

Fetch only raw data links from the host of your choice and display it in the terminal

FTP host

ffq --ftp [accession(s)]

where [accession(s)] is either a single accession or a space-delimited list of accessions.

AWS host

ffq --aws [accession(s)]

GCP host

ffq --gcp [accession(s)]

NCBI host

ffq --ncbi [accession(s)]
Examples:
# FTP with an SRR
$ ffq --ftp SRR10668798
[
    {
        "accession": "SRR10668798",
        "filename": "SRR10668798_1.fastq.gz",
        "filetype": "fastq",
        "filesize": 31876537192,
        "filenumber": 1,
        "md5": "bf8078b5a9cc62b0fee98059f5b87fa7",
        "urltype": "ftp",
        "url": "ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR106/098/SRR10668798/SRR10668798_1.fastq.gz"
    },
...

# FTP with a GSE
$ ffq --ftp GSE115469
[
    {
        "accession": "SRR7276474",
        "filename": "P1TLH.bam",
        "filetype": "bam",
        "filesize": 48545467653,
        "filenumber": 1,
        "md5": "d0fde6bf21d9f97bdf349a3d6f0a8787",
        "urltype": "ftp",
        "url": "ftp://ftp.sra.ebi.ac.uk/vol1/SRA716/SRA716608/bam/P1TLH.bam"
    },
...

# AWS with SRX
$ ffq --aws SRX7347523
[
    {
        "accession": "SRR10668798",
        "filename": "T84_S1_L001_R1_001.fastq.1",
        "filetype": "fastq",
        "filesize": null,
        "filenumber": 1,
        "md5": null,
        "urltype": "aws",
        "url": "s3://sra-pub-src-6/SRR10668798/T84_S1_L001_R1_001.fastq.1"
    },
...

# GCP with ERS
$ ffq --gcp ERS3861775
[
    {
        "accession": "ERR3585496",
        "filename": "4834STDY7002879.bam.1",
        "filetype": "bam",
        "filesize": null,
        "filenumber": 1,
        "md5": null,
        "urltype": "gcp",
        "url": "gs://sra-pub-src-17/ERR3585496/4834STDY7002879.bam.1"
    }
]

# NCBI with GSM
$ ffq --ncbi GSM2905292
[
    {
        "accession": "SRR6425163",
        "filename": "SRR6425163.1",
        "filetype": "sra",
        "filesize": null,
        "filenumber": 1,
        "md5": null,
        "urltype": "ncbi",
        "url": "https://sra-downloadb.be-md.ncbi.nlm.nih.gov/sos2/sra-pub-run-13/SRR6425163/SRR6425163.1"
    }
]

Write accession information to a single JSON file

ffq -o [JSON_PATH] [accession(s)]

where [JSON_PATH] is the path to the JSON file that will contain the information and [accession(s)] is either a single accession or a space-delimited list of accessions.

Write accession information to multiple JSON files, one file per accession

ffq -o [OUT_DIR] --split [accessions]

where [OUT_DIR] is the path to directory to which to write the JSON files and [accessions] is a space-delimited list of accessions. Information about each accession will be written to its own separate JSON file named [accession].json.

Fetch information of all studies (and all of their runs) in one or more papers

ffq [DOIS]

where [DOIS] is a space-delimited list of one or more DOIs. The output is a JSON-formatted string (or a JSON file if -o is provided) with SRA study accessions as keys. When --split is also provided, each study is written to its own separate JSON.

Complete output examples

Examples of complete outputs are available in the examples directory.

Downloading data

ffq is specifically designed to download metadata and to facilitate obtaining links to sequence files. To download raw data from the links obtained with ffq you can use one of the following:

FTP

By default, cURL is installed on most computers and can be used to download files with FTP links. Alternatively, wget can be used.

# Obtain FTP links
$ ffq --ftp SRR10668798
[
    {
        "accession": "SRR10668798",
        "filename": "SRR10668798_1.fastq.gz",
        "filetype": "fastq",
        "filesize": 31876537192,
        "filenumber": 1,
        "md5": "bf8078b5a9cc62b0fee98059f5b87fa7",
        "urltype": "ftp",
        "url": "ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR106/098/SRR10668798/SRR10668798_1.fastq.gz"
    },
    {
        "accession": "SRR10668798",
        "filename": "SRR10668798_2.fastq.gz",
        "filetype": "fastq",
        "filesize": 43760586944,
        "filenumber": 2,
        "md5": "351df47dca211c1f66ef327e280bd4fd",
        "urltype": "ftp",
        "url": "ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR106/098/SRR10668798/SRR10668798_2.fastq.gz"
    }
]

# Download the files one-by-one
$ curl -O ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR106/098/SRR10668798/SRR10668798_1.fastq.gz 
$ curl -O ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR106/098/SRR10668798/SRR10668798_2.fastq.gz 

Alternatively, the urls can be extracted from the json output with jq and then piped into cURL.

$ ffq --ftp SRR10668798 | jq -r '.[] | .url' | xargs curl -O

If you don't have jq installed, you can use the default program grep.

$ ffq --ftp SRR10668798 | grep -Eo '"url": "[^"]*"' | grep -o '"[^"]*"$' | xargs curl -O

AWS

In order to download files from AWS, the aws tool must be installed and credentials must be setup.

# Pipe AWS links to aws s3 cp and download
$ ffq --aws SRX7347523 | jq -r '.[] | .url' | xargs -I {} aws s3 cp {} .

GCP

In order to download files from GCP, the gsutil tool must be install and credentials must be setup.

# Pipe GCP links to gsutil cp and download
$ ffq --gcp ERS3861775 | jq -r '.[] | .url' | xargs -I {} gsutil cp {} .

NCBI-SRA

SRA files downloaded from NCBI can be converted to FASTQ files using fastq-dump or the improved fasterq-dump both of which are installed as part of SRA Toolkit.

# Pipe SRA link to curl and download the SRA file
$ ffq --ncbi GSM2905292 | jq -r '.[] | .url' | xargs curl -O

# Convert the SRA file to FASTQ files with one of the following
$ fastq-dump   ./SRR6425163.1 --split-files --include-technical -O ./SRR6425163 --gzip 
$ fasterq-dump ./SRR6425163.1 --split-files --include-technical -O ./SRR6425163        # fasterq-dump does not have gzip option

Use cases

ffq facilitates the acquisition of publicly available sequencing data to help answer relevant research questions.

The following was submitted by @sbooeshaghi.

# Goal: quantify publicly available scRNAseq data
$ pip install kb-python gget ffq
$ kb ref -i index.idx -g t2g.txt -f1 transcriptome.fa $(gget ref --ftp -w dna,gtf homo_sapiens)
$ kb count -i index.idx -g t2g.txt -x 10xv3 -o out $(ffq --ftp SRR10668798 | jq -r '.[] | .url' | tr '\n' ' ')
# -> count matrix in out/ folder

# Goal: count the total number of reads
$ ffq SRR10668798 | jq '.. | ."ENA-SPOT-COUNT"? | select(. != null)' |  paste -sd+ - | bc
624886427

# Goal: check the total size of the FASTQ files
$ ffq --ftp SRR10668798 | jq '.[] | .filesize ' | paste -sd+ - | bc | numfmt --to=iec-i --suffix=B
71GiB

# Goal: count the number of FASTQ files
$ ffq --ftp SRR10668798 | jq -r 'length'
2

# Goal: get sequence stats for the first 100 entries with seqkit
$ curl -s $(ffq --ftp SRR10668798 | jq -r '.[0] | .url') | zcat | head -400 | seqkit stats -a
file  format  type  num_seqs  sum_len  min_len  avg_len  max_len  Q1  Q2  Q3  sum_gap  N50  Q20(%)  Q30(%)
-     FASTQ   DNA        100    2,600       26       26       26  13  26  13        0   26   95.31   92.92

The following was submitted by @agalvezm.

# Goal: print the first 3 sequences of read 1 to the screen
$ curl -s $(ffq --ftp SRR10668798 | jq -r '.[0] | .url') | zcat | awk '(NR-2)%4==0' | head -n
NCCAAATAGGAATTACATACACCCCC
NAACCTGAGTAGATGTGTTGTTAACT
NGATCTGAGAACTCGGAACTATTTTC

# Goal: get number of counts per unique read sequence from the first 10000 reads
$ curl -s $(ffq --ftp accession | jq -r '.[0] | .url') | zcat | awk '(NR-2)%4==0'| head -n 10000 | sort | uniq -c | sort -r
4 TACACGACACTTAACGATCGGCCTTC
4 GTACTTTAGGCCCGTTTGTGTGCGAT
4 GACGGCTAGTACATGATATAACAAGC
...

The following was submitted by @telatin.

# Goal: concurrent download of a set of FASTQ files given a list of IDs (list.txt)
# (Requires Nextflow and Docker, or Conda, to be installed. Pipeline and dependencies will be installed automatically)
$ nextflow run telatin/getreads -r main -profile docker --list list.txt --outdir downloaded-reads/

For instructions on how to install Nextflow and Docker, or Conda, see the installation instructions.

Do you have a cool use case for ffq? Submit a PR (including the goal, code snippet, and your username) so that we can feature it here.

Failure modes

Many factors, independent of ffq, may result in failure to fetch metadata or missing metadata including:

  1. broken internet connection
  2. improperly formatted accession
  3. recently submitted data to SRA (not synced with ENA)
  4. exceeded request rate for servers
  5. missing metadata from online database

If you believe you have identified a bug in ffq please see the section on contributing*.

Contributing

Thank you for wanting to improve ffq! If you have a bug that is related to ffq please create an issue. The issue should contain

  1. the ffq command ran with --verbose,
  2. the error message, and
  3. the ffq and python version.

Please make all Pull Requests against the devel branch and include a message detailing the exact changes made, the reasons for the change, and tests that check for the correctness of those changes.

Some tips for improving the ffq code base:

  • the developer dependencies can be installed with pip install -r dev-requirements.txt
  • unit tests can be added to the ./tests/test_*.py
  • code reformatting can be performed by running black ffq/
  • code quality can be checked by running make check
  • tests can be performed by running make test

Caveats and limitations

ffq relies on the information provided by the different APIs it uses to retrieve metadata (hosted by ENA, NCBI, ENCODE, etc). Therefore, returning consistent and accurate metadata is dependent on the accuracy and consistency of such databases. Unfortunately, we have observed instances where some APIs are updated without notice. This leads to unconsistent metadata retrieval by ffq that cannot be solved on our end.

For example, as of May 29th, the command:

ffq --ncbi SRR6835844

returned:

[{'accession': 'SRR6835844',
'filename': 'SRR6835844.1',
'filenumber': 1,
'filesize': None,
'filetype': 'sra',
 'md5': None,
'url': 'https://sra-downloadb.be-md.ncbi.nlm.nih.gov/sos2/sra-pub-run-13/SRR6835844/SRR6835844.1',
'urltype': 'ncbi'}]

On June 1st, we detected an error in one of ffq’s tests. Running the same command led to the following output:

[]

Investigating this issue, we discovered that the output of the eutil’s efetch tool had changed (for a comparison, compare files SRR6835844_altlinks_old.txt and SRR6835844_altlinks_new.txt contained in tests/fixtures). In the new output, ncbi hosted links were no longer provided. This affects a large number of accessions, not only SRR6835844. We have updated our tests accordingly and will continue to monitor the situation.

Naming

ffq is short for FetchFastQ.

Cite

@article{galvez2022metadata,
  title={Metadata retrieval from sequence databases with ffq},
  author={G{\'a}lvez-Merch{\'a}n, {\'A}ngel and Min, Kyung Hoi Joseph and Pachter, Lior and Booeshaghi, A. Sina},
  year={2022}
}

More Repositories

1

gget

🧬 gget enables efficient querying of genomic reference databases
Python
936
star
2

kallisto

Near-optimal RNA-Seq quantification
C
654
star
3

BI-BE-CS-183-2023

Introduction to Computational Biology and Bioinformatics Course at Caltech, 2023
Jupyter Notebook
389
star
4

sleuth

Differential analysis of RNA-Seq
R
305
star
5

poseidon

poseidon system - open source syringe pumps and microscope for laboratories
Jupyter Notebook
168
star
6

kb_python

A wrapper for the kallisto | bustools workflow for single-cell RNA-seq pre-processing
Python
147
star
7

kallistobustools

kallisto | bustools workflow for pre-processing single-cell RNA-seq data
115
star
8

seqspec

machine-readable file format for genomic library sequence and structure
Python
110
star
9

voyager

From geospatial to spatial -omics
R
72
star
10

picasso

Picasso: a methods for embedding points in 2D in a way that respects distances while fitting a user-specified shape.
Jupyter Notebook
69
star
11

scRNA-Seq-TCC-prep

Preprocessing of single-cell RNA-Seq (deprecated)
Jupyter Notebook
62
star
12

metakallisto

Using kallisto for metagenomic analysis
Python
50
star
13

LP_2021

TeX
45
star
14

kallisto-transcriptome-indices

Reference transcriptome indices build from kallisto for popular organisms
41
star
15

sircel

Identify cell barcodes from single-cell genomics sequencing experiments
Jupyter Notebook
41
star
16

SpatialFeatureExperiment

Extension of SpatialExperiment with sf
R
36
star
17

MCML

Python
33
star
18

kallisto_paper_analysis

Analysis from kallisto paper
HTML
32
star
19

NYMP_2018

Jupyter Notebook
29
star
20

kma

Keep Me Around: Intron Retention Detection
Python
28
star
21

monod

The Monod package fits CME models to sequencing data.
Python
27
star
22

gget_examples

Examples for gget (https://github.com/pachterlab/gget).
Jupyter Notebook
26
star
23

colosseum

colosseum system - open source fraction collector for laboratories
Jupyter Notebook
24
star
24

GFCP_2022

RNA velocity validation
Jupyter Notebook
23
star
25

MBGBLHGP_2019

Code for reproducing results from the paper "Modular and efficient pre-processing of single-cell RNA-seq data"
Jupyter Notebook
23
star
26

BBB

Bioinformatics for Benched Biologists
Jupyter Notebook
22
star
27

BHGP_2022

Jupyter Notebook
21
star
28

kite

kallisto index tag extractor
Python
20
star
29

aggregationDE

Scripts and software supplement for "Gene-level differential analysis at transcript-level resolution" by Yi, Pimentel, Bray and Pachter
R
20
star
30

qcbc

Jupyter Notebook
19
star
31

splitcode

Flexible and efficient parsing, interpreting and editing of sequencing reads
C
18
star
32

PCCA

Code for performing PCA followed by CCA
Python
18
star
33

sleuth_paper_analysis

Code to reproduce analyses from the sleuth paper
R
17
star
34

concordex

Identification of spatial homogeneous regions
Python
17
star
35

RMEJLBASBMP_2024

Repository for the paper "The impact of package selection and versioning on single-cell RNA-seq analysis"
Jupyter Notebook
17
star
36

CGCCP_2023

scVI extension for unspliced RNA
Jupyter Notebook
16
star
37

CBP_2021

Jupyter Notebook
16
star
38

BYVSTZP_2020

This repository contains the code for reproducing all the results and figures from the preprint "Isoform specificity in the mouse primary motor cortex".
Jupyter Notebook
15
star
39

SBP_2019

Code for producing the analysis in the "Quantifying the tradeoff between sequencing depth and cell number in single-cell RNA-seq" manuscript
Jupyter Notebook
15
star
40

voyagerpy

Python
12
star
41

bears_analyses

Examples of kallisto + sleuth
Python
11
star
42

GSP_2019

Code for reproducing results from the paper "RNA velocity and protein acceleration from single-cell multiomics experiments."
Jupyter Notebook
11
star
43

kallisto-sleuth-workshop-2016

materials and website for the 2016 kallisto sleuth workshop
CSS
11
star
44

lair

home of the bear's lair
CSS
10
star
45

sleuth_walkthroughs

Some sleuth walkthroughs to help you get started
HTML
10
star
46

scATAK

Jupyter Notebook
10
star
47

concordexR

Compute the neighborhood consolidation matrix and identify SHRs
R
10
star
48

MBLGLMBHGP_2021

Jupyter Notebook
8
star
49

Bi-BE-CS-183-2022

Website for the 2021-2022 Caltech class Bi/BE/CS 183: Introduction to Computational Biology and Bioinformatics
Jupyter Notebook
8
star
50

bcl2fastq

source code for bcl2fastq2, files from illumina
C++
8
star
51

CP_2023

Jupyter Notebook
7
star
52

bam2tcc

C++
7
star
53

voyager-testing

Python
7
star
54

GVP_2023

scRNA-seq, regulation, and sysbio
Jupyter Notebook
6
star
55

monod_examples

Tutorials for the Monod package, which fits CME models to sequencing data.
Jupyter Notebook
6
star
56

BGP_2023

Jupyter Notebook
6
star
57

biophysics

Repository for Pachter Lab Biophysics
Python
6
star
58

museumst

Museum of Spatial Transcriptomics
R
6
star
59

GCCP_2022

Roff
5
star
60

SGYP_2019

Jupyter Notebook
5
star
61

BLCSBGLKP_2020

Code for analysis of SARS-CoV-2 sequencing based diagnostic testing data
Jupyter Notebook
5
star
62

kallisto-D

C
5
star
63

zika

sleuth workflow for processing zika RNA-seq dataset
R
5
star
64

SHSOHMP_2024

Code for reproducing the results in the second version of the preprint "Accurate quantification of single-nucleus and single-cell RNA-seq transcripts"
C++
5
star
65

LP_2024

Jupyter Notebook
5
star
66

GBP_2024

Jupyter Notebook
5
star
67

pegasus

modular stepper motor control with Arduino, CNC motor sheild, and Pololu stepper driver. also the workhorse of poseidon and colosseum
Python
4
star
68

SP_2019

Jupyter Notebook
4
star
69

GVFP_2021

SDE comparison preprint
Jupyter Notebook
4
star
70

COVID19-County

COVID-19 data from LA County
Jupyter Notebook
4
star
71

BGP_2024

Jupyter Notebook
4
star
72

FGP_2024

Jupyter Notebook
4
star
73

HSHMP_2022

Python
3
star
74

bibecs183

Bi/BE/CS 183 Winter 2019 - Introduction to Computational Biology and Bioinformatics
Jupyter Notebook
3
star
75

AAQuant

Annotation-Agnostic RNA-seq Quantification
C++
3
star
76

BSP_2023

Jupyter Notebook
3
star
77

BP_2020_2

log(x+1) and log(1+x)
Jupyter Notebook
2
star
78

CP_2021

Code for reproducing the results in "The Split Senate" paper
Python
2
star
79

GRNP_2020

Repository for reproducing the results and figures in Gustafsson et al. 2020
Jupyter Notebook
2
star
80

BKMGP_2021

Jupyter Notebook
2
star
81

isolate_transcripts

Python
2
star
82

CWGFLHGCCHAP_2021

Jupyter Notebook
2
star
83

DBALLSMRDMCMGWSTPMBDKPFP_2023

Jupyter Notebook
2
star
84

GP_2020

Code to reproduce results in the paper "Special Function Methods for Bursty Models of Transcription"
MATLAB
2
star
85

SFEData

Example SpatialFeatureExperiment datasets
R
2
star
86

PROBer

PROBer: A general toolkit for analyzing sequencing-based ‘toeprinting’ assays
C++
2
star
87

BTRBP_2020

Jupyter Notebook
2
star
88

BP_2020

Decrease in ACE2 mRNA expression in aged mouse lung, bioRxiv, 2020.
Jupyter Notebook
2
star
89

GP_2021_4

HTML
2
star
90

CGP_2024_2

Jupyter Notebook
2
star
91

HPM_2022

Simulations of the robustness of AAQuant to noise
R
1
star
92

YLMP_2018

Scripts to reproduce analysis in YLMP, 2018
Shell
1
star
93

bcltools

(still in development only a few things work) tools for converting bcls to fastqs and fastqs to bcls
Python
1
star
94

eXpress

Streaming fragment assignment for real-time analysis of sequencing experiments
1
star
95

make

Open source bioinstrumentation projects
1
star
96

GPCTP_2019-2

1
star
97

GP_2020_2

Intrinsic/extrinsic noise mini-project
Jupyter Notebook
1
star
98

GP_2021_3

Jupyter Notebook
1
star
99

kallisto_tests

A set of (regression) tests for kallisto
1
star
100

KBP_2023

1
star