• Stars
    star
    203
  • Rank 192,890 (Top 4 %)
  • Language
    Rust
  • License
    MIT License
  • Created about 5 years ago
  • Updated 3 months ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Randomly subsample sequencing reads or alignments

rasusa

Build Status codecov License: MIT github release version DOI

Randomly subsample sequencing reads to a specified coverage.

Hall, M. B., (2022). Rasusa: Randomly subsample sequencing reads to a specified coverage. Journal of Open Source Software, 7(69), 3941, https://doi.org/10.21105/joss.03941

Table of Contents

Motivation

I couldn't find a tool for subsampling reads that met my requirements. All the strategies I could find fell short as they either just wanted a number or percentage of reads to subsample to or, if they did subsample to a coverage, they assume all reads are the same size (i.e Illumina). As I mostly work with long-read data this posed a problem if I wanted to subsample a file to certain coverage, as length of reads was never taken into account. rasusa addresses this shortcoming.

A workaround I had been using for a while was using filtlong. It was simple enough, I just figure out the number of bases I need to achieve a (theoretical) coverage for my sample. Say I have a fastq from an E. coli sample with 5 million reads and I want to subset it to 50x coverage. I just need to multiply the expected size of the sample's genome, 4.6 million base pairs, by the coverage I want and I have my target bases - 230 million base pairs. In filtlong, I can do the following

target=230000000
filtlong --target_bases "$target" reads.fq > reads.50x.fq

However, this is technically not the intended function of filtlong; it's a quality filtering tool. What you get in the end is a subset of the "highest scoring" reads at a (theoretical) coverage of 50x. Depending on your circumstances, this might be what you want. However, you bias yourself towards the best/longest reads in the dataset - not a fair representation of your dataset as a whole. There is also the possibility of favouring regions of the genome that produce longer/higher quality reads. De Maio et al. even found that by randomly subsampling nanopore reads you achieve better genome assemblies than if you had filtered.

So, depending on your circumstances, an unbiased subsample of your reads might be what you need. And if this is the case, rasusa has you covered.

Install

tl;dr: precompiled binary

curl -sSL rasusa.mbh.sh | sh
# or with wget
wget -nv -O - rasusa.mbh.sh | sh

You can also pass options to the script like so

$ curl -sSL rasusa.mbh.sh | sh -s -- --help
install.sh [option]

Fetch and install the latest version of rasusa, if rasusa is already
installed it will be updated to the latest version.

Options
        -V, --verbose
                Enable verbose output for the installer

        -f, -y, --force, --yes
                Skip the confirmation prompt during installation

        -p, --platform
                Override the platform identified by the installer [default: apple-darwin]

        -b, --bin-dir
                Override the bin installation directory [default: /usr/local/bin]

        -a, --arch
                Override the architecture identified by the installer [default: x86_64]

        -B, --base-url
                Override the base URL used for downloading releases [default: https://github.com/mbhall88/ssubmit/releases]

        -h, --help
                Display this help message

cargo

Crates.io

Prerequisite: rust toolchain (min. v1.64.0)

cargo install rasusa

conda

Conda (channel only) bioconda version Conda

Prerequisite: conda (and bioconda channel correctly set up)

conda install rasusa

Thank you to Devon Ryan (@dpryan79) for help debugging the bioconda recipe.

Container

Docker images are hosted at quay.io. For versions 0.3.0 and earlier, the images were hosted on Dockerhub.

singularity

Prerequisite: singularity

URI="docker://quay.io/mbhall88/rasusa"
singularity exec "$URI" rasusa --help

The above will use the latest version. If you want to specify a version then use a tag (or commit) like so.

VERSION="0.7.1"
URI="docker://quay.io/mbhall88/rasusa:${VERSION}"

docker

Docker Repository on Quay

Prerequisite: docker

docker pull quay.io/mbhall88/rasusa
docker run quay.io/mbhall88/rasusa --help

You can find all the available tags on the quay.io repository. Note: versions prior to 0.4.0 were housed on Docker Hub.

Build locally

Prerequisite: rust toolchain

git clone https://github.com/mbhall88/rasusa.git
cd rasusa
cargo build --release
target/release/rasusa --help
# if you want to check everything is working ok
cargo test --all

Usage

Basic usage

rasusa --input in.fq --coverage 30 --genome-size 4.6mb

The above command will output the subsampled file to stdout.

Or, if you have paired Illumina

rasusa -i r1.fq -i r2.fq --coverage 30 --genome-size 4g -o out.r1.fq -o out.r2.fq

For more details on the above options, and additional options, see below.

Required parameters

There are three required options to run rasusa.

Input

-i, --input

This option specifies the file(s) containing the reads you would like to subsample. The file(s) must be valid fasta or fastq format and can be compressed (with a tool such as gzip).
Illumina paired files can be passed in two ways.

  1. Using --input twice -i r1.fq -i r2.fq
  2. Using --input once, but passing both files immediately after -i r1.fq r2.fq

Bash wizard tip πŸ§™: Let globs do the work for you -i r*.fq

Coverage

-c, --coverage

Not required if --bases is present

This option is used to determine the minimum coverage to subsample the reads to. It can be specified as an integer (100), a decimal/float (100.7), or either of the previous suffixed with an 'x' (100x).

Note: Due to the method for determining how many bases are required to achieve the desired coverage, the actual coverage, in the end, could be slightly higher than requested. For example, if the last included read is very long. The log messages should inform you of the actual coverage in the end.

Genome size

-g, --genome-size

Not required if --bases is present

The genome size of the input is also required. It is used to determine how many bases are necessary to achieve the desired coverage. This can, of course, be as precise or rough as you like.
Genome size can be passed in many ways. As a plain old integer (1600), or with a metric suffix (1.6kb). All metric suffixes can have an optional 'b' suffix and be lower, upper, or mixed case. So 'Kb', 'kb' and 'k' would all be inferred as 'kilo'. Valid metric suffixes include:

  • Base (b) - multiplies by 1
  • Kilo (k) - multiplies by 1,000
  • Mega (m) - multiplies by 1,000,000
  • Giga (g) - multiplies by 1,000,000,000
  • Tera (t) - multiplies by 1,000,000,000,000

Alternatively, a FASTA/Q index file can be given and the genome size will be set to the sum of all reference sequences in it.

Optional parameters

Output

-o, --output

NOTE: This parameter is required if passing paired Illumina data.

By default, rasusa will output the subsampled file to stdout (if one file is given). If you would prefer to specify an output file path, then use this option.

Output for Illumina paired files can be specified in the same manner as --input

  1. Using --output twice -o out.r1.fq -o out.r2.fq
  2. Using --output once, but passing both files immediately after -o out.r1.fq out.r2.fq

The ordering of the output files is assumed to be the same as the input.
Note: The output will always be in the same format as the input. You cannot pass fastq as input and ask for fasta as output.

rasusa will also attempt to automatically infer whether comression of the output file(s) is required. It does this by detecting any of the supported extensions:

  • .gz: will compress the output with gzip
  • .bz or .bz2: will compress the output with bzip2
  • .lzma: will compress the output with the xz LZMA algorithm

Output compression format

-O, --output-type

Use this option to manually set the compression algoritm to use for the output file(s). It will override any format automatically detected from the output path.

Valid options are:

  • g: gzip
  • b: bzip2
  • l: xz LZMA algorithm
  • u: no compression

Note: these options are case insensitive.

Compresion level

-l, --compress-level

Compression level to use if compressing the output. 1 is for fastest/least compression and 9 is for slowest/best. By default this is set to 6, which is also the default for most compression programs.

Target number of bases

-b, --bases

Explicitly set the number of bases required in the subsample. This option takes the number in the same format as genome size.

Note: if this option is given, genome size and coverage are not required, or ignored if they are provided.

Number of reads

-n, --num

Explicitly set the number of reads in the subsample. This option takes the number in the same format as genome size.

When providing paired reads as input, this option will sample this many total read pairs. For example, when passing -n 20 -i r1.fq r2.fq, the two output files will have 20 reads each, and the read ids will be the same in both.

Note: if this option is given, genome size and coverage are not required.

Fraction of reads

-f, --frac

Explicitly set the fraction of total reads in the subsample. The value given to this option can be a float or a percentage - i.e., -f 0.5 and -f 50 will both take half of the reads.

Note: if this option is given, genome size and coverage are not required.

Random seed

-s, --seed

This option allows you to specify the random seed used by the random subsampler. By explicitly setting this parameter, you make the subsample for the input reproducible. The seed is an integer, and by default it is not set, meaning the operating system will seed the random subsampler. You should only pass this parameter if you are likely to want to subsample the same input file again in the future and want the same subset of reads.

Verbosity

-v

Adding this optional flag will make the logging more verbose. By default, logging will produce messages considered "info" or above (see here for more details). If verbosity is switched on, you will additionally get "debug" level logging messages.

Full usage

$ rasusa --help

rasusa 0.7.1
Michael Hall <[email protected]>
Randomly subsample reads to a specified coverage

USAGE:
    rasusa [OPTIONS] --input <INPUT>...

OPTIONS:
    -b, --bases <bases>
            Explicitly set the number of bases required e.g., 4.3kb, 7Tb, 9000, 4.1MB

            If this option is given, --coverage and --genome-size are ignored

    -c, --coverage <FLOAT>
            The desired coverage to sub-sample the reads to

            If --bases is not provided, this option and --genome-size are required

    -f, --frac <FLOAT>
            Subsample to a fraction of the reads - e.g., 0.5 samples half the reads

            Values >1 and <=100 will be automatically converted - e.g., 25 => 0.25

    -g, --genome-size <size|faidx>
            Genome size to calculate coverage with respect to. e.g., 4.3kb, 7Tb, 9000, 4.1MB

            Alternatively, a FASTA/Q index file can be provided and the genome size will be set to
            the sum of all reference sequences.

            If --bases is not provided, this option and --coverage are required

    -h, --help
            Print help information

    -i, --input <INPUT>...
            The fast{a,q} file(s) to subsample.

            For paired Illumina you may either pass this flag twice `-i r1.fq -i r2.fq` or give two
            files consecutively `-i r1.fq r2.fq`.

    -l, --compress-level <1-9>
            Compression level to use if compressing output

            [default: 6]

    -n, --num <INT>
            Subsample to a specific number of reads

            If paired-end reads are passed, this is the number of (matched) reads from EACH file.
            This option accepts the same format as genome size - e.g., 1k will take 1000 reads

    -o, --output <OUTPUT>...
            Output filepath(s); stdout if not present.

            For paired Illumina you may either pass this flag twice `-o o1.fq -o o2.fq` or give two
            files consecutively `-o o1.fq o2.fq`. NOTE: The order of the pairs is assumed to be the
            same as that given for --input. This option is required for paired input.

    -O, --output-type <u|b|g|l>
            u: uncompressed; b: Bzip2; g: Gzip; l: Lzma

            Rasusa will attempt to infer the output compression format automatically from the
            filename extension. This option is used to override that. If writing to stdout, the
            default is uncompressed

    -s, --seed <INT>
            Random seed to use

    -v
            Switch on verbosity

    -V, --version
            Print version information

Snakemake

If you want to use rasusa in a snakemake pipeline, it is advised to use the wrapper.

rule subsample:
    input:
        r1="{sample}.r1.fq",
        r2="{sample}.r2.fq",
    output:
        r1="{sample}.subsampled.r1.fq",
        r2="{sample}.subsampled.r2.fq",
    params:
        options="--seed 15",  # optional
        genome_size="3mb",  # required
        coverage=20,  # required
    log:
        "logs/subsample/{sample}.log",
    wrapper:
        "0.70.0/bio/rasusa"

See the latest wrapper documentation for the most up-to-date version number.

Benchmark

β€œTime flies like an arrow; fruit flies like a banana.”
― Anthony G. Oettinger

The real question is: will rasusa just needlessly eat away at your precious time on earth?

To do this benchmark, I am going to use hyperfine.

The data I used comes from

Bainomugisa, Arnold, et al. "A complete high-quality MinION nanopore assembly of an extensively drug-resistant Mycobacterium tuberculosis Beijing lineage strain identifies novel variation in repetitive PE/PPE gene regions." Microbial genomics 4.7 (2018).

Single long read input

Download and rename the fastq

URL="ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR649/008/SRR6490088/SRR6490088_1.fastq.gz"
wget "$URL" -O - | gzip -d -c > tb.fq

The file size is 2.9G, and it has 379,547 reads.
We benchmark against filtlong using the same strategy outlined in Motivation.

TB_GENOME_SIZE=4411532
COVG=50
TARGET_BASES=$(( TB_GENOME_SIZE * COVG ))
FILTLONG_CMD="filtlong --target_bases $TARGET_BASES tb.fq"
RASUSA_CMD="rasusa -i tb.fq -c $COVG -g $TB_GENOME_SIZE -s 1"
hyperfine --warmup 3 --runs 10 --export-markdown results-single.md \
     "$FILTLONG_CMD" "$RASUSA_CMD" 

Results

Command Mean [s] Min [s] Max [s] Relative
filtlong --target_bases 220576600 tb.fq 21.685 Β± 0.055 21.622 21.787 21.77 Β± 0.29
rasusa -i tb.fq -c 50 -g 4411532 -s 1 0.996 Β± 0.013 0.983 1.023 1.00

Summary: rasusa ran 21.77 Β± 0.29 times faster than filtlong.

Paired-end input

Download and then deinterleave the fastq with pyfastaq

URL="ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR648/008/SRR6488968/SRR6488968.fastq.gz"
wget "$URL" -O - | gzip -d -c - | fastaq deinterleave - r1.fq r2.fq

Each file's size is 179M and has 283,590 reads.
For this benchmark, we will use seqtk. We will also test seqtk's 2-pass mode as this is analogous to rasusa.

NUM_READS=140000
SEQTK_CMD_1="seqtk sample -s 1 r1.fq $NUM_READS > /tmp/r1.fq; seqtk sample -s 1 r2.fq $NUM_READS > /tmp/r2.fq;"
SEQTK_CMD_2="seqtk sample -2 -s 1 r1.fq $NUM_READS > /tmp/r1.fq; seqtk sample -2 -s 1 r2.fq $NUM_READS > /tmp/r2.fq;"
RASUSA_CMD="rasusa -i r1.fq r2.fq -n $NUM_READS -s 1 -o /tmp/r1.fq -o /tmp/r2.fq"
hyperfine --warmup 10 --runs 100 --export-markdown results-paired.md \
     "$SEQTK_CMD_1" "$SEQTK_CMD_2" "$RASUSA_CMD"

Results

Command Mean [ms] Min [ms] Max [ms] Relative
seqtk sample -s 1 r1.fq 140000 > /tmp/r1.fq; seqtk sample -s 1 r2.fq 140000 > /tmp/r2.fq; 907.7 Β± 23.6 875.4 997.8 1.84 Β± 0.62
seqtk sample -2 -s 1 r1.fq 140000 > /tmp/r1.fq; seqtk sample -2 -s 1 r2.fq 140000 > /tmp/r2.fq; 870.8 Β± 54.9 818.2 1219.8 1.77 Β± 0.61
rasusa -i r1.fq r2.fq -n 140000 -s 1 -o /tmp/r1.fq -o /tmp/r2.fq 492.2 Β± 165.4 327.4 887.4 1.00

Summary: rasusa ran 1.84 times faster than seqtk (1-pass) and 1.77 times faster than seqtk (2-pass)

So, rasusa is faster than seqtk but doesn't require a fixed number of reads - allowing you to avoid doing maths to determine how many reads you need to downsample to a specific coverage. πŸ€“

Contributing

If you would like to help improve rasusa you are very welcome!

For changes to be accepted, they must pass the CI and coverage checks. These include:

  • Code is formatted with rustfmt. This can be done by running cargo fmt in the project directory.
  • There are no compiler errors/warnings. You can check this by running cargo clippy --all-features --all-targets -- -D warnings
  • Code coverage has not reduced. If you want to check coverage before pushing changes, I use kcov.

Citing

If you use rasusa in your research, it would be very much appreciated if you could cite it.

DOI

Hall, M. B., (2022). Rasusa: Randomly subsample sequencing reads to a specified coverage. Journal of Open Source Software, 7(69), 3941, https://doi.org/10.21105/joss.03941

Bibtex

@article{Hall2022,
  doi = {10.21105/joss.03941},
  url = {https://doi.org/10.21105/joss.03941},
  year = {2022},
  publisher = {The Open Journal},
  volume = {7},
  number = {69},
  pages = {3941},
  author = {Michael B. Hall},
  title = {Rasusa: Randomly subsample sequencing reads to a specified coverage},
  journal = {Journal of Open Source Software}
}

More Repositories

1

ssubmit

Submit slurm sbatch jobs without a script
Rust
62
star
2

pafpy

A lightweight library for working with PAF (Pairwise mApping Format) files
Python
31
star
3

compression_benchmark

Benchmarking FASTQ compression with 'mature' compression algorithms
Python
30
star
4

taeper

A small python program to simulate a real-time Nanopore sequencing run based on a previous experiment.
Python
19
star
5

drprg

Drug Resistance Prediction with Reference Graphs
Rust
19
star
6

nohuman

Remove human reads from a sequencing run
Rust
19
star
7

ontime

Extract subsets of ONT (Nanopore) reads based on time
Rust
19
star
8

skc

Shared k-mer content between two genomes
Rust
17
star
9

psdm

Compute a pairwise SNP distance matrix from one or two alignment(s)
Rust
17
star
10

container_recipes

Repository with all my container recipes πŸ‘©β€πŸ³
Dockerfile
15
star
11

NanoVarBench

Evaluating Nanopore-based bacterial variant calling
Python
14
star
12

streamformatics

Real-time species-typing visualisation for nanopore data.
JavaScript
13
star
13

fast5seek

Subset of fast5 files contained in a fastq, BAM, or SAM file.
Python
11
star
14

tbpore

Mycobacterium tuberculosis genomic analysis from Nanopore sequencing data
Python
11
star
15

pistis

Quality control plotting for long reads
Python
10
star
16

classification_benchmark

Benchmarking different ways of doing read (taxonomic) classification, with a focus on removal of contamination and MTB classification
Python
9
star
17

MemoryUnits

Python objects for dealing with metric units and memory, file, and genome sizes πŸ’Ύ
Python
7
star
18

reveal-hugo-nord

A nord-themed reveal-hugo presentation template
CSS
6
star
19

bioscripts

Collection of scripts for doing various tasks
Python
5
star
20

head_to_head_pipeline

Snakemake pipelines to run the analysis for the Illumina vs. Nanopore comparison.
Jupyter Notebook
5
star
21

thesis

My PhD thesis
Jupyter Notebook
4
star
22

rawmeth

Package for exploring raw signal of nanopore reads that have been resquiggled with nanoraw.
Python
4
star
23

eipp-2019-singularity

Singularity group project for EIPP 2019
Jupyter Notebook
3
star
24

reveal-hugo-ebi

An EBI-themed reveal.js presentation template powered by reveal-hugo
CSS
2
star
25

Longitude_pipeline

Pipeline for analysing M. tuberculosis nanopore reads and getting drug susceptibility information.
Python
2
star
26

Sepsis-methylation

Pipeline to analyse the dRNA methylation of sepsis samples.
Jupyter Notebook
2
star
27

MTB_masks

A collection of genome masks for Mycobacterium tuberculosis
2
star
28

pandora_analysis_pipeline

Python
2
star
29

cud

Color Universal Design colourblind-friendly python matplotlib palette
Python
2
star
30

tubemaps_pilot

Playing around with getting sequence tubemaps to work for population reference graphs.
xBase
1
star
31

envs-primer

Presentation for EBI 2021 Predocs on software environments and containers
JavaScript
1
star
32

tinysink

Synchronise Nanopore reads with a server.
Shell
1
star
33

bam_slice

Cut slices out of a BAM file given a list of positions.
Python
1
star