• Stars
    star
    430
  • Rank 101,083 (Top 2 %)
  • Language
    Python
  • License
    MIT License
  • Created over 7 years ago
  • Updated 28 days ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Plotting scripts for long read sequencing data

NanoPlot

Plotting tool for long read sequencing data and alignments.

Twitter URL conda badge Build Status

NanoPlot is also available as a web service.

Example plot

The example plot above shows a bivariate plot comparing log transformed read length with average basecall Phred quality score. More examples can be found in the gallery on my blog 'Gigabase Or Gigabyte'.

In addition to various plots also a NanoStats file is created summarizing key features of the dataset.

This script performs data extraction from Oxford Nanopore sequencing data in the following formats:

  • fastq files
    (can be bgzip, bzip2 or gzip compressed)
  • fastq files generated by albacore, guppy or MinKNOW containing additional information
    (can be bgzip, bzip2 or gzip compressed)
  • sorted bam files
  • sequencing_summary.txt output table generated by albacore, guppy or MinKnow basecalling (can be gzip, bz2, zip and xz compressed)
  • fasta files (can be bgzip, bzip2 or gzip compressed)
    Multiple files of the same type can be offered simultaneously

INSTALLATION

pip install NanoPlot

Upgrade to a newer version using:
pip install NanoPlot --upgrade

or

conda badge
conda install -c bioconda nanoplot

The script is written for python3.

OUTPUT

NanoPlot creates:

  • a statistical summary
  • a number of plots
  • a html summary file

USAGE

usage: NanoPlot [-h] [-v] [-t THREADS] [--verbose] [--store] [--raw] [--huge] [-o OUTDIR] [--no_static] [-p PREFIX] [--tsv_stats] [--info_in_report] [--maxlength N]
                [--minlength N] [--drop_outliers] [--downsample N] [--loglength] [--percentqual] [--alength] [--minqual N] [--runtime_until N] [--readtype {1D,2D,1D2}]
                [--barcoded] [--no_supplementary] [-c COLOR] [-cm COLORMAP] [-f [{png,jpg,jpeg,webp,svg,pdf,eps,json} ...]] [--plots [{kde,hex,dot} ...]]
                [--legacy [{kde,dot,hex} ...]] [--listcolors] [--listcolormaps] [--no-N50] [--N50] [--title TITLE] [--font_scale FONT_SCALE] [--dpi DPI] [--hide_stats]
                (--fastq file [file ...] | --fasta file [file ...] | --fastq_rich file [file ...] | --fastq_minimal file [file ...] | --summary file [file ...] | --bam file [file ...] | --ubam file [file ...] | --cram file [file ...] | --pickle pickle | --feather file [file ...])

CREATES VARIOUS PLOTS FOR LONG READ SEQUENCING DATA.

General options:
  -h, --help            show the help and exit
  -v, --version         Print version and exit.
  -t, --threads THREADS
                        Set the allowed number of threads to be used by the script
  --verbose             Write log messages also to terminal.
  --store               Store the extracted data in a pickle file for future plotting.
  --raw                 Store the extracted data in tab separated file.
  --huge                Input data is one very large file.
  -o, --outdir OUTDIR   Specify directory in which output has to be created.
  --no_static           Do not make static (png) plots.
  -p, --prefix PREFIX   Specify an optional prefix to be used for the output files.
  --tsv_stats           Output the stats file as a properly formatted TSV.
  --info_in_report      Add NanoPlot run info in the report.

Options for filtering or transforming input prior to plotting:
  --maxlength N         Hide reads longer than length specified.
  --minlength N         Hide reads shorter than length specified.
  --drop_outliers       Drop outlier reads with extreme long length.
  --downsample N        Reduce dataset to N reads by random sampling.
  --loglength           Additionally show logarithmic scaling of lengths in plots.
  --percentqual         Use qualities as theoretical percent identities.
  --alength             Use aligned read lengths rather than sequenced length (bam mode)
  --minqual N           Drop reads with an average quality lower than specified.
  --runtime_until N     Only take the N first hours of a run
  --readtype {1D,2D,1D2}
                        Which read type to extract information about from summary. Options are 1D, 2D,
                        1D2
  --barcoded            Use if you want to split the summary file by barcode
  --no_supplementary    Use if you want to remove supplementary alignments

Options for customizing the plots created:
  -c, --color COLOR     Specify a valid matplotlib color for the plots
  -cm, --colormap COLORMAP
                        Specify a valid matplotlib colormap for the heatmap
  -f, --format [{png,jpg,jpeg,webp,svg,pdf,eps,json} ...]
                        Specify the output format of the plots, which are in addition to the html files
  --plots [{kde,hex,dot} ...]
                        Specify which bivariate plots have to be made.
  --legacy [{kde,dot,hex} ...]
                        Specify which bivariate plots have to be made (legacy mode).
  --listcolors          List the colors which are available for plotting and exit.
  --listcolormaps       List the colors which are available for plotting and exit.
  --no-N50              Hide the N50 mark in the read length histogram
  --N50                 Show the N50 mark in the read length histogram
  --title TITLE         Add a title to all plots, requires quoting if using spaces
  --font_scale FONT_SCALE
                        Scale the font of the plots by a factor
  --dpi DPI             Set the dpi for saving images
  --hide_stats          Not adding Pearson R stats in some bivariate plots

Input data sources, one of these is required.:
  --fastq file [file ...]
                        Data is in one or more default fastq file(s).
  --fasta file [file ...]
                        Data is in one or more fasta file(s).
  --fastq_rich file [file ...]
                        Data is in one or more fastq file(s) generated by albacore, MinKNOW or guppy
                        with additional information concerning channel and time.
  --fastq_minimal file [file ...]
                        Data is in one or more fastq file(s) generated by albacore, MinKNOW or guppy
                        with additional information concerning channel and time. Is extracted swiftly
                        without elaborate checks.
  --summary file [file ...]
                        Data is in one or more summary file(s) generated by albacore or guppy.
  --bam file [file ...]
                        Data is in one or more sorted bam file(s).
  --ubam file [file ...]
                        Data is in one or more unmapped bam file(s).
  --cram file [file ...]
                        Data is in one or more sorted cram file(s).
  --pickle pickle       Data is a pickle file stored earlier.
  --feather file [file ...]
                        Data is in one or more feather file(s).

EXAMPLES:
    NanoPlot --summary sequencing_summary.txt --loglength -o summary-plots-log-transformed
    NanoPlot -t 2 --fastq reads1.fastq.gz reads2.fastq.gz --maxlength 40000 --plots hex dot
    NanoPlot --color yellow --bam alignment1.bam alignment2.bam alignment3.bam --downsample 10000

NOTES

  • --downsample won't save you tons of time, as down sampling is only done after collecting all data and probably would only make a difference for a huge amount of data. If you want to save time you could down sample your data upfront. Note also that extracting information from a summary file is faster than other formats, and that you can extract from multiple files simultaneously (which will happen in parallel then). Some plot types (especially kde) are slower than others and you can take a look at the input for --plots to speed things up (default is to make both kde and dot plot). If you are only interested in say the read length histogram it is possible to write a script to just get you that and avoid wasting time on the rest. Let me know if you need any help here.
  • --plots uses the plotly package to plot kde and dot plots. Hex option will be ignored.
  • --legacy plotting of a hex plot currently is only possible using this option,which uses the seaborn and matplotlib package, since there is no support for it in plotly (yet). Plots like kde and dot are also possible with this option.

EXAMPLE USAGE

NanoPlot --summary sequencing_summary.txt --loglength -o summary-plots-log-transformed  
NanoPlot -t 2 --fastq reads1.fastq.gz reads2.fastq.gz --maxlength 40000 --plots dot --legacy hex
NanoPlot -t 12 --color yellow --bam alignment1.bam alignment2.bam alignment3.bam --downsample 10000 -o bamplots_downsampled

ACKNOWLEDGMENTS/CONTRIBUTORS

  • Ilias Bukraa for tremendous improvements and maintenance of the code
  • Andreas SjΓΆdin for building and maintaining conda recipes
  • Darrin Schultz @conchoecia for Pauvre code
  • @alexomics for fixing the indentation of the printed stats
  • Botond Sipos @bsipos for speeding up the calculation of average quality scores

CONTRIBUTING

I welcome all suggestions, bug reports, feature requests and contributions. Please leave an issue or open a pull request. I will usually respond within a day, or rarely within a few days.

PLOTS GENERATED

Plot Fastq Fastq_rich Fastq_minimal Bam Summary Options Style
Histogram of read length x x x x x N50
Histogram of (log transformed) read length x x x x x N50
Bivariate plot of length against base call quality x x x x log transformation dot, hex, kde
Heatmap of reads per channel x x
Cumulative yield plot x x x
Violin plot of read length over time x x x
Violin plot of base call quality over time x x
Bivariate plot of aligned read length against sequenced read length x dot, hex, kde
Bivariate plot of percent reference identity against read length x log transformation dot, hex, kde
Bivariate plot of percent reference identity against base call quality x dot, hex, kde
Bivariate plot of mapping quality against read length x log transformation dot, hex, kde
Bivariate plot of mapping quality against basecall quality x dot, hex, kde

COMPANION SCRIPTS

  • NanoComp: comparing multiple runs
  • NanoStat: statistic summary report of reads or alignments
  • NanoFilt: filtering and trimming of reads
  • NanoLyse: removing contaminant reads (e.g. lambda control DNA) from fastq

CITATION

If you use this tool, please consider citing our publication.

Copyright: 2016-2020 Wouter De Coster [email protected]

More Repositories

1

nanopack

An overview of all nanopack tools
Python
210
star
2

nanofilt

Filtering and trimming of long read sequencing data
Python
189
star
3

chopper

Rust
150
star
4

cramino

A *fast* tool for BAM/CRAM quality evaluation, intended for long reads
Rust
127
star
5

nanocomp

Comparison of multiple long read datasets
Python
103
star
6

nanostat

Create statistic summary of an Oxford Nanopore read dataset
Python
92
star
7

nanoQC

Quality control tools for nanopore sequencing data
Python
91
star
8

methplotlib

Plotting tools for nanopore methylation data
Python
90
star
9

nano-snakemake

A snakemake pipeline for SV analysis from nanopore genome sequencing
Python
51
star
10

nanolyse

Remove lambda phage reads from a fastq file
Python
28
star
11

surpyvor

A python wrapper around SURVIVOR
Python
19
star
12

kyber

Rust
17
star
13

DEA.R

Script to automate differential expression analysis using DESeq2, edgeR or limma-voom
R
17
star
14

phasius

Rust
13
star
15

nanoget

Functions to extract information from Oxford Nanopore sequencing data and alignments
Python
11
star
16

nanomath

A few simple math function for other Oxford Nanopore processing scripts
Python
9
star
17

PromisingPreprint

A python twitter bot tweeting about preprints reaching an interesting altmetric score
Python
8
star
18

STRdust

Tandem repeat genotyping from long reads
Rust
8
star
19

enrichr_cli

Python script to use enrichr from command line (http://amp.pharm.mssm.edu/Enrichr/)
Python
7
star
20

nanotest

Small test datasets for testing nanopack scripts and modules
Shell
5
star
21

make_arrow

A Rust tool to create an arrow file from a cram/bam file
Rust
4
star
22

pathSTR

Repository with code for the analysis of pathogenic STRs in the 1000G ONT resequencing data
Jupyter Notebook
4
star
23

read_length_SV_discovery

Jupyter Notebook
3
star
24

nanoplotter

Plotting functions of Oxford Nanopore sequencing data
Python
2
star
25

fast5purge

Purge a fast5 file from sensitive information
Python
2
star
26

tool-packaging

Some notes on how to make a pypi package
Python
1
star
27

GermlineCNVCaller

Testing the GATK4.beta.5 GermlineCNVCaller
Python
1
star
28

nanosplit

Splitting Oxford Nanopore data in a fail and pass dataset using a user defined quality cutoff
Python
1
star
29

determine-gender

Scripts to determine the gender of samples in exome and transcriptome sequencing
Python
1
star
30

combine_images

Bit of Python code to resize and combine images
Python
1
star