• Stars
    star
    146
  • Rank 252,769 (Top 5 %)
  • Language
    Python
  • License
    Other
  • Created over 7 years ago
  • Updated 8 months ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Oxford Nanopore Technologies fast5 API software

.

ont_fast5_api

ont_fast5_api is a simple interface to HDF5 files of the Oxford Nanopore .fast5 file format.

It provides:

  • Concrete implementation of the fast5 file schema using the generic h5py library
  • Plain-english-named methods to interact with and reflect the fast5 file schema
  • Tools to convert between multi_read and single_read formats
  • Tools to compress/decompress raw data in files

Getting Started

The ont_fast5_api is available on PyPI and can be installed via pip:

pip install ont-fast5-api

Alternatively, it is available on github where it can be built from source:

git clone https://github.com/nanoporetech/ont_fast5_api
pip install ./ont_fast5_api

Dependencies

ont_fast5_api is a pure python project and should run on most python versions and operating systems.

It requires:

Interface - get_fast5_file

The ont_fast5_api provides a simple interface to access the data structures in .fast5 files of either single- or multi- read format using the same method calls.

For example to print the raw data from all reads in a file:

from ont_fast5_api.fast5_interface import get_fast5_file

def print_all_raw_data():
    fast5_filepath = "test/data/single_reads/read0.fast5" # This can be a single- or multi-read file
    with get_fast5_file(fast5_filepath, mode="r") as f5:
        for read in f5.get_reads():
            raw_data = read.get_raw_data()
            print(read.read_id, raw_data)

Interface - Console Scripts

The ont_fast5_api provides terminal/command-line console_scripts for converting between files in the Oxford Nanopore single_read and multi_read .fast5 file formats. These are provided to ensure compatibility between tools which expect either the single_read or multi_read .fast5 file formats.

The scripts are added during installation and can be called from the terminal/command-line or from within python.

single_to_multi_fast5

This script converts folders containing single_read_fast5 files into multi_read_fast5_files:

single_to_multi_fast5
[required]
    -i, --input_path    INPUT_PATH      <(path) folder containing single_read_fast5 files>
    -s, --save_path     SAVE_PATH       <(path) to folder where multi_read fast5 files will be output>

[optional]
    -t, --threads       THREADS         <(int) number of CPU threads to use; default=1>
    -f, --filename_base FILENAME_BASE   <(string) name for new multi_read file; default="batch" (see note-1)>
    -n, --batch_size    BATCH_SIZE      <(int) number of single_reads to include in each multi_read file; default=4000>
    --recursive                         <if included, recursively search sub-directories for single_read files>

note-1: newly created multi_read files require a name. This is the filename_base with the batch count and .fast5 appended to it; e.g. -f batch yields batch_0.fast5, batch_1.fast5, ...

example usage:

single_to_multi_fast5 --input_path /data/reads --save_path /data/multi_reads
    --filename_base batch_output --batch_size 100 --recursive

Where /data/reads and/or its subfolders contain single_read .fast5 files. The output will be multi_read fast5 files each containing 100 reads, in the folder: /data/multi_reads with the names: batch_output_0.fast5, batch_output_1.fast5 etc.

multi_to_single_fast5

This script converts folders containing multi_read_fast5 files into single_read_fast5 files:

multi_to_single_fast5
[required]
    -i, --input_path    INPUT_PATH  <(path) folder containing multi_read_fast5 files>
    -s, --save_path     SAVE_PATH   <(path) to folder where single_read fast5 files will be output

[optional]
    -t, --threads       THREADS     <(int) number of CPU threads to use; default=1>
    --recursive                     <if included, recursively search sub-directories for multi_read files>

example usage:

multi_to_single_fast5 --input_path /data/multi_reads --save_path /data/single_reads
    --recursive

Where /data/multi_reads and/or its subfolders contain multi_read .fast5 files. The output will be single_read .fast5 files in the folder /data/single_reads with one subfolder per multi_read input file

fast5_subset

This script extracts reads from multi_read_fast5_file(s) based on a list of read_ids:

fast5_subset
[required]
    -i, --input         INPUT_PATH      <(path) to folder containing multi_read_fast5 files or an individual multi_read_fast5 file>
    -s, --save_path     SAVE_PATH       <(path) to folder where multi_read fast5 files will be output>
    -l,--read_id_list   SUMMARY_PATH    <(file) either sequencing_summary.txt file or a file containing a list of read_ids>

[optional]
    -f, --filename_base FILENAME_BASE   <(string) name for new multi_read file; default="batch" (see note-1)>
    -n, --batch_size    BATCH_SIZE      <(int) number of single_reads to include in each multi_read file; default=4000>
    --recursive                         <if included, recursively search sub-directories for single_read files>

example usage:

fast5_subset --input /data/multi_reads --save_path /data/subset
    --read_id_list read_id_list.txt --batch_size 100 --recursive

Where /data/multi_reads and/or its subfolders contain multi_read .fast5 files and read_id_list.txt is a text file either containing 1 read_id per line or a tsv file with a column named read_id. The output will be multi_read .fast5 files each containing 100 reads, in the folder: /data/multi_reads with the names: batch_output_0.fast5, batch_output_1.fast5 etc.

demux_fast5

This script for demultiplexing reads from multi_read_fast5_file(s).

Extracts reads into multiple directories based on column value in a summary file:

demux_fast5.py
[required]
  -i, --input          INPUT_PATH    <Path to Fast5 file or directory of Fast5 files>
  -s, --save_path      SAVE_PATH     <Directory to output MultiRead subsets>
  -l, --summary_file   SUMMARY_PATH  <TSV file containing read_id and demultiplex columns>

[optional]
  --read_id_column     COLUMN_NAME   <Name of read_id column in summary file (default 'read_id')>
  --demultiplex_column COLUMN_NAME   <Name of column for demultiplexing in summary file (default 'barcoding_arrangement')>
  -f, --filename_base  FILENAME_BASE <Root of output filename, default='batch' -> 'batch_0.fast5'>
  -n, --batch_size     BATCH_SIZE    <Number of reads per multi-read file, default 4000>
  -t, --threads        THREADS       <Maximum number of processes to use>
  -r, --recursive                    <Flag to search recursively through input directory for MultiRead fast5 files>
  --ignore_symlinks                  <Ignore symlinks when searching recursively for fast5 files>
  -c --compression     COMPRESSION   <Target output compression type (vbz,vbz_legacy_v0,gzip,None)>

Intended use is for multiplexed experiments, for reads with different barcodes or from different genomes.

example usage:

demux_fast5 --input /data/multi_reads --save_path /data/demultiplexed_reads --summary_file barcoding_summary.txt

Where /data/multi_reads and/or its subfolders contain fast5 files from multiplexed experiment, barcoding_summary.txt is the output of guppy_barcoder. /data/demultiplexed_reads will contain a directory per barcode, containing multi_read .fast5 files with names: /data/demultiplexed_reads/barcode01/batch_0.fast5, /data/demultiplexed_reads/barcode02/batch_0.fast5 etc. Directories are named by values in demultiplex column.

compress_fast5

This script copies and converts raw data between vbz and gzip compression formats:

compress_fast5
[required]
    -i, --input_path    INPUT_PATH  <(path) folder containing multi_read_fast5 files>
    -s, --save_path     SAVE_PATH   <(path) to folder where single_read fast5 files will be output>
    -c, --compression   COMPRESSION <(str) [vbz, gzip] target compression format>

[optional]
    -t, --threads       THREADS     <(int) number of CPU threads to use; default=1>
    --recursive                     <if included, recursively search sub-directories for fast5 files>
    --sanitize                      <flag to remove optional groups (such as basecalling and modified base information)>

example usage:

compress_fast5 --input_path /data/uncompressed_reads --save_path /data/compressed_reads
    --compression vbz --recursive --threads 40

Where /data/uncompressed_reads and/or its subfolders contain .fast5 files. The output will be a copy of the input folder structure containing compressed reads preserving both the folder structure and file type.

The optional --sanitize option can be used to greatly reduce file size when files contain optional data from the Guppy basecaller that could in principle be regenerated by running Guppy. The files output when using the sanitize option will be identical in structure to those output by MinKNOW when live basecalling is disabled.

NB compress_fast5 will copy .fast5 files in order to compress them due to HDF5 implementation constraints. Further detail of HDF5 data management strategies can be found: https://support.hdfgroup.org/HDF5/doc/Advanced/FileSpaceManagement/FileSpaceManagement.pdf

VBZ Compression

VBZ compression is a compression algorithm developed by Oxford Nanopore to reduce file size and improve read/write performance when handling raw data in Fast5 files. Previously, the default compression was GZIP and comparing to GZIP we see a compression improvement of >30% and a CPU performance improvement of >10X for compression and >5X for decompression. Further details of the implementation and benchmarks can be found here: https://github.com/nanoporetech/vbz_compression

Benchmarking the performance of compression within the ont_fast5_api against a normal file copy showed compressing from gzip to vbz was approximately 2x slower than copying files. In other words, if it would take two hours to copy a set of files from an input folder to an output folder then it should take four hours to compress those files with VBZ. Running the script without compressing (i.e. the same type of compression in and out; gzip->gzip) was approximately 2x faster than a file copy since it can utilise mutiple threads.

Glossary of Terms:

HDF5 file format - a portable file format for storing and managing data. It is designed for flexible and efficient I/O and for high volume and complex data

Fast5 - an implementation of the HDF5 file format, with specific data schemas for Oxford Nanopore sequencing data

Single read fast5 - A fast5 file containing all the data pertaining to a single Oxford Nanopore read. This may include raw signal data, run metadata, fastq-basecalls and any other additional analyses

Multi read fast5 - A fast5 file containing data pertaining to a multiple Oxford Nanopore reads.

Demultiplexing - A process of separating reads of an experiment where multiple samples were mixed together (multiplexed), into corresponding samples. Demultiplexing is based on markers that identify sample origin, e.g. unique barcodes or alignment to a reference genome.

More Repositories

1

dorado

Oxford Nanopore's Basecaller
C++
493
star
2

medaka

Sequence correction provided by ONT Research
Python
411
star
3

bonito

A PyTorch Basecaller for Oxford Nanopore Reads
Python
392
star
4

tombo

Tombo is a suite of tools primarily for the identification of modified nucleotides from raw nanopore sequencing data.
Python
230
star
5

megalodon

Megalodon is a research command line tool to extract high accuracy modified base and sequence variant calls from raw nanopore reads by anchoring the information rich basecalling neural network output to a reference genome/transriptome.
Python
197
star
6

fast-ctc-decode

Blitzing Fast CTC Beam Search Decoder
Rust
176
star
7

remora

Methylation/modified base calling separated from basecalling.
Python
156
star
8

modkit

A bioinformatics tool for working with modified bases
Rust
137
star
9

pod5-file-format

Pod5: a high performance file format for nanopore reads.
C++
131
star
10

taiyaki

Training models for basecalling Oxford Nanopore reads
Python
114
star
11

pipeline-structural-variation

Pipeline for calling structural variations in whole genomes sequencing Oxford Nanopore data
Python
113
star
12

pipeline-transcriptome-de

Pipeline for differential gene expression (DGE) and differential transcript usage (DTU) analysis using long reads
Python
106
star
13

read_until_api

Read Until client library for Nanopore Sequencing
Python
102
star
14

rerio

Research release basecalling models and configurations
Python
102
star
15

flappie

Flip-flop basecaller for Oxford Nanopore reads
C
98
star
16

pomoxis

Analysis components from Oxford Nanopore Research
Python
94
star
17

scrappie

Scrappie is a technology demonstrator for the Oxford Nanopore Research Algorithms group
C
91
star
18

ont-assembly-polish

ONT assembly and Illumina polishing pipeline
Makefile
91
star
19

pychopper

A tool to identify, orient, trim and rescue full length cDNA reads
Python
79
star
20

qcat

qcat is a Python command-line tool for demultiplexing Oxford Nanopore reads from FASTQ files.
Python
77
star
21

jmespath-ts

Typescript translation of the jmespath.js package
TypeScript
63
star
22

wub

Tools and software library developed by the ONT Applications group
Python
61
star
23

minknow_api

Protobuf and gRPC specifications for the MinKNOW API
Python
55
star
24

pore-c

Pore-C support
Python
53
star
25

kmer_models

Predictive kmer models for development use
53
star
26

katuali

Analysis pipelines from Oxford Nanopore Technologies' Research Division
Python
50
star
27

duplex-tools

Splitting of sequence reads by internal adapter sequence search
Python
49
star
28

pinfish

Tools to annotate genomes using long read transcriptomics data
Go
45
star
29

sockeye

Single Cell Transcriptomics
Python
40
star
30

vbz_compression

VBZ compression plugin for nanopore signal data
C++
38
star
31

pipeline-nanopore-ref-isoforms

Pipeline for annotating genomes using long read transcriptomics data with stringtie and other tools
Python
36
star
32

Pore-C-Snakemake

Python
33
star
33

bwapy

Python bindings to bwa mem
Python
31
star
34

ont_tutorial_basicqc

A bioinformatics tutorial demonstrating a best-practice workflow to review a flowcell's sequence_summary.txt
TeX
30
star
35

pyguppyclient

Python client library for Guppy
Python
30
star
36

pipeline-umi-amplicon

Workflow to prepare high accuracy single molecule consensus sequences from amplicon data using unique molecular identifiers
Python
28
star
37

pipeline-pinfish-analysis

Pipeline for annotating genomes using long read transcriptomics data with pinfish
Python
27
star
38

pipeline-nanopore-denovo-isoforms

Pipeline for de novo clustering of long transcriptomic reads
Python
26
star
39

sloika

Sloika is Oxford Nanopore Technologies' software for training neural network models for base calling
Python
25
star
40

fast5_research

Fast5 API provided by ONT Research
Python
21
star
41

pyspoa

Python bindings to spoa
Python
18
star
42

DTR-phage-pipeline

Python
16
star
43

minimappy

Python bindings to minimap2
Python
16
star
44

isONclust2

A tool for de novo clustering of long transcriptomic reads
C++
14
star
45

jmespath-plus

JMESPath with extended collection of built-in functions
TypeScript
14
star
46

minknow_lims_interface

Protobuff and gRPC specifications for the MinKNOW LIMS Interface
13
star
47

fast5mod

Extract modifed base call information from Guppy Fast5 files.
Python
13
star
48

ont_h5_validator

Python
12
star
49

dRNA-paper-scripts

Direct RNA publication scripts
Python
11
star
50

currennt

Modified fork of CURRENNT https://sourceforge.net/projects/currennt/
C++
11
star
51

pipeline-polya-diff

Pipeline for testing shifts in poly(A) tail lengths estimated by nanopolish
Python
9
star
52

ont-open-datasets

Website describing data releases, and providing additional resources.
HTML
9
star
53

pipeline-polya-ng

Pipeline for calling poly(A) tail lengths from nanopore direct RNA data using nanopolish
Python
9
star
54

ts-runtime-typecheck

A collection of common types for TypeScript along with dynamic type cast methods.
TypeScript
9
star
55

epi2me-api

API for communicating with the EPI2ME Platform for nanopore data analysis. Used by EPI2ME Agent & CLI.
TypeScript
9
star
56

cronkite

One **hell** of a reporter
TypeScript
8
star
57

mako

Analyte identification via squiggles.
Python
7
star
58

marine-phage-paper-scripts

Python
6
star
59

homebrew-tap

Homebrew casks for applications from Oxford Nanopore Technologies PLC and Metrichor Ltd.
Ruby
6
star
60

barcoding

Naïve barcode deconvolution for amplicons
Perl
6
star
61

ont-minimap2

Cross platform builds for minimap2
CMake
5
star
62

plasmid-map

Plasmid map visualisations for Metrichor reports
TypeScript
5
star
63

spliced_bam2gff

Go
5
star
64

hammerpede

A package for training strand-specific profile HMMs for primer sets from real Nanopore data
Python
5
star
65

hatch-protobuf

Hatch plugin for generating Python files from Protocol Buffers .proto files
Python
4
star
66

fastq-filter

Quality and length filter for FastQ data
Python
4
star
67

bripy

Bam Read Index for python
C
3
star
68

pipeline-pychopper

Utility pipeline for running pychopper, a tool to identify full length cDNA reads
Python
3
star
69

lamprey

GUI for desktop basecalling
JavaScript
3
star
70

panga

Python
2
star
71

data-rambler

An experimental language for a JSON query, transformation and streaming
TypeScript
2
star
72

getopt-win32

C
2
star
73

ts-argue

TypeScript
1
star
74

onesie

A Linux device-driver for the MinION-mk1C
C
1
star
75

vbz-h5py-plugin

Python
1
star
76

fs-inspect

node.js library for indexing the contents of a folder
TypeScript
1
star