• Stars
    star
    124
  • Rank 288,207 (Top 6 %)
  • Language
    Python
  • License
    GNU General Publi...
  • Created almost 7 years ago
  • Updated almost 4 years ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

a signal-level demultiplexer for Oxford Nanopore reads

Deepbinner

Deepbinner is a tool for demultiplexing barcoded Oxford Nanopore sequencing reads. It does this with a deep convolutional neural network classifier, using many of the architectural advances that have proven successful in image classification. Unlike other demultiplexers (e.g. Albacore and Porechop), Deepbinner identifies barcodes from the raw signal (a.k.a. squiggle) which gives it greater sensitivity and fewer unclassified reads.

  • Reasons to use Deepbinner:
    • To minimise the number of unclassified reads (use Deepbinner by itself).
    • To minimise the number of misclassified reads (use Deepbinner in conjunction with Albacore demultiplexing).
    • You plan on running signal-level downstream analyses, like Nanopolish. Deepbinner can demultiplex the fast5 files which makes this easier.
  • Reasons to not use Deepbinner:
    • You only have basecalled reads not the raw fast5 files (which Deepbinner requires).
    • You have a small/slow computer. Deepbinner is more computationally intensive than Porechop.
    • You used a sequencing/barcoding kit other than the ones Deepbinner was trained on.

You can read more about Deepbinner in this preprint:
Wick RR, Judd LM, Holt KE. Deepbinner: Demultiplexing barcoded Oxford Nanopore reads with deep convolutional neural networks. bioRxiv. 2018; doi:10.1101/366526.

2021 update

I developed Deepbinner almost three years ago, which is a very long time in the fast-moving space of Nanopore sequencing! Since then, a lot has changed, and for most users, Deepbinner is probably no longer the best choice for demultiplexing your Nanopore reads.

When Deepbinner was published, it had a nice advantage over sequence-based demultiplexing. I.e. demultiplexing from the raw signal gave better accuracy than demultiplexing from a basecalled sequence. But the last few years have seen very nice increases in Oxford Nanopore basecalling accuracy, and that has made sequence-based demultiplexing more accurate as well, so Deepbinner's advantage has considerably narrowed. Guppy (Oxford Nanopore's production basecalling tool) has integrated sequence-based demultiplexing, and this makes it very convenient to use. Also, Deepbinner's models are out-of-date: they cover only 12 barcodes, but up to 96 native barcodes are now available.

The short version is this: I think most users should demultiplex with Guppy, not Deepbinner. Guppy is easier to run and will do nearly as well as Deepbinner (probably, I haven't tested this quantitatively).

On a final note, I don't think that the concept of raw-signal-based demultiplexing with a neural network is obsolete. Raw signals always contain more information than basecalled sequences, and neural networks can make very good classifiers. In a perfect world, I'd like to see raw-signal neural-network demultiplexing integrated into Guppy – a feature request in case any Guppy developers are reading this 😄 So I will leave Deepbinner's repo in place for any intrepid users that might want to modify it, train custom models, etc. But consider it deprecated.

Table of contents

Requirements

Deepbinner runs on MacOS and Linux and requires Python 3.5+.

TensorFlow logo

Its most complex requirement is TensorFlow, which powers the neural network. TensorFlow can run on CPUs (easy to install, supported on many machines) or on NVIDIA GPUs (better performance). If you're only going to use Deepbinner to classify reads, you may not need GPU-level performance (read more here). But if you want to train your own Deepbinner neural network, then using a GPU is a necessity.

The simplest way to install TensorFlow for your CPU is with pip3 install tensorflow. Building TensorFlow from source may give slighly better performance (because it will use all instructions sets supported by your CPU) but the installation is more complex. If you are using Ubuntu and have an NVIDIA GPU, check out these instructions for installing TensorFlow with GPU support.

Deepbinner uses some other Python packages (Keras, NumPy and h5py) but these should be taken care of by pip when installing Deepbinner. It also assumes that you have gzip available on your command line. If you are going to train your own Deepbinner network, then you'll need a few more Python packages as well (see the training instructions).

If you are using multi-read fast5 files (new in 2019), then you'll also need to have the multi_to_single_fast5 tool installed on your path. You can get it here: github.com/nanoporetech/ont_fast5_api.

Installation

Install from source

You can install Deepbinner using pip, either from a local copy:

git clone https://github.com/rrwick/Deepbinner.git
pip3 install ./Deepbinner
deepbinner --help

Or directly from GitHub:

pip3 install git+https://github.com/rrwick/Deepbinner.git
deepbinner --help

Run without installation

Deepbinner can be run directly from its repository by using the deepbinner-runner.py script, no installation required:

git clone https://github.com/rrwick/Deepbinner.git
Deepbinner/deepbinner-runner.py -h

If you run Deepbinner this way, it's up to you to make sure that all necessary Python packages are installed.

Quick usage

Demultiplex native barcoding reads that are already basecalled:

deepbinner classify --native fast5_dir > classifications
deepbinner bin --classes classifications --reads basecalled_reads.fastq.gz --out_dir demultiplexed_reads

Demultiplex rapid barcoding reads that are already basecalled:

deepbinner classify --rapid fast5_dir > classifications
deepbinner bin --classes classifications --reads basecalled_reads.fastq.gz --out_dir demultiplexed_reads

Demultiplex native barcoding raw fast5 reads (potentially in real-time during a sequencing run):

deepbinner realtime --in_dir fast5_dir --out_dir demultiplexed_fast5s --native

Demultiplex rapid barcoding raw fast5 reads (potentially in real-time during a sequencing run):

deepbinner realtime --in_dir fast5_dir --out_dir demultiplexed_fast5s --rapid

The sample_reads.tar.gz file in this repository contains a small test set: six fast5 files and a FASTQ of their basecalled sequences. When classified with Deepbinner, you should get two reads each from barcodes 1, 2 and 3.

Available trained models

Deepbinner currently only provides pre-trained models for the EXP-NBD103 native barcoding expansion and the SQK-RBK004 rapid barcoding kit. See more details here.

If you have different data, then pre-trained models aren't available. If you have lots of existing data, you can train your own network. Alternatively, if you can share your data with me, I could train a model and make it available as part of Deepbinner. Let me know!

Using Deepbinner after basecalling

If your reads are already basecalled, then running Deepbinner is a two-step process:

  1. Classify reads using the fast5 files
  2. Organise the basecalled FASTQ reads into bins using the classifications

Step 1: classifying fast5 reads

This is accomplished using the deepbinner classify command, e.g.:

deepbinner classify --native fast5_dir > classifications

Since the native barcoding kit puts barcodes on both the start and end of reads, Deepbinner will look for both. Most reads should have a barcode at the start, but barcodes at the end are less common. If a read has conflicting barcodes at the start and end, it will be put in the unclassified bin. The --require_both option makes Deepbinner only bin reads with a matching start and end barcode, but this is very stringent and will result in far more unclassified reads. See more on the wiki: Combining start and end barcodes. None of this applies if you are using rapid barcoding reads (--rapid), as they only have a barcode at the start.

Here is the full usage for deepbinner classify.

Step 2: binning basecalled reads

This is accomplished using the deepbinner bin command, e.g.:

deepbinner bin --classes classifications --reads basecalled_reads.fastq.gz --out_dir 

This will leave your original basecalled reads in place, copying the sequences out to new files in your specified output directory. Both FASTA and FASTQ reads inputs are okay, gzipped or not. Deepbinner will gzip the binned reads at the end of the process.

Here is the full usage for deepbinner bin.

Using Deepbinner before basecalling

If you haven't yet basecalled your reads, you can use deepbinner realtime to bin the fast5 files, e.g.:

deepbinner realtime --in_dir fast5s --out_dir demultiplexed_fast5s --native

This command will move (not copy) fast5 files from the --in_dir directory to the --out_dir directory. As the command name suggests, this can be run in real-time – Deepbinner will watch the input directory and wait for new reads. Just set --in_dir to where MinKNOW deposits its reads. Or if you sequence on a laptop and copy the reads to a server, you can run Deepbinner on the server, watching the directory where the reads are deposited. Use Ctrl-C to stop it.

This command doesn't have to be run in real-time – it works just as well on a directory of fast5 files from a finished sequencing run.

Here is the full usage for deepbinner realtime (many of the same options as the classify command).

Using Deepbinner with Albacore demultiplexing

If you use both Deepbinner and Albacore to demultiplex reads, only keeping reads for which both tools agree on the barcode, you can achieve very low rates of misclassified reads (high precision, positive predictive value) but a larger proportion of reads will not be classified (put into the 'none' bin). This is what I usually do with my sequencing runs!

The easiest way to achieve this is to follow the Using Deepbinner before basecalling instructions above. Then run Albacore separately on each of Deepbinner's output directories, with its --barcoding option on. You should find that for each bin, Albacore puts most of the reads in the same bin (the reads we want to keep), some in the unclassified bin (slightly suspect reads, likely with lower quality basecalls) and a small number in a different bin (very suspect reads).

Here are some instructions and Bash code to carry this out automatically.

Using Deepbinner with multi-read fast5s

Multi-read fast5s complicate the matter for Deepbinner: if one fast5 file contains reads from more than one barcode, then it cannot simply be moved into a bin. The simplest solution is to first run the multi_to_single_fast5 tool available in the ont_fast5_api before running Deepbinner. This is necessary if you are running the deepbinner classify command.

If you are running the deepbinner realtime command, then Deepbinner can handle multi-read fast5 files. It will run the multi_to_single_fast5 tool putting the single-read fast5s into a temporary directory, and then move the single-read fast5s into bins in the output directory. However, unlike running deepbinner realtime on single-read fast5s, where the fast5s are moved into the destination directory, running it on multi-read fast5s will leave the original input files in place (because it's the unpacked single-read fast5s which are moved). So you might want to delete the multi-read fast5s after Deepbinner finishes to save disk space.

Performance

Deepbinner lives up to its name by using a deep neural network. It's therefore not particularly fast, but should be fast enough to keep up with a typical MinION run. If you want to squeeze out a bit more performance, try adjusting the 'Performance' options. Read more here for a detailed description of these options. In my tests, it can classify about 15 reads/sec using 12 threads (the default). Giving it more threads helps a little, but not much.

Building TensorFlow from source may give better performance (because it can then use all available instruction sets on your CPU). Running TensorFlow on a GPU will definitely give better Deepbinner performance: my tests on a Tesla K80 could classify over 100 reads/sec.

Training

You can train your own neural network with Deepbinner, but you'll need two things:

  • Lots of training data using the same barcoding and sequencing kits. More is better, so ideally from more than one sequencing run.
  • A fast computer to train on, ideally with TensorFlow running on a big GPU.

If you can meet those requirements, then read on in the Deepbinner training instructions!

Contributing

As always, the wider community is welcome to contribute to Deepbinner by submitting issues or pull requests.

I also have a particular need for one kind of contribution: training reads! The lab where I work has mainly used R9.4/R9.5 flowcells with the SQK-LSK108 kit. If you have other types of reads that you can share, I'd be interested (see here for more info).

Acknowledgments

I would like to thank James Ferguson from the Garvan Institute. We met at the Nanopore Day Melbourne event in February 2018 where I saw him present on raw signal detection of barcodes. It was then that the seeds of Deepbinner were sown!

I'm also in debt to Matthew Croxen for sharing his SQK-RBK004 rapid barcoding reads with me – they were used to build Deepbinner's pre-trained model for that kit.

License

GNU General Public License, version 3

More Repositories

1

Bandage

a Bioinformatics Application for Navigating De novo Assembly Graphs Easily
C++
582
star
2

Unicycler

hybrid assembly pipeline for bacterial genomes
C++
559
star
3

Porechop

adapter trimmer for Oxford Nanopore reads
C++
337
star
4

Basecalling-comparison

A comparison of different Oxford Nanopore basecallers
Python
313
star
5

Trycycler

A tool for generating consensus long-read assemblies for bacterial genomes
Python
306
star
6

Filtlong

quality filtering tool for long reads
C++
285
star
7

Long-read-assembler-comparison

Benchmarking of long-read assembly tools for bacterial whole genomes
Python
170
star
8

Badread

a long read simulator that can imitate many types of read problems
Python
168
star
9

Polypolish

a short-read polishing tool for long-read assemblies
Rust
143
star
10

Perfect-bacterial-genome-tutorial

Python
118
star
11

Metagenomics-Index-Correction

Python
78
star
12

Bacsort

a collection of scripts for organising bacterial genomes by species
Python
76
star
13

Minipolish

A tool for Racon polishing of miniasm assemblies
Python
72
star
14

Assembly-Dereplicator

A tool for removing redundant genomes from a set of assemblies
Python
68
star
15

August-2019-consensus-accuracy-update

A short analysis of Oxford Nanopore consensus accuracy for bacterial genome assemblies
Python
58
star
16

Verticall

Recombination-free trees
Python
56
star
17

Rebaler

reference-based long read assemblies of bacterial genomes
Python
47
star
18

MinION-desktop

Scripts and programs for the Holt Lab's MinION desktop
Python
32
star
19

Bacterial-genome-assemblies-with-multiplex-MinION-sequencing

Shell
32
star
20

Core-SNP-filter

a tool to filter sites in a FASTA-format whole-genome pseudo-alignment
Rust
30
star
21

Fast5-to-Fastq

A simple tool for extracting reads from Oxford Nanopore fast5 files
Python
26
star
22

Compare-annotations

A script for comparing old vs new versions of genome annotations
Python
20
star
23

LinesOfCodeCounter

A Python script to count lines of code in a directory for specific file extension, excluding blank/comment lines
Python
18
star
24

Catpac

a Contig Alignment Tool for Pairwise Assembly Comparison
Python
12
star
25

Small-plasmid-Nanopore

Python
11
star
26

DASCRUBBER-wrapper

Wrapper script for easier read scrubbing with DASCRUBBER
Python
10
star
27

GFA-dead-end-counter

a tool for counting dead ends in GFA assembly graphs
Rust
9
star
28

SPAdes-Contig-Graph

a tool for creating a FASTG contig graph from a SPAdes assembly
Python
9
star
29

Langtons-Ant-Animator

Program for creating Langton's Ant animations
C++
8
star
30

Klebsiella-assembly-species

a tool for assigning species to Klebsiella assemblies
Python
8
star
31

Circular-Contig-Extractor

Python
8
star
32

ONT-assembler-benchmark

Python
5
star
33

KleborateModular

A modular rewrite of Kleborate
Python
4
star
34

SRST2-table-from-assemblies

This is a tool for conducting a gene screen on assemblies, producing an SRST2-like output.
Python
3
star
35

IDBA-to-GFA

Python
3
star
36

Grovolve

Demonstration of evolution by natural selection
C++
3
star
37

Trycycler-paper

Supplementary figures, tables and scripts for the Trycycler paper
Python
3
star
38

Nanopore-barcode-binner

C++
3
star
39

Nanopore-read-processor

A script for sorting, assessing and converting Oxford Nanopore reads
Python
2
star
40

Adapter-assembler

C++
2
star
41

Bugraft

Demonstration of speciation and descent from a common ancestor
C++
2
star
42

Unicycler-assembly-tests

Shell
1
star
43

MLST-from-SRST2

This tool uses a table of compiled results from SRST2 to create an MLST-like scheme.
Python
1
star
44

SPAdes-completion-checker

Tool to assess SPAdes assembly graph paths using read depth
Python
1
star
45

Irsat

Iterative Read Subset Assembly Tool
Python
1
star
46

Polypolish-paper

Supplementary figures, tables and scripts for the Polypolish paper
Python
1
star
47

rrwick.github.io

SCSS
1
star