• Stars
    star
    170
  • Rank 223,357 (Top 5 %)
  • Language
    Python
  • License
    Creative Commons ...
  • Created over 5 years ago
  • Updated over 3 years ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Benchmarking of long-read assembly tools for bacterial whole genomes

Benchmarking of long-read assemblers for prokaryote whole genome sequencing

DOI

This repo contains the supplementary figures, scripts and data used for our paper comparing long-read assemblers:
Wick RR, Holt KE. Benchmarking of long-read assemblers for prokaryote whole genome sequencing. F1000Research. 2019;8(2138).

Are you interested in the older version of this comparison which was hosted here on GitHub? You can still find it here.



Figures

Figure 1

Figure 1. Assembly results for the simulated read sets which cover a wide variety of parameters for length, depth and quality. 'Miniasm+' here refers to the entire Miniasm/Minipolish assembly pipeline. A: Proportion of each possible assembly outcome. B: Relative contiguity of the chromosome for each assembly, showing cleanliness of circularisation. C: Relative contiguity of all plasmids in the assemblies, showing cleanliness of circularisation. D: Sequence identity of each assembly's longest alignment to the chromosome. E: The maximum indel error size in each assembly's longest alignment to the chromosome. F: Total time taken (wall time) for each assembly. G: Maximum RAM usage for each assembly.





Figure 2

Figure 2. Assembly results for the real read sets, half containing ONT MinION reads (circles) and half PacBio RSII reads (X shapes). 'Miniasm+' here refers to the entire Miniasm/Minipolish assembly pipeline. A: Proportion of each possible assembly outcome. B: Relative contiguity of the chromosome for each assembly, showing cleanliness of circularisation. C: Relative contiguity of all plasmids in the assemblies, showing cleanliness of circularisation. D: Sequence identity of each assembly's longest alignment to the chromosome. E: The maximum indel error size in each assembly's longest alignment to the chromosome. F: Total time taken (wall time) for each assembly. G: Maximum RAM usage for each assembly.





Supplementary figures

Figure S1

Figure S1. Distributions of chromosome sizes (A), plasmid sizes (B) and per-genome plasmid counts (C) for the reference genomes used to make the simulated read sets.





Figure S2

Figure S2. Badread parameter histograms for the simulated read sets. A: Mean read depths were sampled from a uniform distribution ranging from 5x to 200x. B: Mean read lengths were sampled from a uniform distribution ranging from 100 to 20000 bp. C: Read length standard deviations were sampled from a uniform distribution ranging from 100 to twice that set's mean length (up to 40000 bp). D: Mean read identities were sampled from a uniform distribution ranging from 80% to 99%. E: Max read identities were sampled from a uniform distribution ranging from that set's mean identity plus 1% to 100%. F: Read identity standard deviations were sampled from a uniform distribution ranging from 1% to the max identity minus the mean identity. G, H and I: Junk, random and chimera rates were all sampled from an exponential distribution with a mean of 2%. J: Glitch sizes/skips were sampled from a uniform distribution ranging from 0 to 100. K: Glitch rates for each set were calculated from the size/skip according to this formula: 100000/(1.6986^(s/10)). L: Adapter lengths were sampled from an exponential distribution with a mean of 50.





Figure S3

Figure S3. Top: the target simulated depth of each replicon relative to the chromosome. The smaller the plasmid, the wider the range of possible depths. Bottom: the absolute read depth of each replicon after read simulation.





Figure S4

Figure S4. Commands used for each of the eight assemblers tested.





Figure S5

Figure S5. Possible states for the assembly of a circular replicon. Reference sequences are shown in the inner circles in black and aligned contig sequences are shown in the outer circles in colour (red at the contig start to violet at the contig end). A: Complete assembly with perfect circularisation. B: Complete assembly but with missing bases leading to a gapped circularisation. C: Complete assembly but with duplicated bases leading to overlapping circularisation. D: Incomplete assembly due to fragmentation (multiple contigs per replicon). E: Incomplete assembly due to missing sequence. F: Incomplete assembly due to misassembly (non-contiguous sequence in the contig).





Figure S6

Figure S6. Reference triplication for assembly assessment. A: Due to the ambiguous starting position of a circular replicon, a completely-assembled contig will typically not align to the reference in a single unbroken alignment. B: Doubling the reference sequence will allow for a single alignment, regardless of starting position. C: However, if the contig contains start/end overlap (i.e.\ contiguity >100%) then even a doubled reference may not be sufficient to achieve a single alignment, depending on the starting position. D: A tripled reference allows for an unbroken alignment, regardless of starting position, even in cases of >100% contiguity.





Figure S7

Figure S7. Contiguity of the simulated read set assemblies plotted against Badread parameters for each of the tested assemblers. These plots show how well the assemblers tolerate different problems in the read sets. A: Mean read depth (higher is better). B: Max read identity (higher is better). C: N50 read length (higher is better). D: The sum of random read rate and junk read rate (lower is better). E: Chimeric read rate (lower is better). F: Adapter sequence length (lower is better). G: Glitch size/skip (lower is better).





Figure S8

Figure S8. Plasmid completion for the simulated read set assemblies for each of the tested assemblers, plotted with plasmid length and read depth. Solid dots indicate completely assembled plasmids (contiguity β‰₯99%) while open dots indicate incomplete plasmids (contiguity <99%). Percentages in the plot titles give the proportion of plasmids which were completely assembled.





Figure S9

Figure S9. Plasmid completion for the real read set assemblies for each of the tested assemblers, plotted with plasmid length and read depth. Solid dots indicate completely assembled plasmids (contiguity β‰₯99%) while open dots indicate incomplete plasmids (contiguity <99%). Percentages in the plot titles give the proportion of plasmids which were completely assembled.





License

Creative Commons Attribution 4.0 International

More Repositories

1

Bandage

a Bioinformatics Application for Navigating De novo Assembly Graphs Easily
C++
582
star
2

Unicycler

hybrid assembly pipeline for bacterial genomes
C++
559
star
3

Porechop

adapter trimmer for Oxford Nanopore reads
C++
337
star
4

Basecalling-comparison

A comparison of different Oxford Nanopore basecallers
Python
313
star
5

Trycycler

A tool for generating consensus long-read assemblies for bacterial genomes
Python
306
star
6

Filtlong

quality filtering tool for long reads
C++
285
star
7

Badread

a long read simulator that can imitate many types of read problems
Python
168
star
8

Polypolish

a short-read polishing tool for long-read assemblies
Rust
143
star
9

Deepbinner

a signal-level demultiplexer for Oxford Nanopore reads
Python
124
star
10

Perfect-bacterial-genome-tutorial

Python
118
star
11

Metagenomics-Index-Correction

Python
78
star
12

Bacsort

a collection of scripts for organising bacterial genomes by species
Python
76
star
13

Minipolish

A tool for Racon polishing of miniasm assemblies
Python
72
star
14

Assembly-Dereplicator

A tool for removing redundant genomes from a set of assemblies
Python
68
star
15

August-2019-consensus-accuracy-update

A short analysis of Oxford Nanopore consensus accuracy for bacterial genome assemblies
Python
58
star
16

Verticall

Recombination-free trees
Python
56
star
17

Rebaler

reference-based long read assemblies of bacterial genomes
Python
47
star
18

MinION-desktop

Scripts and programs for the Holt Lab's MinION desktop
Python
32
star
19

Bacterial-genome-assemblies-with-multiplex-MinION-sequencing

Shell
32
star
20

Core-SNP-filter

a tool to filter sites in a FASTA-format whole-genome pseudo-alignment
Rust
30
star
21

Fast5-to-Fastq

A simple tool for extracting reads from Oxford Nanopore fast5 files
Python
26
star
22

Compare-annotations

A script for comparing old vs new versions of genome annotations
Python
20
star
23

LinesOfCodeCounter

A Python script to count lines of code in a directory for specific file extension, excluding blank/comment lines
Python
18
star
24

Catpac

a Contig Alignment Tool for Pairwise Assembly Comparison
Python
12
star
25

Small-plasmid-Nanopore

Python
11
star
26

DASCRUBBER-wrapper

Wrapper script for easier read scrubbing with DASCRUBBER
Python
10
star
27

GFA-dead-end-counter

a tool for counting dead ends in GFA assembly graphs
Rust
9
star
28

SPAdes-Contig-Graph

a tool for creating a FASTG contig graph from a SPAdes assembly
Python
9
star
29

Langtons-Ant-Animator

Program for creating Langton's Ant animations
C++
8
star
30

Klebsiella-assembly-species

a tool for assigning species to Klebsiella assemblies
Python
8
star
31

Circular-Contig-Extractor

Python
8
star
32

ONT-assembler-benchmark

Python
5
star
33

KleborateModular

A modular rewrite of Kleborate
Python
4
star
34

SRST2-table-from-assemblies

This is a tool for conducting a gene screen on assemblies, producing an SRST2-like output.
Python
3
star
35

IDBA-to-GFA

Python
3
star
36

Grovolve

Demonstration of evolution by natural selection
C++
3
star
37

Trycycler-paper

Supplementary figures, tables and scripts for the Trycycler paper
Python
3
star
38

Nanopore-barcode-binner

C++
3
star
39

Nanopore-read-processor

A script for sorting, assessing and converting Oxford Nanopore reads
Python
2
star
40

Adapter-assembler

C++
2
star
41

Bugraft

Demonstration of speciation and descent from a common ancestor
C++
2
star
42

Unicycler-assembly-tests

Shell
1
star
43

MLST-from-SRST2

This tool uses a table of compiled results from SRST2 to create an MLST-like scheme.
Python
1
star
44

SPAdes-completion-checker

Tool to assess SPAdes assembly graph paths using read depth
Python
1
star
45

Irsat

Iterative Read Subset Assembly Tool
Python
1
star
46

Polypolish-paper

Supplementary figures, tables and scripts for the Polypolish paper
Python
1
star
47

rrwick.github.io

SCSS
1
star