• Stars
    star
    167
  • Rank 225,296 (Top 5 %)
  • Language
    Python
  • License
    Creative Commons ...
  • Created over 5 years ago
  • Updated over 3 years ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Benchmarking of long-read assembly tools for bacterial whole genomes

Benchmarking of long-read assemblers for prokaryote whole genome sequencing

DOI

This repo contains the supplementary figures, scripts and data used for our paper comparing long-read assemblers:
Wick RR, Holt KE. Benchmarking of long-read assemblers for prokaryote whole genome sequencing. F1000Research. 2019;8(2138).

Are you interested in the older version of this comparison which was hosted here on GitHub? You can still find it here.



Figures

Figure 1

Figure 1. Assembly results for the simulated read sets which cover a wide variety of parameters for length, depth and quality. 'Miniasm+' here refers to the entire Miniasm/Minipolish assembly pipeline. A: Proportion of each possible assembly outcome. B: Relative contiguity of the chromosome for each assembly, showing cleanliness of circularisation. C: Relative contiguity of all plasmids in the assemblies, showing cleanliness of circularisation. D: Sequence identity of each assembly's longest alignment to the chromosome. E: The maximum indel error size in each assembly's longest alignment to the chromosome. F: Total time taken (wall time) for each assembly. G: Maximum RAM usage for each assembly.





Figure 2

Figure 2. Assembly results for the real read sets, half containing ONT MinION reads (circles) and half PacBio RSII reads (X shapes). 'Miniasm+' here refers to the entire Miniasm/Minipolish assembly pipeline. A: Proportion of each possible assembly outcome. B: Relative contiguity of the chromosome for each assembly, showing cleanliness of circularisation. C: Relative contiguity of all plasmids in the assemblies, showing cleanliness of circularisation. D: Sequence identity of each assembly's longest alignment to the chromosome. E: The maximum indel error size in each assembly's longest alignment to the chromosome. F: Total time taken (wall time) for each assembly. G: Maximum RAM usage for each assembly.





Supplementary figures

Figure S1

Figure S1. Distributions of chromosome sizes (A), plasmid sizes (B) and per-genome plasmid counts (C) for the reference genomes used to make the simulated read sets.





Figure S2

Figure S2. Badread parameter histograms for the simulated read sets. A: Mean read depths were sampled from a uniform distribution ranging from 5x to 200x. B: Mean read lengths were sampled from a uniform distribution ranging from 100 to 20000 bp. C: Read length standard deviations were sampled from a uniform distribution ranging from 100 to twice that set's mean length (up to 40000 bp). D: Mean read identities were sampled from a uniform distribution ranging from 80% to 99%. E: Max read identities were sampled from a uniform distribution ranging from that set's mean identity plus 1% to 100%. F: Read identity standard deviations were sampled from a uniform distribution ranging from 1% to the max identity minus the mean identity. G, H and I: Junk, random and chimera rates were all sampled from an exponential distribution with a mean of 2%. J: Glitch sizes/skips were sampled from a uniform distribution ranging from 0 to 100. K: Glitch rates for each set were calculated from the size/skip according to this formula: 100000/(1.6986^(s/10)). L: Adapter lengths were sampled from an exponential distribution with a mean of 50.





Figure S3

Figure S3. Top: the target simulated depth of each replicon relative to the chromosome. The smaller the plasmid, the wider the range of possible depths. Bottom: the absolute read depth of each replicon after read simulation.





Figure S4

Figure S4. Commands used for each of the eight assemblers tested.





Figure S5

Figure S5. Possible states for the assembly of a circular replicon. Reference sequences are shown in the inner circles in black and aligned contig sequences are shown in the outer circles in colour (red at the contig start to violet at the contig end). A: Complete assembly with perfect circularisation. B: Complete assembly but with missing bases leading to a gapped circularisation. C: Complete assembly but with duplicated bases leading to overlapping circularisation. D: Incomplete assembly due to fragmentation (multiple contigs per replicon). E: Incomplete assembly due to missing sequence. F: Incomplete assembly due to misassembly (non-contiguous sequence in the contig).





Figure S6

Figure S6. Reference triplication for assembly assessment. A: Due to the ambiguous starting position of a circular replicon, a completely-assembled contig will typically not align to the reference in a single unbroken alignment. B: Doubling the reference sequence will allow for a single alignment, regardless of starting position. C: However, if the contig contains start/end overlap (i.e.\ contiguity >100%) then even a doubled reference may not be sufficient to achieve a single alignment, depending on the starting position. D: A tripled reference allows for an unbroken alignment, regardless of starting position, even in cases of >100% contiguity.





Figure S7

Figure S7. Contiguity of the simulated read set assemblies plotted against Badread parameters for each of the tested assemblers. These plots show how well the assemblers tolerate different problems in the read sets. A: Mean read depth (higher is better). B: Max read identity (higher is better). C: N50 read length (higher is better). D: The sum of random read rate and junk read rate (lower is better). E: Chimeric read rate (lower is better). F: Adapter sequence length (lower is better). G: Glitch size/skip (lower is better).





Figure S8

Figure S8. Plasmid completion for the simulated read set assemblies for each of the tested assemblers, plotted with plasmid length and read depth. Solid dots indicate completely assembled plasmids (contiguity β‰₯99%) while open dots indicate incomplete plasmids (contiguity <99%). Percentages in the plot titles give the proportion of plasmids which were completely assembled.





Figure S9

Figure S9. Plasmid completion for the real read set assemblies for each of the tested assemblers, plotted with plasmid length and read depth. Solid dots indicate completely assembled plasmids (contiguity β‰₯99%) while open dots indicate incomplete plasmids (contiguity <99%). Percentages in the plot titles give the proportion of plasmids which were completely assembled.





License

Creative Commons Attribution 4.0 International

More Repositories

1

Bandage

a Bioinformatics Application for Navigating De novo Assembly Graphs Easily
C++
574
star
2

Unicycler

hybrid assembly pipeline for bacterial genomes
C++
536
star
3

Porechop

adapter trimmer for Oxford Nanopore reads
C++
322
star
4

Basecalling-comparison

A comparison of different Oxford Nanopore basecallers
Python
313
star
5

Trycycler

A tool for generating consensus long-read assemblies for bacterial genomes
Python
299
star
6

Filtlong

quality filtering tool for long reads
C++
271
star
7

Badread

a long read simulator that can imitate many types of read problems
Python
150
star
8

Polypolish

a short-read polishing tool for long-read assemblies
Rust
138
star
9

Deepbinner

a signal-level demultiplexer for Oxford Nanopore reads
Python
124
star
10

Perfect-bacterial-genome-tutorial

Python
116
star
11

Metagenomics-Index-Correction

Python
77
star
12

Bacsort

a collection of scripts for organising bacterial genomes by species
Python
74
star
13

Minipolish

A tool for Racon polishing of miniasm assemblies
Python
71
star
14

Assembly-Dereplicator

A tool for removing redundant genomes from a set of assemblies
Python
64
star
15

August-2019-consensus-accuracy-update

A short analysis of Oxford Nanopore consensus accuracy for bacterial genome assemblies
Python
56
star
16

Verticall

Recombination-free trees
Python
54
star
17

Rebaler

reference-based long read assemblies of bacterial genomes
Python
45
star
18

MinION-desktop

Scripts and programs for the Holt Lab's MinION desktop
Python
32
star
19

Bacterial-genome-assemblies-with-multiplex-MinION-sequencing

Shell
32
star
20

Core-SNP-filter

a tool to filter sites in a FASTA-format whole-genome pseudo-alignment
Rust
26
star
21

Fast5-to-Fastq

A simple tool for extracting reads from Oxford Nanopore fast5 files
Python
26
star
22

Compare-annotations

A script for comparing old vs new versions of genome annotations
Python
20
star
23

LinesOfCodeCounter

A Python script to count lines of code in a directory for specific file extension, excluding blank/comment lines
Python
17
star
24

Catpac

a Contig Alignment Tool for Pairwise Assembly Comparison
Python
12
star
25

Small-plasmid-Nanopore

Python
11
star
26

DASCRUBBER-wrapper

Wrapper script for easier read scrubbing with DASCRUBBER
Python
10
star
27

GFA-dead-end-counter

a tool for counting dead ends in GFA assembly graphs
Rust
9
star
28

SPAdes-Contig-Graph

a tool for creating a FASTG contig graph from a SPAdes assembly
Python
9
star
29

Langtons-Ant-Animator

Program for creating Langton's Ant animations
C++
8
star
30

Klebsiella-assembly-species

a tool for assigning species to Klebsiella assemblies
Python
8
star
31

Circular-Contig-Extractor

Python
8
star
32

KleborateModular

A modular rewrite of Kleborate
Python
4
star
33

SRST2-table-from-assemblies

This is a tool for conducting a gene screen on assemblies, producing an SRST2-like output.
Python
3
star
34

IDBA-to-GFA

Python
3
star
35

Grovolve

Demonstration of evolution by natural selection
C++
3
star
36

Trycycler-paper

Supplementary figures, tables and scripts for the Trycycler paper
Python
3
star
37

Nanopore-barcode-binner

C++
3
star
38

Nanopore-read-processor

A script for sorting, assessing and converting Oxford Nanopore reads
Python
2
star
39

Adapter-assembler

C++
2
star
40

Bugraft

Demonstration of speciation and descent from a common ancestor
C++
2
star
41

Unicycler-assembly-tests

Shell
1
star
42

MLST-from-SRST2

This tool uses a table of compiled results from SRST2 to create an MLST-like scheme.
Python
1
star
43

SPAdes-completion-checker

Tool to assess SPAdes assembly graph paths using read depth
Python
1
star
44

Irsat

Iterative Read Subset Assembly Tool
Python
1
star
45

Polypolish-paper

Supplementary figures, tables and scripts for the Polypolish paper
Python
1
star
46

rrwick.github.io

SCSS
1
star