• Stars
    star
    134
  • Rank 270,967 (Top 6 %)
  • Language
    Perl
  • License
    GNU General Publi...
  • Created almost 10 years ago
  • Updated 8 months ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

💾 📃 "Reads to report" for public health and clinical microbiology

License: GPL v2 Don't judge me Difficult to install

Nullarbor

Pipeline to generate complete public health microbiology reports from sequenced isolates

⚠️ This documents the current Nullarbor 2.x version; previous 1.x is here

Motivation

Public health microbiology labs receive batches of bacterial isolates whenever there is a suspected outbreak.In modernised labs, each of these isolates will be whole genome sequenced, typically on an Illumina or Ion Torrent instrument. Each of these WGS samples needs to quality checked for coverage, contamination and correct species. Genotyping (eg. MLST) and resistome characterisation is also required. Finally a phylogenetic tree needs to be generated to show the relationship and genomic distance between the strains. All this information is then combined with epidemiological information (metadata for each sample) to assess the situation and inform further action.

Example reports

Feel free to browse some example reports.

Pipeline

Limitations

Nullarbor currently only supports Illumina paired-end sequencing data; single end reads, from either Illumina or Ion Torrent are not supported. All jobs are run on a single compute node; there is no support yet for distributing the work across a high performance cluster.

Per isolate

  1. Clean reads
    • remove adaptors, low quality bases and reads (Trimmomatic)
  2. Species identification
    • k-mer analysis against known genome database (Kraken, Kraken2, Centrifuge)
  3. De novo assembly
    • User can select (SKESA, SPAdes, Megahit, shovill, Velvet)
  4. Annotation
    • Add features to assembly Prokka)
  5. MLST
    • From assembly w/ automatic scheme detection (mlst + PubMLST)
  6. Resistome
  7. Virulome
  8. Variants
    • From reads aligned to reference (snippy)

Per isolate set

  1. Core genome SNPs
  2. Infer core SNP phylogeny
  3. Pan genome
    • From annotated contigs (Roary)
  4. Report
    • Summary isolate information (HTML + Plotly.JS + DataTables + PhyloCanvas)
    • More detailed per isolate pages (COMING SOON)

Installation

You need to install both the software and the databases separately.

Software

Conda

Install Conda or Miniconda:

conda install -c conda-forge -c bioconda -c defaults nullarbor

Homebrew (coming soon)

Install Homebrew (macOS) or LinuxBrew (Linux).

brew install brewsci/bio/nullarbor

Source

This is the hardest way to install Nullarbor.

cd $HOME
git clone https://github.com/tseemann/nullarbor.git

# keep running this command and installing stuff until it says everything is correct
./nullarbor/bin/nullarbor.pl --check

# For Perl modules (eg. YAML::Tiny), use one of the following methods
apt-get install yaml-tiny-perl  # ubuntu/debian
yum install perl-YAML-Tiny      # centos/redhat
cpan YAML::Tiny
cpanm YAML::Tiny

Databases

Kraken

You need to install a Kraken database (~8 GB).

wget https://ccb.jhu.edu/software/kraken/dl/minikraken_20171019_8GB.tgz
tar -C $HOME -zxvf minikraken_20171019_8GB.tgz

Kraken 2

You need to install a Kraken2 database (~8 GB).

wget ftp://ftp.ccb.jhu.edu/pub/data/kraken2_dbs/minikraken2_v2_8GB_201904_UPDATE.tgz
tar -C $HOME -zxvf minikraken2_v2_8GB_201904_UPDATE.tgz

Centrifuge

Install a Centrifuge database (~8 GB):

wget ftp://ftp.ccb.jhu.edu/pub/infphilo/centrifuge/data/p_compressed+h+v.tar.gz
mkdir $HOME/centrifuge-db
tar -C $HOME/centrifuge-db -zxvf p_compressed+h+v.tar.gz

Set global database locations

Then add the following to your $HOME/.bashrc so Nullarbor can find the databases:

export KRAKEN_DEFAULT_DB=$HOME/minikraken_20171019_8GB
export KRAKEN2_DEFAULT_DB=$HOME/minikraken2_v2_8GB_201904_UPDATE
export CENTRIFUGE_DEFAULT_DB=$HOME/centrifuge-db/p_compressed+h+v

You should be good to go now. When you first run Nullarbor it will let you know of any missing dependencies or databases.

Usage

Check dependencies

Nullarbor does a self-check of all binaries, Perl modules and databases:

nullarbor.pl --check

Create a 'samples' file (TAB)

This is a file, one line per isolate, with 3 tab separated columns: ID, R1, R2.

Isolate1	/data/reads/Isolate1_R1.fq.gz	/data/reads/Isolate2_R1.fq.gz
Isolate2	/data/reads/Isolate2_R1.fq      /data/reads/Isolate2_R2.fq
Isolate3	/data/old/s_3_1_sequence.txt	/data/old/s_3_2_sequence.txt
Isolate3b	/data/reads/Isolate3b_R1.fastq	/data/reads/Isolate3b_R2.fastq

Choose a reference genome (FASTA, GENBANK)

This is just a regular FASTA or GENBANK file. Try and choose a reference phylogenomically similar to your isolates.
If you use a GENBANK or EMBL file the annotations will be used to annotate SNPs by Snippy.

Generate the run folder

This command will create a new folder with a Makefile in it:

nullarbor.pl --name PROJNAME --mlst saureus --ref US300.fna --input samples.tab --outdir OUTDIR

This will check that everything is okay. One of the last lines it prints is the command you need to run to actually perform the analysis e.g.

Run the pipeline with: nice make -j 4 -C OUTDIR

So you can just cut and paste that:

nice make -j 4 -C OUTDIR

The -C option just means to change into the /home/maria/listeria/nullarbor folder first, so you could do this instead:

cd OUTDIR
make -j 4

View the report

firefox OUTDIR/report/index.html

Here are some example reports.

See some options

Once set up, a Nullarbor folder can be used in a few different ways. See what's available with this command:

make help

Advanced usage

Quick preview mode

You should not do a full run the first time, because it will probably contain outliers and QC failures. To build a quick "rough" tree:

make preview

This will create a mini-report in the same report/ folder. Use this to identify outliers and then comment them out (or delete) them from the --input file. Then type the following to regenerate the report for a second round of inspection:

make again
make preview

When you are happy with the result, proceed with the full analysis:

make again
make

Prefilling data

Often you want to perform multiple analyses where some of the isolates have been used in previous Nullarbor runs. It is wasteful to recompute results you already have. The --prefill option allows you to "copy" existing result files into a new Nullarbor folder before commencing the run.

To set it up, add a prefill section to nullarbor.conf as follows:

# nullarbor.conf
prefill:
        contigs.fa: /home/seq/MDU/QC/{ID}/contigs.fa

The {ID} will replaced for each isolate ID in your --input TAB file and the contigs.fa copied from the source path specified. This will prevent Nullarbor having to re-assemble the reads.

Using different components

Nullarbor 2.x has a plugin system for assembly and tree building. These can be changed using the --assembler and --treebuilder options.

Read trimming is off by default, because most sequences are now provided pre-trimmed, and retrimming occupies much disk space. To trim Illumina adaptors, use the --trim option.

Removing isolates from an existing run

After examining the report from your initial analysis, it is common to observe some outliers, or bad data. In this case, you want to remove those isolates from the analysis, but want to minimize the amount of recomputation needed.

Just go to the original --input TAB file and either (1) remove the offending lines; or (2) just add a # symbol to "comment out" the line and it will be ignored by Nullarbor.

Then go back into the Nullarbor folder and type make again and it should make a new report. Assemblies and SNPs won't be redone, but the tree-builder and pan-genome components will need to run again.

Adding isolates to an existing run

As per "Removing isolates" above, you can also add in more isolates to your original --input TAB file when you want to expand the analysis. Then just type make again and it should only recalculate things it needs to, saving a lot of computation.

Immediate start

If you don't want to cut and paste the make .... instructions to start the analysis, just add the --run option to your nullarbor.pl command.

Influential environmental variables

  • NULLARBOR_CONF - default --conf, the path to nullarbor.conf
  • NULLARBOR_CPUS - default --cpus
  • NULLARBOR_ASSEMBLER - default --assembler tool
  • NULLARBOR_TREEBUILDER - default --treebuilder tool
  • NULLARBOR_TAXONER - default --taxoner tool

Dependencies

Nullarbor has many dependencies, so you are best off using a package manager to install it. Type nullarbor.pl --check to see what you need.

Perl: Bio::Perl Time::Piece List::Util Path::Tiny YAML::Tiny Moo SVG Text::CSV List::MoreUtils IO::File

Tools: seqtk trimmomatic prokka roary mlst abricate seqret skesa megahit spades shovill snippy snp-dists newick-utils iqtree fasttree quicktree kraken kraken2 centrifuge

Databases: minikraken centrifuge-bacvirhum

Note that these are only the immediate dependencies and that the tools listed above will depend on various other tools, Perl modules, and Python modules.

Etymology

The Nullarbor is a huge treeless plain that spans the area between south-west and south-east Australia. It comes from the Latin "nullus" (no) and "arbor" (tree), or "no trees". As this software will generate a tree, there is an element of Australian irony in the name.

Issues

Submit problems to the Issues Page

License

GPL 2.0

Citation

Seemann T, Goncalves da Silva A, Bulach DM, Schultz MB, Kwong JC, Howden BP. Nullarbor Github https://github.com/tseemann/nullarbor

More Repositories

1

prokka

⚡ ♒ Rapid prokaryotic genome annotation
Perl
808
star
2

snippy

✂️ ⚡ Rapid haploid variant calling and core genome alignment
Perl
471
star
3

abricate

🔎 💊 Mass screening of contigs for antimicrobial and virulence genes
Perl
359
star
4

shovill

⚡♠️ Assemble bacterial isolate genomes from Illumina paired-end reads
Perl
210
star
5

barrnap

🔬 ♌ Bacterial ribosomal RNA predictor
Perl
200
star
6

mlst

🆔 Scan contig files against PubMLST typing schemes
Shell
191
star
7

snp-dists

Pairwise SNP distance matrix from a FASTA sequence alignment
C
126
star
8

any2fasta

Convert various sequence formats to FASTA
Perl
115
star
9

VelvetOptimiser

📈 Automatically optimise three of Velvet's assembly parameters.
Perl
47
star
10

samclip

Filter SAM file for soft and hard clipped alignments
Perl
44
star
11

phastaf

Identify phage regions in bacterial genomes for masking purposes
Perl
29
star
12

seeka

Get microbial sequence data easier and faster
Perl
28
star
13

homebrew-bioinformatics-linux

🍺 🐧 Homebrew formulae for bioinformatics software only available for Linux
Ruby
27
star
14

berokka

🍊 💫 Trim, circularise and orient long read bacterial genome assemblies
Perl
25
star
15

ekidna

Assembly based core genome SNP alignments for bacteria
Perl
25
star
16

sixess

🔬🐛 Rapid 16s rRNA identification from isolate FASTQ files
Shell
23
star
17

cgmlst-dists

🐻⇔🐨 Calculate distance matrix from ChewBBACA cgMLST allele call tables
C
23
star
18

PEAR

Pair-End AssembeR
C
22
star
19

mokka

Annotate your metagenome assemblies
11
star
20

scapper

Whole genome core alignments from multiple draft genomes
Perl
10
star
21

kounta

🧮 🔢 Generate multi-sample k-mer count matrix from WGS
Perl
9
star
22

trencha

Normalize VCF depth for Illumina GC bias
Perl
7
star
23

legsta

🍗⭐ In silico Legionella pneumophila Sequence Based Typing
Perl
7
star
24

snasm

Assembly based core SNP alignments
Perl
7
star
25

tseemann.github.io

Torsten Seemann's Home Page
HTML
7
star
26

noary

🍣 🦐 A lightweight nucleotide bacterial ortholog clustering tool
Perl
7
star
27

wombac

‼️ Rapid core genome SNP alignments from multiple bacterial genomes
Perl
7
star
28

klosham

Find closest aligned sequences to a query sequnece
C
6
star
29

kopynumba

Identify copy number variation in bacterial Illumina sequences
6
star
30

fasterqc

A non-Java alternative to the classic FastQC tool
Perl
5
star
31

ragnarokka

Annotate and correct erro-prone ONT genomes
5
star
32

spekki

Species prediction from NGS reads
Python
5
star
33

polisha

Fix small assembly errors using Illumina reads
Perl
5
star
34

polyfix

🔪⛓️ Repair nanopore assemblies using related genome(s)
Perl
5
star
35

skrofula

Yet another M.tuberculosis typing and resistance tool, but for the impatient (not in-patient)
Perl
5
star
36

varion

5
star
37

injecta

Insert genes into genomes to aid synthetic test data generation
Perl
4
star
38

kurra

Fast whole genome phylogeny
4
star
39

heterik

Estimate heterozygosity or mixture level of a bacterial WGS sample
4
star
40

dehomopolymerate

Collapse sequence homopolymers to a single character
C
4
star
41

babykraken

👶🦑 Very small Kraken2 database for bundling with pipelines
4
star
42

bowkaster

cgMLST from FASTQ reads
4
star
43

perl-biotool

🐫 🐪 Small pure Perl5 libraries for writing command line bioinformatics tools
Perl
3
star
44

anthrakks

Distinguish Bacillus cereus and biovar anthracis (anthrax)
3
star
45

bioinfo-scripts

Collection of bioinformatics utility scripts, mostly written in Bioperl
Perl
3
star
46

gbk2bcfgff

Convert Genbank to GFF compatible with "bcftools csq"
2
star
47

easy-web-blast

2
star
48

mini-outbreak

Small WGS dataset for testing bacterial outbreak analysis pipelines
2
star
49

snippa

Experimental modular bacterial SNP calling pipeline
Perl
2
star
50

vikka

Viral genomics toolkit for pandemics
1
star
51

gard

🍆 💊 Gonococcal Antimicrobial Resistance Detection
1
star
52

wtfq

Duplicate FASTQ reads to address undersequenced regions
C
1
star
53

simuvar

Simulate variants of bacterial genomes for testing SNP callers
1
star
54

coginator

Assign COGs to protein sequences
1
star
55

skrilla

It ain't all about skrilla
1
star
56

arborkart

Phylogenomic trees with maps for the web
1
star
57

assembill

Simple script to clip, assemble, tile and annotate a bacterial genome from Illumina reads
Shell
1
star
58

kroucha

Mock repository for Sanger publications citing Croucher et al
1
star