• Stars
    star
    177
  • Rank 209,170 (Top 5 %)
  • Language
    Shell
  • License
    GNU General Publi...
  • Created about 10 years ago
  • Updated over 1 year ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

๐Ÿ†” Scan contig files against PubMLST typing schemes

Build Status License: GPL v2 Don't judge me

mlst

Scan contig files against traditional PubMLST typing schemes

Quick Start

% mlst contigs.fa
contigs.fa  neisseria  11149  abcZ(672) adk(3) aroE(4) fumC(3) gdh(8) pdhC(4) pgm(6)

% mlst genome.gbk.gz
genome.gbk.gz  sepidermidis  184  arcC(16) aroE(1) gtr(2) mutS(1) pyrR(2) tpiA(1) yqiL(1)

% mlst --label Anthrax GCF_001941925.1_ASM194192v1_genomic.fna.bz2
Anthrax  bcereus  -  glp(24) gmk(1) ilv(~83) pta(1) pur(~71) pyc(37) tpi(41)

% mlst --nopath /opt/data/refseq/S_pyogenes/*.fna
NC_018936.fna  spyogenes  28   gki(4)   gtr(3)   murI(4)   mutS(4)  recP(4)    xpt(2)   yqiL(4)
NC_017596.fna  spyogenes  11   gki(2)   gtr(6)   murI(1)   mutS(2)  recP(2)    xpt(2)   yqiL(2)
NC_008022.fna  spyogenes  55   gki(11)  gtr(9)   murI(1)   mutS(9)  recP(2)    xpt(3)   yqiL(4)
NC_006086.fna  spyogenes  382  gki(5)   gtr(52)  murI(5)   mutS(5)  recP(5)    xpt(4)   yqiL(3)
NC_008024.fna  spyogenes  -    gki(5)   gtr(11)  murI(8)   mutS(5)  recP(15?)  xpt(2)   yqiL(1)
NC_017040.fna  spyogenes  172  gki(56)  gtr(24)  murI(39)  mutS(7)  recP(30)   xpt(2)   yqiL(33)

Installation

Conda

If you are using Conda

% conda install -c conda-forge -c bioconda -c defaults mlst

Brew

If you are using the MacOS Homebrew or LinuxBrew packaging system:

% brew install brewsci/bio/mlst

Source

% cd $HOME
% git clone https://github.com/tseemann/mlst.git
% $HOME/mlst/bin/mlst --help

Dependencies

  • Perl >= 5.26
  • NCBI BLAST+ blastn >= 2.9.0
    • You probably have blastn already installed already.
    • If you use Brew or Conda, this will install the blast package for you.
  • Perl modules: Moo,List::MoreUtils,JSON
    • Debian: sudo apt-get install libmoo-perl liblist-moreutils-perl libjson-perl
    • Redhat: sudo apt-get install perl-Moo perl-List-MoreUtils perl-JSON
    • Most Unix: sudo cpan Moo List::MoreUtils JSON
  • any2fasta
    • Converts sequence files to FASTA, even compressed ones

Usage

Simply just give it a genome file in FASTA/GenBank/EMBL format, optionally compressed with gzip, zip or bzip2.

% mlst contigs.fa
contigs.fa  neisseria  11149  abcZ(672) adk(3) aroE(4) fumC(3) gdh(8) pdhC(4) pgm(6)

It returns a tab-separated line containing

  • the filename
  • the matching PubMLST scheme name
  • the ST (sequence type)
  • the allele IDs

You can give it multiple files at once, and they can be in FASTA/GenBank/EMBL format, and even compressed with gzip, bzip2 or zip.

% mlst genomes/*
genomes/6008.fna        saureus         239  arcc(2)   aroe(3)   glpf(1)   gmk_(1)   pta_(4)   tpi_(4)   yqil(3)
genomes/strep.fasta.gz  ssuis             1  aroA(1)   cpn60(1)  dpr(1)    gki(1)    mutS(1)   recA(1)   thrA(1)
genomes/NC_002973.gbk   lmonocytogenes    1  abcZ(3)   bglA(1)   cat(1)    dapE(1)   dat(3)    ldh(1)    lhkA(3)
genomes/L550.gbk.bz2    leptospira      152  glmU(26)  pntA(30)  sucA(28)  tpiA(35)  pfkB(39)  mreA(29)  caiB(29)

Without auto-detection

You can force a particular scheme (useful for reporting systems):

% mlst --scheme neisseria NM*
NM003.fa   neisseria  4821  abcZ(222)  adk(3)  aroE(58)  fumC(275)  gdh(30)  pdhC(5)  pgm(255)
NM005.gbk  neisseria  177   abcZ(7)    adk(8)  aroE(10)  fumC(38)   gdh(10)  pdhC(1)  pgm(20)
NM011.fa   neisseria  11    abcZ(2)    adk(3)  aroE(4)   fumC(3)    gdh(8)   pdhC(4)  pgm(6)
NMC.gbk.gz neisseria  8     abcZ(2)    adk(3)  aroE(7)   fumC(2)    gdh(8)   pdhC(5)  pgm(2)

You can make mlst behave like older version before auto-detection existed by providing the --legacy parameter with the --scheme parameter. In that case it will print a fixed tabular output with a heading containing allele names specific to that scheme:

% mlst --legacy --scheme neisseria *.fa
FILE      SCHEME     ST    abcZ  adk  aroE  fumC  gdh  pdhC  pgm
NM003.fa  neisseria  11    2     3    4     3       8     4    6
NM009.fa  neisseria  11149 672   3    4     3       8     4    6
MN043.fa  neisseria  11    2     3    4     3       8     4    6
NM051.fa  neisseria  11    2     3    4     3       8     4    6
NM099.fa  neisseria  1287  2     3    4    17       8     4    6
NM110.fa  neisseria  11    2     3    4     3       8     4    6

Available schemes

To see which PubMLST schemes are supported:

% mlst --list

abaumannii achromobacter aeromonas afumigatus cdifficile efaecium
hcinaedi hparasuis hpylori kpneumoniae leptospira
saureus xfastidiosaย yersinia ypseudotuberculosis yruckeri

The above list is shortened. You can get more details using mlst --longlist.

achromobacter     nusA       rpoB      eno       gltB      lepA       nuoL      nrdA
abaumannii        Oxf_gltA   Oxf_gyrB  Oxf_gdhB  Oxf_recA  Oxf_cpn60  Oxf_gpi   Oxf_rpoD
abaumannii_2      Pas_cpn60  Pas_fusA  Pas_gltA  Pas_pyrG  Pas_recA   Pas_rplB  Pas_rpoB
aeromonas         gyrB       groL      gltA      metG      ppsA       recA
aphagocytophilum  pheS       glyA      fumC      mdh       sucA       dnaN      atpA
arcobacter        aspA       atpA      glnA      gltA      glyA       pgm       tkt
afumigatus        ANX4       BGT1      CAT1      LIP       MAT1_2     SODB      ZRF2
bcereus           glp        gmk       ilv       pta       pur        pyc       tpi
<snip>

Missing data

Version 2.x does not just look for exact matches to full length alleles. It attempts to tell you as much as possible about what it found using the notation below:

Symbol Meaning Length Identity
n exact intact allele 100% 100%
~n novel full length allele similar to n 100% โ‰ฅ --minid
n? partial match to known allele โ‰ฅ --mincov โ‰ฅ --minid
- allele missing < --mincov < --minid
n,m multiple alleles ย  ย 

Scoring system

Each MLST prediction gets a score out of 100. The score for a scheme with N alleles is as follows:

  • +90/N points for an exact allele match e.g. 42
  • +63/N points for a novel allele match (50% of an exact allele) e.g. ~42
  • +18/N points for a partial allele match (20% of an exact alelle) e.g. 42?
  • 0 points for a missing allele e.g. -
  • +10 points if there is a matching ST type for the allele combination

It is possible to filter results using the --minscore option which takes a value between 1 and 100. If you only want to report known ST types, then use --minscore 100. To also include novel combinations of existing alleles with no ST type, use --minscore 90. The default is --minscore 50 which is an ad hoc value I have found allows for genuine partial ST matches but eliminates false positives.

Tweaking the output

The output is TSV (tab-separated values). This makes it easy to parse and manipulate with Unix utilities like cut and sort etc. For example, if you only want the filename and ST you can do the following:

% mlst --scheme abaumanii AB*.fasta | cut -f1,3 > ST.tsv

If you prefer CSV because it loads more smoothly into MS Excel, use the --csv option:

% mlst --csv Peptobismol.fna.gz > mlst.csv

JSON output is available too; it returns an array of dictionaries, one per input file. The id will be the same as filename unless --label is used, but that only works when scanning a single file.

% mlst -q --json out.json test/example.gbk.gz test/novel.fasta.bz2
% cat out.json
[
   {
      "scheme" : "sepidermidis",
      "alleles" : {
         "mutS" : "1",
         "yqiL" : "1",
         "tpiA" : "1",
         "pyrR" : "2",
         "gtr" : "2",
         "aroE" : "1",
         "arcC" : "16"
      },
      "sequence_type" : "184",
      "filename" : "test/example.gbk.gz",
      "id" : "test/example.gbk.gz"
   },
   {
      "sequence_type" : "-",
      "filename" : "test/novel.fasta.bz2",
      "scheme" : "spneumoniae",
      "alleles" : {
         "gki" : "2",
         "aroE" : "7",
         "ddl" : "22",
         "gdh" : "15",
         "xpt" : "1",
         "recP" : "~10",
         "spi" : "6"
      },
      "id" : "test/novel.fasta.bz2"
   }
]

You can also save the "novel" alleles for submission to PubMLST::

% mlst -q --novel nouveau.fa s_myces.fasta

% cat nouveau.fa

>streptomyces.recA-e562a2cd93e701e3b58ba0670bcbba0c s_myces.fasta
GACGTGGCCCTCGGCGTCGGCGGTCTGCCGCGCGGCCGCGTCGTCGAGATCTACGGACCGGAGTCCTCC...

The format of the sequence IDs is scheme.allele-hash filename where hash is the hexadecimal MD5 digest of the allele DNA sequence.

Mapping to genus/species

Included is a file called db/scheme_species_map.tab which has 3 tab-separated columns as follows:

#SCHEME GENUS   SPECIES
abaumannii      Acinetobacter   baumannii
abaumannii_2    Acinetobacter   baumannii
achromobacter   Achromobacter
aeromonas       Aeromonas
afumigatus      Aspergillus     afumigatus
arcobacter      Arcobacter
bburgdorferi    Borrelia        burgdorferi
bhampsonii      Brachyspira     hampsonii
bhenselae       Bartonella      henselae
borrelia        Borrelia
bpilosicoli     Brachyspira     pilosicoli
<snip>

Note that that some schemes are species specific, and others are genus specific, so the SPECIES column is empty. Note that the same species/genus can apply to multiple schemes, see abaumanii above.

Updating the database

The mlst software comes bundled with the traditional MLST databases; namely those schemes with less than 10 genes. I strive to make regular releases with updated databases, but if this is not frequent enough you can update the databases yourself using some tools included in the scripts folder as follows:

# Figure out where mlst is installed
% which mlst
/home/user/sw/mlst

# Go into the scripts folder (you need to have write access!)
% cd /home/user/sw/mlst/scripts

# Run the downloader script (you need 'wget' installed)
% ./mlst-download_pub_mlst | bash

# Check it downloaded everything ok
% find pubmlst | less

# Save the old database folder
% mv ../db/pubmlst ../db/pubmlst.old

# Put the new folder there
% mv ./pubmlst ../db/

# Regenerate the BLAST database
% ./mlst-make_blast_db

# Check schemes are installed
% ../bin/mlst --list

Adding a new scheme

If you are unable or unwilling to add your scheme to PubMLST via BIGSdb you can insert a new scheme into your local mlst database.

The directory structure

Each MLST scheme exists in a folder withing the mlst/db/pubmlst folder. The name of the folder is the scheme name, say saureus for Staphylococcus aureus. It contains files like this:

% cd mlst/db/pubmlst/sareus
% ls -1
saureus.txt
arcC.tfa
aroE.tfa
glpF.tfa
gmk.tfa
pta.tfa
tpi.tfa
yqiL.tfa

The folder name (ie. saureus) must be the same name as the scheme file (ie. saureus.txt) or it will not work.

The scheme file

The saureus.txt is a tab-separated file containing one ST definition per row. The header line must be present. Extra columns with names mlst_clade,clonal_complex,species,CC,Lineage are ignored.

% head -n 5 saureus.txt
ST      arcC    aroE    glpF    gmk     pta     tpi     yqiL    clonal_complex
1       1       1       1       1       1       1       1
2       2       2       2       2       2       2       26
3       1       1       1       9       1       1       12
4       10      10      8       6       10      3       2

The allele sequence files

Each of the .tfa files are nucleotide FASTA files with the allele sequences for each locus. There must be a .tfa file for each and every allele locus in the TSV scheme .txt file. Here is what the arcC.tfa file looks like:

% head -n 20 arcC.tfa
>arcC_1
TTATTAATCCAACAAGCTAAATCGAACAGTGACACAACGCCGGCAATGCCATTGGATACT
TGTGGTGCAATGTCACAGGGTATGATAGGCTATTGGTTGGAAACTGAAATCAATCGCATT
TTAACTGAAATGAATAGTGATAGAACTGTAGGCACAATCGTTACACGTGTGGAAGTAGAT
AAAGATGATCCACGATTCAATAACCCAACCAAACCAATTGGTCCTTTTTATACGAAAGAA
GAAGTTGAAGAATTACAAAAAGAACAGCCAGACTCAGTCTTTAAAGAAGATGCAGGACGT
GGTTATAGAAAAGTAGTTGCGTCACCACTACCTCAATCTATACTAGAACACCAGTTAATT
CGAACTTTAGCAGACGGTAAAAATATTGTCATTGCATGCGGTGGTGGCGGTATTCCAGTT
ATAAAAAAAGAAAATACCTATGAAGGTGTTGAAGCG
>arcC_2
TTATTAATCCAACAAGCTAAATCGAACAGTGACACAACGCCGGCAATGCCATTGGATACT
TGTGGTGCAATGTCACAAGGTATGATAGGCTATTGGTTGGAAACTGAAATCAATCGCATT
TTAACTGAAATGAATAGTGATAGAACTGTAGGCACAATCGTAACACGTGTGGAAGTAGAT
AAAGATGATCCACGATTTGATAACCCAACTAAACCAATTGGTCCTTTTTATACGAAAGAA
GAAGTTGAAGAATTACAAAAAGAACAGCCAGGCTCAGTCTTTAAAGAAGATGCAGGACGT
GGTTATAGAAAAGTAGTTGCGTCACCACTACCTCAATCTATACTAGAACACCAGTTAATT
CGAACTTTAGCAGACGGTAAAAATATTGTCATTGCATGCGGTGGTGGCGGTATTCCAGTT
ATAAAAAAAGAAAATACCTATGAAGGTGTTGAAGCG

The FASTA sequence IDs must be named as >allele_number or >allele-number. Ideally the sequences will not contain any ambiguous IUPAC symbols. i.e. just A,T,C,G.

Adding a new scheme

  1. Make a new folder in mlst/db/pubmlst/SCHEME
  2. Put your SCHEME.txt file in there
  3. Put your ALLELE.tfa files in there
  4. Run mlst/scripts/mlst-make_blast_db to update the BLAST indices
  5. Run mlst --longlist | grep SCHEME to see if it exists
  6. Run mlst --scheme SCHEME file.fasta to see if it works

If it doesn't - go back and check you really did do Step 4 above.

Citations

The mlst software incorporates components of the PubMLST database which must be cited in any publications that use mlst:

"This publication made use of the PubMLST website (https://pubmlst.org/) developed by Keith Jolley (Jolley & Maiden 2010, BMC Bioinformatics, 11:595) and sited at the University of Oxford. The development of that website was funded by the Wellcome Trust".

You should also cite this software (currently unpublished) as:

Bugs

Please submit via the Github Issues page

Licence

GPL v2

Author

More Repositories

1

prokka

โšก โ™’ Rapid prokaryotic genome annotation
Perl
746
star
2

snippy

โœ‚๏ธ โšก Rapid haploid variant calling and core genome alignment
Perl
432
star
3

abricate

๐Ÿ”Ž ๐Ÿ’Š Mass screening of contigs for antimicrobial and virulence genes
Perl
327
star
4

barrnap

๐Ÿ”ฌ โ™Œ Bacterial ribosomal RNA predictor
Perl
200
star
5

shovill

โšกโ™ ๏ธ Assemble bacterial isolate genomes from Illumina paired-end reads
Perl
199
star
6

nullarbor

๐Ÿ’พ ๐Ÿ“ƒ "Reads to report" for public health and clinical microbiology
Perl
125
star
7

any2fasta

Convert various sequence formats to FASTA
Perl
115
star
8

snp-dists

Pairwise SNP distance matrix from a FASTA sequence alignment
C
110
star
9

VelvetOptimiser

๐Ÿ“ˆ Automatically optimise three of Velvet's assembly parameters.
Perl
47
star
10

samclip

Filter SAM file for soft and hard clipped alignments
Perl
44
star
11

phastaf

Identify phage regions in bacterial genomes for masking purposes
Perl
29
star
12

seeka

Get microbial sequence data easier and faster
Perl
28
star
13

homebrew-bioinformatics-linux

๐Ÿบ ๐Ÿง Homebrew formulae for bioinformatics software only available for Linux
Ruby
27
star
14

berokka

๐ŸŠ ๐Ÿ’ซ Trim, circularise and orient long read bacterial genome assemblies
Perl
25
star
15

ekidna

Assembly based core genome SNP alignments for bacteria
Perl
25
star
16

cgmlst-dists

๐Ÿปโ‡”๐Ÿจ Calculate distance matrix from ChewBBACA cgMLST allele call tables
C
23
star
17

sixess

๐Ÿ”ฌ๐Ÿ› Rapid 16s rRNA identification from isolate FASTQ files
Shell
23
star
18

PEAR

Pair-End AssembeR
C
22
star
19

mokka

Annotate your metagenome assemblies
11
star
20

scapper

Whole genome core alignments from multiple draft genomes
Perl
10
star
21

kounta

๐Ÿงฎ ๐Ÿ”ข Generate multi-sample k-mer count matrix from WGS
Perl
9
star
22

snasm

Assembly based core SNP alignments
Perl
7
star
23

trencha

Normalize VCF depth for Illumina GC bias
Perl
7
star
24

legsta

๐Ÿ—โญ In silico Legionella pneumophila Sequence Based Typing
Perl
7
star
25

tseemann.github.io

Torsten Seemann's Home Page
HTML
7
star
26

noary

๐Ÿฃ ๐Ÿฆ A lightweight nucleotide bacterial ortholog clustering tool
Perl
7
star
27

wombac

โ€ผ๏ธ Rapid core genome SNP alignments from multiple bacterial genomes
Perl
7
star
28

klosham

Find closest aligned sequences to a query sequnece
C
6
star
29

kopynumba

Identify copy number variation in bacterial Illumina sequences
6
star
30

spekki

Species prediction from NGS reads
Python
5
star
31

fasterqc

A non-Java alternative to the classic FastQC tool
Perl
5
star
32

ragnarokka

Annotate and correct erro-prone ONT genomes
5
star
33

polisha

Fix small assembly errors using Illumina reads
Perl
5
star
34

polyfix

๐Ÿ”ชโ›“๏ธ Repair nanopore assemblies using related genome(s)
Perl
5
star
35

skrofula

Yet another M.tuberculosis typing and resistance tool, but for the impatient (not in-patient)
Perl
5
star
36

varion

5
star
37

injecta

Insert genes into genomes to aid synthetic test data generation
Perl
4
star
38

kurra

Fast whole genome phylogeny
4
star
39

heterik

Estimate heterozygosity or mixture level of a bacterial WGS sample
4
star
40

bowkaster

cgMLST from FASTQ reads
4
star
41

dehomopolymerate

Collapse sequence homopolymers to a single character
C
4
star
42

babykraken

๐Ÿ‘ถ๐Ÿฆ‘ Very small Kraken2 database for bundling with pipelines
4
star
43

perl-biotool

๐Ÿซ ๐Ÿช Small pure Perl5 libraries for writing command line bioinformatics tools
Perl
3
star
44

anthrakks

Distinguish Bacillus cereus and biovar anthracis (anthrax)
3
star
45

bioinfo-scripts

Collection of bioinformatics utility scripts, mostly written in Bioperl
Perl
3
star
46

gbk2bcfgff

Convert Genbank to GFF compatible with "bcftools csq"
2
star
47

easy-web-blast

2
star
48

mini-outbreak

Small WGS dataset for testing bacterial outbreak analysis pipelines
2
star
49

snippa

Experimental modular bacterial SNP calling pipeline
Perl
2
star
50

vikka

Viral genomics toolkit for pandemics
1
star
51

gard

๐Ÿ† ๐Ÿ’Š Gonococcal Antimicrobial Resistance Detection
1
star
52

wtfq

Duplicate FASTQ reads to address undersequenced regions
C
1
star
53

simuvar

Simulate variants of bacterial genomes for testing SNP callers
1
star
54

coginator

Assign COGs to protein sequences
1
star
55

skrilla

It ain't all about skrilla
1
star
56

arborkart

Phylogenomic trees with maps for the web
1
star
57

assembill

Simple script to clip, assemble, tile and annotate a bacterial genome from Illumina reads
Shell
1
star
58

kroucha

Mock repository for Sanger publications citing Croucher et al
1
star