• Stars
    star
    127
  • Rank 275,111 (Top 6 %)
  • Language
    Python
  • License
    GNU General Publi...
  • Created almost 8 years ago
  • Updated 8 months ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Processing pipeline for pan-genome visulization and exploration

panX: microbial pan-genome analysis and exploration

Wei Ding, Franz Baumdicker, Richard A Neher; panX: pan-genome analysis and exploration, Nucleic Acids Research, Volume 46, Issue 1, 9 January 2018, Pages e5, https://doi.org/10.1093/nar/gkx977

Overview: panX is a software package for microbial pan-genome analysis, visualization and exploration. The analysis pipeline is based on DIAMOND, MCL and phylogeny-aware post-processing. It takes a set of annotated bacterial strains as input (e.g. NCBI RefSeq records or user's own data in GenBank format). All genes from all strains are compared to each other via DIAMOND and then clustered into orthologous groups using MCL and adaptive phylogenetic post-processing, which split distantly related genes and paralogs if necessary. For each gene cluster, corresponding alignment and phylogeny are constructed. All core gene SNPs are then used to build strain/species phylogeny.

The results can be interactively explored using a powerful web-based visualization application (either hosted by web server or run locally on desktop). The web application integrates various interconnected components (pan-genome statistical charts, gene cluster table, alignment, comparative phylogenies, metadata table) and allows rapid search and filter of gene clusters by gene name, annotation, duplication, diversity, gene gain/loss events, etc. Strain-specific metadata are integrated into strain phylogeny such that genes related to adaptation, antibiotic resistance, virulence can be readily identified.

Table of contents

Pipeline overview

panX

Quick start

git clone https://github.com/neherlab/pan-genome-analysis.git
cd pan-genome-analysis

Install dependencies easily via Conda and then run the test: sh run-TestSet.sh

The results can be explored using our interactive pan-genome-visualization application.

Installing dependencies

Conda

The required software and python packages can be readily installed using Conda.

wget https://repo.continuum.io/miniconda/Miniconda2-latest-Linux-x86_64.sh
bash Miniconda2-latest-Linux-x86_64.sh
export PATH=~/miniconda2/bin:$PATH
conda env create -f panX-environment.yml
source activate panX

Overview of dependencies:

How to run

To run the test set: sh run-TestSet.sh

In data/TestSet, you will find a small set of four Mycoplasma genitalium genomes that is used in this tutorial. Your own data should also reside in such a folder within data/ -- we will refer to this folder as run directory below. The name of the run directory is used as a species name in down-stream analysis.

All steps can be run in order by omitting the -st option, whereas using -st 5 6 will specify the analysis steps. If running only specific steps such as -st 5 6, steps before 5 should already be finished.

-t sets the number of CPU cores.

./panX.py -fn data/TestSet -sl TestSet -t 32 > TestSet.log 2> TestSet.err

This calls panX.py to run each step using scripts located in folder ./scripts/

./panX.py [-h] -fn folder_name -sl species_name
                   [-st steps [steps ...]] [-rt raxml_max_time]
                   [-t threads] [-bp blast_file_path]

Mandatory parameters: -fn folder_name / -sl species_name
NOTICE: species_name e.g.: S_aureus
Example: ./panX.py -fn ./data/TestSet -sl TestSet -t 32 > TestSet.log 2> TestSet.err

Directory structure and analysis output

The analysis generates clustering result ./data/YourSpecies/allclusters_final.tsv

and files required for visualizing the pan-genome using pan-genome-visualization.

./data
    YourSpecies               # folder specific to the your pan genome
      - input_GenBank              # INPUT: genomes in GenBank format
        - strain1.gbk
        - strain2.gbk
        ...
      - vis
        - geneCluster.json       # for clusters table: gene clusters and their summary statistics
        - strainMetainfo.json    # for metadata table: strain-associated metadata
        - metaConfiguration.js   # metadata configuration file (also accept valid customized file)
        - coreGenomeTree.json    # core genome SNP tree (json file)
        - strain_tree.nwk        # core genome SNP tree (newick file)

        - geneCluster/           # folder contain orthologous clusters
                                 # nucleotide and amino acid alignment in gzipped FASTA format
                                 # reduced alignment contains a consensus sequence and variable sites (identical sites shown as dots)
                                 # tree and presence/absence(gain/loss) pattern in json format
          - GC00000001_na_aln.fa.gz
          - GC00000001_aa_aln.fa.gz
          - GC00000001_na_aln_reduced.fa.gz
          - GC00000001_aa_aln_reduced.fa.gz
          - GC00000001_tree.json
          - GC00000001_patterns.json

In which step different files and directories are produced is described in more details in step-tutorials.md.

Command line arguments

(Click here for more details)

Soft core-gene:

-cg    core-genome threshold [e.g.: 0.7] percentage of strains used to decide whether a gene is core
E.g.: ./panX.py -cg 0.7 -fn ...

Large dataset (use divide-and-conquer(DC) strategy which scales approximately linearly with the number of genomes):

-dmdc  apply DC strategy to run DIAMOND on subsets and then combine the results
-dcs   subset size used in DC strategy [default:50]
E.g.: ./panX.py -dmdc -dcs 50 -fn ...

Calculate branch associations with metadata (e.g. drug concentration):

-iba  infer_branch_association
-mtf  ./data/yourSpecies/meta_config.tsv
E.g.: ./panX.py -iba -mtf ./data/yourSpecies/meta_config.tsv -fn ...

Example: meta_config.tsv

To bring the branch association into effect for the visualization, one needs to add the generated file to the visualization repository as described in Special feature: visualize branch association(BA) and presence/absence(PA) association.

More Repositories

1

covid19_scenarios

Models of COVID-19 outbreak trajectories and hospital demand
JavaScript
1,361
star
2

treetime

Maximum likelihood inference of time stamped phylogenies and ancestral reconstruction
Jupyter Notebook
206
star
3

SARS-CoV-2_variant-reports

Informal summaries of notable SARS-CoV-2 lineages
102
star
4

pangraph

A bioinformatic toolkit to align genome assemblies into pangenome graphs
Julia
75
star
5

covid19_scenarios_data

Data preprocessing scripts and preprocessed data storage for COVID-19 Scenarios project
Python
42
star
6

treetime_examples

A collection of documented examples using TreeTime
Python
15
star
7

CoV_Seasonality

Model coronavirus seasonality and explore consequences for nCoV dynamics
TeX
15
star
8

nextalign

๐Ÿงฌ Viral genome reference alignment
12
star
9

ffpopsim

FFPopSim is a collection of C++ classes and a Python interface for efficient simulation of large populations, in particular when the product of mutation rate and population size is larger than one. It consists of one library for individual-based simulations, and a complementary one for simulation of the entire genotype distribution. The latter is coded efficiently using Fast-Fourier Transforms to speed up recombination operations.
C++
12
star
10

PoissonNMF

ImageJ plugins for blind decomposition of fluorencence microscopy data
Java
11
star
11

ncov-europe

Python
8
star
12

treetime-cloud

TypeScript
6
star
13

HIVEVO_access

Python
5
star
14

python_morbidostat

Control software for a morbidostat reactor based on arduino and python
Python
5
star
15

FluSpeciation

Code for SIR models with antigenic evolution
MATLAB
5
star
16

nextclade_data_workflows

Python
4
star
17

genome-assembly

Bacterial genome assembly pipeline
Python
4
star
18

HIVEVO_figures

Python
4
star
19

SVVC

simple viral variant caller
Python
3
star
20

HIV_time_of_infection

Estimating time of HIV-1 infection from next-generation sequence diversity (code and data)
Python
3
star
21

synmut

Manuscript: Quantifying Selection against Synonymous Mutations in HIV-1 env Evolution.
TeX
3
star
22

2019_Yan_RQS_flu_analysis

Python
3
star
23

gisaid_nextstrain

Python
3
star
24

2019-krisp-nextstrain-workshop

Tutorials, data, and results of the 2019 KRISP nextstrain workshop
Python
3
star
25

timetree_viewer

CSS
3
star
26

HIVEVO_reversion

Repository for the analysis and figures of Valentin Druelle's paper on HIV-1 reversion
Python
3
star
27

treetool

tool to build and visualize annotated influenza phylogenies based on blab/nextflu
CSS
3
star
28

flu_clades

Python
3
star
29

treetime_validation

Python
3
star
30

EV-D68_sequence_mapping

Code processing and analyzing data from Dyrdak et al, 2019
Python
2
star
31

SC2_variant_rates

analyze substitution rate and mutation behavior within variants.
TeX
2
star
32

2019_Puller_SiteSpecificGTR

TeX
2
star
33

2020_EU1_paper

Python
2
star
34

BA286

BA.2.86 project with Sigal lab
Python
2
star
35

demo-auspice-tree

Demonstrates how to embed Auspice tree into React application
JavaScript
2
star
36

ratchet

TeX
2
star
37

allflu

2
star
38

krisp

visualization repository for the 2019 nextstrain workshop at KRISP
2
star
39

spike-only

Spike-only SARS-CoV-2 Nextstrain build
Python
2
star
40

ncov-simple

Python
2
star
41

2018_evd68_paneurope_analysis

Scripts producing figures and analysis of Hodcroft et al, 2020
Python
2
star
42

pid

Manuscript: Challenges with Using Primer IDs to Improve Accuracy of Next Generation Sequencing
Python
1
star
43

nextstrain_base

pipeline components for realtime virus analysis
Python
1
star
44

cluster-mutations

Get mutations in cluster by querying from LAPIS API
Python
1
star
45

enterovirus_nextstrain

Python
1
star
46

sequence_distances

Python
1
star
47

multiflu

interactive flu trees side-by-side
JavaScript
1
star
48

enterovirus_a71

Python
1
star
49

reccoal

Manuscript: Coalescence and genetic diversity in sexual populations under selection
TeX
1
star
50

EV_D68_analysis

scripts that generate figures for Dyrdak et al 2018 manuscript
Python
1
star
51

flu-ingest

Python
1
star
52

CompBio2023

simple viral consensus assembly
Python
1
star
53

ctlfit

Manuscript: Inferring HIV Escape Rates from Multi-Locus Genotype Data
TeX
1
star
54

TreeKnit-web

TypeScript
1
star
55

HIVEVO_reservoir

Establishment and stability of the latent HIV-1 DNA reservoir
TeX
1
star
56

HIVEVO_recombination

Scripts created to analyse recombination in HIVEVO dataset
Jupyter Notebook
1
star