run_dbcan4
Status
A standalone tool of dbCAN3 web server.
Update Info
run_dbcan 4.0.0 is released.
-
CAZyme substrate prediction based on dbCAN-sub ;
-
CGC substrate prediction based on dbCAN-PUL searching and dbCAN-sub majority voting. For CGC substrate prediction, please see our dbCAN-seq update paper for details. With these new functions (esp. the dbCAN-sub search), run_dbcan4.0 is now slower to get the result back to you. Please be patient!
-
See #127 for some explanation of different output files for substrate predictions
-
Please split your files if your input contains > 1 million proteins.
Please update all of the databases.
Function
- Accepts user input
- Predicts genes if needed
- Runs input against HMMER, DIAMOND, and dbCAN_sub
- Optionally predicts CGCs with CGCFinder
Support Platform
Linux(Ubuntu, CentOS), MacOS
Installation via Bioconda
-
Please install Anoconda first.
-
Install NCBI Blast+.
-
Create virtual environment with dbcan and activate the virtual environment.
conda create -n run_dbcan python=3.8 dbcan -c conda-forge -c bioconda
conda activate run_dbcan
If you are old user, just update the conda virtual environment run_dbcan
via running conda install dbcan
.
- Database Installation.
test -d db || mkdir db
cd db \
&& wget http://bcb.unl.edu/dbCAN2/download/Databases/fam-substrate-mapping-08252022.tsv \
&& wget http://bcb.unl.edu/dbCAN2/download/Databases/PUL.faa && makeblastdb -in PUL.faa -dbtype prot \
&& wget http://bcb.unl.edu/dbCAN2/download/Databases/dbCAN-PUL_07-01-2022.xlsx \
&& wget http://bcb.unl.edu/dbCAN2/download/Databases/dbCAN-PUL_07-01-2022.txt \
&& wget http://bcb.unl.edu/dbCAN2/download/Databases/dbCAN-PUL.tar.gz && tar xvf dbCAN-PUL.tar.gz \
&& wget http://bcb.unl.edu/dbCAN2/download/Databases/dbCAN_sub.hmm && hmmpress dbCAN_sub.hmm \
&& wget http://bcb.unl.edu/dbCAN2/download/Databases/V11/CAZyDB.08062022.fa && diamond makedb --in CAZyDB.08062022.fa -d CAZy \
&& wget https://bcb.unl.edu/dbCAN2/download/Databases/V11/dbCAN-HMMdb-V11.txt && mv dbCAN-HMMdb-V11.txt dbCAN.txt && hmmpress dbCAN.txt \
&& wget https://bcb.unl.edu/dbCAN2/download/Databases/V11/tcdb.fa && diamond makedb --in tcdb.fa -d tcdb \
&& wget http://bcb.unl.edu/dbCAN2/download/Databases/V11/tf-1.hmm && hmmpress tf-1.hmm \
&& wget http://bcb.unl.edu/dbCAN2/download/Databases/V11/tf-2.hmm && hmmpress tf-2.hmm \
&& wget https://bcb.unl.edu/dbCAN2/download/Databases/V11/stp.hmm && hmmpress stp.hmm \
&& cd ../ && wget http://bcb.unl.edu/dbCAN2/download/Samples/EscheriaColiK12MG1655.fna \
&& wget http://bcb.unl.edu/dbCAN2/download/Samples/EscheriaColiK12MG1655.faa \
&& wget http://bcb.unl.edu/dbCAN2/download/Samples/EscheriaColiK12MG1655.gff
- (Optional) SignalP Installation.
Our program include Signalp Petitide prediction with SignalP. Make sure to set
use_signalP=True
and have to obtain your own academic license of SignalP and download it from here, and then move the perl file from the tarball file (signalp-4.1g.Linux.tar.gz) into/usr/bin/signalp
by yourself. Following statement is singalP-4.1 installation instruction.
Decompress signalp-4.1g.Linux.tar.gz than open the directory
tar -xvf signalp-4.1g.Linux.tar.gz && cd signalp-4.1
Then you can find those files/directories located in signalp-4.1
directory
(base) lehuang@lehuang:~/Downloads/signalp-4.1$ ls
bin lib signalp signalp.1 signalp-4.1.readme syn test
signalp is the perl file that you will use in your program Edit the paragraph labeled "GENERAL SETTINGS, CUSTOMIZE ..." in the top of the file 'signalp'. The following twovmandatory variables need to be set:
**SIGNALP** full path to the signalp-4.1 directory on your system
**outputDir** where to store temporary files (writable to all users)
**MAX_ALLOWED_ENTRIES** the number of input sequences allowed per run.
Here is the example for me to change line 13, line 17 and line 20 in `singalp` file. I suggest you to set MAX_ALLOWED_ENTRIES as 100000
###############################################################################
# GENERAL SETTINGS: CUSTOMIZE TO YOUR SITE
###############################################################################
# full path to the signalp-4.1 directory on your system (mandatory)
BEGIN {
$ENV{SIGNALP} = '/home/lehuang/Downloads/signalp-4.1';
}
# determine where to store temporary files (must be writable to all users)
my $outputDir = "/home/lehuang/Downloads/signalp-4.1/output";
# max number of sequences per run (any number can be handled)
my $MAX_ALLOWED_ENTRIES=100000;
And then, use this command:
sudo cp signalp /usr/bin/signalp
sudo chmod 755 /usr/bin/signalp
If you don't have the permission to access /usr/bin
, you can use the parameter -sp
or --signalP_path
to indicate your signalp
file path in the run_dbcan program. Please see the step 6.
6. Check Program.
run_dbcan EscheriaColiK12MG1655.fna prok --out_dir output_EscheriaColiK12MG1655
If you want to run the code with SignalP
run_dbcan EscheriaColiK12MG1655.fna prok --out_dir output_EscheriaColiK12MG1655 --use_signalP=TRUE
If you don't have the permission to access /usr/bin
when running with signalP, you can use the parameter -sp
or --signalP_path
to indicate your signalp
file path in the run_dbcan program.
run_dbcan EscheriaColiK12MG1655.fna prok --out_dir output_EscheriaColiK12MG1655 --use_signalP=TRUE -sp /home/lehuang/Downloads/signalp-4.1/signalp
Installation via Docker
- Make sure docker is installed on your computer successfully.
- Docker pull image
docker pull haidyi/run_dbcan:latest
- Run. Mount
input sequence file
andoutput directory
to the container.
docker run --name <preferred_name> -v <host-path>:<container-path> -it haidyi/run_dbcan:latest <input_file> {protein,meta,prok} [params] --out_dir <output_dir>
REQUIREMENTS
TOOLS
P.S.: You do not need to download CGCFinder
and hmmscan-parser
because they are included in run_dbcan V4. If you use python package or docker, you don't need to download Prodigal because they includes these denpendencies. Otherwise we recommend you to install and copy them into /usr/bin
as system application or add their path into system envrionmental profile.
[Python3]--Be sure to use python3, not python2
DIAMOND-- Included in run_dbcan4.
HMMER--Included in run_dbcan4.
hmmscan-parser--This is included in run_dbcan4.
dbCAN_sub--Included in run_dbcan4.
signalp--please download and install if you need.
Prodigal--Included in run_dbcan4.
CGCFinder--Included in run_dbcan4.
DATABASES Installation (those are included in step4 Database Installation)
Databse -- Database Folder
CAZy.fa--use diamond makedb --in CAZyDB.09242021.fa -d CAZy
dbCAN_sub --use hmmpress dbCAN_sub.hmm
.
dbCAN-PUL The substrates files from dbCAN-PUL.
PUL--The PUL sequences, use makeblastdb -in PUL.faa -dbtype prot
.
dbCAN-HMMdb-V11.txt--First use mv dbCAN-HMMdb-V11.txt dbCAN.txt
, then use hmmpress dbCAN.txt
tcdb.fa--use diamond makedb --in tcdb.fa -d tcdb
tf-1.hmm--use hmmpress tf-1.hmm
tf-2.hmm--use hmmpress tf-2.hmm
stp.hmm--use hmmpress stp.hmm
Params
Required arguments:
inputFile User input file. Must be in FASTA format.
{protein,prok,meta} Type of sequence input. protein=proteome; prok=prokaryote; meta=metagenome
optional arguments:
-h, --help show this help message and exit
--dbCANFile DBCANFILE
Indicate the file name of HMM database such as dbCAN.txt, please use the newest one from dbCAN2 website.
--dia_eval DIA_EVAL DIAMOND E Value
--dia_cpu DIA_CPU Number of CPU cores that DIAMOND is allowed to use
--hmm_eval HMM_EVAL HMMER E Value
--hmm_cov HMM_COV HMMER Coverage val
--hmm_cpu HMM_CPU Number of CPU cores that HMMER is allowed to use
--out_pre OUT_PRE Output files prefix
--out_dir OUT_DIR Output directory
--db_dir DB_DIR Database directory
--tools {hmmer,diamond,dbcansub,all} [{hmmer,diamond,dbcansub,all} ...], -t {hmmer,diamond,dbcansub,all} [{hmmer,diamond,dbcansub,all} ...]
Choose a combination of tools to run
--use_signalP USE_SIGNALP
Use signalP or not, remember, you need to setup signalP tool first. Because of signalP license, Docker version does not have signalP.
--signalP_path SIGNALP_PATH, -sp SIGNALP_PATH
The path for signalp. Default location is signalp
--gram {p,n,all}, -g {p,n,all}
Choose gram+(p) or gram-(n) for proteome/prokaryote nucleotide, which are params of SingalP, only if user use singalP
-v VERSION, --version VERSION
dbCAN-sub parameters:
--dbcan_thread DBCAN_THREAD, -dt DBCAN_THREAD
--tf_eval TF_EVAL tf.hmm HMMER E Value
--tf_cov TF_COV tf.hmm HMMER Coverage val
--tf_cpu TF_CPU tf.hmm Number of CPU cores that HMMER is allowed to use
--stp_eval STP_EVAL stp.hmm HMMER E Value
--stp_cov STP_COV stp.hmm HMMER Coverage val
--stp_cpu STP_CPU stp.hmm Number of CPU cores that HMMER is allowed to use
CGC_Finder parameters:
--cluster CLUSTER, -c CLUSTER
Predict CGCs via CGCFinder. This argument requires an auxillary locations file if a protein input is being used
--cgc_dis CGC_DIS CGCFinder Distance value
--cgc_sig_genes {tf,tp,stp,tp+tf,tp+stp,tf+stp,all}
CGCFinder Signature Genes value
CGC_Substrate parameters:
--cgc_substrate run cgc substrate prediction?
--pul PUL dbCAN-PUL PUL.faa
-o OUT, --out OUT
-w WORKDIR, --workdir WORKDIR
-env ENV, --env ENV
-oecami, --oecami out eCAMI prediction intermediate result?
-odbcanpul, --odbcanpul
output dbCAN-PUL prediction intermediate result?
dbCAN-PUL homologous searching parameters:
how to define homologous gene hits and PUL hits
-upghn UNIQ_PUL_GENE_HIT_NUM, --uniq_pul_gene_hit_num UNIQ_PUL_GENE_HIT_NUM
-uqcgn UNIQ_QUERY_CGC_GENE_NUM, --uniq_query_cgc_gene_num UNIQ_QUERY_CGC_GENE_NUM
-cpn CAZYME_PAIR_NUM, --CAZyme_pair_num CAZYME_PAIR_NUM
-tpn TOTAL_PAIR_NUM, --total_pair_num TOTAL_PAIR_NUM
-ept EXTRA_PAIR_TYPE, --extra_pair_type EXTRA_PAIR_TYPE
None[TC-TC,STP-STP]. Some like sigunature hits
-eptn EXTRA_PAIR_TYPE_NUM, --extra_pair_type_num EXTRA_PAIR_TYPE_NUM
specify signature pair cutoff.1,2
-iden IDENTITY_CUTOFF, --identity_cutoff IDENTITY_CUTOFF
identity to identify a homologous hit
-cov COVERAGE_CUTOFF, --coverage_cutoff COVERAGE_CUTOFF
query coverage cutoff to identify a homologous hit
-bsc BITSCORE_CUTOFF, --bitscore_cutoff BITSCORE_CUTOFF
bitscore cutoff to identify a homologous hit
-evalue EVALUE_CUTOFF, --evalue_cutoff EVALUE_CUTOFF
evalue cutoff to identify a homologous hit
dbCAN-sub major voting parameters:
how to define dbsub hits and dbCAN-sub subfamily substrate
-hmmcov HMMCOV, --hmmcov HMMCOV
-hmmevalue HMMEVALUE, --hmmevalue HMMEVALUE
-ndsc NUM_OF_DOMAINS_SUBSTRATE_CUTOFF, --num_of_domains_substrate_cutoff NUM_OF_DOMAINS_SUBSTRATE_CUTOFF
define how many domains share substrates in a CGC, one protein may include several subfamily domains.
-npsc NUM_OF_PROTEIN_SUBSTRATE_CUTOFF, --num_of_protein_substrate_cutoff NUM_OF_PROTEIN_SUBSTRATE_CUTOFF
define how many sequences share substrates in a CGC, one protein may include several subfamily domains.
-subs SUBSTRATE_SCORS, --substrate_scors SUBSTRATE_SCORS
each cgc contains with substrate must more than this value
RUN & OUTPUT
Use following command to run the program.
run_dbcan [inputFile] [inputType] [-c AuxillaryFile] [-t Tools] etc.
Several files will be produced via run_dbcan
. They are as follows:
uniInput - The unified input file for the rest of the tools
(created by prodigal if a nucleotide sequence was used)
dbsub.out - the output from the dbCAN_sub run
diamond.out - the output from the diamond blast
hmmer.out - the output from the hmmer run
tf.out - the output from the diamond blast predicting TF's for CGCFinder
tc.out - the output from the diamond blast predicting TC's for CGCFinder
cgc.gff - GFF input file for CGCFinder
cgc.out - ouput from the CGCFinder run
overview.txt - Details the CAZyme predictions across the three tools with signalp results
see #127 for some explanation of different output files for substrate predictions
EXAMPLE
An example setup is available in the example directory. Included in this directory are two FASTA sequences (one protein, one nucleotide).
To run this example type, run:
run_dbcan EscheriaColiK12MG1655.fna prok --out_dir output_EscheriaColiK12MG1655
or
run_dbcan EscheriaColiK12MG1655.faa protein --out_dir output_EscheriaColiK12MG1655
To run the examples with CGCFinder turned on, run:
run_dbcan EscheriaColiK12MG1655.fna prok -c cluster --out_dir output_EscheriaColiK12MG1655
or
run_dbcan EscheriaColiK12MG1655.faa protein -c EscheriaColiK12MG1655.gff --out_dir output_EscheriaColiK12MG1655
Notice that the protein command has a GFF file following the -c option. A GFF or BED format file with gene position information is required to run CGCFinder when using a protein input.
If you have any questions, please feel free to contact with Dr. Yin ([email protected] or [email protected]) or me (Le Huang) on Issue Dashboard.
Reference
This is the standalone version of dbCAN annotation tool for automated CAZyme annotation (known as run_dbCAN), written by Le Huang and Tanner Yohe.
If you want to use our dbCAN3 webserver, please go to http://bcb.unl.edu/dbCAN2/. Please cite us: Jinfang Zheng, Qiwei Ge, Yuchen Yan, Xinpeng Zhang, Le Huang, Yanbin Yin, dbCAN3: automated carbohydrate-active enzyme and substrate annotation, Nucleic Acids Research, 2023;, gkad328, https://doi.org/10.1093/nar/gkad328
@article{10.1093/nar/gkad328,
author = {Zheng, Jinfang and Ge, Qiwei and Yan, Yuchen and Zhang, Xinpeng and Huang, Le and Yin, Yanbin},
title = "{dbCAN3: automated carbohydrate-active enzyme and substrate annotation}",
journal = {Nucleic Acids Research},
year = {2023},
month = {05},
issn = {0305-1048},
doi = {10.1093/nar/gkad328},
url = {https://doi.org/10.1093/nar/gkad328},
note = {gkad328},
eprint = {https://academic.oup.com/nar/advance-article-pdf/doi/10.1093/nar/gkad328/50150154/gkad328.pdf},
}
If you use dbCAN standalone tool (run_dbcan) or/and our web server for publication, please cite us:
Han Zhang, Tanner Yohe, Le Huang, Sarah Entwistle, Peizhi Wu, Zhenglu Yang, Peter K Busk, Ying Xu, Yanbin Yin; dbCAN2: a meta server for automated carbohydrate-active enzyme annotation, Nucleic Acids Research, Volume 46, Issue W1, 2 July 2018, Pages W95–W101, https://doi.org/10.1093/nar/gky418
@article{doi:10.1093/nar/gky418,
author = {Zhang, Han and Yohe, Tanner and Huang, Le and Entwistle, Sarah and Wu, Peizhi and Yang, Zhenglu and Busk, Peter K and Xu, Ying and Yin, Yanbin},
title = {dbCAN2: a meta server for automated carbohydrate-active enzyme annotation},
journal = {Nucleic Acids Research},
volume = {46},
number = {W1},
pages = {W95-W101},
year = {2018},
doi = {10.1093/nar/gky418},
URL = {http://dx.doi.org/10.1093/nar/gky418},
eprint = {/oup/backfile/content_public/journal/nar/46/w1/10.1093_nar_gky418/1/gky418.pdf}
}
If you want to use pre-computed bacterial CAZyme sequences/annotations directly, please go to http://bcb.unl.edu/dbCAN_seq/ and cite us:
Le Huang, Han Zhang, Peizhi Wu, Sarah Entwistle, Xueqiong Li, Tanner Yohe, Haidong Yi, Zhenglu Yang, Yanbin Yin; dbCAN-seq: a database of carbohydrate-active enzyme (CAZyme) sequence and annotation, Nucleic Acids Research, Volume 46, Issue D1, 4 January 2018, Pages D516–D521, https://doi.org/10.1093/nar/gkx894*
@article{doi:10.1093/nar/gkx894,
author = {Huang, Le and Zhang, Han and Wu, Peizhi and Entwistle, Sarah and Li, Xueqiong and Yohe, Tanner and Yi, Haidong and Yang, Zhenglu and Yin, Yanbin},
title = {dbCAN-seq: a database of carbohydrate-active enzyme (CAZyme) sequence and annotation},
journal = {Nucleic Acids Research},
volume = {46},
number = {D1},
pages = {D516-D521},
year = {2018},
doi = {10.1093/nar/gkx894},
URL = {http://dx.doi.org/10.1093/nar/gkx894},
eprint = {/oup/backfile/content_public/journal/nar/46/d1/10.1093_nar_gkx894/2/gkx894.pdf}
}