PhyloWGS
This Python/C++ code is the accompanying software for the paper PhyloWGS: Reconstructing subclonal composition and evolution from whole-genome sequencing of tumors, with authors Amit G. Deshwar, Shankar Vembu, Christina K. Yung, Gun Ho Jang, Lincoln Stein, and Quaid Morris.
Input files
The input to evolve.py is two tab-delimited text files -- one for SSM data and
one for CNV data. Please see the files ssm_data.txt
and cnv_data.txt
included with PhyloWGS for examples.
To see how to generate ssm_data.txt
and cnv_data.txt
from a VCF file and
Battenberg CNV file, please see
the included parser.
ssm_data.txt
:
id
: identifier for each SSM. Identifiers must start ats0
and increment, so the first data row will haves0
, the second rows1
, and so forth.gene
: any string identifying the variant -- this need not be a gene name.<chr>_<pos>
(e.g.,2_234577
) works well.a
: number of reference-allele reads at the variant locus.d
: total number of reads at the variant locus.mu_r
: fraction of expected reference allele sampling from the reference population. E.g., if the tumor has an A->T somatic mutation at the locus, the genotype of the reference population should be AA. Thus,mu_r
should be1 - (sequencing error rate)
. Given the 0.001 error rate in Illumina sequencing, setting this column to 0.999 works well.mu_v
: fraction of expected reference allele sampling from variant population. Suppose an A->T somatic mutation occurred at the locus.mu_v
always uses normal ploidy (i.e., the copy number in non-CNV regions). As humans are diploid, copy number will thus always be 2. So, the variant population genotype should be AT, meaning we will observe the reference allele with frequency0.5 - (sequencing error rate)
. Given the 0.001 error rate in Illumina sequencing, setting this column to 0.499 works well.
cnv_data.txt
: Note that if you are running without any CNVs, this file should
be empty. You can create the empty file via the command touch cnv_data.txt
.
cnv
: identifier for each CNV. Identifiers must start atc0
and increment, so the first data row will havec0
, the second rowc1
, and so forth.a
: number of reference reads covering the CNV.d
: total number of reads covering the CNV. This will be affected by factors such as total copy number at the locus, sequencing depth, and the size of the chromosomal region spanned by the CNV.ssms
: SSMs that overlap with this CNV. Each entry is a comma-separated triplet consisting of SSM ID, maternal copy number, and paternal copy number. These triplets are separated by semicolons.
When running evolve.py
, the random seed used for the run will be written to
random_seed.txt
in the current directory. To choose this seed, you may give
the --random-seed <integer>
option to evolve.py
. If no random seed is
specified, but the random_seed.txt
file already exists in the current working
directory, the seed stored in that file will be used. This behaviour lets you
deterministically repeat runs by copying the random_seed.txt
files from a
previous batch.
Installing PhyloWGS
- Install dependencies.
- Install Python 2 versions of NumPy (www.numpy.org) and SciPy (www.scipy.org).
- Install Python 2 version of ETE2 (e.g.:
pip2 install --user ete2
). - Install GSL (http://www.gnu.org/software/gsl/).
-
Compile the C++ file.
g++ -o mh.o -O3 mh.cpp util.cpp `gsl-config --cflags --libs`
Running PhyloWGS with multiple MCMC chains (recommended)
To obtain MCMC samples that better approximate the true posterior distribution
over trees, we suggest running multiple concurrent MCMC chains using
multievolve.py
. To do so, run the following:
python2 multievolve.py --num-chains 4 --ssms ssm_data.txt --cnvs cnv_data.txt
Each chain is run as a separate process. Consequently, we suggest adjusting the
--num-chains
option to reflect the number of CPU cores you wish to dedicate
to PhyloWGS. Note that increasing --num-chains
will not decrease runtime,
but will increase the number of samples you take, and thus should yield a
better posterior approximation.
To decrease runtime, you can reduce the number of MCMC samples PhyloWGS takes as follows:
python2 multievolve.py --num-chains 4 --ssms ssm_data.txt --cnvs cnv_data.txt --burnin-samples 1 --mcmc-samples 1
By taking only one burnin and one true sample, PhyloWGS should complete in only a minute or so. This will, however, severely compromise the quality of your results. Use so few samples only so that you can test PhyloWGS before performing a proper run. To get proper results, we suggest using at least the number of burn-in and true samples that are specified by default (1000 and 2500, respectively).
Running PhyloWGS with only one MCMC chain (not recommended)
To run only a single MCMC chain, run the following:
python2 evolve.py ssm_data.txt cnv_data.txt
The quality of your posterior approximation will likely suffer relative to when you run multiple chains.
Viewing and interpreting results
-
Generate JSON results.
mkdir test_results cd test_results # To work with viewer in Step 5, the naming conventions used here must be # followed. # "example_data" is simply the name by which you want your results to be identified. python2 /path/to/phylowgs/write_results.py example_data ../trees.zip example_data.summ.json.gz example_data.muts.json.gz example_data.mutass.zip cd ..
All options:
usage: write_results.py [-h] [--include-ssm-names] [--min-ssms MIN_SSMS]
dataset_name tree_file tree_summary_output
mutlist_output mutass_output
Write JSON files describing trees
positional arguments:
dataset_name Name identifying dataset
tree_file File containing sampled trees
tree_summary_output Output file for JSON-formatted tree summaries
mutlist_output Output file for JSON-formatted list of mutations
mutass_output Output file for JSON-formatted list of SSMs and CNVs
assigned to each subclone
optional arguments:
-h, --help show this help message and exit
--include-ssm-names Include SSM names in output (which may be sensitive
data) (default: False)
--min-ssms MIN_SSMS Minimum number or percent of SSMs to retain a subclone
(default: 0.01)
-
View results.
mv test_results /path/to/phylowgs/witness/data cd /path/to/phylowgs/witness gunzip data/*/*.gz python2 index_data.py python2 -m SimpleHTTPServer # Open http://127.0.0.1:8000 in your web browser. Note that, by # default, the server listens for connections from any host.
Full option listing
The multi-chain executor multievolve.py
takes the following options. Note
that it will also accept all options that evolve.py
takes, which are listed
below.
usage: multievolve.py [-h] [-n NUM_CHAINS]
[-r RANDOM_SEEDS [RANDOM_SEEDS ...]]
[-I CHAIN_INCLUSION_FACTOR] [-O OUTPUT_DIR] --ssms
SSM_FILE --cnvs CNV_FILE
optional arguments:
-h, --help show this help message and exit
-n NUM_CHAINS, --num-chains NUM_CHAINS
Number of chains to run concurrently (default: 4)
-r RANDOM_SEEDS [RANDOM_SEEDS ...], --random-seeds RANDOM_SEEDS [RANDOM_SEEDS ...]
Space-separated random seeds with which to initialize
each chain. Specify one for each chain. (default:
None)
-I CHAIN_INCLUSION_FACTOR, --chain-inclusion-factor CHAIN_INCLUSION_FACTOR
Factor for determining which chains will be included
in the output "merged" folder. Default is 1.5, meaning
that the sum of the likelihoods of the trees found in
each chain must be greater than 1.5x the maximum of
that value across chains. Setting this value = inf
includes all chains and setting it = 1 will include
only the best chain. (default: 1.5)
-O OUTPUT_DIR, --output-dir OUTPUT_DIR
Directory where results from each chain will be saved.
We will create it if it does not exist. (default:
chains)
--ssms SSM_FILE File listing SSMs (simple somatic mutations, i.e.,
single nucleotide variants. For proper format, see
README.md. (default: None)
--cnvs CNV_FILE File listing CNVs (copy number variations). For proper
format, see README.md. (default: None)
The single-chain executor evolve.py
takes the following options. Note
that all such options can also be passed to multievolve.py
.
usage: evolve.py [-h] [-O OUTPUT_DIR] [-b WRITE_BACKUPS_EVERY]
[-S WRITE_STATE_EVERY] [-B BURNIN_SAMPLES] [-s MCMC_SAMPLES]
[-i MH_ITERATIONS] [-r RANDOM_SEED] [-t TMP_DIR]
[-p PARAMS_FILE]
ssm_file cnv_file
positional arguments:
ssm_file File listing SSMs (simple somatic mutations, i.e.,
single nucleotide variants. For proper format, see
README.md.
cnv_file File listing CNVs (copy number variations). For proper
format, see README.md.
optional arguments:
-h, --help show this help message and exit
-O OUTPUT_DIR, --output-dir OUTPUT_DIR
Path to output directory (default: None)
-b WRITE_BACKUPS_EVERY, --write-backups-every WRITE_BACKUPS_EVERY
Number of iterations to go between writing backups of
program state (default: 100)
-S WRITE_STATE_EVERY, --write-state-every WRITE_STATE_EVERY
Number of iterations between writing program state to
disk. Higher values reduce IO burden at the cost of
losing progress made if program is interrupted.
(default: 10)
-B BURNIN_SAMPLES, --burnin-samples BURNIN_SAMPLES
Number of burnin samples (default: 1000)
-s MCMC_SAMPLES, --mcmc-samples MCMC_SAMPLES
Number of MCMC samples (default: 2500)
-i MH_ITERATIONS, --mh-iterations MH_ITERATIONS
Number of Metropolis-Hastings iterations (default:
5000)
-r RANDOM_SEED, --random-seed RANDOM_SEED
Random seed for initializing MCMC sampler (default:
None)
-t TMP_DIR, --tmp-dir TMP_DIR
Path to directory for temporary files (default: None)
-p PARAMS_FILE, --params PARAMS_FILE
JSON file listing run parameters, generated by the
parser (default: None)
Resuming a previous PhyloWGS run
If PhyloWGS is interrupted for whatever reason, you can resume your existing
run by simply running multievolve.py
or evolve.py
(depending on which you
used to begin the run) from the same directory as the previous run, without any
command-line params:
# Start initial run.
python2 multievolve.py --ssms ssm_data.txt --cnvs cnv_data.txt
# Hit CTRL+C to send SIGINT, halting run partway through.
# Resume run (must be executed from same directory as initial invocation):
python2 multievolve.py
License
Copyright (C) 2018 Quaid Morris
This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.
This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.
You should have received a copy of the GNU General Public License along with this program. If not, see http://www.gnu.org/licenses/.