There are no reviews yet. Be the first to send feedback to the community and the maintainers!
This software package infers population size history from a diploid sequence using the Pairwise Sequentially Markovian Coalescent (PSMC) model. The detailed model is described in file `psmc.tex'. To compile the binaries, you may run make; (cd utils; make) After that, you may try utils/fq2psmcfa -q20 diploid.fq.gz > diploid.psmcfa psmc -N25 -t15 -r5 -p "4+25*2+4+6" -o diploid.psmc diploid.psmcfa utils/psmc2history.pl diploid.psmc | utils/history2ms.pl > ms-cmd.sh utils/psmc_plot.pl diploid diploid.psmc where `diploid.fq.gz' is typically the whole-genome diploid consensus sequence of one human individual, which can be generated by, for example: samtools mpileup -C50 -uf ref.fa aln.bam | bcftools view -c - \ | vcfutils.pl vcf2fq -d 10 -D 100 | gzip > diploid.fq.gz Here option -d sets and minimum read depth and -D sets the maximum. It is recommended to set -d to a third of the average depth and -D to twice. Program `fq2psmcfa' transforms the consensus sequence into a fasta-like format where the i-th character in the output sequence indicates whether there is at least one heterozygote in the bin [100i, 100i+100). For dipcall output, you may use the following to generate psmcfa: seqtk mutfa ref.fa <(gzip -dc prefix.dip.vcf.gz|utils/vcf2snp.pl -) \ | seqtk seq -cM prefix.dip.bed -l80 | utils/fq2psmcfa - > prefix.psmcfa Program `psmc' infers the population size history. In particular, the `-p' option specifies that there are 64 atomic time intervals and 28 (=1+25+1+1) free interval parameters. The first parameter spans the first 4 atomic time intervals, each of the next 25 parameters spans 2 intervals, the 27th spans 4 intervals and the last parameter spans the last 6 time intervals. The `-p' and `-t' options are manually chosen such that after 20 rounds of iterations, at least ~10 recombinations are inferred to occur in the intervals each parameter spans. Impropriate settings may lead to overfitting. The command line in the example above has been shown to be suitable for modern humans. The `psmc' program infers the scaled mutation rate, the recombination rate and the free population size parameters. All these parameters are scaled to 2N0. You may run `psmc2history.pl' combined with `history2ms.pl' to generate the ms command line that simulates the history inferred by PSMC, or visualize the result with `psmc_plot.pl'. To perform bootstrapping, one has to run splitfa first to split long chromosome sequences to shorter segments. When the `-b' option is applied, psmc will then randomly sample with replacement from these segments. As an example, the following command lines perform 100 rounds of bootstrapping: utils/fq2psmcfa -q20 diploid.fq.gz > diploid.psmcfa utils/splitfa diploid.psmcfa > split.psmcfa psmc -N25 -t15 -r5 -p "4+25*2+4+6" -o diploid.psmc diploid.psmcfa seq 100 | xargs -i echo psmc -N25 -t15 -r5 -b -p "4+25*2+4+6" \ -o round-{}.psmc split.fa | sh cat diploid.psmc round-*.psmc > combined.psmc utils/psmc_plot.pl -pY50000 combined combined.psmc One probably wants to modify the "xargs" command-line to parallelize PSMC. If you have questions about PSMC, please ask at <http://hengli.uservoice.com/>. You do not need to register unless you also want to modify your own questions. You may also post comments at github (if you have a github account). I want to make the question and the answer public such that others can see them and I do not need to answer the same question multiple times. Thank you for using PSMC. APPENDIX I: Scaling the PSMC output =================================== The PSMC output is scaled to the 2N_0. There are two ways of rescaling the time and the popuation size more meaningfully. Firstly, suppose we know the per-site per-generation mutation rate \mu, we can compute N_0 as: N_0 = \theta_0 / (4\mu) / s where \theta_0 is given at the 2nd column of "TR" lines, and s is the bin size we use for generating the PSMC input. Knowing N_0, we can scale time to generations and relative population size to effective size by T_k = 2N_0 * t_k N_k = N_0 * \lambda_k where t_k and \lambda_k are given at the 3rd and 4th columns of "RS" lines, respectively. A problem with the above strategy is that we do not know a definite answer of \mu and in fact it various with regions and mutation types. An alternative way is to use per-site pairwise sequence divergence to represent time: d_k = 2\mu * T_k = t_k * \theta_0 / s and use scaled mutation rate to represent population size: \theta_k = 4N_k * \mu = \lambda_k * \theta_0 / s where, again, t_k and \lambda_k are given at the "RS" line, \theta_0 at the "TR" line and s is the bin size, which defaults to 100 in fq2psmcfa. APPENDIX II: Correcting for low coverage ======================================== For diploid genomes sequenced to low coverage, heterozygotes will be randomly lost due to the lack of coverage of both alleles. This has the same effect as smaller mutation rate and can be corrected. If you know the fraction of hets missed due to low coverage, you can generate the PSMC plot with: psmc_plot.pl -M "sample1=0.1,sample2=0.2" prefix sample1.psmc sample2.psmc This says that sample1 has 10% false negative rate (FNR) on hets and sample2 has 20%. The plotting script does not correct FNR for bootstrapping. If you want to plot the result with your own scripts, you can increase \theta_0 to \theta_0/(1-FNR). Unfortunately, I haven't found a reliable way to estimate the background FNR relevant to PSMC. The simple and unscientific approach is to align the PSMC curves by eye. Probably a better solution is to downsample a high-coverage sample to a certain coverage and measures FNR. I have not done this. Not correcting for low coverage is the most common pitfall when using PSMC.
minimap2
A versatile pairwise aligner for genomic and spliced nucleotide sequencesbwa
Burrow-Wheeler Aligner for short-read alignment (see minimap2 for long-read alignment)seqtk
Toolkit for processing sequences in FASTA/Q formatsbioawk
BWK awk modified for biological dataminigraph
Sequence-to-graph mapper and graph generatorminiprot
Align proteins to genomes with splicing and frameshiftminiasm
Ultrafast de novo assembly for long noisy reads (though having no consensus step)wgsim
Reads simulatorgfatools
Tools for manipulating sequence graphs in the GFA and rGFA formatsbiofast
Benchmarking programming languages/implementations for common tasks in Bioinformaticsreadfq
Fast multi-line FASTA/Q reader in several programming languagescgranges
A C/C++ library for fast interval overlap queries (with a "bedtools coverage" example)kmer-cnt
Code examples of fast and simple k-mer counters for tutorial purposespangene
Constructing a pangenome gene graphbedtk
A simple toolset for BED files (warning: CLI may change before bedtk becomes stable)ksw2
Global alignment and alignment extensionyak
Yet another k-mer analyzerfermikit
De novo assembly based variant calling pipeline for Illumina short readsminimap
This repo is DEPRECATED. Please use minimap2, the successor of minimap.hickit
TAD calling, phase imputation, 3D modeling and more for diploid single-cell Hi-C (Dip-C) and general Hi-Cbgt
Flexible genotype query among 30,000+ samples whole-genomedipcall
Reference-based variant calling pipeline for a pair of phased haplotype assembliessrf
SRF: Satellite Repeat Finderunimap
A EXPERIMENTAL fork of minimap2 optimized for assembly-to-reference alignmentminipileup
Simple pileup-based variant callerfermi
A WGS de novo assembler based on the FMD-index for large genomesdna-nn
Model and predict short DNA sequence features with neural networksfermi-lite
Standalone C library for assembling Illumina short reads in small regionsbfc
High-performance error correction for Illumina resequencing dataropebwt2
Incremental construction of FM-index for DNA sequencestabtk
Toolkit for processing TAB-delimited formatgwfa
Proof-of-concept implementation of GWFA for sequence-to-graph alignmentCHM-eval
miniwfa
A reimplementation of the WaveFront Alignment algorithm at low memoryjstreeview
Interactive phylogenetic tree viewer/editorsamtools
This is *NOT* the official repository of samtools.etrf
Exact Tandem Repeat Finder (not a TRF replacement)ref-gen
Human reference genome analysis setsbioseq-js
For live demo, see http://lh3lh3.users.sourceforge.net/bioseq.shtmllv89
C implementation of the Landau-Vishkin algorithmpartig
An experimental tool to estimate the similarity between all pairs of contigsasub
A unified array job submitter for LSF, SGE/UGE and Slurmklib.nim
Experimental getopt, gzip reader, FASTA/Q parser and interval queries in nim-langcalN50
Compute N50/NG50 and auN/auNGsdust
Symmetric DUST for finding low-complexity regions in DNA sequencesgffio
pre-pe
Preprocessing paired-end reads produced with experiment-specific protocolshapdip
The CHM1-NA12878 benchmark for single-sample SNP/INDEL calling from WGS Illumina datafermi2
misc
Useful small programsvarcmp
The first CHM1 paper (Li, 2014)minisv
Lightweight mosaic/somatic SV caller for long reads (WIP)lianti
Tools to process LIANTI sequence datartgeval
Wrapper for RTG's vcfeval; DEPRECATED!nasw
Dynamic programming for aa-to-nt alignment with affine gap, splicing and frameshiftsgdp-fermi
FermiKit small variant calls for public SGDP samplesgfa1
This repo is deprecated. Please use gfatools instead.pubLRasm
PortableCrystal
Portable Crystal binary distributions for Linux on x86_64foreign
Modified or extracted from other programstrimadap
Fast but inaccurate adapter trimmer for Illumina readslh3-snippets
ropebwt3
Construction and utility of BWT for DNA string setstreebest
TreeBeST: Tree Building guided by Species Treeunicall
A wrapper for calling small variants from human germline high-coverage single-sample Illumina datafastARG
Fast heuristic ARG constructionproot-wrapper
Demonstrating the PRoot programrmaxcut
An experimental tool to find approximate max-cuts in a large graphbwa-docker
Minimal docker image for bwa. Not developed any more.sdg
EXPERIMENTAL implementation of side graphnaivepca
Naive PCA for genotype datamdust
mdust from DFCI Gene Indices Software Tools (archived for a historical record only)editdist-U85
Fast implementation of Ukkenon's O(ND) algorithm for computing edit distancelh3.github.com
libdivsufsort
Automatically exported from code.google.com/p/libdivsufsortmem-paper
Manuscript for BWA-MEMbcf2
Experimental bcftools port to support BCF2; DEPRECATED by htslib and htsboxthesis
PhD thesisropebwt
fermi-paper
The first fermi paper (Li, 2012)crlf
Concise Run-Length Format for small alphabets; DEPRECATEDpsnw
prototypecentos5-vm
Instructions on how to deploy CentOS 5 virtual machinesmag2gfa
DEPRECATED. Code has been moved to lh3/gfa1/miscibsget
Download files from Illumina BaseSpace (*OUTDATED* as BaseSpace has changed APIs)smtl-paper
Samtools statistics paper (Li, 2011)mssa-bench
Evaluating the performance of multi-string SA constructionsamtools-legacy
For testing only. DON'T USE!Love Open Source and this site? Check out how you can help us