• Stars
    star
    147
  • Rank 251,347 (Top 5 %)
  • Language
    C
  • License
    Other
  • Created almost 14 years ago
  • Updated almost 2 years ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Implementation of the Pairwise Sequentially Markovian Coalescent (PSMC) model
This software package infers population size history from a diploid sequence
using the Pairwise Sequentially Markovian Coalescent (PSMC) model. The
detailed model is described in file `psmc.tex'.

To compile the binaries, you may run

    make; (cd utils; make)

After that, you may try

    utils/fq2psmcfa -q20 diploid.fq.gz > diploid.psmcfa
    psmc -N25 -t15 -r5 -p "4+25*2+4+6" -o diploid.psmc diploid.psmcfa
    utils/psmc2history.pl diploid.psmc | utils/history2ms.pl > ms-cmd.sh
    utils/psmc_plot.pl diploid diploid.psmc

where `diploid.fq.gz' is typically the whole-genome diploid consensus sequence
of one human individual, which can be generated by, for example:

    samtools mpileup -C50 -uf ref.fa aln.bam | bcftools view -c - \
      | vcfutils.pl vcf2fq -d 10 -D 100 | gzip > diploid.fq.gz

Here option -d sets and minimum read depth and -D sets the maximum. It is
recommended to set -d to a third of the average depth and -D to twice.  Program
`fq2psmcfa' transforms the consensus sequence into a fasta-like format where
the i-th character in the output sequence indicates whether there is at least
one heterozygote in the bin [100i, 100i+100).

For dipcall output, you may use the following to generate psmcfa:

    seqtk mutfa ref.fa <(gzip -dc prefix.dip.vcf.gz|utils/vcf2snp.pl -) \
      | seqtk seq -cM prefix.dip.bed -l80 | utils/fq2psmcfa - > prefix.psmcfa

Program `psmc' infers the population size history. In particular, the `-p'
option specifies that there are 64 atomic time intervals and 28 (=1+25+1+1)
free interval parameters. The first parameter spans the first 4 atomic time
intervals, each of the next 25 parameters spans 2 intervals, the 27th spans 4
intervals and the last parameter spans the last 6 time intervals. The `-p' and
`-t' options are manually chosen such that after 20 rounds of iterations, at
least ~10 recombinations are inferred to occur in the intervals each parameter
spans. Impropriate settings may lead to overfitting. The command line in the
example above has been shown to be suitable for modern humans.

The `psmc' program infers the scaled mutation rate, the recombination rate and
the free population size parameters. All these parameters are scaled to 2N0. You
may run `psmc2history.pl' combined with `history2ms.pl' to generate the ms
command line that simulates the history inferred by PSMC, or visualize the result
with `psmc_plot.pl'.

To perform bootstrapping, one has to run splitfa first to split long chromosome
sequences to shorter segments. When the `-b' option is applied, psmc will then
randomly sample with replacement from these segments. As an example, the
following command lines perform 100 rounds of bootstrapping:

    utils/fq2psmcfa -q20 diploid.fq.gz > diploid.psmcfa
	utils/splitfa diploid.psmcfa > split.psmcfa
    psmc -N25 -t15 -r5 -p "4+25*2+4+6" -o diploid.psmc diploid.psmcfa
	seq 100 | xargs -i echo psmc -N25 -t15 -r5 -b -p "4+25*2+4+6" \
	    -o round-{}.psmc split.fa | sh
    cat diploid.psmc round-*.psmc > combined.psmc
	utils/psmc_plot.pl -pY50000 combined combined.psmc

One probably wants to modify the "xargs" command-line to parallelize PSMC.

If you have questions about PSMC, please ask at <http://hengli.uservoice.com/>.
You do not need to register unless you also want to modify your own questions.
You may also post comments at github (if you have a github account). I want to
make the question and the answer public such that others can see them and I do
not need to answer the same question multiple times. Thank you for using PSMC.


APPENDIX I: Scaling the PSMC output
===================================

The PSMC output is scaled to the 2N_0. There are two ways of rescaling the time
and the popuation size more meaningfully.

Firstly, suppose we know the per-site per-generation mutation rate \mu, we can
compute N_0 as:

  N_0 = \theta_0 / (4\mu) / s

where \theta_0 is given at the 2nd column of "TR" lines, and s is the bin size
we use for generating the PSMC input. Knowing N_0, we can scale time to
generations and relative population size to effective size by

  T_k = 2N_0 * t_k
  N_k = N_0 * \lambda_k

where t_k and \lambda_k are given at the 3rd and 4th columns of "RS" lines,
respectively.

A problem with the above strategy is that we do not know a definite answer of
\mu and in fact it various with regions and mutation types. An alternative way
is to use per-site pairwise sequence divergence to represent time:

  d_k = 2\mu * T_k = t_k * \theta_0 / s

and use scaled mutation rate to represent population size:

  \theta_k = 4N_k * \mu = \lambda_k * \theta_0 / s

where, again, t_k and \lambda_k are given at the "RS" line, \theta_0 at the
"TR" line and s is the bin size, which defaults to 100 in fq2psmcfa.


APPENDIX II: Correcting for low coverage
========================================

For diploid genomes sequenced to low coverage, heterozygotes will be randomly
lost due to the lack of coverage of both alleles. This has the same effect as
smaller mutation rate and can be corrected. If you know the fraction of hets
missed due to low coverage, you can generate the PSMC plot with:

  psmc_plot.pl -M "sample1=0.1,sample2=0.2" prefix sample1.psmc sample2.psmc

This says that sample1 has 10% false negative rate (FNR) on hets and sample2
has 20%. The plotting script does not correct FNR for bootstrapping. If you
want to plot the result with your own scripts, you can increase \theta_0 to
\theta_0/(1-FNR).

Unfortunately, I haven't found a reliable way to estimate the background FNR
relevant to PSMC. The simple and unscientific approach is to align the PSMC
curves by eye. Probably a better solution is to downsample a high-coverage
sample to a certain coverage and measures FNR. I have not done this.

Not correcting for low coverage is the most common pitfall when using PSMC.

More Repositories

1

minimap2

A versatile pairwise aligner for genomic and spliced nucleotide sequences
C
1,754
star
2

bwa

Burrow-Wheeler Aligner for short-read alignment (see minimap2 for long-read alignment)
C
1,504
star
3

seqtk

Toolkit for processing sequences in FASTA/Q formats
C
1,363
star
4

bioawk

BWK awk modified for biological data
C
590
star
5

minigraph

Sequence-to-graph mapper and graph generator
C
402
star
6

miniprot

Align proteins to genomes with splicing and frameshift
C
316
star
7

miniasm

Ultrafast de novo assembly for long noisy reads (though having no consensus step)
TeX
302
star
8

wgsim

Reads simulator
C
258
star
9

gfatools

Tools for manipulating sequence graphs in the GFA and rGFA formats
C
208
star
10

biofast

Benchmarking programming languages/implementations for common tasks in Bioinformatics
C
175
star
11

readfq

Fast multi-line FASTA/Q reader in several programming languages
C
170
star
12

cgranges

A C/C++ library for fast interval overlap queries (with a "bedtools coverage" example)
C
165
star
13

kmer-cnt

Code examples of fast and simple k-mer counters for tutorial purposes
C++
162
star
14

pangene

Constructing a pangenome gene graph
C
158
star
15

bedtk

A simple toolset for BED files (warning: CLI may change before bedtk becomes stable)
C
134
star
16

ksw2

Global alignment and alignment extension
C
127
star
17

yak

Yet another k-mer analyzer
C
111
star
18

fermikit

De novo assembly based variant calling pipeline for Illumina short reads
TeX
108
star
19

minimap

This repo is DEPRECATED. Please use minimap2, the successor of minimap.
C
106
star
20

hickit

TAD calling, phase imputation, 3D modeling and more for diploid single-cell Hi-C (Dip-C) and general Hi-C
C
105
star
21

bgt

Flexible genotype query among 30,000+ samples whole-genome
C
96
star
22

dipcall

Reference-based variant calling pipeline for a pair of phased haplotype assemblies
JavaScript
92
star
23

srf

SRF: Satellite Repeat Finder
TeX
86
star
24

unimap

A EXPERIMENTAL fork of minimap2 optimized for assembly-to-reference alignment
C
85
star
25

minipileup

Simple pileup-based variant caller
C
79
star
26

fermi

A WGS de novo assembler based on the FMD-index for large genomes
C
74
star
27

dna-nn

Model and predict short DNA sequence features with neural networks
C
72
star
28

fermi-lite

Standalone C library for assembling Illumina short reads in small regions
C
72
star
29

bfc

High-performance error correction for Illumina resequencing data
TeX
68
star
30

ropebwt2

Incremental construction of FM-index for DNA sequences
TeX
67
star
31

tabtk

Toolkit for processing TAB-delimited format
C
59
star
32

gwfa

Proof-of-concept implementation of GWFA for sequence-to-graph alignment
C
56
star
33

CHM-eval

TeX
49
star
34

miniwfa

A reimplementation of the WaveFront Alignment algorithm at low memory
C
49
star
35

jstreeview

Interactive phylogenetic tree viewer/editor
JavaScript
46
star
36

samtools

This is *NOT* the official repository of samtools.
C
46
star
37

etrf

Exact Tandem Repeat Finder (not a TRF replacement)
C
45
star
38

ref-gen

Human reference genome analysis sets
Makefile
44
star
39

bioseq-js

For live demo, see http://lh3lh3.users.sourceforge.net/bioseq.shtml
HTML
37
star
40

lv89

C implementation of the Landau-Vishkin algorithm
C++
35
star
41

partig

An experimental tool to estimate the similarity between all pairs of contigs
C
35
star
42

asub

A unified array job submitter for LSF, SGE/UGE and Slurm
Perl
32
star
43

klib.nim

Experimental getopt, gzip reader, FASTA/Q parser and interval queries in nim-lang
Nim
32
star
44

calN50

Compute N50/NG50 and auN/auNG
JavaScript
31
star
45

sdust

Symmetric DUST for finding low-complexity regions in DNA sequences
C
31
star
46

gffio

C
31
star
47

pre-pe

Preprocessing paired-end reads produced with experiment-specific protocols
C
31
star
48

hapdip

The CHM1-NA12878 benchmark for single-sample SNP/INDEL calling from WGS Illumina data
JavaScript
30
star
49

fermi2

C
26
star
50

misc

Useful small programs
C
26
star
51

varcmp

The first CHM1 paper (Li, 2014)
TeX
25
star
52

minisv

Lightweight mosaic/somatic SV caller for long reads (WIP)
JavaScript
24
star
53

lianti

Tools to process LIANTI sequence data
C
23
star
54

rtgeval

Wrapper for RTG's vcfeval; DEPRECATED!
Shell
21
star
55

nasw

Dynamic programming for aa-to-nt alignment with affine gap, splicing and frameshift
C
18
star
56

sgdp-fermi

FermiKit small variant calls for public SGDP samples
17
star
57

gfa1

This repo is deprecated. Please use gfatools instead.
C
16
star
58

pubLRasm

16
star
59

PortableCrystal

Portable Crystal binary distributions for Linux on x86_64
15
star
60

foreign

Modified or extracted from other programs
C
15
star
61

trimadap

Fast but inaccurate adapter trimmer for Illumina reads
C
14
star
62

lh3-snippets

C
14
star
63

ropebwt3

Construction and utility of BWT for DNA string sets
C
13
star
64

treebest

TreeBeST: Tree Building guided by Species Tree
C
13
star
65

unicall

A wrapper for calling small variants from human germline high-coverage single-sample Illumina data
Perl
12
star
66

fastARG

Fast heuristic ARG construction
C
12
star
67

proot-wrapper

Demonstrating the PRoot program
Perl
12
star
68

rmaxcut

An experimental tool to find approximate max-cuts in a large graph
C
11
star
69

bwa-docker

Minimal docker image for bwa. Not developed any more.
11
star
70

sdg

EXPERIMENTAL implementation of side graph
C
10
star
71

naivepca

Naive PCA for genotype data
C
10
star
72

mdust

mdust from DFCI Gene Indices Software Tools (archived for a historical record only)
C
10
star
73

editdist-U85

Fast implementation of Ukkenon's O(ND) algorithm for computing edit distance
C
9
star
74

lh3.github.com

TeX
9
star
75

libdivsufsort

Automatically exported from code.google.com/p/libdivsufsort
C
8
star
76

mem-paper

Manuscript for BWA-MEM
6
star
77

bcf2

Experimental bcftools port to support BCF2; DEPRECATED by htslib and htsbox
C
6
star
78

thesis

PhD thesis
TeX
5
star
79

ropebwt

C
4
star
80

fermi-paper

The first fermi paper (Li, 2012)
3
star
81

crlf

Concise Run-Length Format for small alphabets; DEPRECATED
C
3
star
82

psnw

prototype
C
2
star
83

centos5-vm

Instructions on how to deploy CentOS 5 virtual machines
2
star
84

mag2gfa

DEPRECATED. Code has been moved to lh3/gfa1/misc
C
2
star
85

ibsget

Download files from Illumina BaseSpace (*OUTDATED* as BaseSpace has changed APIs)
C
2
star
86

smtl-paper

Samtools statistics paper (Li, 2011)
Lua
1
star
87

mssa-bench

Evaluating the performance of multi-string SA construction
C
1
star
88

samtools-legacy

For testing only. DON'T USE!
C
1
star