DNA-seq
Databases for variants
- Disease Variant Store
- The ExAC Browser: Displaying reference data information from over 60,000 exomes
- Pathogenic Germline Variants in 10,389 Adult Cancers
Important paper DNA damage is a major cause of sequencing errors, directly confounding variant identification
However, in this study we show that false positive variants can account for more than 70% of identified somatic variations, rendering conventional detection methods inadequate for accurate determination of low allelic variants. Interestingly, these false positive variants primarily originate from mutagenic DNA damage which directly confounds determination of genuine somatic mutations. Furthermore, we developed and validated a simple metric to measure mutagenic DNA damage and demonstrated that mutagenic DNA damage is the leading cause of sequencing errors in widely-used resources including the 1000 Genomes Project and The Cancer Genome Atlas.
How to represent sequence variants
Sequence Variant Nomenclature from Human Genome Variation Society
dbSNP IDs are not unique?
<script async src="//platform.twitter.com/widgets.js" charset="utf-8"></script>Oh God, why are people still using dbSNP IDs as though they're unique identifiers?
— Daniel MacArthur (@dgmacarthur) July 27, 2016
The Evolving Utility of dbSNP
see a post:dbSNP (build 147) exceeds a ridiculous 150 million variants
In the early days of next-generation sequencing, dbSNP provided a vital discriminatory tool. In exome sequencing studies of Mendelian disorders, any variant already present in dbSNP was usually common, and therefore unlikely to cause rare genetic diseases. Some of the first high-profile disease gene studies therefore used dbSNP as a filter. Similarly, in cancer genomics, a candidate somatic mutation observed at the position of a known polymorphism typically indicated a germline variant that was under-called in the normal sample. Again, dbSNP provided an important filter.
Now, the presence or absence of a variant in dbSNP carries very little meaning. The database includes over 100,000 variants from disease mutation databases such as OMIM or HGMD. It also contains some appreciable number of somatic mutations that were submitted there before databases like COSMIC became available. And, like any biological database, dbSNP undoubtedly includes false positives.
Thus, while the mere presence of a variant in dbSNP is a blunt tool for variant filtering, dbSNP’s deep allele frequency data make it incredibly powerful for genetics studies: it can rule out variants that are too prevalent to be disease-causing, and prioritize ones that are rarely observed in human populations. This discriminatory power will only increase as ambitious large-scale sequencing projects like CCDG make their data publicly available.
Tips and lessons learned during my DNA-seq data analysis journey.
-
Allel frequency(AF)
Allele frequency, or gene frequency, is the proportion of a particular allele (variant of a gene) among all allele copies being considered. It can be formally defined as the percentage of all alleles at a given locus on a chromosome in a population gene pool represented by a particular allele. AF is affected by copy-number variation, which is common for cancers. tools such as pyclone take tumor purity and copy-number data into account to calculate Cancer Cell Fraction (CCFs). -
"for SNVs, we are interested in genotype 0/1, 1/1 for tumor and 0/0 for normal. 1/1 genotype is very rare.
It requires the same mutation occurs at the same place in two sister chromsomes which is very rare. one possible way to get 1/1 is deletion of one chromosome and duplication of the mutated chromosome". Quote from Siyuan Zheng. -
"Mutect analysis on the TCGA samples finds around 5000 ~ 8000 SNVs per sample." Quote from Siyuan Zheng.
-
Cell lines might be contamintated or mislabled. The Great Big Clean-Up
-
Tumor samples are not pure, you will always have stromal cells and infiltrating immnue cells in the tumor bulk. When you analyze the data, keep this in mind.
-
the devil 0 based and 1 based coordinate systems! Make sure you know which system your file is using:
credit from Vince Buffalo.
Also, read this post and this post
Also read The UCSC Genome Browser Coordinate Counting Systems
- Which human reference genome to use? by Heng Li
TL;DR: If you map reads to GRCh37 or hg19, use hs37-1kg:
ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/technical/reference/human_g1k_v37.fasta.gz If you map to GRCh37 and believe decoy sequences help with better variant calling, use hs37d5:
ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/technical/reference/phase2_reference_assembly_sequence/hs37d5.fa.gz If you map reads to GRCh38 or hg38, use the following:
ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/001/405/GCA_000001405.15_GRCh38/seqs_for_alignment_pipelines.ucsc_ids/GCA_000001405.15_GRCh38_no_alt_analysis_set.fna.gz
-
Reference Genome Components by GATK team.
-
Human genome reference builds - GRCh38/hg38 - b37 - hg19 by GATK team.
get the reference files and mapping index programmatically
- Go Get Data from Aaron's lab.
- Refgenie: a reference genome resource manager
- genomepy
some useful tools for preprocessing
- FastqPuri fastq quality assessment and filtering tool.
- fastp A tool designed to provide fast all-in-one preprocessing for FastQ files. This tool is developed in C++ with multithreading supported to afford high performance. really promising, take a look!
- A new tool bazam A read extraction and realignment tool for next generation sequencing data. Take a look!
- bwa-mem2 exact the same results of bwa-mem, 80% faster!
check sample swapping
- somalier sample-swap checking directly on BAMs/CRAMs for cancer data
Mutation caller, structural variant caller
-
paper Making the difference: integrating structural variation detection tools
-
Mapping and characterization of structural variation in 17,795 deeply sequenced human genomes
-
GATK HaplotypeCaller Analysis of BWA (mem) mapped Illumina reads
-
An ensemble approach to accurately detect somatic mutations using SomaticSeq tool github page
-
A synthetic-diploid benchmark for accurate variant-calling evaluation A benchmark dataset from Heng Li. github repo
-
Strelka2: fast and accurate calling of germline and somatic variants paper: https://www.nature.com/articles/s41592-018-0051-x
-
lancet is a somatic variant caller (SNVs and indels) for short read data. Lancet uses a localized micro-assembly strategy to detect somatic mutation with high sensitivity and accuracy on a tumor/normal pair. paper: https://www.nature.com/articles/s42003-018-0023-9
-
needlestack an ultra-sensitive variant caller for multi-sample next generation sequencing data. This tool seems to be very useful for multi-region tumor sample analysis. paper
-
PerSVade: personalized structural variant detection in any species of interest
Delly is the best sv caller in the DREAM challenge https://www.synapse.org/#!Synapse:syn312572/wiki/70726
-
[Comprehensive evaluation of structural variation detection algorithms for whole genome sequencing] (https://genomebiology.biomedcentral.com/articles/10.1186/s13059-019-1720-5)
-
[Genotyping structural variants in pangenome graphs using the vg toolkit (https://genomebiology.biomedcentral.com/articles/10.1186/s13059-020-1949-z)
-
SVAFotate Annotate a (lumpy) structual variant (SV) VCF with allele frequencies (AFs) from large population SV cohorts (currently CCDG and/or gnomAD) with a simple command line tool.
-
Comprehensively benchmarking applications for detecting copy number variation Our results show that the sequencing depth can strongly affect CNV detection. Among the ten applications benchmarked, LUMPY performs best for both high sensitivity and specificity for each sequencing depth.
-
minigraph from Heng Li to call complex SVs.
-
Parliament2: Accurate structural variant calling at scale. by Fritz group in BCM. https://academic.oup.com/gigascience/article/9/12/giaa145/6042728
-
Bent Perderson works on smoove which improves upon lumpy.
-
COSMOS: Somatic Large Structural Variation Detector
-
Fusion And Chromosomal Translocation Enumeration and Recovery Algorithm (FACTERA)
-
VarDict: a novel and versatile variant caller for next-generation sequencing in cancer research. we demonstrated that VarDict has improved sensitivity over
Manta
and equivalent sensitivity toLumpy
. SNP call rates are on par withMuTect
, and VarDict is more sensitive and precise thanScalpel
and other callers for insertions and deletions. see a post by Brad Chapman. Looks very promising. -
Weaver: Allele-Specific Quantification of Structural Variations in Cancer Genomes. Paper
-
Prioritisation of Structural Variant Calls in Cancer Genomes simple_sv_annotation.py to annotate Lumpy and Mannta SV calls.
-
Genome-wide profiling of heritable and de novo STR variations short tandem repeats.
SNV filtering
- paper: Using high-resolution variant frequencies to empower clinical genome interpretation shiny App
Whole exome and genome sequencing have transformed the discovery of genetic variants that cause human Mendelian disease, but discriminating pathogenic from benign variants remains a daunting challenge. Rarity is recognised as a necessary, although not sufficient, criterion for pathogenicity, but frequency cutoffs used in Mendelian analysis are often arbitrary and overly lenient. Recent very large reference datasets, such as the Exome Aggregation Consortium (ExAC), provide an unprecedented opportunity to obtain robust frequency estimates even for very rare variants. Here we present a statistical framework for the frequency-based filtering of candidate disease-causing variants, accounting for disease prevalence, genetic and allelic heterogeneity, inheritance mode, penetrance, and sampling variance in reference datasets.
-
a new database called dbDSM A database of Deleterious Synonymous Mutation, a continually updated database that collects, curates and manages available human disease-related SM data obtained from published literature.
-
LncVar: a database of genetic variation associated with long non-coding genes
Annotation of the variants
- VEP
- ANNOVAR
- VCFanno
- Personal Cancer Genome Reporter (PCGR)
- awesome-cancer-variant-databases
Mannual review of the variants called by IGV
Third generation sequencing for Structural variants (works on short reads as well!)
- beautiful “Ribbon” viewer to visualize complicated SVs revealed by PacBio reads github page
- Sniffles: Structural variation caller using third generation sequencing is a structural variation caller using third generation sequencing (PacBio or Oxford Nanopore). It detects all types of SVs using evidence from split-read alignments, high-mismatch regions, and coverage analysis.
- splitThreader for visualizing structural variants. Finally a good visualizer!
- New Genome Browser (NGB) - a Web - based NGS data viewer with unique Structural Variations (SVs) visualization capabilities, high performance, scalability, and cloud data support. Looks very promising.
tools useful for everyday bioinformatics
- bedtools one must know how to use it!
- bedops useful as bedtools.
- valr provides tools to read and manipulate genome intervals and signals. (dplyr friendly!)
- tidygenomics similar to GRanges but operate on dataframes!
- InteractionSet useful for Hi-C, ChIA-PET. I used it for Breakpoints clustering for structural variants
- Paired Genomic Loci Tool Suite
gpltools intersect
can do breakpoint merging. - svtools Tools for processing and analyzing structural variants.
- sveval Functions to compare a SV call sets against a truth set.
- Teaser A tool to benchmark mappers and different parameters within minutes.
A series of posts from Brad Chapman
- Validating multiple cancer variant callers and prioritization in tumor-only samples
- Benchmarking variation and RNA-seq analyses on Amazon Web Services with Docker
- Validating generalized incremental joint variant calling with GATK HaplotypeCaller, FreeBayes, Platypus and samtools
- Validated whole genome structural variation detection using multiple callers
- Validated variant calling with human genome build 38
Copy number variants
- Interactive analysis and assessment of single-cell copy-number variations: Ginkgo
- Copynumber Viewer
- paper: Computational tools for copy number variation (CNV) detection using next-generation sequencing data: features and perspectives
- bioconductor copy number work flow
- paper: Assessing the reproducibility of exome copy number variations predictions
- CNVkit A command-line toolkit and Python library for detecting copy number variants and alterations genome-wide from targeted DNA sequencing.
- SavvyCNV: genome-wide CNV calling from off-target reads
- dryclean Robust foreground detection in somatic copy number data https://www.biorxiv.org/content/10.1101/847681v2
Tools for visulization
-
New app gene.iobio
App here I will definetely have it a try. -
ASCIIGenome is a command-line genome browser running from terminal window and solely based on ASCII characters. Since ASCIIGenome does not require a graphical interface it is particularly useful for quickly visualizing genomic data on remote servers. The idea is to make ASCIIGenome the Vim of genome viewers.
Tools for vcf files
- tools for pedigree files. It can determine sex from PED and VCF files. Developed by Brent Pedersen. I really like tools from Aaron Quinlan's lab.
- cyvcf2 is a cython wrapper around htslib built for fast parsing of Variant Call Format (VCF) files
- PyVCF - A Variant Call Format Parser for Python
- VcfR: an R package to manipulate and visualize VCF format data
- Varapp is an application to filter genetic variants, with a reactive graphical user interface. Powered by GEMINI.
- varmatch: robust matching of small variant datasets using flexible scoring schemes
- vcf-validator validate your VCF files!
- BrowseVCF: a web-based application and workflow to quickly prioritize disease-causative variants in VCF files
mutation signature
- signeR
- deconstructSigs
- MutationalPatterns
- sigminer: an easy-to-use and scalable toolkit for genomic alteration signature analysis and visualization in R
Tools for MAF files
TCGA has all the variants calls in MAF format. Please read a post by Cyriac Kandoth.
- convert vcf to MAF: perl script by Cyriac Kandoth.
- once converted to MAF, one can use this MAFtools to do visualization: oncoprint wraps complexHeatmap, Lollipop and Mutational Signatures etc. Very cool, I just found it...
- MutationalPatterns: an integrative R package for studying patterns in base substitution catalogues
Tools for bam files
- VariantBam: Filtering and profiling of next-generational sequencing data using region-specific rules
Annotate and explore variants
- Variant Effect Predictor: VEP
- SNPEFF
- vcfanno
- myvariant.info tutorial
- FunSeq2- A flexible framework to prioritize regulatory mutations from cancer genome sequencing
- ClinVar
- ExAC
- vcf2db and GEMINI: a flexible framework for exploring genome variation from Qunlan lab.
Plotting
1.oncoprint 2.deconstructSigs aims to determine the contribution of known mutational processes to a tumor sample. By using deconstructSigs, one can: Determine the weights of each mutational signature contributing to an individual tumor sample; Plot the reconstructed mutational profile (using the calculated weights) and compare to the original input sample 3. Fast Principal Component Analysis of Large-Scale Genome-Wide Data
Identify driver genes
intra-Tumor heterogenity
- ESTIMATE
- ABSOLUTE
- THetA: Tumor Heterogeneity Analysis is an algorithm that estimates the tumor purity and clonal/sublconal copy number aberrations directly from high-throughput DNA sequencing data. The latest release is called THetA2 and includes a number of improvements over previous versions.
- CIBERSORT is an analytical tool developed by Newman et al. to provide an estimation of the abundances of member cell types in a mixed cell population, using gene expression data
- xcell is a webtool that performs cell type enrichment analysis from gene expression data for 64 immune and stroma cell types. xCell is a gene signatures-based method learned from thousands of pure cell types from various sources.
- paper: Digitally deconvolving the tumor microenvironment
- Comprehensive analyses of tumor immunity: implications for cancer immunotherapy by Shierly Liu's lab. TIMER: Tumor IMmune Estimation Resource A comprehensive resource for the clinical relevance of tumor-immune infiltrations
- Reference-free deconvolution of DNA methylation data and mediation by cell composition effects. The R package's documentation is minimal... see tutorial here from the author. Brent Perdson has a tool implementing the same method used by Houseman: celltypes450.
- paper: Toward understanding and exploiting tumor heterogeneity
- paper: The prognostic landscape of genes and infiltrating immune cells across human cancers from Alizadeh lab.
- Robust enumeration of cell subsets from tissue expression profiles from Alizadeh lab, and the CIBERSORT tool
- A series of posts on tumor evolution
- mapscape bioc package MapScape integrates clonal prevalence, clonal hierarchy, anatomic and mutational information to provide interactive visualization of spatial clonal evolution.
- cellscape bioc package Explores single cell copy number profiles in the context of a single cell tree
tumor colonality and evolution
- A step-by-step guide to estimate tumor clonality/purity from variant allele frequency data
- densityCut: an efficient and versatile topological approach for automatic clustering of biological data can be used to cluster allel frequence.
- phyC: Clustering cancer evolutionary trees
- CloneCNA: detecting subclonal somatic copy number alterations in heterogeneous tumor samples from whole-exome sequencing data
- paper: Distinct evolution and dynamics of epigenetic and genetic heterogeneity in acute myeloid leukemia
- paper: Visualizing Clonal Evolution in Cancer
- An R package for inferring the subclonal architecture of tumors:sciclone
- Inferring and visualizing clonal evolution in multi-sample cancer sequencing: clonevol
- fishplot: Create timecourse "fish plots" that show changes in the clonal architecture of tumors
- tools from OMICS tools website
- PhyloWGS: Reconstructing subclonal composition and evolution from whole-genome sequencing of tumors.
- SCHISM SubClonal Hierarchy Inference from Somatic Mutation
mutual exclusiveness of mutations
- MEGSA: A powerful and flexible framework for analyzing mutual exclusivity of tumor mutations.
- CoMet
- DISCOVER co-occurrence and mutual exclusivity analysis for cancer genomics data.
mutation enrich in pathways
*PathScore: a web tool for identifying altered pathways in cancer data
Non-coding mutations
CRISPR
- The caRpools package - Analysis of pooled CRISPR Screens
- CRISPR Library Designer (CLD): a software for the multispecies design of sgRNA libraries
long reads
Quality Assessment Tools for Oxford Nanopore MinION data Signal-level algorithms for MinION data
Single-cell DNA sequencing
- A review paper 2016: Single-cell genome sequencing:current state of the science
- Monovar: single-nucleotide variant detection in single cells
- R2C2: Improving nanopore read accuracy enables the sequencing of highly-multiplexed full-length single-cell cDNA
- sci-LIANTI, a high-throughput, high-coverage single-cell DNA sequencing method that combines single-cell combinatorial indexing (sci) and linear amplification via transposon insertion (LIANTI)