• Stars
    star
    139
  • Rank 262,954 (Top 6 %)
  • Language
    Shell
  • Created about 9 years ago
  • Updated over 1 year ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

DNA sequencing analysis notes from Ming Tang

DNA-seq

Databases for variants

Important paper DNA damage is a major cause of sequencing errors, directly confounding variant identification

However, in this study we show that false positive variants can account for more than 70% of identified somatic variations, rendering conventional detection methods inadequate for accurate determination of low allelic variants. Interestingly, these false positive variants primarily originate from mutagenic DNA damage which directly confounds determination of genuine somatic mutations. Furthermore, we developed and validated a simple metric to measure mutagenic DNA damage and demonstrated that mutagenic DNA damage is the leading cause of sequencing errors in widely-used resources including the 1000 Genomes Project and The Cancer Genome Atlas.

Functional equivalence of genome sequencing analysis pipelines enables harmonized variant calling across human genetics projects

How to represent sequence variants

Sequence Variant Nomenclature from Human Genome Variation Society

dbSNP IDs are not unique?

Oh God, why are people still using dbSNP IDs as though they're unique identifiers?

— Daniel MacArthur (@dgmacarthur) July 27, 2016
<script async src="//platform.twitter.com/widgets.js" charset="utf-8"></script>

The Evolving Utility of dbSNP

see a post:dbSNP (build 147) exceeds a ridiculous 150 million variants

In the early days of next-generation sequencing, dbSNP provided a vital discriminatory tool. In exome sequencing studies of Mendelian disorders, any variant already present in dbSNP was usually common, and therefore unlikely to cause rare genetic diseases. Some of the first high-profile disease gene studies therefore used dbSNP as a filter. Similarly, in cancer genomics, a candidate somatic mutation observed at the position of a known polymorphism typically indicated a germline variant that was under-called in the normal sample. Again, dbSNP provided an important filter.

Now, the presence or absence of a variant in dbSNP carries very little meaning. The database includes over 100,000 variants from disease mutation databases such as OMIM or HGMD. It also contains some appreciable number of somatic mutations that were submitted there before databases like COSMIC became available. And, like any biological database, dbSNP undoubtedly includes false positives.

Thus, while the mere presence of a variant in dbSNP is a blunt tool for variant filtering, dbSNP’s deep allele frequency data make it incredibly powerful for genetics studies: it can rule out variants that are too prevalent to be disease-causing, and prioritize ones that are rarely observed in human populations. This discriminatory power will only increase as ambitious large-scale sequencing projects like CCDG make their data publicly available.

Tips and lessons learned during my DNA-seq data analysis journey.

  1. Allel frequency(AF)
    Allele frequency, or gene frequency, is the proportion of a particular allele (variant of a gene) among all allele copies being considered. It can be formally defined as the percentage of all alleles at a given locus on a chromosome in a population gene pool represented by a particular allele. AF is affected by copy-number variation, which is common for cancers. tools such as pyclone take tumor purity and copy-number data into account to calculate Cancer Cell Fraction (CCFs).

  2. "for SNVs, we are interested in genotype 0/1, 1/1 for tumor and 0/0 for normal. 1/1 genotype is very rare.
    It requires the same mutation occurs at the same place in two sister chromsomes which is very rare. one possible way to get 1/1 is deletion of one chromosome and duplication of the mutated chromosome". Quote from Siyuan Zheng.

  3. "Mutect analysis on the TCGA samples finds around 5000 ~ 8000 SNVs per sample." Quote from Siyuan Zheng.

  4. Cell lines might be contamintated or mislabled. The Great Big Clean-Up

  5. Tumor samples are not pure, you will always have stromal cells and infiltrating immnue cells in the tumor bulk. When you analyze the data, keep this in mind.

  6. the devil 0 based and 1 based coordinate systems! Make sure you know which system your file is using:

credit from Vince Buffalo. Also, read this post and this post

Also read The UCSC Genome Browser Coordinate Counting Systems

TL;DR: If you map reads to GRCh37 or hg19, use hs37-1kg:

ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/technical/reference/human_g1k_v37.fasta.gz If you map to GRCh37 and believe decoy sequences help with better variant calling, use hs37d5:

ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/technical/reference/phase2_reference_assembly_sequence/hs37d5.fa.gz If you map reads to GRCh38 or hg38, use the following:

ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/001/405/GCA_000001405.15_GRCh38/seqs_for_alignment_pipelines.ucsc_ids/GCA_000001405.15_GRCh38_no_alt_analysis_set.fna.gz

get the reference files and mapping index programmatically

some useful tools for preprocessing

  • FastqPuri fastq quality assessment and filtering tool.
  • fastp A tool designed to provide fast all-in-one preprocessing for FastQ files. This tool is developed in C++ with multithreading supported to afford high performance. really promising, take a look!
  • A new tool bazam A read extraction and realignment tool for next generation sequencing data. Take a look!
  • bwa-mem2 exact the same results of bwa-mem, 80% faster!

check sample swapping

  • somalier sample-swap checking directly on BAMs/CRAMs for cancer data

Mutation caller, structural variant caller

Delly is the best sv caller in the DREAM challenge https://www.synapse.org/#!Synapse:syn312572/wiki/70726

SNV filtering

Whole exome and genome sequencing have transformed the discovery of genetic variants that cause human Mendelian disease, but discriminating pathogenic from benign variants remains a daunting challenge. Rarity is recognised as a necessary, although not sufficient, criterion for pathogenicity, but frequency cutoffs used in Mendelian analysis are often arbitrary and overly lenient. Recent very large reference datasets, such as the Exome Aggregation Consortium (ExAC), provide an unprecedented opportunity to obtain robust frequency estimates even for very rare variants. Here we present a statistical framework for the frequency-based filtering of candidate disease-causing variants, accounting for disease prevalence, genetic and allelic heterogeneity, inheritance mode, penetrance, and sampling variance in reference datasets.

  • a new database called dbDSM A database of Deleterious Synonymous Mutation, a continually updated database that collects, curates and manages available human disease-related SM data obtained from published literature.

  • LncVar: a database of genetic variation associated with long non-coding genes

Annotation of the variants

Mannual review of the variants called by IGV

Third generation sequencing for Structural variants (works on short reads as well!)

tools useful for everyday bioinformatics

A series of posts from Brad Chapman

  1. Validating multiple cancer variant callers and prioritization in tumor-only samples
  2. Benchmarking variation and RNA-seq analyses on Amazon Web Services with Docker
  3. Validating generalized incremental joint variant calling with GATK HaplotypeCaller, FreeBayes, Platypus and samtools
  4. Validated whole genome structural variation detection using multiple callers
  5. Validated variant calling with human genome build 38

Copy number variants

Tools for visulization

  1. New app gene.iobio
    App here I will definetely have it a try.

  2. ASCIIGenome is a command-line genome browser running from terminal window and solely based on ASCII characters. Since ASCIIGenome does not require a graphical interface it is particularly useful for quickly visualizing genomic data on remote servers. The idea is to make ASCIIGenome the Vim of genome viewers.

Tools for vcf files

  1. tools for pedigree files. It can determine sex from PED and VCF files. Developed by Brent Pedersen. I really like tools from Aaron Quinlan's lab.
  2. cyvcf2 is a cython wrapper around htslib built for fast parsing of Variant Call Format (VCF) files
  3. PyVCF - A Variant Call Format Parser for Python
  4. VcfR: an R package to manipulate and visualize VCF format data
  5. Varapp is an application to filter genetic variants, with a reactive graphical user interface. Powered by GEMINI.
  6. varmatch: robust matching of small variant datasets using flexible scoring schemes
  7. vcf-validator validate your VCF files!
  8. BrowseVCF: a web-based application and workflow to quickly prioritize disease-causative variants in VCF files

mutation signature

Tools for MAF files

TCGA has all the variants calls in MAF format. Please read a post by Cyriac Kandoth.

  1. convert vcf to MAF: perl script by Cyriac Kandoth.
  2. once converted to MAF, one can use this MAFtools to do visualization: oncoprint wraps complexHeatmap, Lollipop and Mutational Signatures etc. Very cool, I just found it...
  3. MutationalPatterns: an integrative R package for studying patterns in base substitution catalogues

Tools for bam files

  1. VariantBam: Filtering and profiling of next-generational sequencing data using region-specific rules

Annotate and explore variants

  1. Variant Effect Predictor: VEP
  2. SNPEFF
  3. vcfanno
  4. myvariant.info tutorial
  5. FunSeq2- A flexible framework to prioritize regulatory mutations from cancer genome sequencing
  6. ClinVar
  7. ExAC
  8. vcf2db and GEMINI: a flexible framework for exploring genome variation from Qunlan lab.

Plotting

1.oncoprint 2.deconstructSigs aims to determine the contribution of known mutational processes to a tumor sample. By using deconstructSigs, one can: Determine the weights of each mutational signature contributing to an individual tumor sample; Plot the reconstructed mutational profile (using the calculated weights) and compare to the original input sample 3. Fast Principal Component Analysis of Large-Scale Genome-Wide Data

Identify driver genes

intra-Tumor heterogenity

tumor colonality and evolution

mutual exclusiveness of mutations

  • MEGSA: A powerful and flexible framework for analyzing mutual exclusivity of tumor mutations.
  • CoMet
  • DISCOVER co-occurrence and mutual exclusivity analysis for cancer genomics data.

mutation enrich in pathways

*PathScore: a web tool for identifying altered pathways in cancer data

Non-coding mutations

CRISPR

long reads

Quality Assessment Tools for Oxford Nanopore MinION data Signal-level algorithms for MinION data

Single-cell DNA sequencing

More Repositories

1

getting-started-with-genomics-tools-and-resources

Unix, R and python tools for genomics and data science
Shell
1,143
star
2

RNA-seq-analysis

RNAseq analysis notes from Ming Tang
Python
867
star
3

ChIP-seq-analysis

ChIP-seq analysis notes from Ming Tang
Python
670
star
4

scRNAseq-analysis-notes

scRNAseq analysis notes from Ming Tang
626
star
5

bioinformatics-one-liners

Bioinformatics one liners from Ming Tang
452
star
6

awesome_spatial_omics

tools and notes for spatial omics
209
star
7

The-world-of-faculty

resources for faculty
207
star
8

TCR-BCR-seq-analysis

T/B cell receptor sequencing analysis notes
202
star
9

scATACseq-analysis-notes

my notes for scATACseq analysis
113
star
10

pyflow-ChIPseq

a snakemake pipeline to process ChIP-seq files from GEO or in-house
Python
101
star
11

scclusteval

Single Cell Cluster Evaluation
R
85
star
12

pyflow-ATACseq

ATAC-seq snakemake pipeline
Python
82
star
13

DNA-methylation-analysis

DNA methylation analysis notes from Ming Tang
78
star
14

machine-learning-resource

70
star
15

papers_with_data_to_mine

published papers with a lot of data
61
star
16

oneliner_100day_challenge

Bioinformatics one-liner for 100 days
43
star
17

scATACutils

R/Bioconductor package for working with 10x scATACseq data
R
38
star
18

scRNA-seq-workshop-Fall-2019

Harvard FAS informatics scRNAseq workshop website
R
36
star
19

biotech_resource

some resources for startup companies
36
star
20

compbio_tutorials

My youtube programming scripts
HTML
33
star
21

compbio_resources_chatomics

24
star
22

pyflow-RNAseq

RNAseq pipeline based on snakemake
Python
22
star
23

Machine_learning_drug_discovery

21
star
24

awesome-long-reads

tools and notes for long reads analysis
19
star
25

pyflow-scATACseq

snakemake workflow for post-processing scATACseq data
Python
19
star
26

crazyhottommy

17
star
27

Coursera_Bioinformatics_for_Beginners

python scripts for the Coursera Bioinformatics for Beginners
Python
17
star
28

pyflow-cellranger

A Snakemake pipeline for cellranger to process 10x single-cell RNAseq data
Python
15
star
29

scripts-general-use

HTML
15
star
30

single-cell-DNAseq-notes

14
star
31

pyflow_seurat_parameter

cluster stability measurement by subsampling and reclustering with Seurat V3 and V4
R
11
star
32

immunotherapy_scRNAseq_papers

11
star
33

CV

my CV using pagedown
JavaScript
10
star
34

awesome-single-cell-proteomics

9
star
35

mixed_histology_lung_cancer

8
star
36

cloud_computing_resources

7
star
37

MIT6.00.1x-Introduction-to-Computer-Science-and-Programming-Using-Python

my notes for the homework
Python
5
star
38

immunology_tools

5
star
39

pyflow-single-cell

single-cell RNAseq ATACseq processing pipeline
Python
5
star
40

writing-tips

5
star
41

scATACtools

R, python, unix tools for 10x scATACseq data
R
5
star
42

wholebrain_docker

docker file for wholebrain http://www.wholebrainsoftware.org/cms/installing-wholebrain-on-ubuntudebian/
Dockerfile
5
star
43

Genrich_compare

snakemake workflow comparing Genrich and MACS2
Python
5
star
44

phantompeakqualtools

Automatically exported from code.google.com/p/phantompeakqualtools
R
4
star
45

computation_wiki

Tommy's computation wiki
HTML
4
star
46

flowcytometry_analysis_notes

4
star
47

mixing_histology_lung_cancer

HTML
3
star
48

pyflow-chromForest

snakemake workflow for random forest based feature selection on chromHMM data
Python
3
star
49

odyssey_dot_files

my dot files on Harvard Odyssey HPC
Shell
3
star
50

primer3_scATAC_peaks

batch design primers for scATACseq differential peaks
Shell
3
star
51

seurat_v3_dockerfile

docker file for seurat v3
Dockerfile
2
star
52

PRADA_pipeline_Verhaak_lab

Shell
2
star
53

STAT115_HW

Tommy's homework
2
star
54

rocker_tidyvese_jpeg_cairo

docker file to extend rocker tidyverse
Dockerfile
2
star
55

ucn3_neuron_microarray_analysis

2
star
56

epigenomics_concept_learning

2
star
57

CIDC_single_cell

snakemake single cell pipeline for CIDC
Python
2
star
58

ucn3_neuron_microarray

2
star
59

EvaluateSingleCellClustering

examples for using scclusteval
R
2
star
60

bulk-RNAseq-workshop

HTML
2
star
61

compbio_challenges

2
star
62

machine_learning_datasets

2
star
63

rosalind_problems_python_solutions

Python
1
star
64

Epigenome_RoadTrip

my RoadTrip project
Python
1
star
65

data-science-machine-learning-project

HTML
1
star
66

ChIP-seq-carpentry

Development of the ChIPseq workshop for Data Carpentry
Python
1
star
67

one-click-hugo-cms

SCSS
1
star
68

nextjs-blog-theme

JavaScript
1
star
69

hodgkin_lymphoma_publication_scRNAseq_analysis

Jupyter Notebook
1
star