Applied Computational Genomics Course at UU: Spring 2022
- Faculty: Aaron Quinlan (aquinlan at genetics.utah.edu)
- HSEB 3515B, but Zoom (https://utah.zoom.us/j/95686980443) for the first two weeks
- Teaching assistants:
- Holly Thorpe
- Jason Kunisaki
- Casey Sederman
- Isabelle Cooperstein
- Meets Tu and Th from 10:30-11:50 January 11, 2021.
- TA Hours (TBD):
- Wednesday 12PM - 1PM (https://utah.zoom.us/j/95737998575, pw: 314025)
- Monday 2PM - 3PM (https://utah.zoom.us/j/95805178811, pw: 278239)
- Homework Submission Link
Overview
This course will provide a comprehensive introduction to fundamental concepts and experimental approaches in the analysis and interpretation of experimental genomics data. It will be structured as a series of lectures covering key concepts and analytical strategies. A diverse range of biological questions enabled by modern DNA sequencing technologies will be explored including sequence alignment, the identification of genetic variation, structural variation, and ChIP-seq and RNA-seq analysis. Students will learn and apply the fundamental data formats and analysis strategies that underlie computational genomics research. The primary goal of the course is for students to be grounded in theory and leave the course empowered to conduct independent genomic analyses.
Important notes
- Class participation is expected. Ask a question if you have one!
- When on Zoom, cameras must be on.
Grading policy
All assignments are due on the date stated in class. Ten percent of the grade will be deducted for each 24 hours that the assignment is late.
Course lecture slides
- reading assignment: 01-Brief-History-of-Bioinformatics.pdf
- Jan 11, 2022: Course overview and Intro to UNIX
- Jan 13, 2022: Intro to UNIX, Part 2
- reading assignment: 02-Human-Genome-Review.pdf
- Jan 18, 2022: Intro to UNIX, Part 3 and Intro to the Human Genome
- slides
- youtube
- Homework #1: https://gist.github.com/arq5x/c0eb84bce2086fbfbe9184668ef87b31#file-hw1-md
- due Jan 25 at 11:59PM
- post answers as
UNID.hw1.txt
to this link
- Jan 20, 2022: Pattern searching in the human genome
-
Jan 25, 2022: Pattern searching in the human genome and Intro to Data Analysis in RStudio
-
Jan 27, 2022: Data frames and Importing Data
- slides
- Homework #2: https://gist.github.com/arq5x/c0eb84bce2086fbfbe9184668ef87b31#file-hw2-md
- due Feb 3 at 11:59PM
- post answers as
UNID.hw2.txt
to this link
-
Feb 1, 2022: More with data frames, precision v. accuracy, very basic RNA-seq analysis
-
Feb 3, 2022: Intro to the tidyverse (guest lecturer: Charlie Murtaugh)
-
Feb 8, 2022: DNA sequencing technologies
- slides
- youtube
- Homework #3: https://gist.github.com/arq5x/c0eb84bce2086fbfbe9184668ef87b31#file-hw3-md
- due Feb 17 at 11:59PM
- post answers as
UNID.hw3.html
to this link
-
Feb 10, 2022: FASTQ format and tools
-
Feb 15, 2022: Sequence mapping and alignment
-
Feb 17, 2022: Sequence alignment and SAM/BAM format samtools, and IGV
-
Feb 22, 2022: Samtools and IGV
-
Feb 24, 2022: Poisson Processes in Biology
-
March 1, 2022: Uncertainty in RNA-seq data
-
March 3, 2022: An introduction to awk and bioawk
-
Homework #4: https://gist.github.com/arq5x/c0eb84bce2086fbfbe9184668ef87b31#file-hw4-v3-md
- due Mar 24 at 11:59PM
- post answers as
UNID.hw4.txt
to this link
-
Mar 15, 2022: Genetic Variation
-
Mar 17, 2022: SNP and INDEL discovery (part 1)
-
Mar 22, 2022: Rates and patterns of human germline variation
-
Mar 24, 2022: VCF format, Hardy Weinberg Equilibrium, VCF toolkits
-
Mar 29, 2022: VCF annotation and interpetation
-
Homework #5: https://gist.github.com/arq5x/c0eb84bce2086fbfbe9184668ef87b31#file-hw5-2022-md
- due Mar 18 at 11:59PM
- post answers as
UNID.hw5.txt
to this link -->
-
Mar 31, 2022: Genome Annotation and Resources
-
April 5, 2022: Genome Annotation Formats.
- April 7, 2022: Genome arithmetic with bedtools
- Apr 12, 2022: Real world analyses with bedtools.
- Homework #6: solve all 10 puzzles at the end of the bedtools tutorial: http://quinlanlab.org/tutorials/bedtools/bedtools.html
- due April 26 (last day of classes at 11:59PM
- post answers as
UNID.hw6.txt
to this link
- Apr 14, 2022: Monte Carlo simulations and more on UNIX
- Apr 19, 2022: The Normal Distribution
- Apr 21, 2022: Descriptive plots. The Central Limit Theorem
- April 26, 2022: The t-statistic, t-distribution, t-tests, and p-values
Not covered in 2022's course, but available for reference.
-
Apr 13, 2020: Q-Q plots
-
April 22, 2020: Introduction to Linear Regression
-
April 27, 2020: Introduction to tidyverse
- slides
- [youtube]
-
The Central Limit Theorem and Confidence Intervals
-
Structural and copy number variation
-
Patterns of Mutation in the Human Genome
Homework
- [Homework 1: Basic Unix analysis]
- [Homework 2: DNA Pattern exploration in a FASTA file]
- [Homework 3: Working with the FASTQ format]
- [Homework 4: BAM files, samtools, IGV]
- [Homework 5: Exploring genetic variation in VCF files]
- [Homework 6: Bedtools analysis problems. Bottom of page]
- Homework 7: Probability and R
Syllabus
-
Class 1 (Tu Jan 9; Layer): Course overview and Intro to UNIX
- Class 1 Slides
- Required Reading Prior to Lecture:
- Part 1 of Unix and Perl Primer for Biologists
- Topics covered
- Brief history of computational biology
- Course computing environment
- Intro. to UNIX: Part 1
- Logging in
- The "shell"
- "Home"
- Navigation
- File system
- Files
- Basic commands:
ls
,pwd
,cd
,mkdir
,head
-
Class 2 (Th Jan 11; Layer): Intro to UNIX Part 2
- Class 2 Slides
- Required Reading Prior to Lecture:
- Part 2 (Advanced UNIX) of Unix and Perl Primer for Biologists
- Topics covered
- Intro. to UNIX: Part 2
- grep
- cut
- redirects
- Intro. to UNIX: Part 2
- Homework 1 assigned. (due by start of class, Jan 17)
-
Class 3 (Tu Jan 16; Quinlan): The human genome
- Class 3 Slides
- Required Reading Prior to Lecture:
- Topics covered
- Karyotype
- Chromosome structure
- Centromeres
- Banding
- Chromatin
- How was the genome sequenced?
- sequencing technology
- assembly strategy
- Chromosomes
- size
- gene content
- centromeres
- Haplotypes
- Genes and transcripts
- Repeat content
- mobile elements
- simple repeats
- GC content, banding
- CpG islands
-
Class 4 (Th Jan 18; Quinlan): Using UNIX to find patterns in a genome
- Required Reading Prior to Lecture:
- None.
- Topics covered
- The UNIX PATH
- Environment variables
- Basic regular expressions with grep
- sort
- uniq
- Homework 2 (finding biological patterns in FASTA files with UNIX) assigned
- Required Reading Prior to Lecture:
-
Class 5 (Tu Jan 23; Quinlan): Genetic variation: mutations, polymorphisms, and haplotypes
- Required Reading Prior to Lecture:
- Topics covered
- Genetic variation: what, why, etc.
- Mutation vs. polymorhism
- De novo mutation
- Human mutation rates
- Polymorphism
- SNPs INDELs
- abundance
- frequency
- examples
- 1000 Genomes
- Site frequency spectrum
- Population stratification
- Intro to haplotypes and recombination
-
Class 6 (Th Jan 25; Quinlan): Modern DNA sequencing technologies
- Required Reading Prior to Lecture:
- Topics covered
- Illumina sequencing
- Overview of technology
- Paired-end vs. single-end
- Pacbio
- Oxford nanopore
- Base calling
- FASTQ format
- seqtk, fastx toolkit
- Illumina sequencing
- Homework 3 (working with the FASTQ format) assigned
-
Class 7 (Tu Jan 30; Quinlan): DNA sequence mapping and alignment](https://docs.google.com/presentation/d/1RskyGhXx4Lc6wSvvb_ZuCUJGUiP2RAr9X8bGh9Kz77I/edit?usp=sharing)
- Required Reading Prior to Lecture:
- Topics covered
- Sequence alignment
- Theory
- Mapping versus alignment
- Local versus global alignment
- Smith waterman
- Needleman-wunsch
- Advanced algorithms
- Alignment for RNA-seq
- Alignment for SV detection.
- Tools
- BWA, etc.
- Sequence alignment
-
Class 8 (Th Feb 1; Quinlan): SAM/BAM format, samtools, and IGV](https://docs.google.com/presentation/d/1_iT3btOZqjPmVb8Ryk5ssMBCMxoQ0MVmasZ6G0luA-c/edit?usp=sharing)
- The SAM/BAM format
- Samtools
- IGV
- Homework 4 (creating and working with SAM/BAM files with samtools and IGV) assigned
-
Class 9 (Tu Feb 6; Quinlan): SNP and INDEL discovery (part 1)](https://docs.google.com/presentation/d/1D4XY9XxQiyYcwwhomRRONxCPr_bJvcC0WM4sb8vouZM/edit?usp=sharing)
- Required Reading Prior to Lecture:
- Optional Reading Prior to Lecture:
- TODO FOR NEXT TIME: INTRODUCE POISSON MODEL OF COVERAGE. 30X IS A DISTRIBUTION
- Topics covered
- SNP and INDEL calling
- Theory
- Basic concept
- Sequencing error
- Bayes theorem and priors
- Theory
- Assigning a genotype
- Common problems and artifacts
- paralogy
- low depth
- high error rate
- ambiguous alignment
- Single sample variant detection
- SNP and INDEL calling
-
Class 10 (Th Feb 8; Quinlan): SNP and INDEL discovery (part 2)](https://docs.google.com/presentation/d/12jeJQPbntPPPGYszIH1l9u83mXFVU1XdJw-bNgbFu28/edit?usp=sharing)
- Required Reading Prior to Lecture:
- Topics covered
- VCF format
- Attributes
- Genotypes
- Population calling
- Basic annotations
- VCF format
- Landscape of human genetic variation
- Alleles and genotypes
- Allele frequency spectrum
- Hardy weinberg equilibrium
- More on haplotypes and recombination
- Exploring the format
- examples
- IGV
- Manipulating VCF with bcftools
- Homework 5 (variant calling and working with VCF files with bcftools and UNIX) assigned
-
Class 11 (Tu Feb 13; Quinlan): VCF format, Hardy Weinberg Equilibrium, VCF toolkits
- Topics covered
- VCF Format
- Allele frequencies
- Genotype frequencies
- Hardy Weinberg Equilibrium
- Topics covered
-
Class 12 (Th Feb 15; Quinlan): VCF annotation and interpetation
- Required Reading Prior to Lecture:
- Topics covered
- Concepts
- e.g, synonymous, non-synonymous
- frameshift
- stopgain
- constraint
- impact of transcript model
- Concepts
- Tools
- Polyphen
- SIFT
- vcfanno
- CADD
- VEP
- SnpEff
-
Class 13 (Tu Feb 20; Quinlan): Variation in genome structure
- Required Reading Prior to Lecture:
- Topics covered
- The genome is repetitive
- Segmental duplication
- SV versus CNV
- SV Mechanisms
- NAHR / ectopic recombination
- NHEJ
- Replication mechansism
- SV detection
- Examples
-
Class 14 (Th Feb 22; Quinlan): Somatic mutation in cancer
- Required Reading Prior to Lecture:
- Topics covered
- Sources of mutation
- Mutational landscape
- Tumor heterogeneity
- Somatic mutation detetion
- why is it so hard?
- Using mutation to track cancer evolution
- Mosaicism and disease
-
Class 15 (Tu Feb 27; Quinlan): Genome annotation
- Required Reading Prior to Lecture:
- None
- Topics covered
- How and why do we annotate a genome?
- Conservation
- CpG islands
- Repeatmasker
- Chromatin modifications
- DNA methylations
- Linkage blocks
- Required Reading Prior to Lecture:
-
Class 16 (Th Mar 1; Quinlan): Genome data formats and genome arithmetic
- Required Reading Prior to Lecture:
- None
- Topics covered
- The genome as a coordinate system
- BED format
- GFF format
- VCF format
- UCSC and Biomart to retrieve genome annotations
- UCSC and IGV to visualize
- a bit of awk
- Required Reading Prior to Lecture:
-
Class 17 (Tu Mar 8; Quinlan): Applied genome arithmetic with bedtools; part 1
- Required Reading Prior to Lecture:
- Topics covered
- The genome as a coordinate system revisited
- Basic concepts of genome arithmetic
- Introduction to bedtools
- Homework 9 (basic genome arithmetic with bedtools) assigned (due Mar 7)
-
Class 18 (Th Mar 8; Quinlan): Applied genome arithmetic with bedtools; part 2
-
Class 19 (Tu Mar 13; Quinlan): Digging deeper into UNIX, part 1
- awk
- sed
- tr
- PATH
- .bashrc
-
Class 20 (Th Mar 15; Quinlan): ChIP-seq analysis
- experimental design
- protocols
- examples
-
Spring Break March 18-25
-
Class 21 (Tu Mar 27; Quinlan): RNA-seq analysis
- analyses
- toolsets
- Class project assignment
-
Class 22 (Th Mar 29; Quinlan): Basic probability
- Probability with coins and dice
- Probability with DNA
- Conditional probabilities
- Use R for examples
-
Class 23 (Tu Apr 3; Quinlan): Statistical tests
- Gaussian
- Z scores
- Chi-squared
- Fisher
- KS test
- Rank tests
- Applications
- Gaussian
-
Class 24 (Th Apr 5; Quinlan): How do I know if my observation is significant?
- Models
- Expectation
- Tests for significance
-
Class 25 (Tu Apr 10; Quinlan): Data visualization, part 1
- Why
- Pattern recognition
- Detect problems
- Ansombe’s quartet
- Introduce class projects
-
Class 26 (Tu Apr 12; Quinlan): Data visualization, part 2
- http://www.nature.com/collections/qghhqm/pointsofsignificance
- Scatter plots
- Histograms
- Box whiskers
-
Class 27 (Tu Apr 17; Quinlan): Advanced topics
- loops
- shuffling
- randomization
- advanced commands
- basic scripts and pipelines
-
Class 28 (Th Apr 19; Quinlan): Group Presentations, part 1
-
Class 29 (Tu Apr 24; Quinlan): Group Presentations, part 2