• Stars
    star
    189
  • Rank 204,649 (Top 5 %)
  • Language
    Jupyter Notebook
  • License
    MIT License
  • Created over 6 years ago
  • Updated over 6 years ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Workshop on measuring, analyzing, and visualizing the 3D genome with Hi-C data.

Hi-C Data Analysis Bootcamp

A tutorial on measuring, analyzing, and visualizing the 3D genome with Hi-C provided by Harvard, MIT, and UMassMed.

Funky Colormaps

πŸ“’ Slides, code, and data is available for you to rerun the analyses!

Introduction

4D Nucleome Data Coordination and Integration Center and the Center for 3D Structure and Physics of the Genome hosted a Hi-C data analysis bootcamp at Harvard Medical School on May, 8th 2018. This repo contains the material for this bootcamp. Below, you can find more information on how to walk through the hands-on sessions offline.

Files in this repository

  • Tutorial Part 1 (Hi-C Protocol): Slides PDF | PPTX
  • Tutorial Part 2 (From fastqs to contact matrices): Slides HTML
  • Tutorial Part 3 (From contact matrices to biology): Slides PDF | PPTX
  • Tutorial Part 4 (Hi-C Data Visualization - HiGlass): Slides HTML
  • Tutorial Part 5 (Hi-C Data Visualization - HiPiler): Slides PDF | HTML

Presenters

  • Johan Gibcus, Research Instructor, Universy Massachusetts Medical School
  • Nezar Abdennur, PhD student, MIT
  • Soo Lee, Senior Bioinformatics Scientist, Harvard Medical School
  • Peter Kerpedjiev, Postdoctoral Research Fellow, Harvard Medical School
  • Fritz Lekschas PhD Student, Harvard University
  • Leonid Mirny Professor, MIT

Organizers

Motivation and Objectives

Due in large part to the explanatory power of chromosome organization in gene regulation, its association with disease and disorder as well as the unanswered questions regarding the mechanisms behind its maintenance and function, the 3D structure and function of the genome are becoming increasingly target of scientific scrutiny. With efforts such as the 4D Nucleome Project and ENCODE 4 already beginning to generate large amounts of data, the ability to analyze and visualize it will be a valuable asset to any computational biologist tasked with interpretation of experimental results.

The objectives of this tutorial are

  • To introduce the theoretical concepts related to 3D genome data analysis
  • To familiarize participants with the data types, analysis pipeline, and common tools for analysis and visualization of 3D genome data
  • To provide a hands on experience in data analysis by walking through some common use cases of existing tools for data analysis and visualization.

After the workshop participants should be able to obtain, process, analyze, and visualize 3D genome data on their own as well as to understand some of the logic, motivation and pitfalls associated with common operations such as matrix balancing and multi-resolution visualization.

The subject matter and practical exercises presented in this tutorial will be accessible to a broad audience. Prior experience with next generation sequencing and the data it produces will be helpful for understanding the subsequent processing steps used to derive contact maps as well as some of the artifacts that can arise during data processing.

The material will be most useful to computational biologists and biologists working on genomics-related topics.

Student Requirements

  • A server will be set up for students with all the required software.
  • Windows users, please install Putty (for ssh).

Agenda

09:00 - 09:10 Introduction and Overview (Peter Park and Burak Alver, Harvard)

09:10 - 10:30 Hi-C Protocol (Johan Gibcus, UMass)

10:30 - 10:45 Break

10:45 - 12:15 From fastqs to contact matrices (Soohyun Lee, Harvard)

12:15 - 13:00 Lunch

13:00 - 14:00 From contact matrices to biology (Nezar Abdennur, MIT)

14:00 - 15:00 Hi-C Data Visualization - HiGlass (Peter Kerpedjiev, Harvard)

15:00 - 15:15 Break

15:15 - 16:00 Hi-C Data Visualization - HiPiler (Fritz Lekschas, Harvard)

16:00 - 17:00 Keynote Speaker - Leonid Mirny, MIT

Instructor Bios

Johan Gibcus

Johan Gibcus is a Research Instructor at the University of Massachussetts Medical School. He has not only used but also refined the Hi-C protocol to answer important biological questions about chromosome organization and replication. Web: http://www.dekkerlab.org/

Soo Lee

Soo Lee is a Senior Bioinformatics Scientist in the Department of Biomedical Informatics at Harvard Medical School. She is creating cloud-based pipelines for Hi-C and other genomic data and developing infrastructure for automation of such pipelines as part of the 4D Nucleome Data Coordination and Integration Center. Web: compbio.hms.harvard.edu/people/soohyun-lee

Nezar Abdennur

Nezar Abdennur is a PhD candidate in Computational and Systems Biology at MIT. His research focuses on the determinants of 3D genome organization and the development of tools for dealing with large Hi-C datasets. Twitter: @nv1ctus Web: nvictus.me

Peter Kerpedjiev

Peter Kerpedjiev is a postdoctoral researcher working on creating tools (such as HiGlass) for visualizing large genomic data sets. Twitter: @pkerpedjiev Web: emptypipes.org

Fritz Lekschas

Fritz is a PhD student working on biomedical information visualization with focus on large multiscale genomic data sets. He created tools like HiPiler or Scalable Insets Twitter: @flekschas Web: lekschas.de

Leonid Mirny

Leonid Mirny is a professor at MIT's Institute for Medical Engineering & Science. His lab studies the three dimensional organization of chromosomes using a combination of computational analysis and simulation. Twitter: @leonidmirny Web: mirnylab.mit.edu

Pointers for Offline Walk-through

During the bootcamp, users were given access to linux servers where

  • docker was installed,
  • conda was installed,
  • a conda enivronment was set up with a number of dependencies installed, including juypter notebook,
  • higlass-manager was installed,
  • and sample data was downloaded.

You can set up a similar environment and walk through the hands-on sessions of the bootcamp by following the instructions below. Allow 30G of storage for all files used in the tutorial.

From fastqs to contact matrices

  1. Install docker, if you have not already done so. (Docker is a lighter alternative to virtual machines.)
  2. Pull the docker image: docker pull duplexa/4dn-hic:v42. This docker image contains a number of software that have been pre-installed for HiC data processing.
  3. Download the sample data for this session under your home directory to "~/data/" (or edit the commands on the slides accordingly, if you prefer a different directory).
mkdir data
cd data/
# input fastq files
wget https://s3.amazonaws.com/4dn-dcic-public/hic-data-analysis-bootcamp/input_R1.fastq.gz
wget https://s3.amazonaws.com/4dn-dcic-public/hic-data-analysis-bootcamp/input_R2.fastq.gz
gunzip input_R1.fastq.gz
gunzip input_R2.fastq.gz
# bwa genome index
wget https://s3.amazonaws.com/4dn-dcic-public/hic-data-analysis-bootcamp/hg38.bwaIndex.tgz
tar -xzf hg38.bwaIndex.tgz
rm hg38.bwaIndex.tgz
# chromsizes
wget https://s3.amazonaws.com/4dn-dcic-public/hic-data-analysis-bootcamp/hg38.mainonly.chrom.size
# prebaked output files
wget https://s3.amazonaws.com/4dn-dcic-public/hic-data-analysis-bootcamp/prebaked.tgz
tar -xzf prebaked.tgz
rm prebaked.tgz
# move back a directory
cd ..

Now, you should be able to follow slides 1 through 23 of the tutorial. When you are finished, exit the docker container with Ctrl-d before proceeding to the next part.

Working in a cluster without docker

If you are working in a High Performance Compute Cluster, you may not be allowed the install Docker. Instead, you can find the recipe for the docker image used above here. The exact configuration of the docker image can be seen in the dockerfile. You can get information on the bioinformatics software installed inside the docker image in the download.sh file.

From contact matrices to biology

  1. Install conda, if you have not already done so. Conda is an open source package management tool that allows you to create separate environments.
  2. Clone this repo and set up the environment.
    git clone https://github.com/hms-dbmi/hic-data-analysis-bootcamp
    cd hic-data-analysis-bootcamp
    git pull
    #you may need some of the following in case you have an issue creating an environment
    #conda update --all -y
    #sudo yum install -y hg
    #conda install gcc
    conda env create -n bootcamp -f environment.yml
    
  3. Download the sample data for this session into the pre-existing "notebooks/data" directory (or edit the commands on the slides accordingly, if you prefer a different directory.
    # from the hic-data-analysis-bootcamp directory we just made
    cd notebooks/data
    wget https://s3.amazonaws.com/4dn-dcic-public/hic-data-analysis-bootcamp/NIPBL.1000.mcool
    wget https://s3.amazonaws.com/4dn-dcic-public/hic-data-analysis-bootcamp/NIPBL.10000.cool
    wget https://s3.amazonaws.com/4dn-dcic-public/hic-data-analysis-bootcamp/NIPBL.20000.cool
    wget https://s3.amazonaws.com/4dn-dcic-public/hic-data-analysis-bootcamp/NIPBL.40000.cool
    wget https://s3.amazonaws.com/4dn-dcic-public/hic-data-analysis-bootcamp/NIPBL.100000.cool
    wget https://s3.amazonaws.com/4dn-dcic-public/hic-data-analysis-bootcamp/TAM.1000.mcool
    wget https://s3.amazonaws.com/4dn-dcic-public/hic-data-analysis-bootcamp/TAM.10000.cool
    wget https://s3.amazonaws.com/4dn-dcic-public/hic-data-analysis-bootcamp/TAM.20000.cool
    wget https://s3.amazonaws.com/4dn-dcic-public/hic-data-analysis-bootcamp/TAM.40000.cool
    wget https://s3.amazonaws.com/4dn-dcic-public/hic-data-analysis-bootcamp/TAM.100000.cool
    wget https://s3.amazonaws.com/4dn-dcic-public/hic-data-analysis-bootcamp/UNTR.1000.mcool
    wget https://s3.amazonaws.com/4dn-dcic-public/hic-data-analysis-bootcamp/UNTR.10000.cool
    wget https://s3.amazonaws.com/4dn-dcic-public/hic-data-analysis-bootcamp/UNTR.20000.cool
    wget https://s3.amazonaws.com/4dn-dcic-public/hic-data-analysis-bootcamp/UNTR.40000.cool
    wget https://s3.amazonaws.com/4dn-dcic-public/hic-data-analysis-bootcamp/UNTR.100000.cool
    
    wget https://s3.amazonaws.com/4dn-dcic-public/hic-data-analysis-bootcamp/CtcfCtrl.mm9__VS__InputCtrl.mm9.narrowPeak_with_motif.txt.gz
    wget https://s3.amazonaws.com/4dn-dcic-public/hic-data-analysis-bootcamp/GSM1551552_HIC003_merged_nodups.txt.subset.gz
    wget https://s3.amazonaws.com/4dn-dcic-public/hic-data-analysis-bootcamp/NIPBL_R1.nodups.pairs.gz
    wget https://s3.amazonaws.com/4dn-dcic-public/hic-data-analysis-bootcamp/NIPBL_R1.nodups.pairs.gz.px2
    wget https://s3.amazonaws.com/4dn-dcic-public/hic-data-analysis-bootcamp/Rao2014-GM12878-MboI-allreps-filtered.1000kb.cool
    wget https://s3.amazonaws.com/4dn-dcic-public/hic-data-analysis-bootcamp/Rao2014-GM12878-MboI-allreps-filtered.5kb.cool
    wget https://s3.amazonaws.com/4dn-dcic-public/hic-data-analysis-bootcamp/UNTR_R1.nodups.pairs.gz
    wget https://s3.amazonaws.com/4dn-dcic-public/hic-data-analysis-bootcamp/UNTR_R1.nodups.pairs.gz.px2
    wget https://s3.amazonaws.com/4dn-dcic-public/hic-data-analysis-bootcamp/b37.chrom.sizes.reduced
    wget https://s3.amazonaws.com/4dn-dcic-public/hic-data-analysis-bootcamp/ctcf-sites.paired.300kb_flank10kb.tsv
    wget https://s3.amazonaws.com/4dn-dcic-public/hic-data-analysis-bootcamp/hg19.chrom.sizes.reduced
    wget https://s3.amazonaws.com/4dn-dcic-public/hic-data-analysis-bootcamp/mm9.chrom.sizes.reduced
    wget https://s3.amazonaws.com/4dn-dcic-public/hic-data-analysis-bootcamp/mm9.fa
    wget https://s3.amazonaws.com/4dn-dcic-public/hic-data-analysis-bootcamp/ranked_TSS.tsv
    
  4. Go back to the "notebooks" directory and activate the environment to run the jupyter notebook.
    cd ..
    source activate bootcamp
    jupyter notebook
    

If you're running it on your local machine, the notebook will open at http://localhost:8888. You may have to input the token displayed when starting up the Jupyter. Follow the steps in the notebooks starting with the top one, named "00_intro_cooler-cli".

HiGlass

  1. Install and start docker on your machine.
    docker pull gehlenborglab/higlass:v0.2.63  # higlass
    pip install higlass-manage --upgrade
    
  2. Download the sample data.
    wget https://s3.amazonaws.com/4dn-dcic-public/hic-data-analysis-bootcamp/Schwarzer-et-al-2017-NIPBL.multi.cool
    wget https://s3.amazonaws.com/4dn-dcic-public/hic-data-analysis-bootcamp/Schwarzer-et-al-2017-RNAseq-minus.bw
    wget https://s3.amazonaws.com/4dn-dcic-public/hic-data-analysis-bootcamp/Schwarzer-et-al-2017-UNTR.multi.cool
    

Now, you should be able to follow slides 24 through 59 of the tutorial.

Resources

Software

Package and environment management

Papers

  • Imakaev, Maxim, et al. "Iterative correction of Hi-C data reveals hallmarks of chromosome organization." Nature methods 9.10 (2012): 999-1003. doi:10.1038/nmeth.2148
  • Lajoie, Bryan R., Job Dekker, and Noam Kaplan. "The Hitchhiker’s guide to Hi-C analysis: practical guidelines." Methods 72 (2015): 65-75. doi:10.1016/j.ymeth.2014.10.031
  • Kerpedjiev, Peter, et al. "HiGlass: Web-based Visual Comparison And Exploration Of Genome Interaction Maps" bioRxiv. doi:10.1101/121889
  • Lekschas, Fritz et al. "HiPiler: Visual Exploration Of Large Genome Interaction Matrices With Interactive Small Multiples" IEEE Transactions on Visualization and Computer Graphics, 24(1), 522-531. doi:10.1109/TVCG.2017.2745978
  • Belaghzal H, et al. "Hi-C 2.0: An optimized Hi-C procedure for high-resolution genome-wide mapping of chromosome conformation." Methods. 2017 https://doi.org/10.1016/j.ymeth.2017.04.004
  • Golloshi R, et al. "Iteratively improving Hi-C experiments one step at a time." Methods. 2018 https://doi.org/10.1016/j.ymeth.2018.04.033
  • Oddes, Sivan, et al. "Three invariant Hi-C interaction patterns: applications to genome assembly". bioRxiv 306076. https://doi.org/10.1101/306076

More Repositories

1

UpSetR

An R implementation of the UpSet set visualization technique published by Lex, Gehlenborg, et al..
R
704
star
2

viv

Library for multiscale visualization of high-resolution multiplexed bioimaging data on the web. Directly renders Zarr and OME-TIFF.
JavaScript
273
star
3

scde

R package for analyzing single-cell RNA-seq data
R
172
star
4

CHIEF

Clinical Histopathology Imaging Evaluation Foundation Model
Python
130
star
5

vizarr

A minimal Zarr image viewer based on Viv.
TypeScript
119
star
6

chromoscope

Interactive multiscale visualization for structural variation in human genomes
TypeScript
64
star
7

3d-genome-processing-tutorial

A 3D genome data processing tutorial for ISMB/ECCB 2017
Jupyter Notebook
48
star
8

MOMA

MOMA
Python
46
star
9

spp

SPP - R package for analysis of ChIP-seq and other functional sequencing data
C++
39
star
10

upset-altair-notebook

Jupyter Notebooks and other code for Altair-based Interactive UpSet Plots
Jupyter Notebook
29
star
11

halyos

Redesigning the Patient Portal Experience with SMART on FHIR.
JavaScript
23
star
12

UpSetR-shiny

A Shiny wrapper for the UpSetR R package (https://github.com/hms-dbmi/UpSetR).
R
20
star
13

dseqr

single-cell and bulk RNA-seq analyses from counts β†’ pathways β†’ drug candidates.
R
20
star
14

scw

HSCI/Catalyst Single-cell RNA-Seq Workshop
HTML
19
star
15

EHRtemporalVariability

R package for delineating temporal dataset shifts in Eletronic Health Records
HTML
16
star
16

brainmapr

R package to infer spatial location of neuronal subpopulations within the developing mouse brain by integrating single-cell RNA-seq data with in situ RNA patterns from the Allen Developing Mouse Brain Atlas
R
16
star
17

charm

Python
14
star
18

cistrome-explorer

Interactive visual analytic tool for exploring epigenomics data w/ associated metadata, powered by HiGlass and Gosling
Jupyter Notebook
13
star
19

genocat

Genomic Visualization Catalog
HTML
13
star
20

hail-on-AWS-spot-instances

An option to spin cost effective EMR clusters in AWS with Hail and JupyterNotebook installed
Python
13
star
21

GenoPheno-CatalogShiny

Shiny app for geno-pheno catalog
R
11
star
22

OncoThreads

OncoThreads longitudinal cancer genomics visualization project.
JavaScript
10
star
23

breastCaPathologyTranscriptomics

Integrative Transcriptome-Histopathology Analysis for Breast Cancer Classification
Python
9
star
24

Drug_Explorer

Interactive & explainable GNN for drug repurposing
TypeScript
9
star
25

altair_examples

Juptyer Notebooks with Altair Examples
Jupyter Notebook
8
star
26

UpSetR-paper

Data and scripts for UpSetR paper.
R
8
star
27

gehlenborglab-website

Code for Gehlenborg Lab website.
HTML
8
star
28

crestree

Neural Crest Fate Decisions
R
8
star
29

aws-python-utilities

Python
7
star
30

Access-to-Data-using-PIC-SURE-API

Jupyter Notebook
6
star
31

hapi-fhir-docker

A Docker build of the HAPI-FHIR stack
Java
6
star
32

EHRtemporalVariability-shiny

Shiny app for EHRtemporalVariability R package
R
5
star
33

pic-sure

PIC-SURE API
Java
5
star
34

pic-sure-bdc-infrastructure

HCL
5
star
35

avillachlab-jenkins

HCL
5
star
36

pic-sure-hpds

Java
4
star
37

spacemut

Spatial analysis of genome mutation patterns
R
4
star
38

pic-sure-all-in-one

Shell
4
star
39

Hail-on-Google-Cloud

Jupyter Notebook
4
star
40

pic-sure-hpds-genotype-load-example

Jupyter Notebook
3
star
41

picker

R scatterplot deck.gl widget inspired by vitessce
JavaScript
3
star
42

mHealthieR

R package to assess and evaluate longitudinal mHealth sensor data.
R
3
star
43

hypatio-app

Python
3
star
44

pic-sure-r-adapter-hpds

Adapter library for PIC-SURE HPDS Resources (in R language)
R
3
star
45

map-explorer

Use MAP to explore EHR data for individual patients.
R
3
star
46

decart-2019-data-visualization

Materials for the DeCART 2019 Summer School Data Visualization Course
Jupyter Notebook
3
star
47

django-dbmi-client

A Django application to integrate with DBMI services
Python
3
star
48

pic-sure-auth-microapp

Java
3
star
49

i2b2-Java-API

Java classes to abstract away i2b2 XML - DEPRECATED, See https://github.com/hms-dbmi/IRCT\
Java
3
star
50

tev-server

Repository for tumor evolution visualization back end.
JavaScript
3
star
51

RaMeDiES

Statistical models for finding de novo recurrence and compound heterozygosity across rare disease patient cohorts
Python
3
star
52

matrix_storage_benchmark

Python
2
star
53

hail-workshop-2019

Hail workshop material for: i2b2tranSMART Foundation Harvard Symposium 2019
Python
2
star
54

pic-sure-metadata-curation

Parse and generate variable-level data to be exposed through the search interface.
SAS
2
star
55

pic-sure-bdc-frontend

JavaScript
2
star
56

IRCT

Merged IRCT Repository
Java
2
star
57

music-ecrf-harmonization

R
2
star
58

hail-on-EMR

EMR cluster creation and Hail 0.2 installation
Shell
2
star
59

pklab

Kharchenko Lab Resources
2
star
60

docker-images

This repository stores Dockerfiles and samples to build Docker images for Avillach Lab hms-dbmi projects.
Groovy
2
star
61

dbmi-fileservice

Python
2
star
62

pic-sure-core-frontend

JavaScript
2
star
63

fhirquestionnaire

HTML
2
star
64

i2b2-to-PCORNET-CDM

i2b2 to PCORNET CDM Scripts
PLSQL
2
star
65

hpds-etl-sbg-cwl

Python
2
star
66

service-workbench-infrastructure-tools

AWS lambda functions to extend SWB functionality
JavaScript
2
star
67

rcc_pathology

Development of a Histopathology Informatics Pipeline for Classification and Prediction of Clinical Outcome in Subtypes of Renal Cell Carcinoma. Clinical Cancer Research. 2021 Mar 15. doi: 10.1158/1078-0432.CCR-20-4119. Online ahead of print.
Python
2
star
68

i2b2v2-webclient

JavaScript
2
star
69

dcppc

Data Commons Pilot Phase Project
Jupyter Notebook
2
star
70

vitessce-grid

Simplified wrapper for react-grid-layout
JavaScript
2
star
71

pystarter

base project to be used as a starter for all your other python projects
Python
2
star
72

gehlenborg-lab-best-practices

Guidelines for creating medium-scale visualization software
2
star
73

IRCT-EXT

- DEPRECATED, See https://github.com/hms-dbmi/IRCT
Java
1
star
74

sci-aws-infrastructure

Python
1
star
75

Rcheesecake

Query and retrieve phenotypics and genotypics data using PIC-SURE API
R
1
star
76

PIC-SURE-resources

PIC-SURE resource configuration files
SQLPL
1
star
77

Rcupcake

HTML
1
star
78

dbmisvc-stack

Python
1
star
79

pic-sure-python-adapter-hpds

A Python client library for PIC-SURE-HPDS resources
Python
1
star
80

stack

Python
1
star
81

SciAuth-app-docker

Shell
1
star
82

pic-sure-r-client

Client library in R for connecting to PIC-SURE resources
R
1
star
83

pynxgu

Shell
1
star
84

GIC-ontology

1
star
85

ppm-data

Python
1
star
86

pic-sure-bdc-release-control

1
star
87

exposomeDW_public

1
star
88

SciReg-docker

Shell
1
star
89

upset-faculty

Interactive UpSet plot for DBMI faculty areas of interest.
Jupyter Notebook
1
star
90

pic-sure-python-client

A client library for interacting with the PIC-SURE API
Python
1
star
91

bmi713-visualization-lecture-2018

HTML
1
star
92

avillachlab-pic-sure-splunk-template

Dashboard templates for Splunk 8.0
Shell
1
star
93

single-cell-review-2020

Notesbooks for the 2020 single-cell review paper
R
1
star
94

secret-getter

retrieves secrets given a vault_token, and replaces values in files and/or environment variables
Go
1
star
95

samplestore

This app uses Django Rest Framework to expose an API for managing a collection of data about samples from subjects that participate in a research project.
Python
1
star
96

SciAuthZ-app-docker

A small authorization service
Shell
1
star
97

COPDGene-WGS

Jupyter Notebook
1
star
98

sratoolkit

Dockerfile
1
star
99

PIC-SURE-Frontend

A frontend for the PIC-SURE API
TypeScript
1
star
100

PICTURE

Pathology Imaging Characterization with Uncertainty-aware Rapid Evaluation (PICTURE)
Jupyter Notebook
1
star