• Stars
    star
    141
  • Rank 258,445 (Top 6 %)
  • Language
    Jupyter Notebook
  • License
    MIT License
  • Created about 5 years ago
  • Updated about 2 months ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Toolkit to train base-resolution deep neural networks on functional genomics data and to interpret them

BPNet

CircleCI

BPNet is a python package with a CLI to train and interpret base-resolution deep neural networks trained on functional genomics data such as ChIP-nexus or ChIP-seq. It addresses the problem of pinpointing the regulatory elements in the genome:

BPNet

Specifically, it aims to answer the following questions:

  • What are the sequence motifs?
  • Where are they located in the genome?
  • How do they interact?

For more information, see the BPNet manuscript:

Deep learning at base-resolution reveals motif syntax of the cis-regulatory code (http://dx.doi.org/10.1101/737981.)

Overview

BPNet

Getting started

Main documentation of the bpnet package and an end-to-end example higlighting the main features are contained in the following colab notebook https://colab.research.google.com/drive/1VNsNBfugPJfJ02LBgvPwj-gPK0L_djsD. You can run this notebook yourself by clicking on 'Open in playground'. Individual cells of this notebook can be executed by pressing the Shift+Enter keyboard shortcut.

BPNet

To learn more about colab, visit https://colab.research.google.com and follow the 'Welcome To Colaboratory' notebook.

Main commands

Compute data statistics to inform hyper-parameter selection such as choosing to trade off profile vs total count loss (lambda hyper-parameter):

bpnet dataspec-stats dataspec.yml

Train a model on BigWig tracks specified in dataspec.yml using an existing architecture bpnet9 on 200 bp sequences with 6 dilated convolutional layers:

bpnet train --premade=bpnet9 dataspec.yml --override='seq_width=200;n_dil_layers=6' .

Compute contribution scores for regions specified in the dataspec.yml file and store them into contrib.scores.h5

bpnet contrib . --method=deeplift contrib.scores.h5

Export BigWig tracks containing model predictions and contribution scores

bpnet export-bw . --regions=intervals.bed --scale-contribution bigwigs/

Discover motifs with TF-MoDISco using contribution scores stored in contrib.scores.h5, premade configuration modisco-50k and restricting the number of seqlets per metacluster to 20k:

bpnet modisco-run contrib.scores.h5 --premade=modisco-50k --override='TfModiscoWorkflow.max_seqlets_per_metacluster=20000' modisco/

Determine motif instances with CWM scanning and store them to motif-instances.tsv.gz

bpnet cwm-scan modisco/ --contrib-file=contrib.scores.h5 modisco/motif-instances.tsv.gz

Generate additional reports suitable for ChIP-nexus or ChIP-seq data:

bpnet chip-nexus-analysis modisco/

Note: these commands are also accessible as python functions:

  • bpnet.cli.train.bpnet_train
  • bpnet.cli.train.dataspec_stats
  • bpnet.cli.contrib.bpnet_contrib
  • bpnet.cli.export_bw.bpnet_export_bw
  • bpnet.cli.modisco.bpnet_modisco_run
  • bpnet.cli.modisco.cwm_scan
  • bpnet.cli.modisco.chip_nexus_analysis

Main python classes

  • bpnet.seqmodel.SeqModel - Keras model container specified by implementing output 'heads' and a common 'body'. It contains methods to compute the contribution scores of the input sequence w.r.t. differnet output heads.
  • bpnet.BPNet.BPNetSeqModel - Wrapper around SeqModel consolidating profile and total count predictions into a single output per task. It provides methods to export predictions and contribution scores to BigWig files as well as methods to simulate the spacing between two motifs.
  • bpnet.cli.contrib.ContribFile - File handle to the HDF5 containing the contribution scores
  • bpnet.modisco.files.ModiscoFile - File handle to the HDF5 file produced by TF-MoDISco.
    • bpnet.modisco.core.Pattern - Object containing the PFM, CWM and optionally the signal footprint
    • bpnet.modisco.core.Seqlet - Object containing the seqlet coordinates.
    • bpnet.modisco.core.StackedSeqletContrib - Object containing the sequence, contribution scores and raw data at seqlet locations.
  • bpnet.dataspecs.DataSpec - File handle to the dataspec.yml file
  • dfi - Frequently used alias for a pandas DataFrame containing motif instance coordinates produced by bpnet cwm-scan. See the colab notebook for the column description.

Installation

Supported python version is 3.6. After installing anaconda (download page) or miniconda (download page), create a new bpnet environment by executing the following code:

# Clone this repository
git clone [email protected]:kundajelab/bpnet.git
cd bpnet

# create 'bpnet' conda environment
conda env create -f conda-env.yml

# Disable HDF5 file locking to prevent issues with Keras (https://github.com/h5py/h5py/issues/1082)
echo 'export HDF5_USE_FILE_LOCKING=FALSE' >> ~/.bashrc

# Activate the conda environment
source activate bpnet

Alternatively, you could also start a fresh conda environment by running the following

conda env create -n bpnet python=3.6
source activate bpnet
conda install -c bioconda pybedtools bedtools pybigwig pysam genomelake
pip install git+https://github.com/kundajelab/DeepExplain.git
pip install tensorflow~=1.0 # or tensorflow-gpu if you are using a GPU
pip install bpnet
echo 'export HDF5_USE_FILE_LOCKING=FALSE' >> ~/.bashrc

When using bpnet from the command line, don't forget to activate the bpnet conda environment before:

# activate the bpnet conda environment
source activate bpnet

# run bpnet
bpnet <command> ...

(Optional) Install vmtouch to use bpnet train --vmtouch

To use the --vmtouch in bpnet train command and thereby speed-up data-loading, install vmtouch. vmtouch is used to load the bigWig files into system memory cache which allows multiple processes to access the bigWigs loaded into memory.

Here's how to build and install vmtouch:

# ~/bin = directory for localy compiled binaries
mkdir -p ~/bin
cd ~/bin
# Clone and build
git clone https://github.com/hoytech/vmtouch.git vmtouch_src
cd vmtouch_src
make
# Move the binary to ~/bin
cp vmtouch ../
# Add ~/bin to $PATH
echo 'export PATH=$PATH:~/bin' >> ~/.bashrc

More Repositories

1

deeplift

Public facing deeplift repo
Python
814
star
2

dragonn

A toolkit to learn how to model and interpret regulatory sequence data using deep learning.
Jupyter Notebook
253
star
3

atac_dnase_pipelines

ATAC-seq and DNase-seq processing pipeline
Python
160
star
4

tfmodisco

TF MOtif Discovery from Importance SCOres
Jupyter Notebook
121
star
5

chrombpnet

Bias factorized, base-resolution deep learning models of chromatin accessibility (chromBPNet)
Jupyter Notebook
112
star
6

chipseq_pipeline

AQUAS TF and histone ChIP-seq pipeline
Java
105
star
7

dfim

Deep Feature Interaction Maps (DFIM)
Python
52
star
8

phantompeakqualtools

This package computes informative enrichment and quality measures for ChIP-seq/DNase-seq/FAIRE-seq/MNase-seq data. It can also be used to obtain robust estimates of the predominant fragment length or characteristic tag shift values in these assays.
R
52
star
9

ChromDragoNN

Code for the paper "Integrating regulatory DNA sequence and gene expression to predict genome-wide chromatin accessibility across cellular contexts"
Jupyter Notebook
44
star
10

3DChromatin_ReplicateQC

Software to compute reproducibility and quality scores for Hi-C data
Python
43
star
11

alzheimers_parkinsons

Collaboration with Montine, Chang, and Montgomery labs on Alzheimers / Parkinson's ATAC-seq analysis
Jupyter Notebook
43
star
12

genomelake

Simple and efficient access to genomic data for deep learning models.
Python
43
star
13

simdna

A python library for creating simulated regulatory DNA sequences
Python
38
star
14

abstention

Algorithms for abstention, calibration and domain adaptation to label shift.
Python
36
star
15

coda

Coda: a convolutional denoising algorithm for genome-wide ChIP-seq data
Python
33
star
16

cs273b

CS273B Deep Learning for Genomics Course Materials
Jupyter Notebook
32
star
17

ENCODE_downloader

Downloader for ENCODE
Python
31
star
18

coessentiality

Companion to "A genome-wide almanac of co-essential modules assigns function to uncharacterized genes" (https://doi.org/10.1101/827071)
Python
27
star
19

genomedisco

Software for comparing contact maps from HiC, CaptureC and other 3D genome data.
Jupyter Notebook
25
star
20

training_camp

Genetics training camp
Jupyter Notebook
21
star
21

gkmexplain

Accompanying repository for GkmExplain paper
Jupyter Notebook
21
star
22

fastISM

In-silico Saturation Mutagenesis implementation with 10x or more speedup for certain architectures.
Jupyter Notebook
18
star
23

ataqc

Python
17
star
24

basepairmodels

Python
16
star
25

labelshiftexperiments

Label shift experiments
Jupyter Notebook
15
star
26

seqdataloader

Sequence data label generation and ingestion into deep learning models
Python
12
star
27

bpnet-manuscript

BPNet manuscript code.
Jupyter Notebook
11
star
28

MPRA-DragoNN

Code accompanying the paper "Deciphering regulatory DNA sequences and noncoding genetic variants using neural network models of massively parallel reporter assays"
Python
11
star
29

higlass-dynseq

Dynamic sequence track for HiGlass
JavaScript
11
star
30

DeepBindToKeras

Convert DeepBind models to Keras
C
11
star
31

ProCapNet

Repository for modeling PRO-cap data with the BPNet-like model, ProCapNet.
Jupyter Notebook
11
star
32

mpra

Deep learning MPRAs
Jupyter Notebook
9
star
33

ENCODE_scatac

Python
8
star
34

variant-scorer

A framework to score and analyze variant effects genome-wide using ChromBPNet models
Python
8
star
35

tronn

Transcriptional Regulation (Optimized) Neural Nets (TRoNN)
Python
8
star
36

Cardiogenesis_Repo

Cardiogenesis Repo
Jupyter Notebook
8
star
37

deepbind

I have put my modified version of the deepbind code here.
C
8
star
38

scATAC-reprog

Code for the analysis performed in the paper "Transcription factor stoichiometry, motif affinity and syntax regulate single-cell chromatin dynamics during fibroblast reprogramming to pluripotency" by Nair, Ameen et al.
Jupyter Notebook
7
star
39

veryolddontuse_deeplift_modisco_tutorial

Jupyter Notebook
6
star
40

chromovar3d

Code from the chromatin variation 3d project
JavaScript
6
star
41

chip-nexus-pipeline

ChIP-nexus pipeline
Python
6
star
42

kerasAC

keras accessibility models: code to train, predict, interpret
Python
6
star
43

DART-Eval

Jupyter Notebook
6
star
44

DMSO

Jupyter Notebook
5
star
45

mesoderm

Scripts for dataset processing and QC for the mesoderm differentiation project.
HTML
5
star
46

1kg_ld_utils

utils/notes for LD calculation from 1000 genomes panel
Shell
5
star
47

keras-genomics

Genomics layers for Keras 2
Python
5
star
48

lsgkm-svr

lsgkm+gkmexplain with regression functionality. Builds off kundajelab/lsgkm (which has gkmexplain), which in turn builds off Dongwon-Lee/lsgkm (the original lsgkm repo)
C
5
star
49

yuzu

yuzu is a compressed-sensing based approach for quickly calculating in-silico mutagenesis saliency.
Python
5
star
50

PFBoost

modular 2D boosting code with stabilization and hierarchies
Python
4
star
51

coessentiality-browser

Gene browser using coessentiality and related data
Python
4
star
52

zenodo_upload

Python script to upload files to Zenodo
Python
4
star
53

neural_motif_discovery

Framework for interrogating transcription-factor motifs and their syntax/grammars from neural-network interpretations
Jupyter Notebook
4
star
54

vizsequence

Collecting commonly-repeated sequence visualization code here
Python
4
star
55

av_scripts

A place to track my scripts with git.
Jupyter Notebook
4
star
56

bpnet-refactor

Python
3
star
57

dynseq-pages

3
star
58

locusselect

extraction of data embeddings from deep learning model layers; computation of embedding distance and visualization with umap/tsne
Jupyter Notebook
2
star
59

atlas_resources

A nucleotide-resolution, context-specific sequence annotation of the dynamic regulatory landscape of the human and mouse genomes
Shell
2
star
60

tf_binding_challenge

scoring/ranking code for tf binding challenge
Python
2
star
61

bulk-rna-seq

Pipeline for gecco RNA-seq analysis
Shell
2
star
62

python_reading_group

Jupyter Notebook
2
star
63

higlass-multi-tileset

Multi-tileset data fetcher for HiGlass
JavaScript
2
star
64

retina-models

BPNet models for retina single-cell multiome data
Jupyter Notebook
2
star
65

TF-Atlas

Code repository for the TF-Atlas project
Jupyter Notebook
1
star
66

crispr_safe_targeting_regions

Repository for creating the CRISPR controls termed "safe harbor" regions from Morgens et al., 2017, Nat Comms.
Shell
1
star
67

interpret-benchmark

Benchmarking interpretation methods
Jupyter Notebook
1
star
68

kCCA

Python
1
star
69

momma_dragonn

Flexible deep learning framework
Jupyter Notebook
1
star
70

feature_interactions

Jupyter Notebook
1
star
71

bds_pipeline_modules

BigDataScript (BDS) pipelines and modules
Shell
1
star
72

mseqgen

Multi task batch generator for training deep learning models on CHIP-seq, CHIP-exo, CHIP-nexus, ATAC-seq, RNA-seq (or any other -seq)
Python
1
star
73

SVM_pipelines

Jupyter Notebook
1
star
74

jamboree-toolkit

toolkit for setting up compute environment on gcp for jamborees
Shell
1
star
75

genomics-DL-archsandlosses

A collection of Deep Learning architectures and loss functions from across the genomics literature
Python
1
star
76

affinity_distillation

Jupyter Notebook
1
star
77

SeqPriorizationCATLAS

Sequence priorization using gkm-explain.
Jupyter Notebook
1
star
78

chromBPNet-tutorial

How to train BPNets on ATAC-seq data using the Basepairmodels repo from Stanford's Kundaje Lab.
Shell
1
star
79

PREUSS

PREUSS: predicting RNA editing using sequence and structure
Jupyter Notebook
1
star
80

CTCFMutants

Jupyter Notebook
1
star