• Stars
    star
    121
  • Rank 292,206 (Top 6 %)
  • Language
    Jupyter Notebook
  • License
    MIT License
  • Created about 8 years ago
  • Updated 8 months ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

TF MOtif Discovery from Importance SCOres

TF-MoDISco: Transcription-Factor Motif Discovery from Importance Scores

CircleCI license DOI

This repository contains the code developed for the associated manuscript, Distilling consolidated DNA sequence motifs and cooperative motif syntax from neural-network models of in vivo transcription-factor binding profiles. The analysis scripts and notebooks used to reproduce the results in this manuscript can be found at this repository.

General users should visit the TF-MoDISco-lite repository for a more efficient, actively maintained, and easier-to-use version of the same algorithm.

Structure of TF-MoDISco

The TF-MoDISco algorithm starts with a set of importance scores on genomic sequences, and can perform the following tasks:

  1. Identify high-importance windows of the sequences, termed "seqlets"
  2. Cluster recurring similar seqlets into motifs
  3. Scan through importance scores across the genome to call motif instances (AKA "hit scoring")

Installing TF-MoDISco

pip install modisco

Alternatively, for a specific tagged version or commit, install from source code by cloning this repository, checking out the desired version, and running pip install -e /path/to/cloned/repo.

Required inputs to run the algorithm

In order to run the TF-MoDISco algorithm, the following data is required as an input:

  • An N x L x 4 NumPy array of one-hot encoded genomic sequences, where N is the number of sequences and L is the sequence length (the 4 bases are in A, C, G, T order); this denotes the identity of the sequence
  • A parallel N x L x 4 NumPy array of contribution scores; each position contains the importance of the base specified in the corresponding one-hot encoded sequence (i.e. each base position should have at most one nonzero entry out of the 4, which measures importance at the base in the sequence)
  • An optional parallel N x L x 4 NumPy array of hypothetical contribution scores, which measures the hypothetical contribution of every base (not just the one that is present in the sequence); equivalently, the element-wise product of this array with the one-hot encoded genomic sequences should be identical to the array of contribution scores

Other resources

A technical note describing version 0.5.6.5 is available at https://arxiv.org/abs/1811.00416.

Video of talk at NeurIPS MLCB 2017

Example notebooks for running the algorithm:

  • TF MoDISco TAL GATA: a self-contained example notebook that uses pre-computed importance scores (generated by a neural network) as input. Scores were generated using deeplift as illustated in this notebook. If deeplift doesn't work with your architecture, you could alternatively generate scores using DeepSHAP (DeepSHAP is an extension of DeepLIFT that can work with more diverse architectures) as illustrated in this notebook (heads-up: that notebook uses a custom branch of the DeepSHAP repository).
  • TF MoDISco Nanog: a self-contained example notebook that uses pre-computed importance scores and an empirically-generated null distribution (generated by a gkm-SVM) as input. Scores were generated using gkmexplain as illustated in this notebook. This notebook also illustrates how to use a MEME-based initialization to potentially boost the performance of TF-MoDISco.

More Repositories

1

deeplift

Public facing deeplift repo
Python
814
star
2

dragonn

A toolkit to learn how to model and interpret regulatory sequence data using deep learning.
Jupyter Notebook
253
star
3

atac_dnase_pipelines

ATAC-seq and DNase-seq processing pipeline
Python
160
star
4

bpnet

Toolkit to train base-resolution deep neural networks on functional genomics data and to interpret them
Jupyter Notebook
141
star
5

chrombpnet

Bias factorized, base-resolution deep learning models of chromatin accessibility (chromBPNet)
Jupyter Notebook
112
star
6

chipseq_pipeline

AQUAS TF and histone ChIP-seq pipeline
Java
105
star
7

dfim

Deep Feature Interaction Maps (DFIM)
Python
52
star
8

phantompeakqualtools

This package computes informative enrichment and quality measures for ChIP-seq/DNase-seq/FAIRE-seq/MNase-seq data. It can also be used to obtain robust estimates of the predominant fragment length or characteristic tag shift values in these assays.
R
52
star
9

ChromDragoNN

Code for the paper "Integrating regulatory DNA sequence and gene expression to predict genome-wide chromatin accessibility across cellular contexts"
Jupyter Notebook
44
star
10

3DChromatin_ReplicateQC

Software to compute reproducibility and quality scores for Hi-C data
Python
43
star
11

alzheimers_parkinsons

Collaboration with Montine, Chang, and Montgomery labs on Alzheimers / Parkinson's ATAC-seq analysis
Jupyter Notebook
43
star
12

genomelake

Simple and efficient access to genomic data for deep learning models.
Python
43
star
13

simdna

A python library for creating simulated regulatory DNA sequences
Python
38
star
14

abstention

Algorithms for abstention, calibration and domain adaptation to label shift.
Python
36
star
15

coda

Coda: a convolutional denoising algorithm for genome-wide ChIP-seq data
Python
33
star
16

cs273b

CS273B Deep Learning for Genomics Course Materials
Jupyter Notebook
32
star
17

ENCODE_downloader

Downloader for ENCODE
Python
31
star
18

coessentiality

Companion to "A genome-wide almanac of co-essential modules assigns function to uncharacterized genes" (https://doi.org/10.1101/827071)
Python
27
star
19

genomedisco

Software for comparing contact maps from HiC, CaptureC and other 3D genome data.
Jupyter Notebook
25
star
20

training_camp

Genetics training camp
Jupyter Notebook
21
star
21

gkmexplain

Accompanying repository for GkmExplain paper
Jupyter Notebook
21
star
22

fastISM

In-silico Saturation Mutagenesis implementation with 10x or more speedup for certain architectures.
Jupyter Notebook
18
star
23

ataqc

Python
17
star
24

basepairmodels

Python
16
star
25

labelshiftexperiments

Label shift experiments
Jupyter Notebook
15
star
26

seqdataloader

Sequence data label generation and ingestion into deep learning models
Python
12
star
27

bpnet-manuscript

BPNet manuscript code.
Jupyter Notebook
11
star
28

MPRA-DragoNN

Code accompanying the paper "Deciphering regulatory DNA sequences and noncoding genetic variants using neural network models of massively parallel reporter assays"
Python
11
star
29

higlass-dynseq

Dynamic sequence track for HiGlass
JavaScript
11
star
30

DeepBindToKeras

Convert DeepBind models to Keras
C
11
star
31

ProCapNet

Repository for modeling PRO-cap data with the BPNet-like model, ProCapNet.
Jupyter Notebook
11
star
32

mpra

Deep learning MPRAs
Jupyter Notebook
9
star
33

ENCODE_scatac

Python
8
star
34

variant-scorer

A framework to score and analyze variant effects genome-wide using ChromBPNet models
Python
8
star
35

tronn

Transcriptional Regulation (Optimized) Neural Nets (TRoNN)
Python
8
star
36

Cardiogenesis_Repo

Cardiogenesis Repo
Jupyter Notebook
8
star
37

deepbind

I have put my modified version of the deepbind code here.
C
8
star
38

scATAC-reprog

Code for the analysis performed in the paper "Transcription factor stoichiometry, motif affinity and syntax regulate single-cell chromatin dynamics during fibroblast reprogramming to pluripotency" by Nair, Ameen et al.
Jupyter Notebook
7
star
39

veryolddontuse_deeplift_modisco_tutorial

Jupyter Notebook
6
star
40

chromovar3d

Code from the chromatin variation 3d project
JavaScript
6
star
41

chip-nexus-pipeline

ChIP-nexus pipeline
Python
6
star
42

kerasAC

keras accessibility models: code to train, predict, interpret
Python
6
star
43

DART-Eval

Jupyter Notebook
6
star
44

DMSO

Jupyter Notebook
5
star
45

mesoderm

Scripts for dataset processing and QC for the mesoderm differentiation project.
HTML
5
star
46

1kg_ld_utils

utils/notes for LD calculation from 1000 genomes panel
Shell
5
star
47

keras-genomics

Genomics layers for Keras 2
Python
5
star
48

lsgkm-svr

lsgkm+gkmexplain with regression functionality. Builds off kundajelab/lsgkm (which has gkmexplain), which in turn builds off Dongwon-Lee/lsgkm (the original lsgkm repo)
C
5
star
49

yuzu

yuzu is a compressed-sensing based approach for quickly calculating in-silico mutagenesis saliency.
Python
5
star
50

PFBoost

modular 2D boosting code with stabilization and hierarchies
Python
4
star
51

coessentiality-browser

Gene browser using coessentiality and related data
Python
4
star
52

zenodo_upload

Python script to upload files to Zenodo
Python
4
star
53

neural_motif_discovery

Framework for interrogating transcription-factor motifs and their syntax/grammars from neural-network interpretations
Jupyter Notebook
4
star
54

vizsequence

Collecting commonly-repeated sequence visualization code here
Python
4
star
55

av_scripts

A place to track my scripts with git.
Jupyter Notebook
4
star
56

bpnet-refactor

Python
3
star
57

dynseq-pages

3
star
58

locusselect

extraction of data embeddings from deep learning model layers; computation of embedding distance and visualization with umap/tsne
Jupyter Notebook
2
star
59

atlas_resources

A nucleotide-resolution, context-specific sequence annotation of the dynamic regulatory landscape of the human and mouse genomes
Shell
2
star
60

tf_binding_challenge

scoring/ranking code for tf binding challenge
Python
2
star
61

bulk-rna-seq

Pipeline for gecco RNA-seq analysis
Shell
2
star
62

python_reading_group

Jupyter Notebook
2
star
63

higlass-multi-tileset

Multi-tileset data fetcher for HiGlass
JavaScript
2
star
64

retina-models

BPNet models for retina single-cell multiome data
Jupyter Notebook
2
star
65

TF-Atlas

Code repository for the TF-Atlas project
Jupyter Notebook
1
star
66

crispr_safe_targeting_regions

Repository for creating the CRISPR controls termed "safe harbor" regions from Morgens et al., 2017, Nat Comms.
Shell
1
star
67

interpret-benchmark

Benchmarking interpretation methods
Jupyter Notebook
1
star
68

kCCA

Python
1
star
69

momma_dragonn

Flexible deep learning framework
Jupyter Notebook
1
star
70

feature_interactions

Jupyter Notebook
1
star
71

bds_pipeline_modules

BigDataScript (BDS) pipelines and modules
Shell
1
star
72

mseqgen

Multi task batch generator for training deep learning models on CHIP-seq, CHIP-exo, CHIP-nexus, ATAC-seq, RNA-seq (or any other -seq)
Python
1
star
73

SVM_pipelines

Jupyter Notebook
1
star
74

jamboree-toolkit

toolkit for setting up compute environment on gcp for jamborees
Shell
1
star
75

genomics-DL-archsandlosses

A collection of Deep Learning architectures and loss functions from across the genomics literature
Python
1
star
76

affinity_distillation

Jupyter Notebook
1
star
77

SeqPriorizationCATLAS

Sequence priorization using gkm-explain.
Jupyter Notebook
1
star
78

chromBPNet-tutorial

How to train BPNets on ATAC-seq data using the Basepairmodels repo from Stanford's Kundaje Lab.
Shell
1
star
79

PREUSS

PREUSS: predicting RNA editing using sequence and structure
Jupyter Notebook
1
star
80

CTCFMutants

Jupyter Notebook
1
star