TF-MoDISco: Transcription-Factor Motif Discovery from Importance Scores
This repository contains the code developed for the associated manuscript, Distilling consolidated DNA sequence motifs and cooperative motif syntax from neural-network models of in vivo transcription-factor binding profiles. The analysis scripts and notebooks used to reproduce the results in this manuscript can be found at this repository.
General users should visit the TF-MoDISco-lite repository for a more efficient, actively maintained, and easier-to-use version of the same algorithm.
Structure of TF-MoDISco
The TF-MoDISco algorithm starts with a set of importance scores on genomic sequences, and can perform the following tasks:
- Identify high-importance windows of the sequences, termed "seqlets"
- Cluster recurring similar seqlets into motifs
- Scan through importance scores across the genome to call motif instances (AKA "hit scoring")
Installing TF-MoDISco
pip install modisco
Alternatively, for a specific tagged version or commit, install from source code by cloning this repository, checking out the desired version, and running pip install -e /path/to/cloned/repo
.
Required inputs to run the algorithm
In order to run the TF-MoDISco algorithm, the following data is required as an input:
- An N x L x 4 NumPy array of one-hot encoded genomic sequences, where N is the number of sequences and L is the sequence length (the 4 bases are in A, C, G, T order); this denotes the identity of the sequence
- A parallel N x L x 4 NumPy array of contribution scores; each position contains the importance of the base specified in the corresponding one-hot encoded sequence (i.e. each base position should have at most one nonzero entry out of the 4, which measures importance at the base in the sequence)
- An optional parallel N x L x 4 NumPy array of hypothetical contribution scores, which measures the hypothetical contribution of every base (not just the one that is present in the sequence); equivalently, the element-wise product of this array with the one-hot encoded genomic sequences should be identical to the array of contribution scores
Other resources
A technical note describing version 0.5.6.5 is available at https://arxiv.org/abs/1811.00416.
Video of talk at NeurIPS MLCB 2017
Example notebooks for running the algorithm:
- TF MoDISco TAL GATA: a self-contained example notebook that uses pre-computed importance scores (generated by a neural network) as input. Scores were generated using deeplift as illustated in this notebook. If deeplift doesn't work with your architecture, you could alternatively generate scores using DeepSHAP (DeepSHAP is an extension of DeepLIFT that can work with more diverse architectures) as illustrated in this notebook (heads-up: that notebook uses a custom branch of the DeepSHAP repository).
- TF MoDISco Nanog: a self-contained example notebook that uses pre-computed importance scores and an empirically-generated null distribution (generated by a gkm-SVM) as input. Scores were generated using gkmexplain as illustated in this notebook. This notebook also illustrates how to use a MEME-based initialization to potentially boost the performance of TF-MoDISco.