• Stars
    star
    395
  • Rank 109,040 (Top 3 %)
  • Language
    Python
  • License
    Apache License 2.0
  • Created over 7 years ago
  • Updated 6 months ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Sequential regulatory activity predictions with deep convolutional neural networks.

Basenji

Sequential regulatory activity predictions with deep convolutional neural networks.

Basenji provides researchers with tools to:

  1. Train deep convolutional neural networks to predict regulatory activity along very long chromosome-scale DNA sequences
  2. Score variants according to their predicted influence on regulatory activity across the sequence and/or for specific genes.
  3. Annotate the distal regulatory elements that influence gene activity.
  4. Annotate the specific nucleotides that drive regulatory element function.

Basset successor

This codebase offers numerous improvements and generalizations to its predecessor Basset, and I'll be using it for all of my ongoing work. Here are the salient changes.

  1. Basenji makes predictions in bins across the sequences you provide. You could replicate Basset's peak classification by simply providing smaller sequences and binning the target for the entire sequence.
  2. Basenji intends to predict quantitative signal using regression loss functions, rather than binary signal using classification loss functions.
  3. Basenji is built on TensorFlow, which offers myriad benefits, including distributed computing and a large and adaptive developer community.

However, this codebase is general enough to implement the Basset model, too. I have instructions for how to do that here.


Akita

3D genome folding predictions with deep convolutional neural networks.

Akita provides researchers with tools to:

  1. Train deep convolutional neural networks to predict 2D contact maps along very long chromosome-scale DNA sequences
  2. Score variants according to their predicted influence on contact maps across the sequence and/or for specific genes.
  3. Annotate the specific nucleotides that drive genome folding.

Saluki

mRNA half-life predictions with a hybrid convolutional and recurrent deep neural network.

Saluki provides researchers with tools to:

  1. Train deep convolutional and recurrent neural networks to predict mRNA half-life from an mRNA sequence annotated with the first frame of each codon and splice site positions.
  2. Score variants according to their predicted influence on mRNA half-life, on full-length mRNAs or for a set of pre-defined variants.

A full reproduction of the results presented in the paper, involving variant prediction, motif discovery, and insertional motif anlaysis, can be found here.


Installation

Basenji/Akita were developed with Python3 and a variety of scientific computing dependencies, which you can see and install via requirements.txt for pip and environment.yml for Anaconda. For each case, we kept TensorFlow separate to allow you to choose the install method that works best for you. The codebase is compatible with the latest TensorFlow 2, but should also work with 1.15.

Run the following to install dependencies and Basenji with Anaconda.

    conda env create -f environment.yml
    conda install tensorflow (or tensorflow-gpu)
    python setup.py develop --no-deps

Alternatively, if you want to guarantee working versions of each dependency, you can install via a fully pre-specified environment.

    conda env create -f prespecified.yml
    conda install tensorflow (or tensorflow-gpu)
    python setup.py develop --no-deps

Or the following to install dependencies and Basenji with pip and setuptools.

    python setup.py develop
    pip install tensorflow (or tensorflow-gpu)

Then we recommend setting the following environmental variables.

  export BASENJIDIR=~/code/Basenji
  export PATH=$BASENJIDIR/bin:$PATH
  export PYTHONPATH=$BASENJIDIR/bin:$PYTHONPATH

To verify the install, launch python and run

    import basenji

Manuscripts

Models and (links to) data studied in various manuscripts are available in the manuscripts directory.


Documentation

At this stage, Basenji is something in between personal research code and accessible software for wide use. The primary challenge is uncertainty in what the best role for this type of toolkit is going to be in functional genomics and statistical genetics. The computational requirements don't make it easy either. Thus, this package is under active development, and I encourage anyone to get in touch to relate your experience and request clarifications or additional features, documentation, or tutorials.


Tutorials

These are a work in progress, so forgive incompleteness for the moment. If there's a task that you're interested in that I haven't included, feel free to post it as an Issue at the top.

More Repositories

1

scBasset

Sequence-based Modeling of single-cell ATAC-seq using Convolutional Neural Networks.
Jupyter Notebook
95
star
2

solo

software to detect doublets
Python
85
star
3

borzoi

RNA-seq prediction with deep convolutional neural networks.
Python
75
star
4

scnym

Semi-supervised adversarial neural networks for classification of single cell transcriptomics data
Python
73
star
5

baskerville

Machine learning methods for DNA sequence analysis.
Python
31
star
6

spatial_lda

Probabilistic topic model for identifying cellular micro-environments.
Python
26
star
7

velodyn

Dynamical systems methods for RNA velocity analysis
Python
23
star
8

scmmd

Maximum mean discrepancy comparisons for single cell profiling experiments
Python
14
star
9

ukbb-mri-sseg

UKBB MRI semantic segmentation for Abdominal Dixon and other modalities
Python
13
star
10

impulse

Fit phenomenological sigmoid and impulse curves with priors to improve interpretability
R
10
star
11

msTrawler

R
8
star
12

romic

romic represents high-dimensional measurements by persistently tracking features, samples, and measurements as a dataset is modified and reformatted. It provides a suite of functions that build on this framework to support data analysis and visualization. These functions are combined into shiny apps tailored to exploratory data analysis.
R
5
star
13

AAC_scoring

Map DEXA images from UKBB dataset to abdominal aortic calcification scores
Python
4
star
14

2019_murine_cell_aging

Tools associated with the Calico Murine Aging Cell Atlas, Kimmel 2019 et. al., Genome Research
Python
4
star
15

rDNAcn

Jupyter Notebook
3
star
16

rescan_line_sted

A scientific publication, describing a way to improve microscopy. This repository hosts everything you need to reproduce our results. Read the publication here: https://calico.github.io/rescan_line_sted
Python
3
star
17

remote_refocus

A scientific publication, describing a way to improve microscopy. This repository hosts everything you need to reproduce our results. Read the publication here: https://calico.github.io/remote_refocus/
Python
3
star
18

catnap

CATNAP-related code
Python
2
star
19

claman

Calico Lipidomics and Metabolomics (CLaM) Analysis provides a suite of functions which support data manipulation, statistical analysis and visualizations of mzrollDB files created by MAVEN.
R
2
star
20

Senescence_CD45

Jupyter Notebook
2
star
21

borzoi-paper

Analyses related to the Borzoi paper.
Jupyter Notebook
2
star
22

do_qtl

Python
2
star
23

clamr

Calico Lipidomics and Metabolomics R Core Functions
R
1
star
24

hendrickson_2017_supplement

Supplemental material for the Hendrickson et. al 2017 paper.
Python
1
star
25

testrepo1234

testing 123
1
star
26

DISH

Python
1
star
27

hackett-doomics

Plasma multiomics (metabolites + lipids + proteins) from the SHOCK cohort of Diverse Outcross (DO) mice. 110 mice were longitudinally profiled at 8, 14, and 20 months and molecules abundances were compared to their age and natural lifespan. This repository supports a TO-BE-RELEASED publication interrogating this dataset.
R
1
star