• Stars
    star
    253
  • Rank 160,776 (Top 4 %)
  • Language
    Jupyter Notebook
  • License
    GNU General Publi...
  • Created over 6 years ago
  • Updated about 3 years ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Deep learning infrastructure for genomics

Janggu - Deep learning for Genomics

Documentation Status Travis-CI Build Status Coverage Status PyPI Package latest release License Supported Python Versions Downloads

Janggu logo

Janggu is a python package that facilitates deep learning in the context of genomics. The package is freely available under a GPL-3.0 license.

Janggu visual abstract

In particular, the package allows for easy access to typical Genomics data formats and out-of-the-box evaluation (for keras models specifically) so that you can concentrate on designing the neural network architecture for the purpose of quickly testing biological hypothesis. A comprehensive documentation is available here.

Hallmarks of Janggu:

  1. Janggu provides special Genomics datasets that allow you to access raw data in FASTA, BAM, BIGWIG, BED and GFF file format.
  2. Various normalization procedures are supported for dealing with of the genomics dataset, including 'TPM', 'zscore' or custom normalizers.
  3. Biological features can be represented in terms of higher-order sequence features, e.g. di-nucleotide based features.
  4. The dataset objects are directly consumable with neural networks for example implemented using keras or using scikit-learn (see src/examples in this repository).
  5. Numpy format output of a keras model can be converted to represent genomic coverage tracks, which allows exporting the predictions as BIGWIG files and visualization of genome browser-like plots.
  6. Genomic datasets can be stored in various ways, including as numpy array, sparse dataset or in hdf5 format.
  7. Caching of Genomic datasets avoids time consuming preprocessing steps and facilitates fast reloading.
  8. Janggu provides a wrapper for keras models with built-in logging functionality and automatized result evaluation.
  9. Janggu supports input feature importance attribution using the integrated gradients method and variant effect prediction assessment.
  10. Janggu provides a utilities such as keras layer for scanning both DNA strands for motif occurrences.

Getting started

Janggu makes it easy to access data from genomic file formats and utilize it for machine learning purposes.

dna = Bioseq.create_from_genome('dna', refgenome=<refgenome.fa>, roi=<roi.bed>)
labels = Cover.create_from_bed('labels', bedfiles=<labels.bed>, roi=<roi.bed>)

kerasmodel.fit(dna, labels)

A range of examples can be found in './src/examples' of this repository, which includes jupyter notebooks that illustrate Janggu's functionality and how it can be used with popular deep learning frameworks, including keras, sklearn or pytorch.

Why the name Janggu?

Janggu is a Korean percussion instrument that looks like an hourglass.

Like the two ends of the instrument, the philosophy of the Janggu package is to help with the two ends of a deep learning application in genomics, namely data acquisition and evaluation.

Installation

A list of python dependencies is defined in setup.py. Additionally, bedtools is required for pybedtools which janggu depends on.

Janggu depends on tensorflow and keras. To install janggu with tensorflow version 1 and 2 use

# to install with tensorflow==1.14 and keras==2.2
pip install janggu[tf] # or janggu[tf_gpu]

# to install with tensorflow==2.2 and keras==2.4.3
pip install janggu[tf2] # or janggu[tf2_gpu]

Depending on the pip version (e.g. 20.2.2), some package dependencies may fail to be resolved accurately such that incompatible package versions are installed. If this is the case, you could try using pip install ... --use-feature=2020-resolver or install the required package version manually.

Alternatively, you can install tensorflow and keras via the conda environment using

# tensorflow v1
conda install tensorflow==1.14 keras==2.2  # or tensorflow-gpu

# tensorflow v2
conda install tensorflow==2.2 keras==2.4.3  # or tensorflow-gpu

Further information regarding the installation of tensorflow can be found on the official tensorflow webpage

To verify that the installation works try to run the example contained in the janggu package as follows

git clone https://github.com/BIMSBbioinfo/janggu
cd janggu
python ./src/examples/classify_fasta.py single

A model is then trained to predict the class labels of two sets of toy sequencesby scanning the forward strand for sequence patterns and using an ordinary mono-nucleotide one-hot sequence encoding. The entire training process takes a few minutes on CPU backend. Eventually, some example prediction scores are shown for Oct4 and Mafk sequences. The accuracy should be around 85% and individual example prediction scores should tend to be higher for Oct4 than for Mafk.

You may also try to rerun the training by evaluating sequences features on both strands and using higher-order sequence encoding using i.e. the command-line arguments: dnaconv -order 2. Accuracies and prediction scores for the individual example sequences should improve compared to the previous example.

Citation

Kopp, W., Monti, R., Tamburrini, A., Ohler, U., Akalin, A. Deep learning for genomics using Janggu. Nat Commun 11, 3488 (2020). https://doi.org/10.1038/s41467-020-17155-y

More Repositories

1

compgen2021

R
74
star
2

genomation

R package for genomic feature analysis and visualization
R
73
star
3

maui

Multi-omics Autoencoder Integration: Deep learning-based heterogenous data analysis toolkit
Jupyter Notebook
45
star
4

VoltRon

Spatial omic analysis toolbox for multi-resolution and multi-omic integration using image registration
R
36
star
5

ikarus

Identifying tumor cells at the single-cell level using machine learning
HTML
35
star
6

netSmooth

netSmooth: A Network smoothing based method for Single Cell RNA-seq imputation
R
25
star
7

intro2UnixandSGE

resource for unix and SGE beginners
JavaScript
22
star
8

pigx_rnaseq

Bulk RNA-seq Data Processing, Quality Control, and Downstream Analysis Pipeline
Python
21
star
9

flexynesis

A deep-learning based multi-modal data integration suite that aims to achieve synesis in a flexible manner
Python
21
star
10

pigx

Pipelines in genomics
HTML
18
star
11

pigx_sars-cov-2

PiGx SARS-CoV-2 wastewater sequencing pipeline
Python
18
star
12

guix-bimsb

Packages for GNU Guix that have not yet or will not be submitted upstream for various reasons
Scheme
15
star
13

bavaria

Batch-adversarial variational auto-encoder (BAVARIA) for simultaneous dimensionality reduction and integration of single-cell ATAC-seq datasets
Jupyter Notebook
13
star
14

HOT-or-not-examining-the-basis-of-high-occupancy-target-regions

HOT regions paper
R
11
star
15

mergen

AI-Driven Code Generation, Explanation and Execution for Data Analysis
R
11
star
16

pigx_bsseq

bisulfite sequencing pipeline from fastq to methylation reports
Python
10
star
17

deconvR

The deconvR is an R package designed for analyzing deconvolution of the bulk sample(s) using an atlas of reference omic signature profiles and a user-selected model.
R
10
star
18

ciRcus

An R package for annotation of circular RNAs
R
10
star
19

compgen2018

teaching material for compgen2018 course
HTML
10
star
20

pigx_scrnaseq

Pipeline for analysis of Dropseq single cell data
Python
10
star
21

guix-bimsb-nonfree

GNU Guix package definitions for proprietary software, or software with unclear licenses.
Scheme
10
star
22

crispr_DART

A workflow to analyse sequence mutations in CRISPR-CAS9 targeted amplicon sequencing data
R
10
star
23

projectTemplate

R
9
star
24

Strategies_for_analyzing_BS-seq

HTML
8
star
25

SingleCell_2018

Repository for Single Cell Analysis Course in MDC
HTML
7
star
26

scmaui

Jupyter Notebook
6
star
27

pigx_chipseq

Pipeline for Analysis of ChIP-Seq data
Python
6
star
28

ikarus---auxiliary

Jupyter Notebook
5
star
29

scbrowse

An interactive browser for single-cell ATAC-seq data
Python
5
star
30

RCAS

R package for the RNA Centric Annotation System (RCAS)
R
5
star
31

ccfDNA_ACSS_manuscript

The repo for the manuscript "Cardiovascular disease biomarkers derived from circulating cell-free DNA methylation"
4
star
32

slimR

Short Linear Motif (SLiM) Analysis in the context of human diseases
R
4
star
33

pigx_docs

Documentation for PiGx
JavaScript
3
star
34

guix.install

Install (and import) R packages with Guix from within your R session
R
3
star
35

scregseg

Single-cell regulatory landscape segmentation
Jupyter Notebook
3
star
36

htd3

visualisation library for genetic data based on d3.js
JavaScript
3
star
37

uyar_et_al_multiomics_deeplearning

maui goes pancancer
R
3
star
38

mergenstudio

R
3
star
39

scipipeline

Snakemake pipeline for processing single-cell combinatorial indexing ATAC seq data
Python
2
star
40

puppet-bimsb-guix

Puppet module for deploying Guix
Puppet
2
star
41

sudo

Mattermost bot
Scheme
2
star
42

bimsbbioinfo.github.io

website for the platform
HTML
2
star
43

multiomics_vs_panelseq

The repo of the "Multi-omics alleviates the limitations of panel-sequencing for cancer drug response prediction" manuscript
R
1
star
44

reg2gene

R
1
star
45

flexynesis-benchmarks

Comprehensive continuous benchmarking of tools for multi-omics data integration
Python
1
star
46

RCAS_meta-analysis

The repo is to host the scripts and the results of multiple RBPs.
HTML
1
star
47

rcas-web

Web interface for RCAS
JavaScript
1
star
48

scmaui-experiments

Tutorials for scMaui
Jupyter Notebook
1
star
49

froehlich_uyar_et_al_2020

R
1
star
50

MyoExplorer

Companion code for: Kim et. al., Single-nucleus transcriptomics reveals functional compartmentalization in syncytial skeletal muscle cells
HTML
1
star
51

compgen2015material

HTML
1
star
52

makeNGSnake

Python
1
star
53

spatialNetSmooth

R
1
star
54

ZarrArray

Zarr backend for DelayedArray objects
R
1
star