• Stars
    star
    363
  • Rank 117,374 (Top 3 %)
  • Language
    Python
  • License
    Other
  • Created almost 2 years ago
  • Updated 9 months ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

🧬 Nucleotide Transformer: Building and Evaluating Robust Foundation Models for Human Genomics

Nucleotide Transformer

Python Version Jax Version license

Welcome to the InstaDeep Github repository of the Nucleotide Transformer project.

We are thrilled to open-source this work and provide the community with access to the code and pre-trained weights for eight genomics language models. This project was a collaboration with Nvidia and TUM, and the models were trained on DGX A100 nodes on Cambridge-1.

Description 🧬

We present a comprehensive examination of foundational language models that were pre-trained on DNA sequences from whole-genomes. Compared to other approaches, our models do not only integrate information from single reference genomes, but leverage DNA sequences from over 3,200 diverse human genomes, as well as 850 genomes from a wide range of species, including model and non-model organisms. Through robust and extensive evaluation, we show that these large models provide extremely accurate molecular phenotype prediction compared to existing methods.

Performance on downstream tasks

Fig. 1: The Nucleotide Transformer model matches or outperforms 15 out of 18 downstream tasks using fine-tuning. We show the performance results across downstream tasks for fine-tuned transformer models. Error bars represent 2 SDs derived from 10-fold cross-validation. The performance metrics for the state-of-the-art (SOTA) models are shown as horizontal dotted lines.

Overall, our work provides novel insights related to the training and application of language foundational models to genomics with ample opportunities of their applications in the field.

In this repository, you will find the following:

  • Inference code for our models
  • Pre-trained weights for all eight models
  • Instructions for using the code and pre-trained models

Get started 🚀

To use the code and pre-trained models, simply:

  1. Clone the repository to your local machine.
  2. Install the package by running pip install ..

You can then download and do the inference with any of our eight models in only a few lines of codes:

import haiku as hk
import jax
import jax.numpy as jnp
from nucleotide_transformer.pretrained import get_pretrained_model

# Get pretrained model
parameters, forward_fn, tokenizer, config = get_pretrained_model(
    model_name="50M_multi_species_v2",
    embeddings_layers_to_save=(20,),
    max_positions=32,
)
forward_fn = hk.transform(forward_fn)

# Get data and tokenize it
sequences = ["ATTCCGATTCCGATTCCG", "ATTTCTCTCTCTCTCTGAGATCGATCGATCGAT"]
tokens_ids = [b[1] for b in tokenizer.batch_tokenize(sequences)]
tokens_str = [b[0] for b in tokenizer.batch_tokenize(sequences)]
tokens = jnp.asarray(tokens_ids, dtype=jnp.int32)

# Initialize random key
random_key = jax.random.PRNGKey(0)

# Infer
outs = forward_fn.apply(parameters, random_key, tokens)

# Get embeddings at layer 20
print(outs["embeddings_20"].shape)

Supported model names are:

  • 500M_human_ref
  • 500M_1000G
  • 2B5_1000G
  • 2B5_multi_species
  • 50M_multi_species_v2
  • 100M_multi_species_v2
  • 250M_multi_species_v2
  • 500M_multi_species_v2

You can also run our models and find more example code in google colab Open All Collab

The code runs both on GPU and TPU thanks to Jax!

Embeddings retrieval

The transformer layers are 1-indexed, which means that calling get_pretrained_model with the arguments model_name="50M_multi_species_v2" and embeddings_layers_to_save=(1, 20,) will result in extracting embeddings after the first and 20-th transformer layer. For transformers using the Roberta LM head, it is common practice to extract the final embeddings after the first layer norm of the LM head rather than after the last transformer block. Therefore, if get_pretrained_model is called with the following arguments embeddings_layers_to_save=(24,), the embeddings will not be extracted after the final transformer layer but rather after the first layer norm of the LM head.

Tokenization 🔤

The models are trained on sequences of length up to 1000 tokens, including the <CLS> token prepended automatically to the beginning of the sequence. The tokenizer starts tokenizing from left to right by grouping the letters "A", "C", "G" and "T" in 6-mers. The "N" letter is chosen not to be grouped inside the k-mers, therefore whenever the tokenizer encounters a "N", or if the number of nucleotides in the sequence is not a multiple of 6, it will tokenize the nucleotides without grouping them. Examples are given below:

dna_sequence_1 = "ACGTGTACGTGCACGGACGACTAGTCAGCA" 
tokenized_dna_sequence_1 = [<CLS>,<ACGTGT>,<ACGTGC>,<ACGGAC>,<GACTAG>,<TCAGCA>]

dna_sequence_2 = "ACGTGTACNTGCACGGANCGACTAGTCTGA" 
tokenized_dna_sequence_2 = [<CLS>,<ACGTGT>,<A>,<C>,<N>,<TGCACG>,<G>,<A>,<N>,<CGACTA>,<GTCTGA>]

All the transformers can therefore take sequences of up to 5994 nucleotides if there are no "N" inside.

Acknowledgments 🙏

We thank Maša Roller, as well as members of the Rostlab, particularly Tobias Olenyi, Ivan Koludarov, and Burkhard Rost for constructive discussions that helped identify interesting research directions. Furthermore, we extend gratitude to all those who deposit experimental data in public databases, to those who maintain these databases, and those who make analytical and predictive methods freely available. We also thank the Jax development team.

Citing the Nucleotide Transformer 📚

If you find this repository useful in your work, please add the following citation to our associated paper:

@article{dalla2023nucleotide,
  title={The Nucleotide Transformer: Building and Evaluating Robust Foundation Models for Human Genomics},
  author={Dalla-Torre, Hugo and Gonzalez, Liam and Mendoza Revilla, Javier and Lopez Carranza, Nicolas and Henryk Grywaczewski, Adam and Oteri, Francesco and Dallago, Christian and Trop, Evan and Sirelkhatim, Hassan and Richard, Guillaume and others},
  journal={bioRxiv},
  pages={2023--01},
  year={2023},
  publisher={Cold Spring Harbor Laboratory}
}

If you have any questions or feedback on the code and models, please feel free to reach out to us.

Thank you for your interest in our work!

More Repositories

1

Mava

🦁 A research-friendly codebase for fast experimentation of multi-agent reinforcement learning in JAX
Python
704
star
2

jumanji

🕹️ A diverse suite of scalable reinforcement learning environments in JAX
Python
600
star
3

flashbax

⚡ Flashbax: Accelerated Replay Buffers in JAX
Python
142
star
4

og-marl

Datasets with baselines for offline multi-agent reinforcement learning.
Python
127
star
5

tunbert

TunBERT is the first release of a pre-trained BERT model for the Tunisian dialect using a Tunisian Common-Crawl-based dataset. TunBERT was applied to three NLP downstream tasks: Sentiment Analysis (SA), Tunisian Dialect Identification (TDI) and Reading Comprehension Question-Answering (RCQA)
Python
92
star
6

AlphaNPI

Adapting the AlphaZero algorithm to remove the need of execution traces to train NPI.
Python
77
star
7

manyfold

🧬 ManyFold: An efficient and flexible library for training and validating protein folding models
Python
71
star
8

catx

🐈‍⬛ Contextual bandits library for continuous action trees with smoothing in JAX
Python
61
star
9

marl-eval

A tool for aggregating and plotting MARL experiment data.
Python
59
star
10

poppy

🌺 Population-Based Reinforcement Learning for Combinatorial Optimization
Python
58
star
11

fastpbrl

Vectorization techniques for fast population-based training.
Python
52
star
12

FrameDiPT

FrameDiPT: an SE(3) diffusion model for protein structure inpainting
Jupyter Notebook
49
star
13

sebulba

🪐 The Sebulba architecture to scale reinforcement learning on Cloud TPUs in JAX
Python
46
star
14

InstaNovo

De novo peptide sequencing with InstaNovo: Accurate, database-free peptide identification for large scale proteomics experiments
Python
46
star
15

awesome-marl

A categorised list of Multi-Agent Reinforcemnt Learning (MARL) papers
46
star
16

compass

🧭 COMPASS: Combinatorial Optimization with Policy Adaptation using Latent Space Search
Python
21
star
17

EGTA-NMARL

Experiments for performing empirical game-theoretic analysis of networked system control for common-pool resource management using multi-agent reinforcement learning.
Python
16
star
18

protein-sequence-bfn

Supporting code for our paper "Protein Sequence Modelling with Bayesian Flow Networks"
Python
14
star
19

bioclip

Contrasting Sequence with Structure: Pre-training Graph Representations with PLMs
Python
12
star
20

DebateLLM

Benchmarking Multi-Agent Debate between Language Models for Truthfulness in Q&A.
Jupyter Notebook
12
star
21

gtc-course-2020

Tutorial on Multi-Agent Reinforcement for Train Scheduling
Python
11
star
22

LightMHC

LightMHC: A Light Model for pMHC Structure Prediction with Graph Neural Networks
Python
11
star
23

gcp-gpu-metrics

📈 Tiny Go binary that aims to export Nvidia GPU metrics to GCP monitoring, based on nvidia-smi.
Go
11
star
24

outer-value-function-meta-rl

Code of the paper: Debiasing Meta-Gradient Reinforcement Learning by Learning the Outer Value Function
Jupyter Notebook
10
star
25

matrax

A collection of matrix games in JAX
Python
9
star
26

qd-skill-discovery-benchmark

Neuroevolution is a Competitive Alternative to Reinforcement Learning for Skill Discovery
Python
9
star
27

scaling-resnets

⚡️ A framework that investigates the scaling limit of ResNets and compares it to Neural ODEs. Tested on synthetic and standardized datasets. 📈
Python
5
star
28

IndabaX-TN-2023-RL

Jupyter Notebook
4
star
29

selective-reincarnation-marl

Official repository for Reduce, Reuse, Recycle: Selective Reincarnation in Multi-Agent Reinforcement Learning paper, accepted at the Reincarnating RL workshop at ICLR 2023.
Python
4
star
30

amld-africa-2021

Repository for the workshop at AMLD Africa 2021.
Jupyter Notebook
3
star
31

Indabax-Tunisia-2019

This repository contains the practical notebooks for the Indabax Tunisia 2019, held in Tunis on 13 April.
Jupyter Notebook
3
star
32

IndabaX-SA-2021

IndabaX-SA-2021
Jupyter Notebook
2
star
33

locust-predict

Locust breeding ground prediction using pseudo-absence generation and machine learning.
Jupyter Notebook
2
star
34

tpu-workshop

Materials for the TPU Workshop
Jupyter Notebook
1
star
35

SKAInnotate

Jupyter Notebook
1
star