• Stars
    star
    2,026
  • Rank 21,845 (Top 0.5 %)
  • Language
    Python
  • License
    MIT License
  • Created over 3 years ago
  • Updated 11 months ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Evolutionary Scale Modeling (esm): Pretrained language models for proteins

Evolutionary Scale Modeling

atlas

Update April 2023: Code for the two simultaneous preprints on protein design is now released! Code for "Language models generalize beyond natural proteins" is under examples/lm-design/. Code for "A high-level programming language for generative protein design" is under examples/protein-programming-language/.

This repository contains code and pre-trained weights for Transformer protein language models from the Meta Fundamental AI Research Protein Team (FAIR), including our state-of-the-art ESM-2 and ESMFold, as well as MSA Transformer, ESM-1v for predicting variant effects and ESM-IF1 for inverse folding. Transformer protein language models were introduced in the 2019 preprint of the paper "Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences". ESM-2 outperforms all tested single-sequence protein language models across a range of structure prediction tasks. ESMFold harnesses the ESM-2 language model to generate accurate structure predictions end to end directly from the sequence of a protein.

In November 2022, we released v0 of the ESM Metagenomic Atlas, an open atlas of 617 million predicted metagenomic protein structures. The Atlas was updated in March 2023 in collaboration with EBI. The new v2023_02 adds another 150 million predicted structures to the Atlas, as well as pre-computed ESM2 embeddings. Bulk download, blog post and the resources provided on the Atlas website are documented on this README.

In December 2022, we released two simultaneous preprints on protein design.

  • "Language models generalize beyond natural proteins" (PAPER, CODE) uses ESM2 to design de novo proteins. The code and data associated with the preprint can be found here.
  • "A high-level programming language for generative protein design" (PAPER, CODE) uses ESMFold to design proteins according to a high-level programming language.
Citation For ESM2, ESMFold and ESM Atlas: ```bibtex @article{lin2023evolutionary, title = {Evolutionary-scale prediction of atomic-level protein structure with a language model}, author = {Zeming Lin and Halil Akin and Roshan Rao and Brian Hie and Zhongkai Zhu and Wenting Lu and Nikita Smetanin and Robert Verkuil and Ori Kabeli and Yaniv Shmueli and Allan dos Santos Costa and Maryam Fazel-Zarandi and Tom Sercu and Salvatore Candido and Alexander Rives }, journal = {Science}, volume = {379}, number = {6637}, pages = {1123-1130}, year = {2023}, doi = {10.1126/science.ade2574}, URL = {https://www.science.org/doi/abs/10.1126/science.ade2574}, note={Earlier versions as preprint: bioRxiv 2022.07.20.500902}, } ```

For transformer protein language models:

@article{rives2021biological,
  title={Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences},
  author={Rives, Alexander and Meier, Joshua and Sercu, Tom and Goyal, Siddharth and Lin, Zeming and Liu, Jason and Guo, Demi and Ott, Myle and Zitnick, C Lawrence and Ma, Jerry and others},
  journal={Proceedings of the National Academy of Sciences},
  volume={118},
  number={15},
  pages={e2016239118},
  year={2021},
  publisher={National Acad Sciences},
  note={bioRxiv 10.1101/622803},
  doi={10.1073/pnas.2016239118},
  url={https://www.pnas.org/doi/full/10.1073/pnas.2016239118},
}
Table of contents
What's New

Main models you should use

Shorthand esm.pretrained. Dataset Description
ESM-2 esm2_t36_3B_UR50D() esm2_t48_15B_UR50D() UR50 (sample UR90) SOTA general-purpose protein language model. Can be used to predict structure, function and other protein properties directly from individual sequences. Released with Lin et al. 2022 (Aug 2022 update).
ESMFold esmfold_v1() PDB + UR50 End-to-end single sequence 3D structure predictor (Nov 2022 update).
ESM-MSA-1b esm_msa1b_t12_100M_UR50S() UR50 + MSA MSA Transformer language model. Can be used to extract embeddings from an MSA. Enables SOTA inference of structure. Released with Rao et al. 2021 (ICML'21 version, June 2021).
ESM-1v esm1v_t33_650M_UR90S_1() ... esm1v_t33_650M_UR90S_5() UR90 Language model specialized for prediction of variant effects. Enables SOTA zero-shot prediction of the functional effects of sequence variations. Same architecture as ESM-1b, but trained on UniRef90. Released with Meier et al. 2021.
ESM-IF1 esm_if1_gvp4_t16_142M_UR50() CATH + UR50 Inverse folding model. Can be used to design sequences for given structures, or to predict functional effects of sequence variation for given structures. Enables SOTA fixed backbone sequence design. Released with Hsu et al. 2022.

For a complete list of available models, with details and release notes, see Pre-trained Models.

Usage

Quick start

An easy way to get started is to load ESM or ESMFold through the HuggingFace transformers library, which has simplified the ESMFold dependencies and provides a standardized API and tools to work with state-of-the-art pretrained models.

Alternatively, ColabFold has integrated ESMFold so that you can easily run it directly in the browser on a Google Colab instance.

We also provide an API which you can access through curl or on the ESM Metagenomic Atlas web page.

curl -X POST --data "KVFGRCELAAAMKRHGLDNYRGYSLGNWVCAAKFESNFNTQATNRNTDGSTDYGILQINSRWWCNDGRTPGSRNLCNIPCSALLSSDITASVNCAKKIVSDGNGMNAWVAWRNRCKGTDVQAWIRGCRL" https://api.esmatlas.com/foldSequence/v1/pdb/

For ESM-MSA-1b, ESM-IF1, or any of the other models you can use the original implementation from our repo directly via the instructions below.

Getting started with this repo

As a prerequisite, you must have PyTorch installed to use this repository.

You can use this one-liner for installation, using the latest release of esm:

pip install fair-esm  # latest release, OR:
pip install git+https://github.com/facebookresearch/esm.git  # bleeding edge, current repo main branch

To use the ESMFold model, make sure you start from an environment with python <= 3.9 and pytorch installed. Then add the [esmfold] option to your pip install, which will install the dependencies for OpenFold automatically. Openfold installation requires nvcc.

pip install "fair-esm[esmfold]"
# OpenFold and its remaining dependency
pip install 'dllogger @ git+https://github.com/NVIDIA/dllogger.git'
pip install 'openfold @ git+https://github.com/aqlaboratory/openfold.git@4b41059694619831a7db195b7e0988fc4ff3a307'

NOTE: If openfold installation fails, please double check that nvcc is available and that a cuda-compatable version of PyTorch has been installed.

Alternatively, we provide the esmfold conda environment, which can be built via conda env create -f environment.yml.

We also support PyTorch Hub, which removes the need to clone and/or install this repository yourself:

import torch
model, alphabet = torch.hub.load("facebookresearch/esm:main", "esm2_t33_650M_UR50D")

After pip install, you can load and use a pretrained model as follows:

import torch
import esm

# Load ESM-2 model
model, alphabet = esm.pretrained.esm2_t33_650M_UR50D()
batch_converter = alphabet.get_batch_converter()
model.eval()  # disables dropout for deterministic results

# Prepare data (first 2 sequences from ESMStructuralSplitDataset superfamily / 4)
data = [
    ("protein1", "MKTVRQERLKSIVRILERSKEPVSGAQLAEELSVSRQVIVQDIAYLRSLGYNIVATPRGYVLAGG"),
    ("protein2", "KALTARQQEVFDLIRDHISQTGMPPTRAEIAQRLGFRSPNAAEEHLKALARKGVIEIVSGASRGIRLLQEE"),
    ("protein2 with mask","KALTARQQEVFDLIRD<mask>ISQTGMPPTRAEIAQRLGFRSPNAAEEHLKALARKGVIEIVSGASRGIRLLQEE"),
    ("protein3",  "K A <mask> I S Q"),
]
batch_labels, batch_strs, batch_tokens = batch_converter(data)
batch_lens = (batch_tokens != alphabet.padding_idx).sum(1)

# Extract per-residue representations (on CPU)
with torch.no_grad():
    results = model(batch_tokens, repr_layers=[33], return_contacts=True)
token_representations = results["representations"][33]

# Generate per-sequence representations via averaging
# NOTE: token 0 is always a beginning-of-sequence token, so the first residue is token 1.
sequence_representations = []
for i, tokens_len in enumerate(batch_lens):
    sequence_representations.append(token_representations[i, 1 : tokens_len - 1].mean(0))

# Look at the unsupervised self-attention map contact predictions
import matplotlib.pyplot as plt
for (_, seq), tokens_len, attention_contacts in zip(data, batch_lens, results["contacts"]):
    plt.matshow(attention_contacts[: tokens_len, : tokens_len])
    plt.title(seq)
    plt.show()

ESMFold Structure Prediction

After installing with the [esmfold] option, you can use the ESMFold structure prediction model as follows:

import torch
import esm

model = esm.pretrained.esmfold_v1()
model = model.eval().cuda()

# Optionally, uncomment to set a chunk size for axial attention. This can help reduce memory.
# Lower sizes will have lower memory requirements at the cost of increased speed.
# model.set_chunk_size(128)

sequence = "MKTVRQERLKSIVRILERSKEPVSGAQLAEELSVSRQVIVQDIAYLRSLGYNIVATPRGYVLAGG"
# Multimer prediction can be done with chains separated by ':'

with torch.no_grad():
    output = model.infer_pdb(sequence)

with open("result.pdb", "w") as f:
    f.write(output)

import biotite.structure.io as bsio
struct = bsio.load_structure("result.pdb", extra_fields=["b_factor"])
print(struct.b_factor.mean())  # this will be the pLDDT
# 88.3

Besides esm.pretrained.esmfold_v1() which is the best performing model we recommend using, we also provide esm.pretrained.esmfold_v0() which was used for the experiments in Lin et al. 2022.

We also provide a command line interface (esm-fold) that efficiently predicts structures in bulk from a FASTA file using ESMFold:

usage: esm-fold [-h] -i FASTA -o PDB [--num-recycles NUM_RECYCLES]
                [--max-tokens-per-batch MAX_TOKENS_PER_BATCH]
                [--chunk-size CHUNK_SIZE] [--cpu-only] [--cpu-offload]

optional arguments:
  -h, --help            show this help message and exit
  -i FASTA, --fasta FASTA
                        Path to input FASTA file
  -o PDB, --pdb PDB     Path to output PDB directory
  --num-recycles NUM_RECYCLES
                        Number of recycles to run. Defaults to number used in
                        training (4).
  --max-tokens-per-batch MAX_TOKENS_PER_BATCH
                        Maximum number of tokens per gpu forward-pass. This
                        will group shorter sequences together for batched
                        prediction. Lowering this can help with out of memory
                        issues, if these occur on short sequences.
  --chunk-size CHUNK_SIZE
                        Chunks axial attention computation to reduce memory
                        usage from O(L^2) to O(L). Equivalent to running a for
                        loop over chunks of of each dimension. Lower values
                        will result in lower memory usage at the cost of
                        speed. Recommended values: 128, 64, 32. Default: None.
  --cpu-only            CPU only
  --cpu-offload         Enable CPU offloading

The command will make one prediction for every sequence in the fasta file. Multimers can be predicted and should be entered in the fasta file as a single sequence, with chains seprated by a ":" character.

By default, predictions will be batched together so that shorter sequences are predicted simultaneously. This can be disabled by setting --max-tokens-per-batch=0. Batching can significantly improve prediction speed on shorter sequences.

The --cpu-offload flag can be useful for making predictions on longer sequences. It will attempt to offload some parameters to the CPU RAM, rather than storing on GPU.

Finally, the ablation experiments for LMs of varying sizes Lin et al. 2022 table S1 are released as esm.pretrained.esmfold_structure_module_only_*(). We don't recommend using these models for structure prediction.

Compute embeddings in bulk from FASTA

We provide a command line interface (esm-extract) that efficiently extracts embeddings in bulk for a FASTA file from the ESM:

usage: esm-extract [-h] [--toks_per_batch TOKS_PER_BATCH]
                   [--repr_layers REPR_LAYERS [REPR_LAYERS ...]] --include
                   {mean,per_tok,bos,contacts}
                   [{mean,per_tok,bos,contacts} ...]
                   [--truncation_seq_length TRUNCATION_SEQ_LENGTH]
                   model_location fasta_file output_dir

Extract per-token representations and model outputs for sequences in a FASTA
file

positional arguments:
  model_location        PyTorch model file OR name of pretrained model to
                        download (see README for models)
  fasta_file            FASTA file on which to extract representations
  output_dir            output directory for extracted representations

optional arguments:
  -h, --help            show this help message and exit
  --toks_per_batch TOKS_PER_BATCH
                        maximum batch size
  --repr_layers REPR_LAYERS [REPR_LAYERS ...]
                        layers indices from which to extract representations
                        (0 to num_layers, inclusive)
  --include {mean,per_tok,bos,contacts} [{mean,per_tok,bos,contacts} ...]
                        specify which representations to return
  --truncation_seq_length TRUNCATION_SEQ_LENGTH
                        truncate sequences longer than the given value

The following commands allow the extraction of the final-layer embedding for a FASTA file from the ESM-2 model:

esm-extract esm2_t33_650M_UR50D examples/data/some_proteins.fasta \
  examples/data/some_proteins_emb_esm2 --repr_layers 0 32 33 --include
python scripts/extract.py esm2_t33_650M_UR50D examples/data/some_proteins.fasta \
  examples/data/some_proteins_emb_esm2 --repr_layers 0 32 33 --include mean per_tok

A cuda device is optional and will be auto-detected.

Directory some_proteins_emb_esm2/ now contains one .pt file per FASTA sequence; use torch.load() to load them. scripts/extract.py has flags that determine what's included in the .pt file:

  • --repr-layers (default: final only) selects which layers to include embeddings from.
  • --include specifies what embeddings to save. You can use the following:
    • per_tok includes the full sequence, with an embedding per amino acid (seq_len x hidden_dim).
    • mean includes the embeddings averaged over the full sequence, per layer.
    • bos includes the embeddings from the beginning-of-sequence token. (NOTE: Don't use with the pre-trained models - we trained without bos-token supervision)

CPU offloading for inference with large models

If you want to load very large models like 15B and/or do inference on long sequences on your machine, regular GPU inference may lead to OOM errors. We show how to load the model with Fairscale's Fully Sharded Data Parallel (FSDP) and use its CPU offloading feature. This allows to do inference of large models on a single GPU. Please check out examples/esm2_infer_fairscale_fsdp_cpu_offloading.py for more details.

Zero-shot variant prediction

See "examples/variant-prediction/" for code and pre-trained weights for the ESM-1v models described in Language models enable zero-shot prediction of the effects of mutations on protein function. (Meier et al. 2021).

Note that ESM-2 could be used for variant prediction as well, and is expected to have similar performance to ESM-1v.

Inverse folding

See "examples/inverse_folding/" for detailed user guide. The ESM-IF1 model is described as GVPTransformer in Learning inverse folding from millions of predicted structures. (Hsu et al. 2022).

We also provide a colab notebook for the sequence design and sequence scoring functionalities.

The ESM-IF1 inverse folding model is built for predicting protein sequences from their backbone atom coordinates. We provide scripts here 1) to sample sequence designs for a given structure and 2) to score sequences for a given structure.

Trained with 12M protein structures predicted by AlphaFold2, the ESM-IF1 model consists of invariant geometric input processing layers followed by a sequence-to-sequence transformer, and achieves 51% native sequence recovery on structurally held-out backbones with 72% recovery for buried residues. The model is also trained with span masking to tolerate missing backbone coordinates and therefore can predict sequences for partially masked structures.

Sample sequence designs for a given structure

The environment setup is described in this subsection of examples/inverse_folding.

To sample sequences for a given structure in PDB or mmCIF format, use the sample_sequences.py script. The input file can have either .pdb or .cif as suffix.

For example, to sample 3 sequence designs for the golgi casein kinase structure (PDB 5YH2; PDB Molecule of the Month from January 2022), we can run the following command from the esm root directory:

python examples/inverse_folding/sample_sequences.py examples/inverse_folding/data/5YH2.pdb \
  --chain C --temperature 1 --num-samples 3 --outpath examples/inverse_folding/output/sampled_sequences.fasta

The sampled sequences will be saved in a fasta format to the specified output file.

The temperature parameter controls the sharpness of the probability distribution for sequence sampling. Higher sampling temperatures yield more diverse sequences but likely with lower native sequence recovery. The default sampling temperature is 1. To optimize for native sequence recovery, we recommend sampling with low temperature such as 1e-6.

Scoring sequences

To score the conditional log-likelihoods for sequences conditioned on a given structure, use the score_log_likelihoods.py script.

For example, to score the sequences in examples/inverse_folding/data/5YH2_mutated_seqs.fasta according to the structure in examples/inverse_folding/data/5YH2.pdb, we can run the following command from the esm root directory:

python examples/inverse_folding/score_log_likelihoods.py examples/inverse_folding/data/5YH2.pdb \
  examples/inverse_folding/data/5YH2_mutated_seqs.fasta --chain C \
  --outpath examples/inverse_folding/output/5YH2_mutated_seqs_scores.csv

The conditional log-likelihoods are saved in a csv format in the specified output path. The output values are the average log-likelihoods averaged over all amino acids in a sequence.

For more information, see "./examples/inverse_folding/" for detailed user guide.

ESM Metagenomic Atlas

Please visit the ESM Metagenomic Atlas website, and see our blog post to learn more.

Bulk download instructions available at a seperate README here.

The Atlas resources include a page to fold a sequence using ESMFold, searching a subset of the ESM Atlas by structure or sequence, as well as an API to access those resources programmatically.

Foldseek provides search against the Atlas without the length limitation here.

Notebooks

Inverse folding - predicting or scoring sequences based on backbone structures

The ESM-IF1 inverse folding model predicts protein sequences from their backbone atom coordinates, trained with 12M protein structures predicted by AlphaFold2. This notetook guide you through examples of sampling sequences, calculating conditional log-likelihoods, and extracting encoder output as structure representation.

Supervised variant prediction - training a classifier on the embeddings

To help you get started with using the embeddings, this jupyter notebook tutorial shows how to train a supervised variant predictor using embeddings from ESM-1. You can adopt a similar protocol to train a model for any downstream task, even with limited data. First you can obtain the embeddings for examples/data/P62593.fasta either by downloading the precomputed embeddings as instructed in the notebook or by running the following:

# Obtain the embeddings
python scripts/extract.py esm1v_t33_650M_UR90S_1 examples/data/P62593.fasta \
  examples/data/P62593_emb_esm1v --repr_layers 33 --include mean

Then, follow the remaining instructions in the tutorial. You can also run the tutorial in a colab notebook.

Note, alternatively use the newer instructions for zero-shot variant prediction, which predicts mutational effects without any supervised training.

Unsupervised contact prediction

This jupyter notebook tutorial demonstrates contact prediction with both the ESM-2 and MSA Transformer (ESM-MSA-1) models. Contact prediction is based on a logistic regression over the model's attention maps. This methodology is based on our ICLR 2021 paper, Transformer protein language models are unsupervised structure learners. (Rao et al. 2020) The MSA Transformer (ESM-MSA-1) takes a multiple sequence alignment (MSA) as input, and uses the tied row self-attention maps in the same way. See MSA Transformer. (Rao et al. 2021).

To get unsupervised attention-based contacts, call model.predict_contacts(tokens) or model(tokens, return_contacts=True).

ESMStructuralSplitDataset and self-attention contact prediction

And this jupyter notebook tutorial shows how to load and index the ESMStructuralSplitDataset, and computes the self-attention map unsupervised contact predictions using ESM-2.

Available Models and Datasets

Pre-trained Models

Shorthand esm.pretrained. #layers #params Dataset Embedding Dim Model URL (automatically downloaded to ~/.cache/torch/hub/checkpoints)
ESM-2 esm2_t48_15B_UR50D 48 15B UR50/D 2021_04 5120 https://dl.fbaipublicfiles.com/fair-esm/models/esm2_t48_15B_UR50D.pt
esm2_t36_3B_UR50D 36 3B UR50/D 2021_04 2560 https://dl.fbaipublicfiles.com/fair-esm/models/esm2_t36_3B_UR50D.pt
esm2_t33_650M_UR50D 33 650M UR50/D 2021_04 1280 https://dl.fbaipublicfiles.com/fair-esm/models/esm2_t33_650M_UR50D.pt
esm2_t30_150M_UR50D 30 150M UR50/D 2021_04 640 https://dl.fbaipublicfiles.com/fair-esm/models/esm2_t30_150M_UR50D.pt
esm2_t12_35M_UR50D 12 35M UR50/D 2021_04 480 https://dl.fbaipublicfiles.com/fair-esm/models/esm2_t12_35M_UR50D.pt
esm2_t6_8M_UR50D 6 8M UR50/D 2021_04 320 https://dl.fbaipublicfiles.com/fair-esm/models/esm2_t6_8M_UR50D.pt
ESMFold esmfold_v1 48 (+36) 690M (+3B) UR50/D 2021_04 - https://dl.fbaipublicfiles.com/fair-esm/models/esmfold_3B_v1.pt
esmfold_v0 48 (+36) 690M (+3B) UR50/D 2021_04 - https://dl.fbaipublicfiles.com/fair-esm/models/esmfold_3B_v0.pt
esmfold_structure_module_only_* 0 (+various) various UR50/D 2021_04 - https://dl.fbaipublicfiles.com/fair-esm/models/esmfold_structure_module_only_*
ESM-IF1 esm_if1_gvp4_t16_142M_UR50 20 124M CATH 4.3 + predicted structures for UR50 512 https://dl.fbaipublicfiles.com/fair-esm/models/esm_if1_gvp4_t16_142M_UR50.pt
ESM-1v esm1v_t33_650M_UR90S_[1-5] 33 650M UR90/S 2020_03 1280 https://dl.fbaipublicfiles.com/fair-esm/models/esm1v_t33_650M_UR90S_1.pt
ESM-MSA-1b esm_msa1b_t12_100M_UR50S 12 100M UR50/S + MSA 2018_03 768 https://dl.fbaipublicfiles.com/fair-esm/models/esm_msa1b_t12_100M_UR50S.pt
ESM-MSA-1 esm_msa1_t12_100M_UR50S 12 100M UR50/S + MSA 2018_03 768 https://dl.fbaipublicfiles.com/fair-esm/models/esm_msa1_t12_100M_UR50S.pt
ESM-1b esm1b_t33_650M_UR50S 33 650M UR50/S 2018_03 1280 https://dl.fbaipublicfiles.com/fair-esm/models/esm1b_t33_650M_UR50S.pt
ESM-1 esm1_t34_670M_UR50S 34 670M UR50/S 2018_03 1280 https://dl.fbaipublicfiles.com/fair-esm/models/esm1_t34_670M_UR50S.pt
esm1_t34_670M_UR50D 34 670M UR50/D 2018_03 1280 https://dl.fbaipublicfiles.com/fair-esm/models/esm1_t34_670M_UR50D.pt
esm1_t34_670M_UR100 34 670M UR100 2018_03 1280 https://dl.fbaipublicfiles.com/fair-esm/models/esm1_t34_670M_UR100.pt
esm1_t12_85M_UR50S 12 85M UR50/S 2018_03 768 https://dl.fbaipublicfiles.com/fair-esm/models/esm1_t12_85M_UR50S.pt
esm1_t6_43M_UR50S 6 43M UR50/S 2018_03 768 https://dl.fbaipublicfiles.com/fair-esm/models/esm1_t6_43M_UR50S.pt

Here is a chronological list of the released models and the paper they were introduced in:

Shorthand Release Notes
ESM-1 Released with Rives et al. 2019 (Aug 2020 update).
ESM-1b Released with Rives et al. 2019 (Dec 2020 update). See Appendix B.
ESM-MSA-1 Released with Rao et al. 2021 (Preprint v1).
ESM-MSA-1b Released with Rao et al. 2021 (ICML'21 version, June 2021).
ESM-1v Released with Meier et al. 2021.
ESM-IF1 Released with Hsu et al. 2022.
ESM-2 Released with Lin et al. 2022.

ESM Structural Split Dataset

This is a five-fold cross validation dataset of protein domain structures that can be used to measure generalization of representations across different levels of structural dissimilarity. The dataset implements structural holdouts at the family, superfamily, and fold level. The SCOPe database is used to classify domains. Independently for each level of structural hold-out, the domains are split into 5 equal sets, i.e. five sets of folds, superfamilies, or families. This ensures that for each of the five partitions, structures having the same classification do not appear in both the train and test sets. For a given classification level each structure appears in a test set once, so that in the cross validation experiment each of the structures will be evaluated exactly once.

The dataset provides 3d coordinates, distance maps, and secondary structure labels. For further details on the construction of the dataset see Rives et al. 2019 Appendix A.10.

This jupyter notebook tutorial shows how to load and index the ESMStructuralSplitDataset.

ESMStructuralSplitDataset, upon initializing, will download splits and pkl. We also provide msas for each of the domains. The data can be directly downloaded below.

Name Description URL
splits train/valid splits https://dl.fbaipublicfiles.com/fair-esm/structural-data/splits.tar.gz
pkl pkl objects containing sequence, SSP labels, distance map, and 3d coordinates https://dl.fbaipublicfiles.com/fair-esm/structural-data/pkl.tar.gz
msas a3m files containing MSA for each domain https://dl.fbaipublicfiles.com/fair-esm/structural-data/msas.tar.gz

Pre-training Dataset Split

The split files establishing which UniRef50 clusters were used as held-out evaluation set for pre-training in Rives et al. 2019 and Rao et al. 2021 can be found here:

These files only contain only the UniRef50 IDs and UniRef100 IDs corresponding to the UniRef database, 2018-03 release which is released by the UniProt Consortium under a Creative Commons Attribution (CC BY 4.0) License.

Comparison to related works

Task Unsupervised contact prediction Structure Prediction
Test set Large valid CASP14 CAMEO (Apr-Jun 2022) CASP14 CAMEO (Apr-Jun 2022)
Gremlin (Potts) 39.3
TAPE 11.2
ProtBert-BFD 34.1
Prot-T5-XL-BFD 35.6 46.1 62.6
Prot-T5-XL-Ur50 (3B) 47.9 49.8 69.4
ESM-1 33.7
ESM-1b 41.1 24.4 39 41.6 64.5
ESM-1v 35.3
ESM-MSA-1b 57.4
ESM-2 (8M) 15.9 9.8 15.7 36.7 48.1
ESM-2 (35M) 28.8 16.4 28.4 41.4 56.4
ESM-2 (150M) 42.2 26.8 40.1 49.0 64.9
ESM-2 (700M) 50.1 32.5 47.6 51.3 70.1
ESM-2 (3B) 52.7 34.0 49.9 52.5 71.8
ESM-2 (15B) 54.5 37.0 51.7 55.4 72.1

Comparison to related protein language models on structure prediction tasks.

  • All contact numbers are the top-L,LR precision metric, where long range means sequence separation of at least 24 residues
  • For unsupervised contact prediction, a sparse linear combination of the attention heads is used to directly predict protein contacts, fitted with logistic regression on 20 structures. For more details on the method, see Rao et al. 2020.
  • For structure prediction, an AlphaFold2 structure module is trained directly from the frozen language model embeddings. For more details on the method, see Lin et al. 2022.
  • Direct coupling analysis methods (Gremlin, mfDCA, Psicov) and ESM-MSA-1 use the trRosetta MSAs, while other methods predict from single sequence.

Citations

If you find the models useful in your research, we ask that you cite the relevant paper:

@article{rives2019biological,
  author={Rives, Alexander and Meier, Joshua and Sercu, Tom and Goyal, Siddharth and Lin, Zeming and Liu, Jason and Guo, Demi and Ott, Myle and Zitnick, C. Lawrence and Ma, Jerry and Fergus, Rob},
  title={Biological Structure and Function Emerge from Scaling Unsupervised Learning to 250 Million Protein Sequences},
  year={2019},
  doi={10.1101/622803},
  url={https://www.biorxiv.org/content/10.1101/622803v4},
  journal={PNAS}
}

For the self-attention contact prediction:

@article{rao2020transformer,
  author = {Rao, Roshan M and Meier, Joshua and Sercu, Tom and Ovchinnikov, Sergey and Rives, Alexander},
  title={Transformer protein language models are unsupervised structure learners},
  year={2020},
  doi={10.1101/2020.12.15.422761},
  url={https://www.biorxiv.org/content/10.1101/2020.12.15.422761v1},
  journal={bioRxiv}
}

For the MSA Transformer:

@article{rao2021msa,
  author = {Rao, Roshan and Liu, Jason and Verkuil, Robert and Meier, Joshua and Canny, John F. and Abbeel, Pieter and Sercu, Tom and Rives, Alexander},
  title={MSA Transformer},
  year={2021},
  doi={10.1101/2021.02.12.430858},
  url={https://www.biorxiv.org/content/10.1101/2021.02.12.430858v1},
  journal={bioRxiv}
}

For variant prediction using ESM-1v:

@article{meier2021language,
  author = {Meier, Joshua and Rao, Roshan and Verkuil, Robert and Liu, Jason and Sercu, Tom and Rives, Alexander},
  title = {Language models enable zero-shot prediction of the effects of mutations on protein function},
  year={2021},
  doi={10.1101/2021.07.09.450648},
  url={https://www.biorxiv.org/content/10.1101/2021.07.09.450648v1},
  journal={bioRxiv}
}

For inverse folding using ESM-IF1:

@article{hsu2022learning,
	author = {Hsu, Chloe and Verkuil, Robert and Liu, Jason and Lin, Zeming and Hie, Brian and Sercu, Tom and Lerer, Adam and Rives, Alexander},
	title = {Learning inverse folding from millions of predicted structures},
	year = {2022},
	doi = {10.1101/2022.04.10.487779},
	url = {https://www.biorxiv.org/content/early/2022/04/10/2022.04.10.487779},
	journal = {ICML}
}

For the ESM-2 language model and ESMFold:

@article{lin2022language,
  title={Language models of protein sequences at the scale of evolution enable accurate structure prediction},
  author={Lin, Zeming and Akin, Halil and Rao, Roshan and Hie, Brian and Zhu, Zhongkai and Lu, Wenting and Smetanin, Nikita and dos Santos Costa, Allan and Fazel-Zarandi, Maryam and Sercu, Tom and Candido, Sal and others},
  journal={bioRxiv},
  year={2022},
  publisher={Cold Spring Harbor Laboratory}
}

Much of this code builds on the fairseq sequence modeling framework. We use fairseq internally for our protein language modeling research. We highly recommend trying it out if you'd like to pre-train protein language models from scratch.

Additionally, if you would like to use the variant prediction benchmark from Meier et al. (2021), we provide a bibtex file with citations for all data in ./examples/variant-prediction/mutation_data.bib. You can cite each paper individually, or add all citations in bulk using the LaTeX command:

\nocite{wrenbeck2017deep,klesmith2015comprehensive,haddox2018mapping,romero2015dissecting,firnberg2014comprehensive,deng2012deep,stiffler2015evolvability,jacquier2013capturing,findlay2018comprehensive,mclaughlin2012spatial,kitzman2015massively,doud2016accurate,pokusaeva2019experimental,mishra2016systematic,kelsic2016rna,melnikov2014comprehensive,brenan2016phenotypic,rockah2015systematic,wu2015functional,aakre2015evolving,qi2014quantitative,matreyek2018multiplex,bandaru2017deconstruction,roscoe2013analyses,roscoe2014systematic,mavor2016determination,chan2017correlation,melamed2013deep,starita2013activity,araya2012fundamental}

License

This source code is licensed under the MIT license found in the LICENSE file in the root directory of this source tree.

ESM Metagenomic Atlas (also referred to as โ€œESM Metagenomic Structure Atlasโ€ or โ€œESM Atlasโ€) data is available under a CC BY 4.0 license for academic and commercial use. Copyright (c) Meta Platforms, Inc. All Rights Reserved. Use of the ESM Metagenomic Atlas data is subject to the Meta Open Source Terms of Use and Privacy Policy.

More Repositories

1

llama

Inference code for LLaMA models
Python
44,989
star
2

segment-anything

The repository provides code for running inference with the SegmentAnything Model (SAM), links for downloading the trained model checkpoints, and example notebooks that show how to use the model.
Jupyter Notebook
42,134
star
3

Detectron

FAIR's research platform for object detection research, implementing popular algorithms like Mask R-CNN and RetinaNet.
Python
25,771
star
4

fairseq

Facebook AI Research Sequence-to-Sequence Toolkit written in Python.
Python
25,718
star
5

detectron2

Detectron2 is a platform for object detection, segmentation and other visual recognition tasks.
Python
25,567
star
6

fastText

Library for fast text representation and classification.
HTML
24,973
star
7

faiss

A library for efficient similarity search and clustering of dense vectors.
C++
24,035
star
8

audiocraft

Audiocraft is a library for audio processing and generation with deep learning. It features the state-of-the-art EnCodec audio compressor / tokenizer, along with MusicGen, a simple and controllable music generation LM with textual and melodic conditioning.
Python
18,693
star
9

codellama

Inference code for CodeLlama models
Python
13,303
star
10

detr

End-to-End Object Detection with Transformers
Python
11,076
star
11

ParlAI

A framework for training and evaluating AI models on a variety of openly available dialogue datasets.
Python
10,085
star
12

seamless_communication

Foundational Models for State-of-the-Art Speech and Text Translation
Jupyter Notebook
9,653
star
13

maskrcnn-benchmark

Fast, modular reference implementation of Instance Segmentation and Object Detection algorithms in PyTorch.
Python
9,104
star
14

pifuhd

High-Resolution 3D Human Digitization from A Single Image.
Python
8,923
star
15

hydra

Hydra is a framework for elegantly configuring complex applications
Python
8,044
star
16

AnimatedDrawings

Code to accompany "A Method for Animating Children's Drawings of the Human Figure"
Python
8,032
star
17

ImageBind

ImageBind One Embedding Space to Bind Them All
Python
7,630
star
18

nougat

Implementation of Nougat Neural Optical Understanding for Academic Documents
Python
7,568
star
19

llama-recipes

Scripts for fine-tuning Llama2 with composable FSDP & PEFT methods to cover single/multi-node GPUs. Supports default & custom datasets for applications such as summarization & question answering. Supporting a number of candid inference solutions such as HF TGI, VLLM for local or cloud deployment.Demo apps to showcase Llama2 for WhatsApp & Messenger
Jupyter Notebook
7,402
star
20

pytorch3d

PyTorch3D is FAIR's library of reusable components for deep learning with 3D data
Python
7,322
star
21

dinov2

PyTorch code and models for the DINOv2 self-supervised learning method.
Jupyter Notebook
7,278
star
22

DensePose

A real-time approach for mapping all human pixels of 2D RGB images to a 3D surface-based model of the body
Jupyter Notebook
6,547
star
23

pytext

A natural language modeling framework based on PyTorch
Python
6,357
star
24

metaseq

Repo for external large-scale work
Python
5,947
star
25

demucs

Code for the paper Hybrid Spectrogram and Waveform Source Separation
Python
5,886
star
26

SlowFast

PySlowFast: video understanding codebase from FAIR for reproducing state-of-the-art video models.
Python
5,678
star
27

mae

PyTorch implementation of MAE https//arxiv.org/abs/2111.06377
Python
5,495
star
28

mmf

A modular framework for vision & language multimodal research from Facebook AI Research (FAIR)
Python
5,235
star
29

ConvNeXt

Code release for ConvNeXt model
Python
4,971
star
30

dino

PyTorch code for Vision Transformers training with the Self-Supervised learning method DINO
Python
4,830
star
31

DiT

Official PyTorch Implementation of "Scalable Diffusion Models with Transformers"
Python
4,761
star
32

AugLy

A data augmentations library for audio, image, text, and video.
Python
4,739
star
33

Kats

Kats, a kit to analyze time series data, a lightweight, easy-to-use, generalizable, and extendable framework to perform time series analysis, from understanding the key statistics and characteristics, detecting change points and anomalies, to forecasting future trends.
Python
4,387
star
34

DrQA

Reading Wikipedia to Answer Open-Domain Questions
Python
4,374
star
35

xformers

Hackable and optimized Transformers building blocks, supporting a composable construction.
Python
4,191
star
36

moco

PyTorch implementation of MoCo: https://arxiv.org/abs/1911.05722
Python
4,035
star
37

StarSpace

Learning embeddings for classification, retrieval and ranking.
C++
3,856
star
38

fairseq-lua

Facebook AI Research Sequence-to-Sequence Toolkit
Lua
3,765
star
39

nevergrad

A Python toolbox for performing gradient-free optimization
Python
3,446
star
40

deit

Official DeiT repository
Python
3,425
star
41

dlrm

An implementation of a deep learning recommendation model (DLRM)
Python
3,417
star
42

ReAgent

A platform for Reasoning systems (Reinforcement Learning, Contextual Bandits, etc.)
Python
3,395
star
43

LASER

Language-Agnostic SEntence Representations
Python
3,308
star
44

VideoPose3D

Efficient 3D human pose estimation in video using 2D keypoint trajectories
Python
3,294
star
45

PyTorch-BigGraph

Generate embeddings from large-scale graph-structured data.
Python
3,238
star
46

deepmask

Torch implementation of DeepMask and SharpMask
Lua
3,113
star
47

MUSE

A library for Multilingual Unsupervised or Supervised word Embeddings
Python
3,094
star
48

vissl

VISSL is FAIR's library of extensible, modular and scalable components for SOTA Self-Supervised Learning with images.
Jupyter Notebook
3,038
star
49

pytorchvideo

A deep learning library for video understanding research.
Python
2,885
star
50

XLM

PyTorch original implementation of Cross-lingual Language Model Pretraining.
Python
2,763
star
51

hiplot

HiPlot makes understanding high dimensional data easy
TypeScript
2,481
star
52

ijepa

Official codebase for I-JEPA, the Image-based Joint-Embedding Predictive Architecture. First outlined in the CVPR paper, "Self-supervised learning from images with a joint-embedding predictive architecture."
Python
2,381
star
53

fairscale

PyTorch extensions for high performance and large scale training.
Python
2,319
star
54

audio2photoreal

Code and dataset for photorealistic Codec Avatars driven from audio
Python
2,316
star
55

encodec

State-of-the-art deep learning based audio codec supporting both mono 24 kHz audio and stereo 48 kHz audio.
Python
2,313
star
56

habitat-sim

A flexible, high-performance 3D simulator for Embodied AI research.
C++
2,299
star
57

InferSent

InferSent sentence embeddings
Jupyter Notebook
2,264
star
58

co-tracker

CoTracker is a model for tracking any point (pixel) on a video.
Jupyter Notebook
2,240
star
59

Pearl

A Production-ready Reinforcement Learning AI Agent Library brought by the Applied Reinforcement Learning team at Meta.
Python
2,193
star
60

pyrobot

PyRobot: An Open Source Robotics Research Platform
Python
2,109
star
61

darkforestGo

DarkForest, the Facebook Go engine.
C
2,108
star
62

ELF

An End-To-End, Lightweight and Flexible Platform for Game Research
C++
2,089
star
63

pycls

Codebase for Image Classification Research, written in PyTorch.
Python
2,053
star
64

frankmocap

A Strong and Easy-to-use Single View 3D Hand+Body Pose Estimator
Python
1,972
star
65

video-nonlocal-net

Non-local Neural Networks for Video Classification
Python
1,931
star
66

SentEval

A python tool for evaluating the quality of sentence embeddings.
Python
1,930
star
67

ResNeXt

Implementation of a classification framework from the paper Aggregated Residual Transformations for Deep Neural Networks
Lua
1,863
star
68

SparseConvNet

Submanifold sparse convolutional networks
C++
1,847
star
69

swav

PyTorch implementation of SwAV https//arxiv.org/abs/2006.09882
Python
1,790
star
70

TensorComprehensions

A domain specific language to express machine learning workloads.
C++
1,747
star
71

Mask2Former

Code release for "Masked-attention Mask Transformer for Universal Image Segmentation"
Python
1,638
star
72

habitat-lab

A modular high-level library to train embodied AI agents across a variety of tasks and environments.
Python
1,636
star
73

fvcore

Collection of common code that's shared among different research projects in FAIR computer vision team.
Python
1,623
star
74

TransCoder

Public release of the TransCoder research project https://arxiv.org/pdf/2006.03511.pdf
Python
1,611
star
75

poincare-embeddings

PyTorch implementation of the NIPS-17 paper "Poincarรฉ Embeddings for Learning Hierarchical Representations"
Python
1,587
star
76

votenet

Deep Hough Voting for 3D Object Detection in Point Clouds
Python
1,563
star
77

pytorch_GAN_zoo

A mix of GAN implementations including progressive growing
Python
1,554
star
78

ClassyVision

An end-to-end PyTorch framework for image and video classification
Python
1,552
star
79

deepcluster

Deep Clustering for Unsupervised Learning of Visual Features
Python
1,544
star
80

higher

higher is a pytorch library allowing users to obtain higher order gradients over losses spanning training loops rather than individual training steps.
Python
1,524
star
81

UnsupervisedMT

Phrase-Based & Neural Unsupervised Machine Translation
Python
1,496
star
82

consistent_depth

We estimate dense, flicker-free, geometrically consistent depth from monocular video, for example hand-held cell phone video.
Python
1,479
star
83

Detic

Code release for "Detecting Twenty-thousand Classes using Image-level Supervision".
Python
1,446
star
84

end-to-end-negotiator

Deal or No Deal? End-to-End Learning for Negotiation Dialogues
Python
1,368
star
85

multipathnet

A Torch implementation of the object detection network from "A MultiPath Network for Object Detection" (https://arxiv.org/abs/1604.02135)
Lua
1,349
star
86

CommAI-env

A platform for developing AI systems as described in A Roadmap towards Machine Intelligence - http://arxiv.org/abs/1511.08130
1,324
star
87

theseus

A library for differentiable nonlinear optimization
Python
1,306
star
88

ConvNeXt-V2

Code release for ConvNeXt V2 model
Python
1,300
star
89

DPR

Dense Passage Retriever - is a set of tools and models for open domain Q&A task.
Python
1,292
star
90

CrypTen

A framework for Privacy Preserving Machine Learning
Python
1,283
star
91

denoiser

Real Time Speech Enhancement in the Waveform Domain (Interspeech 2020)We provide a PyTorch implementation of the paper Real Time Speech Enhancement in the Waveform Domain. In which, we present a causal speech enhancement model working on the raw waveform that runs in real-time on a laptop CPU. The proposed model is based on an encoder-decoder architecture with skip-connections. It is optimized on both time and frequency domains, using multiple loss functions. Empirical evidence shows that it is capable of removing various kinds of background noise including stationary and non-stationary noises, as well as room reverb. Additionally, we suggest a set of data augmentation techniques applied directly on the raw waveform which further improve model performance and its generalization abilities.
Python
1,272
star
92

DeepSDF

Learning Continuous Signed Distance Functions for Shape Representation
Python
1,191
star
93

TimeSformer

The official pytorch implementation of our paper "Is Space-Time Attention All You Need for Video Understanding?"
Python
1,172
star
94

House3D

a Realistic and Rich 3D Environment
C++
1,167
star
95

MaskFormer

Per-Pixel Classification is Not All You Need for Semantic Segmentation (NeurIPS 2021, spotlight)
Python
1,149
star
96

LAMA

LAnguage Model Analysis
Python
1,104
star
97

fastMRI

A large-scale dataset of both raw MRI measurements and clinical MRI images.
Python
1,098
star
98

meshrcnn

code for Mesh R-CNN, ICCV 2019
Python
1,083
star
99

mixup-cifar10

mixup: Beyond Empirical Risk Minimization
Python
1,073
star
100

DomainBed

DomainBed is a suite to test domain generalization algorithms
Python
1,071
star