There are no reviews yet. Be the first to send feedback to the community and the maintainers!
Repository Details
ProtTrans is providing state of the art pretrained language models for proteins. ProtTrans was trained on thousands of GPUs from Summit and hundreds of Google TPUs using Transformers Models.
ProtTrans
ProtTrans is providing state of the art pre-trained models for proteins. ProtTrans was trained on thousands of GPUs from Summit and hundreds of Google TPUs using various Transformer models.
2023/07/14: FineTuning with LoRA provides a notebooks on how to fine-tune ProtT5 on both, per-residue and per-protein tasks, using Low-Rank Adaptation (LoRA) for efficient finetuning (thanks @0syrys !).
2022/11/18: Availability: LambdaPP offers a simple web-service to access ProtT5-based predictions and UniProt now offers to download pre-computed ProtT5 embeddings for a subset of selected organisms.
🚀 Quick Start
Example for how to derive embeddings from our best-performing protein language model, ProtT5-XL-U50 (aka ProtT5); also available as colab:
fromtransformersimportT5Tokenizer, T5EncoderModelimporttorchdevice=torch.device('cuda:0'iftorch.cuda.is_available() else'cpu')
# Load the tokenizertokenizer=T5Tokenizer.from_pretrained('Rostlab/prot_t5_xl_half_uniref50-enc', do_lower_case=False).to(device)
# Load the modelmodel=T5EncoderModel.from_pretrained("Rostlab/prot_t5_xl_half_uniref50-enc").to(device)
# only GPUs support half-precision currently; if you want to run on CPU use full-precision (not recommended, much slower)model.full() ifdevice=='cpu'elsemodel.half()
# prepare your protein sequences as a listsequence_examples= ["PRTEINO", "SEQWENCE"]
# replace all rare/ambiguous amino acids by X and introduce white-space between all amino acidssequence_examples= [" ".join(list(re.sub(r"[UZOB]", "X", sequence))) forsequenceinsequence_examples]
# tokenize sequences and pad up to the longest sequence in the batchids=tokenizer.batch_encode_plus(sequences_example, add_special_tokens=True, padding="longest")
input_ids=torch.tensor(ids['input_ids']).to(device)
attention_mask=torch.tensor(ids['attention_mask']).to(device)
# generate embeddingswithtorch.no_grad():
embedding_rpr=model(input_ids=input_ids,attention_mask=attention_mask)
# extract residue embeddings for the first ([0,:]) sequence in the batch and remove padded & special tokens ([0,:7]) emb_0=embedding_repr.last_hidden_state[0,:7] # shape (7 x 1024)# same for the second ([1,:]) sequence but taking into account different sequence lengths ([1,:8])emb_1=embedding_repr.last_hidden_state[1,:8] # shape (8 x 1024)# if you want to derive a single representation (per-protein embedding) for the whole proteinemb_0_per_protein=emb_0.mean(dim=0) # shape (1024)
We also have a script which simplifies deriving per-residue and per-protein embeddings from ProtT5 for a given FASTA file:
💥 Fine Tuning (FT):
Please check:
Fine Tuning Section. More information coming soon.
🧠 Prediction:
Please check:
Prediction Section. Colab example for secondary structure prediction via ProtT5-XL-U50 and Colab example for subcellular localization prediction as well as differentiation between membrane-bound and water-soluble proteins via ProtT5-XL-U50.
⚗️ Protein Sequences Generation:
Please check:
Generate Section. More information coming soon.
📊 Comparison to other protein language models (pLMs)
While developing the use-cases, we compared ProtTrans models to other protein language models, for instance the ESM models. To focus on the effect of changing input representaitons, the following comparisons use the same architectures on top on different embedding inputs.
Important note on ProtT5-XL-UniRef50 (dubbed ProtT5-XL-U50): all performances were measured using only embeddings extracted from the encoder-side of the underlying T5 model as described here. Also, experiments were ran in half-precision mode (model.half()), to speed-up embedding generation. No performance degradation could be observed in any of the experiments when running in half-precision.
❤️ Community and Contributions
The ProtTrans project is a open source project supported by various partner companies and research institutions. We are committed to share all our pre-trained models and knowledge. We are more than happy if you could help us on sharing new ptrained models, fixing bugs, proposing new feature, improving our documentation, spreading the word, or support our project.
📫 Have a question?
We are happy to hear your question in our issues page ProtTrans! Obviously if you have a private question or want to cooperate with us, you can always reach out to us directly via our RostLab email
🤝 Found a bug?
Feel free to file a new issue with a respective title and description on the the ProtTrans repository. If you already found a solution to your problem, we would love to review your pull request!.
✅ Requirements
For protein feature extraction or fine-tuninng our pre-trained models, Pytorch and Transformers library from huggingface is needed. For model visualization, you need to install BertViz library.
If you use this code or our pretrained models for your publication, please cite the original paper:
@ARTICLE
{9477085,
author={Elnaggar, Ahmed and Heinzinger, Michael and Dallago, Christian and Rehawi, Ghalia and Yu, Wang and Jones, Llion and Gibbs, Tom and Feher, Tamas and Angerer, Christoph and Steinegger, Martin and Bhowmik, Debsindhu and Rost, Burkhard},
journal={IEEE Transactions on Pattern Analysis and Machine Intelligence},
title={ProtTrans: Towards Cracking the Language of Lifes Code Through Self-Supervised Deep Learning and High Performance Computing},
year={2021},
volume={},
number={},
pages={1-1},
doi={10.1109/TPAMI.2021.3095381}}