• Stars
    star
    571
  • Rank 78,127 (Top 2 %)
  • Language
    Python
  • License
    Apache License 2.0
  • Created over 1 year ago
  • Updated 10 months ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Repository that contains LLM fine-tuning and deployment scripts along with our research findings.

LLM Finetuning Hub

LLM Finetuning Hub contains code and insights to finetune various large language models for your use-case.

We stress-test both open-source and close-source LLMs through our Evaluation Framework to check their applicability for real-life business use-cases. Finetuning LLMs has never been easier.

Evaluation Framework β€’ Getting Started β€’ LLM Roadmap β€’ Benchmarks β€’ Contributing

Evaluation Framework

For a holistic evaluation, we will make use of the Evaluation Framework that contains 4 pillars:

  • Performance
  • Time to Train
  • Cost to Train
  • Inferencing

For each of the above four pillars, we are sharing our codebase and insights to:

  • Assist you to leverage LLMs for your business needs and challenges
  • Decide which LLM suits your needs from a performance and cost perspective
  • Boost reproducibility efforts which are becoming increasingly difficult with LLMs

We are providing scripts that are ready-to-use for:

  • Finetuning LLMs on your proprietary dataset via PeFT methodologies such as LoRA and Prefix Tuning
  • Performing hyperparameter optimization to get the maximum performance out of these models

Getting Started

You can start fine-tuning your choice of LLM in 4 easy steps:

  1. Setup conda environment

    wget https://repo.anaconda.com/miniconda/Miniconda3-py38_4.11.0-Linux-x86_64.sh
    bash Miniconda3-py38_4.11.0-Linux-x86_64.sh
    source ~/.bashrc
    conda create --name llm_finetuning python=3.9
    conda activate llm_finetuning
  2. Install relevant packages

    git clone https://github.com/georgian-io/LLM-Finetuning-Hub.git
    cd LLM-Finetuning-Hub/
    pip install -r requirements.txt
  3. Finetune your LLM of choice

    For instance, to finetune Llama2-7B or Llama2-13B, do the following:

    cd llama2/ # navigate to Llama2 folder
    python llama2_classification.py --lora_r 8 --epochs 5 --dropout 0.1 --pretrained_ckpt NousResearch/Llama-2-7b-hf # finetune Llama2-7B on newsgroup classification dataset
    python llama2_classification_inference.py --experiment <experiment folder> # evaluate finetuned Llama2 7B version
    python llama2_summarization.py --lora_r 8 --epochs 1 --dropout 0.1 --pretrained_ckpt NousResearch/Llama-2-13b-hf # finetune Llama2-13B on samsum chat dataset
    python llama2_summarization_inference.py --experiment <experiment folder> # evaluate finetuned Llama2 13B version

    For instance, to finetune Falcon-7B, do the following:

    cd falcon/ # navigate to Falcon folder
    python falcon_classification.py --lora_r 8 --epochs 5 --dropout 0.1 # finetune Falcon-7B on newsgroup classification dataset
    python falcon_classification_inference.py --experiment <experiment folder> # evaluate finetuned Falcon
    python falcon_summarization.py --lora_r 8 --epochs 1 --dropout 0.1 # finetune Falcon-7B on samsum chat dataset
    python falcon_summarization_inference.py --experiment <experiment folder> # evaluate finetuned Falcon

    For instance, to finetune Flan-T5-Large, do the following:

    cd flan-t5/ # navigate to Flan-T5 folder
    python flan_classification.py --peft_method prefix --prefix_tokens 20 --epochs 5 # finetune Flan-T5 on newsgroup dataset
    python flan_classification_inference.py --experiment <experiment folder> # evaluate finetuned Flan-T5
    python flan_summarization.py --peft_method lora --lora_r 8 --epochs 1 # finetune Flan-T5 on samsum chat dataset
    python flan_summarization_inference.py --experiment <experiment folder> # evalute finetuned Flan-T5
  4. Zero-shot and Few-shot your LLM of choice

    For instance, to use Falcon-7B on newsgroup classification task, do the following:

    python falcon_baseline_inference.py --task_type classification --prompt_type zero-shot
    python falcon_baseline_inference.py --task_type classification --prompt_type few-shot

    To use Falcon-7B on samsum summarization task, do the following:

    python falcon_baseline_inference.py --task_type summarization --prompt_type zero-shot
    python falcon_baseline_inference.py --task_type summarization --prompt_type few-shot

NOTE: All of our experiments were conducted on the AWS EC2 instance: g5.2xlarge. It has one 24GB Nvidia GPU, and is sufficient to finetune the LLMs in this repository.

LLM Roadmap

Our plan is to perform these experiments on all the LLMs below. To that end, this is a tentative roadmap of the LLMs that we aim to cover, and their corresponding codebase and README links:

LLM Benchmarked? Open-Source? README Codebase
Flan-T5 βœ… βœ… Link Folder
Falcon βœ… βœ… Link Folder
RedPajama βœ… βœ… Link Folder
Llama-2 βœ… βœ… Link Folder
Mistral βœ… βœ… Link Folder
Zephyr βœ… βœ… Link Folder
OpenLlama βœ…
SalesForce XGen βœ…
Mosaic MPT βœ… βœ… Link Folder
Cerebras βœ…
Writer Palmyra βœ… ❌ Link Folder
AI21 Jurassic-2 βœ… ❌ Link Folder
OpenAI GPT-3.5 βœ… ❌ Link Folder
Cohere Command ❌
Google PaLM ❌
Inflection Pi ❌

Benchmarks

We benchmark LLMs across the tasks of classification and summarization. More precisely, we assess the metrics of finetuned LLMs on classification and summarization tasks. Additionally, we perform cost estimation and load testing comparisons for inference purposes.

Classification: Zero-shot prompting VS Few-shot prompting VS Fine-Tuning

We use the Newsgroup dataset which is a 20-way classification problem. Each document needs to be identified as one of the 20 possible newsgroups. To check how quickly LLMs can learn on small number of samples, we compare them with the likes of BERT and Distilbert. Following table captures how models perform as we increase the number of training samples.

Model Open-Source? Zero-shot Accuracy (in %) Few-shot Accuracy (in %) Fine-Tuning + QLoRA (in %)
Falcon 7B βœ… 1.08 ❌ 76.37
RedPajama 3B βœ… 0.00 ❌ 72.34
RedPajama 7B βœ… 0.00 ❌ 75.52
Llama2 7B βœ… 0.00 ❌ 75.30
Llama2 13B βœ… 0.00 ❌ 77.93
Mosaic MPT 7B βœ… 0.00 ❌ 0.00
Mistral 7B βœ… 0.00 ❌ 74.36
Zephyr-7B-Ξ² . βœ… ❌ ❌ 74.90
Palmyra 30B ❌ 15.23 ❌ ❌
Jurassic J2-Light ❌ 1.82 ❌ ❌
Jurassic J2-Mid ❌ 22.93 ❌ ❌
Jurassic J2-Ultra ❌ 43.62 ❌ ❌
OpenAI GPT-3.5-Turbo ❌ 60.22 ❌ 79.41
  • Few-shot Accuracy could not be computed since the prompt length is very large and cannot be accommodated in the prompt.
  • Palmyra does not have finetuning capabilities.
  • Jurassic J2 models' finetuning capabilities on the classification task were not evaluated.
Classification: Sample efficiency VS Accuracy
Model / # samples (fraction) 266 (2.5%) 533 (5%) 1066 (10%) 2666 (25%) 5332 (50%) 10664 (100%)
Distilbert 36.24 46.65 54.15 67.07 72.00 71.91
Bert 16.91 30.75 53.73 68.41 72.46 74.15
Flan-T5-Large 59.86 68.84 73.38 75.45 75.43 72.31
Falcon-7B 61.85 64.02 67.52 70.32 72.42 76.37
RedPajama-3B 55.32 57.49 65.45 67.18 70.58 72.34
RedPajama-7B 58.17 60.31 67.22 69.53 70.96 75.52
Llama2-7B 52.10 54.72 55.97 69.20 69.09 75.30
Llama2-13B 66.23 67.45 71.69 73.50 77.87 77.93
Mosaic MPT-7B ❌ ❌ ❌ ❌ ❌ 0.0
Mistral-7B 49.30 48.14 58.41 64.89 73.10 74.36
Zephyr-7B-Ξ² 46.05 55.66 66.48 66.73 69.54 74.90
Palmyra 30B ❌ ❌ ❌ ❌ ❌ ❌
Jurassic J2-Light ❌ ❌ ❌ ❌ ❌ ❌
Jurassic J2-Mid ❌ ❌ ❌ ❌ ❌ ❌
Jurassic J2-Ultra ❌ ❌ ❌ ❌ ❌ ❌
OpenAI GPT-3.5-Turbo 73.81 56.17 47.32 49.15 78.84 79.41
  • Palmyra does not have finetuning capabilities.
  • Jurassic J2 models' finetuning capabilities on the classification task were not evaluated.
Summarization: Zero-shot prompting VS Few-shot prompting VS Fine-Tuning

We use the samsum dataset which contains chat conversations and their summarized versions. The task here is for LLMs to learn how best to summarize conversations by learning from pairs of conversations and corresponding summaries. Following table captures how LLMs perform on this task.

  • ZS = Zero-shot
  • FS = Few-shot
  • FT = Fine-Tuning
Model ZS Rouge-1 ZS Rouge-2 FS Rouge-1 FS Rouge-2 FT Rouge-1 FT Rouge-2
Flan-T5-Base Full FT ❌ ❌ ❌ ❌ 47.23 21.01
Flan-T5-Large ❌ ❌ ❌ ❌ 49.21 23.39
Falcon-7B 32.21 10.08 34.12 11.9 52.18 27.84
RedPajama-3B 30.09 10.48 29.16 10.05 47.75 23.53
RedPajama-7B 30.85 11.30 23.22 8.24 49.96 25.94
Llama2-7B 30.06 8.61 35.57 14.23 51.71 26.86
Llama2-13B 11.02 3.38 22.50 9.25 52.97 28.32
Mosaic MPT-7B 32.86 10.41 34.71 12.26 23.5 9.67
Mistral Base-7B 32.77 10.64 38.87 16.71 53.61 29.28
Zephyr-7B-Ξ² . 33.93 11.21 35.99 12.97 52.84 27.75
Writer Palmyra 30B 33.68 12.18 39.28 16.19 ❌ ❌
Jurassic J2-Light 38.21 14.78 40.73 17.09 44.69 20.15
Jurassic J2-Mid 39.11 15.59 43.39 18.34 48.38 23.90
Jurassic J2-Ultra 41.63 17.27 45.31 19.27 ❌ ❌
OpenAI GPT-3.5-Turbo 36.41 13.31 39.08 15.83 55.91 31.88
Cost estimation and load testing

We deployed the models mentioned above on two servers: FastApi and the HuggingFace Text Generation Inference server. The goal was to compare the cost and latency between our custom server, developed using FastApi, and the inference server (TGI), which comes with many built-in optimizations.

All servers were run and received inference requests on an AWS g5.4xlarge instance with Nvidia GPU A10. For load testing, we utilized Vegeta to see how the system copes with a high volume of requests. Our objective was to identify the maximum RPS each model could manage, along with throughput, latency, and cost per 1,000 tokens. We created a set of sample sentences, each about ~100 tokens long, to generate the requests. During the load testing, a random sentence was chosen for each request, ensuring consistent testing results. This method allowed us to identify the typical RPS range each model and service could handle for various tasks.

Below, two tables summarize our observations for all the models, tasks, and most used deployment options explored in this repository (we also tried LLama on Nvidia A100 using the Ray server; more details can be found here). Generally, the TGI server is more cost-effective than the custom server and simpler to set up. It provided better RPS, throughput, and lower latency. A different inference server, vLLm, can offer even higher maximum RPS compared to TGI (you can find more details about our load testing experiments with it for LLama-2 here). Last thing to mention is that models designed for classification are slower than those for summarization. Aslo, the model's size (number of training parameters) doesn't significantly impact its performance.

Text Generation Inference

Classification Summarization
Model Flan-T5 Large Falcon-7B RP-3B RP-7B LLama2-7B LLama2-13B Flan-T5 Large Falcon-7B RP-3B RP-7B LLama2-7B LLama-2-13B
Inference cost (per 1K tokens) $0.00001 $0.00005 $0.00003 $0.00003 $0.00003 $0.00003 $0.00001 $0.00004 $0.00001 $0.00002 $0.00002 $0.00002
RPS 145 125 135 125 125 125 120 145 195 145 135 125
Throughput 78.5 30.3 57.3 26.13 19.81 9.60 45.5 53.8 96.06 41.5 36.10 22.16
Latency 90% (seconds) 1.5 2.7 1.44 3.98 4.8 12.04 2.03 1.82 0.7139 2.5 2.6 5.15

FastApi

Classification Summarization
Model Flan-T5 Large Falcon-7B RP-3B RP-7B LLama2-7B LLama2-13B Flan-T5 Large Falcon-7B RP-3B RP-7B LLama2-7B LLama2-13B
Inference cost (per 1K tokens) $0.00001 - $0.001 $0.001 $0.001 $0.001 $0.00007 - $0.00002 $0.00002 $0.00003 $0.0003
RPS 180 - 4 4 4 4 30 - 160 160 100 10
Throughput 5.84 - 0.15 0.14 0.11 0.14 1.5 - 5.46 5.27 3.43 1.73
Latency 90% (seconds) 28.01 - 26.4 28.1 27.3 27.9 18.27 - 28.4 29.527 28.1 5.1

In conclusion, the TGI server offers a more cost-efficient and streamlined approach compared to custom servers, delivering superior performance metrics. While classification models tend to be slower, the size of the model, in terms of training parameters, doesn't notably affect its efficiency. Choosing the right server and model type is crucial for optimizing cost and latency.

Contributing

If you would like to contribute to this project, we recommend following the "fork-and-pull" Git workflow.

  1. Fork the repo on GitHub
  2. Clone the project to your own machine
  3. Commit changes to your own branch
  4. Push your work back up to your fork
  5. Submit a Pull request so that we can review your changes

NOTE: Be sure to merge the latest from "upstream" before making a pull request!

Correspondence

If you have any questions, please reach out to:

More Repositories

1

Multimodal-Toolkit

Multimodal model for text and tabular data with HuggingFace transformers as building block for text data
Python
479
star
2

Knowledge-Distillation-Toolkit

A knowledge distillation toolkit based on PyTorch and PyTorch Lightning.
Python
125
star
3

pyoats

Quick and Easy Time Series Outlier Detection
Python
90
star
4

Transformers-Domain-Adaptation

Adapt Transformer-based language models to new text domains
Jupyter Notebook
79
star
5

hydra

A cloud-agnostic ML Platform that will enable Data Scientists to run multiple experiments, perform hyper parameter optimization, evaluate results and serve models (batch/realtime) while still maintaining a uniform development UX across cloud environments
HCL
39
star
6

automl_benchmark

Distributed, large-scale, benchmarking framework for rigorous assessment of automatic machine learning repositories, projects, and libraries.
Python
30
star
7

foreshadow

An automatic machine learning system
Python
29
star
8

genai-bootcamp

Georgian GenAI 2023 Bootcamps Codebase
Jupyter Notebook
19
star
9

GAL

Georgian AI Library (GAL) is a repository that contains research notes on Georgian applied research areas.
Jupyter Notebook
10
star
10

newscrawl

Jupyter Notebook
5
star
11

Diffbot-Graph-Learning

Heterogenous Graph Representation learning on Diffbot's Knowledge Graph. Includes asynchronous BFS traversal on Diffbot's Knowledge Graph, Diffbot's enhance API to build custom graph dataset and models to run on these graphs.
Python
5
star
12

dp-roundtable

Jupyter Notebook
4
star
13

annotation_tool

JavaScript
3
star
14

azure-ml-demo

Jupyter Notebook
1
star
15

gpt-neo-sagemaker

Using sagemaker to train and serve gpt-neo models from HuggingFace
Jupyter Notebook
1
star