• Stars
    star
    3,845
  • Rank 10,920 (Top 0.3 %)
  • Language
  • Created 12 months ago
  • Updated 4 months ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Numbers every LLM developer should know

Numbers every LLM Developer should know

At Google, there was a document put together by Jeff Dean, the legendary engineer, called Numbers every Engineer should know. It’s really useful to have a similar set of numbers for LLM developers to know that are useful for back-of-the envelope calculations. Here we share particular numbers we at Anyscale use, why the number is important and how to use it to your advantage.

Notes on the Github version

Last updates: 2023-05-17

If you feel there's an issue with the accuracy of the numbers, please file an issue. Think there are more numbers that should be in this doc? Let us know or file a PR.

We are thinking the next thing we should add here is some stats on tokens per second of different models.

Prompts

40-90%1: Amount saved by appending “Be Concise” to your prompt

It’s important to remember that you pay by the token for responses. This means that asking an LLM to be concise can save you a lot of money. This can be broadened beyond simply appending “be concise” to your prompt: if you are using GPT-4 to come up with 10 alternatives, maybe ask it for 5 and keep the other half of the money.

1.3:1 -- Average tokens per word

LLMs operate on tokens. Tokens are words or sub-parts of words, so “eating” might be broken into two tokens “eat” and “ing”. A 750 word document in English will be about 1000 tokens. For languages other than English, the tokens per word increases depending on their commonality in the LLM's embedding corpus.

Knowing this ratio is important because most billing is done in tokens, and the LLM’s context window size is also defined in tokens.

Prices2

Prices are of course subject to change, but given how expensive LLMs are to operate, the numbers in this section are critical. We use OpenAI for the numbers here, but prices from other providers you should check out (Anthropic, Cohere) are in the same ballpark.

~50:1 -- Cost Ratio of GPT-4 to GPT-3.5 Turbo3

What this means is that for many practical applications, it’s much better to use GPT-4 for things like generating high quality fine tuning data, or for automated evaluation of other models -- things you might only do once instead of it living in the middle of your inference cycle. It is roughly 50 times cheaper to use GPT-3.5-Turbo than GPT-4 (the “roughly” is because GPT-4 charges differently for the prompt and the generated output) – so you really need to check on how far you can get with GPT-3.5-Turbo. GPT-3.5-Turbo is more than enough for tasks like summarization for example.

5:1 -- Cost Ratio of generation of text using GPT-3.5-Turbo vs OpenAI embedding

This means it is way cheaper to look something up in a vector store than to ask an LLM to generate it. E.g. “What is the capital of Delaware?” when looked up in an neural information retrieval system costs about 5x4 less than if you asked GPT-3.5-Turbo. The cost difference compared to GPT-4 is a whopping 250x!

10:1 -- Cost Ratio of OpenAI embedding to Self-Hosted embedding

Note: this number is sensitive to load and embedding batch size, so please consider this approximate.

In our blog post, we noted that using a g4dn.4xlarge (on-demand price: $1.20/hr) we were able to embed at about 9000 tokens per second using Hugging Face’s SentenceTransformers (which are pretty much as good as OpenAI’s embeddings). Doing some basic math of that rate and that node type indicates it is considerably cheaper (factor of 10 cheaper) to self-host embeddings (and that is before you start to think about things like ingress and egress fees).

6:1 -- Cost Ratio of OpenAI fine tuned vs base model queries

It costs you 6 times as much to serve a fine tuned model as it does the base model on OpenAI. This is pretty exorbitant, but might make sense because of the possible multi-tenancy of base models. It also means it is far more cost effective to tweak the prompt for a base model than to fine tune a customized model.

1:1 -- Cost Ratio of Self-Hosted base vs fine-tuned model queries

If you’re self hosting a model, then it more or less costs the same amount to serve a fine tuned model as it does to serve a base one: the models have the same number of parameters.

Training and Fine Tuning

~$1 million: Cost to train a 13 billion parameter model on 1.4 trillion tokens

The LLaMa paper mentions it took them 21 days to train LLaMa using 2048 GPUs A100 80GB GPUs. We considered training our own model on the Red Pajama training set, then we ran the numbers. The above is assuming everything goes right, nothing crashes, and the calculation succeeds on the first time, etc. Plus it involves the coordination of 2048 GPUs. That’s not something most companies can do (shameless plug time: of course, we at Anyscale can – that’s our bread and butter! Contact us if you’d like to learn more). The point is that training your own LLM is possible, but it’s not cheap. And it will literally take days to complete each run. Much cheaper to use a pre-trained model.

< 0.001: Cost ratio of fine tuning vs training from scratch

This is a bit of a generalization, but the cost of fine tuning is negligible. We showed for example that you can fine tune a 6B parameter model for about $7. Even at OpenAI’s rate for its most expensive fine-tunable model, Davinci, it is 3c per 1000 tokens. That means to fine tune on the entire works of Shakespeare (about 1 million words), you’re looking at $405. However, fine tuning is one thing and training from scratch is another …

GPU Memory

If you’re self-hosting a model, it’s really important to understand GPU memory because LLMs push your GPU’s memory to the limit. The following statistics are specifically about inference. You need considerably more memory for training or fine tuning.

V100: 16GB, A10G: 24GB, A100: 40/80GB: GPU Memory Capacities

It may seem strange, but it’s important to know the amount of memory different types of GPUs have. This will cap the number of parameters your LLM can have. Generally, we like to use A10Gs because they cost $1.50 to $2 per hour each at AWS on-demand prices and have 24G of GPU memory, vs the A100s which will run you about $5 each at AWS on-demand prices.

2x number of parameters: Typical GPU memory requirements of an LLM for serving

For example, if you have a 7 billion parameter model, it takes about 14GB of GPU space. This is because most of the time, one 16-bit float (or 2 bytes) is required per parameter. There’s usually no need to go beyond 16-bit accuracy, and most of the time when you go to 8-bit accuracy you start to lose resolution (though that may be acceptable in some cases). Of course there are efforts to reduce this, notably llama.cpp which runs a 13 billion parameter model on a 6GB GPU by quantizing aggressively down to 4 bits (and 8 bits without too much impact), but that’s atypical.

~1GB: Typical GPU memory requirements of an embedding model

Whenever you are doing sentence embedding (a very typical thing you do for clustering, semantic search and classification tasks), you need an embedding model like sentence transformers. OpenAI also has its own embeddings that they provide commercially.

You typically don’t have to worry about how much memory embeddings take on the GPU, they’re fairly small. We’ve even had the embedding and the LLM on the same GPU.

>10x: Throughput improvement from batching LLM requests

Running an LLM query through a GPU is very high latency: it may take, say, 5 seconds, with a throughput of 0.2 queries per second. The funny thing is, though, if you run two tasks, it might only take 5.2 seconds. This means that if you can bundle 25 queries together, it would take about 10 seconds, and our throughput has improved to 2.5 queries per second. However, see the next point.

~1 MB: GPU Memory required for 1 token of output with a 13B parameter model

The amount of memory you need is directly proportional to the maximum number of tokens you want to generate. So for example, if you want to generate outputs of up to 512 tokens (about 380 words), you need 512MB. No big deal you might say – I have 24GB to spare, what’s 512MB? Well, if you want to run bigger batches it starts to add up. So if you want to do batches of 16, you need 8GB of space. There are some techniques being developed that overcome this, but it’s still a real issue.

Cheatsheet

Screenshot 2023-05-17 at 1 46 09 PM

Next Steps

See our earlier blog series on solving Generative AI infrastructure and using LangChain with Ray.

If you are interested in learning more about Ray, see Ray.io and Docs.Ray.io.

To connect with the Ray community join #LLM on the Ray Slack or our Discuss forum.

If you are interested in our Ray hosted service for ML Training and Serving, see Anyscale.com/Platform and click the 'Try it now' button

Ray Summit 2023: If you are interested to learn much more about how Ray can be used to build performant and scalable LLM applications and fine-tune/train/serve LLMs on Ray, join Ray Summit on September 18-20th! We have a set of great keynote speakers including John Schulman from OpenAI and Aidan Gomez from Cohere, community and tech talks about Ray as well as practical training focused on LLMs.

Notes

Footnotes

  1. Based on experimentation with GPT-3.5-Turbo using a suite of prompts on 2023-05-08. ↩

  2. Retrieved from http://openai.com/pricing on 2023-05-08. ↩

  3. GPT-4: 6c/1k tokens for the prompt, 12c/1k tokens for the generation (32,000 window version, 8,000 window version is half that). GPT-3.5 Turbo: 0.2c/1k tokens. ↩

  4. This assumes the vector lookup is “free.” It’s not, but it uses CPUs (much cheaper) and is fairly fast. ↩

  5. 1 million words / 0.75 tokens/word / 1000*0.03 = $40. ↩

More Repositories

1

ray

Ray is a unified framework for scaling AI and Python applications. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.
Python
30,993
star
2

ray-llm

RayLLM - LLMs on Ray
Python
1,029
star
3

kuberay

A toolkit to run Ray applications on Kubernetes
Go
861
star
4

tutorial

Jupyter Notebook
772
star
5

tune-sklearn

A drop-in replacement for Scikit-Learn’s GridSearchCV / RandomizedSearchCV -- but with cutting edge hyperparameter tuning techniques.
Python
464
star
6

llmperf

LLMPerf is a library for validating and benchmarking LLMs
Python
366
star
7

llmperf-leaderboard

358
star
8

ray-educational-materials

This is suite of the hands-on training materials that shows how to scale CV, NLP, time-series forecasting workloads with Ray.
Jupyter Notebook
272
star
9

ray_lightning

Pytorch Lightning Distributed Accelerators using Ray
Python
204
star
10

langchain-ray

Examples on how to use LangChain and Ray
Python
202
star
11

rl-experiments

Keeping track of RL experiments
148
star
12

xgboost_ray

Distributed XGBoost on Ray
Python
132
star
13

deltacat

A portable Pythonic Data Catalog API powered by Ray that brings exabyte-level scalability and fast, ACID-compliant, change-data-capture to your big data workloads.
Python
97
star
14

rayfed

A multiple parties joint, distributed execution engine based on Ray, to help build your own federated learning frameworks in minutes.
Python
81
star
15

mobius

Mobius is an AI infrastructure platform for distributed online learning, including online sample processing, training and serving.
Java
78
star
16

plasma

A minimal shared memory object store design
C
40
star
17

enhancements

Tracking Ray Enhancement Proposals
40
star
18

lightgbm_ray

LightGBM on Ray
Python
40
star
19

ray_beam_runner

Ray-based Apache Beam runner
Python
37
star
20

mlflow-ray-serve

MLFlow Deployment Plugin for Ray Serve
Python
35
star
21

distml

Distributed ML Optimizer
Python
29
star
22

llms-in-prod-workshop-2023

Deploy and Scale LLM-based applications
Jupyter Notebook
23
star
23

ray-legacy

An experimental distributed execution engine
Python
21
star
24

ray_shuffling_data_loader

A Ray-based data loader with per-epoch shuffling and configurable pipelining, for shuffling and loading training data for distributed training of machine learning models.
Python
18
star
25

pygloo

Pygloo provides Python bindings for Gloo.
C++
15
star
26

contrib-workflow-dag

Python
11
star
27

anyscale-berkeley-ai-hackathon

Ray and Anyscale for UC Berkeley AI Hackathon!
Jupyter Notebook
11
star
28

credis

C++
9
star
29

ray-acm-workshop-2023

Scalable/Distributed Computer Vision with Ray
Jupyter Notebook
9
star
30

spark-ray-example

A simple demonstration of embedding Ray in a Spark UDF. For Spark + AI Summit 2020.
Jupyter Notebook
8
star
31

community

Artifacts intended to support the Ray Developer Community: SIGs, RFC overviews, and governance. We're very glad you're here! ✨
8
star
32

llm-application

Jupyter Notebook
6
star
33

releaser

Python
5
star
34

scalable-learning

Scaling multi-node multi-GPU workloads
5
star
35

raynomics

Experimental genomics algorithms in Ray
Python
5
star
36

air-reference-arch

Jupyter Notebook
5
star
37

serve-movie-rec-demo

Python
5
star
38

maze-raylit

Hackathon 2020! Max Archit Zhe
Python
5
star
39

ray-serve-arize-observe

Building Real-Time Inference Pipelines with Ray Serve
Jupyter Notebook
5
star
40

anyscale-workshop-nyc-2023

Scalable NLP model fine-tuning and batch inference with Ray and Anyscale
Jupyter Notebook
5
star
41

kuberay-helm

Helm charts for the KubeRay project
Mustache
4
star
42

ray-saturday-dec-2022

Ray Saturday Dec 2022 edition
Jupyter Notebook
4
star
43

RFC

Community Documents
4
star
44

sandbox

Ray repository sandbox
Python
4
star
45

ray-demos

Collection of demos build with Ray
Jupyter Notebook
4
star
46

prototype_gpu_buffer

Python
3
star
47

arrow-build

Queue for building arrow
3
star
48

numbuf

Serializing primitive Python types in Arrow
C++
3
star
49

odsc-west-workshop-2023

Jupyter Notebook
3
star
50

2022_04_13_ray_serve_meetup_demo

Code samples for Ray Serve Meetup on 04/13/2022
Python
2
star
51

q4-2021-docs-hackathon

HTML
2
star
52

ray-scripts

Experimental scripts for deploying and using Ray
Shell
2
star
53

raytracer

Polymer WebUI for Ray
HTML
2
star
54

travis-tracker-v2

Python
2
star
55

scipy-ray-scalable-ml-tutorial-2023

Jupyter Notebook
2
star
56

rllib-contrib

Python
2
star
57

serve_workloads

Python
2
star
58

qcon-workshop-2023

Jupyter Notebook
2
star
59

travis-tracker

Dashboard for Tracking Travis Python Test Result.
TypeScript
1
star
60

common

Code that is shared between Ray projects
C
1
star
61

photon

A local scheduler and node manager for Ray
C
1
star
62

spmd_grid

Grid-style gang-scheduling and collective communication for Ray
Python
1
star
63

checkstyle_java

Python
1
star
64

raylibs

Libraries for Ray
1
star
65

issues-to-airtable

JavaScript
1
star
66

ray-docs-zh

Chinese translation of Ray documentation. This may not be update to date.
1
star
67

ray-project.github.io

The Ray project website
HTML
1
star
68

streaming

Streaming processing engine based on ray platform.
1
star
69

train-serve-primer

Jupyter Notebook
1
star
70

serve_config_examples

Python
1
star
71

Ray-Forward

Some resources about Ray Forward Meetup
1
star
72

ray-summit-2022

Website for Ray Summit 2022
HTML
1
star