• Stars
    star
    152
  • Rank 244,685 (Top 5 %)
  • Language
    Rust
  • Created almost 2 years ago
  • Updated over 1 year ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

fast_gpt2

Experiment to run from load to finish ML almost 5x faster, works mostly by optimizing load.

Fast gpt2 on a real cluster is 3x faster to run

This is an experimental test to remove the need for PyTorch and have a highly specific runtime that enables to load much faster than using regular PyTorch + transformers using safetensors and direct memory mapping.

Overview

  • Written in Rust
  • Almost no dependency (intel-mkl/blas)
  • Has a webserver (used to demonstrate differences on real clusters)
  • Implements Gpt2 text-generation (greedy mode only) with past key values (this is the only way to be on par for performance).
  • Docker build (optimized for intel-mkl).
  • Docker image 42Mb (excluding model + tokenizer which get downloaded at runtime, since it's faster than pulling from registry).

Use

cargo run --example run --release --features intel-mkl # for better runtime performance mkl helps

Caveat: The first run will actually download the models so will definitely be much slower than this. Speed to load and run 20 forward passes of gpt2.

Safetensors 251.041ยตs
Tokenizer 43.468349ms
Loaded & encoded 43.681588ms
Loop in 172.272045ms # First loop is slower, no past key values + mmap needs to finish
Loop in 36.165002ms
Loop in 36.269518ms
Loop in 36.311927ms
Loop in 36.329951ms
Loop in 36.477757ms
Loop in 34.368017ms
Loop in 32.67637ms
Loop in 32.67117ms
Loop in 32.909676ms
Result Ok("My name is John. I'm a man of God. I")
Total Inference 530.36737ms

This basically loads the model instantly and runs the first forward pass at 56ms instead of ~30ms for the subsequent passes.

Comparison

Here is a reference with the same code in Python (ofc python is much more feature complete, so I included just the import times for reference)

TRANSFORMERS_OFFLINE=1 python test.py (TRANSFORMERS_OFFLINE=1 to remove potential network slowdown)
Loaded torch 0:00:00.992501
Loaded transformers 0:00:02.095964
Loaded in 0:00:03.444400
/home/nicolas/src/transformers/src/transformers/generation/utils.py:1134: UserWarning: You have modified the pretrained model configuration to control generation. This is a deprecated strategy to control generation and will be removed soon, in a future version. Please use a generation configuration file (see https://huggingface.co/docs/transformers/main_classes/text_generation)
  warnings.warn(
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Tokens: 0:00:00.081493/tokens
Inference took: 0:00:00.814981
[{'generated_text': "My name is John. I'm a man of God. I"}]
Ran in 0:00:04.259426

So almost 5x faster than the naive PyTorch version. Both use safetensors fast loading. As the logs show, most of the "slow" part is in loading torch and transformers. Then the runtime is mostly the same (not here, but it depends on the machine, on most machines I could try runtime performance was much closer to the point I think they are the same).

Keep in mind this is very naรฏve PyTorch, there are way to shrink all libs, and make things faster still The real core important numbers to remember is that this lib is somehow able to load in ~181ms (172+43 for full load + pass - 32ms which is a single pass) compared to ~3.4s from transformers+pytorch.

More Repositories

1

rdev

Simple library to listen and send events to keyboard and mouse (MacOS, Windows, Linux)
Rust
480
star
2

smelte-rs

Rust
57
star
3

alphagozero

Unofficial attempt to rebuild AlphaGo Zero
Python
57
star
4

bloomserver

Rust
36
star
5

django-documentation

provides a way to integrate a sphinx based documentation into your app.
Python
31
star
6

ggblas

Rust
25
star
7

hf-chat

TypeScript
25
star
8

django-userpreferences

Store user preferences for other django apps
Python
15
star
9

bindgen_cuda

Rust
14
star
10

operational-transform-go

This is an operational transform for go
Go
11
star
11

go-euler

A trial at project euler in Go! (golang)
Go
10
star
12

safetensors

Python
9
star
13

rl-baselines

Python
7
star
14

django-simple-feedback

Simple feedback for django
Python
7
star
15

fast_bert

Rust
7
star
16

zandle

Testing zig comptime out for complex tensor typing thing.
Zig
6
star
17

stable-diffusion-webui-hub

Python
4
star
18

django-badges

Fork from https://bitbucket.org/jiaaro/django-badges
Python
3
star
19

bitstamp-go

An api written in go (golang) for bitstamp
Go
3
star
20

hf_transfer

Rust
3
star
21

probability_tree

Probability Tree for the web
JavaScript
3
star
22

nccl-rs

Rust
3
star
23

static_typing_tch

Rust
2
star
24

serde_pyo3

Rust
2
star
25

ort_test

Rust
2
star
26

multigo

Html5 experiment
JavaScript
2
star
27

kernels_triton

Python
2
star
28

axum_cudarc

Rust
2
star
29

mkl-sys

Rust
1
star
30

haskell-euler

A trial at euler problems in Haskell
Haskell
1
star
31

hf-hub-rs

Rust
1
star
32

smsportal-app

SMS Portal App
Java
1
star
33

gohighcharts

Library to display graphics using highcharts on a local server. Supports dynamic data through channels.
Go
1
star
34

euler-rust

Rust
1
star
35

rcfiles

Personal files
Nix
1
star
36

QWave

Clone of QWave hosted on google code
C++
1
star
37

custom_kernel

Cuda
1
star
38

esaxx-rs

Bindings to copy of SentencePiece esaxx library (fast suffix array and frequent substrings).
C++
1
star
39

awesomeTempWidget

a small temperature widget for awesome-WM using ACPI
Lua
1
star
40

smsportal-server

SMS Portal Server (So you can send SMS via your web server
Go
1
star