• Stars
    star
    805
  • Rank 54,712 (Top 2 %)
  • Language
    C
  • License
    MIT License
  • Created about 1 year ago
  • Updated about 1 year ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

C++ implementation for BLOOM

bloomz.cpp

Inference of HuggingFace's BLOOM-like models in pure C/C++.

The repo was built on top of the amazing llama.cpp repo by @ggerganov, to support BLOOM models. It supports all models that can be loaded using BloomForCausalLM.from_pretrained().

bloomz-7b1

Demo

bloomz-7b1

Usage

First, you need to clone the repo and build it:

git clone https://github.com/NouamaneTazi/bloomz.cpp
cd bloomz.cpp
make

Convert weights

Then, you must convert the model weights to the ggml format. Any BLOOM model can be converted.

Some weights hosted on the Hub are already converted. You can find the list here.

Otherwise, the quickest way to convert weights is to use this converter tool. It is a Space hosted on the Huggingface Hub that converts and quantizes weights for you and upload them to the repository of your choice.

If you prefer, you can manually convert the weights on your machine:

# install required libraries
python3 -m pip install torch numpy transformers accelerate

# download and convert the 7B1 model to ggml FP16 format
python3 convert-hf-to-ggml.py bigscience/bloomz-7b1 ./models 
# Note: you can add --use-f32 to convert to FP32 instead of FP16

Optionally, you can quantize the model to 4-bits.

./quantize ./models/ggml-model-bloomz-7b1-f16.bin ./models/ggml-model-bloomz-7b1-f16-q4_0.bin 2

Run inference

Finally, you can run the inference.

./main -m ./models/ggml-model-bloomz-7b1-f16-q4_0.bin -t 8 -n 128

Your output should look like this:

make && ./main -m models/ggml-model-bloomz-7b1-f16-q4_0.bin  -p 'Translate "Hi, how are you?" in French:' -t 8 -n 256

I llama.cpp build info: 
I UNAME_S:  Darwin
I UNAME_P:  arm
I UNAME_M:  arm64
I CFLAGS:   -I.              -O3 -DNDEBUG -std=c11   -fPIC -pthread -DGGML_USE_ACCELERATE
I CXXFLAGS: -I. -I./examples -O3 -DNDEBUG -std=c++11 -fPIC -pthread
I LDFLAGS:   -framework Accelerate
I CC:       Apple clang version 13.1.6 (clang-1316.0.21.2.5)
I CXX:      Apple clang version 13.1.6 (clang-1316.0.21.2.5)

make: Nothing to be done for `default'.
main: seed = 1678899845
llama_model_load: loading model from 'models/ggml-model-bloomz-7b1-f16-q4_0.bin' - please wait ...
llama_model_load: n_vocab = 250880
llama_model_load: n_ctx   = 512
llama_model_load: n_embd  = 4096
llama_model_load: n_mult  = 1
llama_model_load: n_head  = 32
llama_model_load: n_layer = 30
llama_model_load: f16     = 2
llama_model_load: n_ff    = 16384
llama_model_load: n_parts = 1
llama_model_load: ggml ctx size = 5312.64 MB
llama_model_load: memory_size =   480.00 MB, n_mem = 15360
llama_model_load: loading model part 1/1 from 'models/ggml-model-bloomz-7b1-f16-q4_0.bin'
llama_model_load: ............................................. done
llama_model_load: model size =  4831.16 MB / num tensors = 366

main: prompt: 'Translate "Hi, how are you?" in French:'
main: number of tokens in prompt = 11
153772 -> 'Translate'
 17959 -> ' "H'
    76 -> 'i'
 98257 -> ', '
 20263 -> 'how'
  1306 -> ' are'
  1152 -> ' you'
  2040 -> '?'
     5 -> '"'
   361 -> ' in'
196427 -> ' French:'

sampling parameters: temp = 0.800000, top_k = 40, top_p = 0.950000, repeat_last_n = 64, repeat_penalty = 1.300000


Translate "Hi, how are you?" in French: Bonjour, comment Γ§a va?</s> [end of text]


main: mem per token = 24017564 bytes
main:     load time =  3092.29 ms
main:   sample time =     2.40 ms
main:  predict time =  1003.04 ms / 59.00 ms per token
main:    total time =  5307.23 ms

Advanced usage

Here's a list of the available options:

usage: ./main [options]

options:
  -h, --help            show this help message and exit
  -s SEED, --seed SEED  RNG seed (default: -1)
  -t N, --threads N     number of threads to use during computation (default: 4)
  -p PROMPT, --prompt PROMPT
                        prompt to start generation with (default: random)
  -n N, --n_predict N   number of tokens to predict (default: 128)
  --top_k N             top-k sampling (default: 40)
  --top_p N             top-p sampling (default: 0.9)
  --repeat_last_n N     last n tokens to consider for penalize (default: 64)
  --repeat_penalty N    penalize repeat sequence of tokens (default: 1.3)
  --temp N              temperature (default: 0.8)
  -b N, --batch_size N  batch size for prompt processing (default: 8)
  -m FNAME, --model FNAME
                        model path (default: models/ggml-model-bloomz-7b1-f16-q4_0.bin)

Memory usage

Model Disk Mem
bloomz-7b1-f16-q4_0 4.7 GB 5.3 GB

iOS App

The repo includes a proof-of-concept iOS app in the Bloomer directory. You need to provide the converted model weights, placing a file called ggml-model-bloomz-560m-f16.bin inside that folder. This is what it looks like on an iPhone:

bloom-ios-screenshot

More Repositories

1

website-monitor

A tool written in Go that helps you monitor a collection of websites using various metrics.
Go
11
star
2

Pictionary

A multiplayer drawing game.
JavaScript
7
star
3

mathemaroc

TypeScript
6
star
4

NouamaneTazi

6
star
5

hf_search

Huggingface Semantic Search Engine
Jupyter Notebook
6
star
6

nouamanetazi.github.io

Nouamane Tazi's personal website.
HTML
6
star
7

Cherokey4WD_Python

Automated Driving Robot With an Arduino, a Raspberry Pi and a Pi Camera
Python
5
star
8

docker-kafka-quickstart

A minimal example for a docker - kafka project
Python
5
star
9

q-learning-self-driving-car

4
star
10

NLP

Jupyter Notebook
4
star
11

arabicocr

4
star
12

pyToolbox

Jupyter Notebook
4
star
13

paris-traffic-forecast

Forecasting paris traffic
Jupyter Notebook
4
star
14

python_PDF_parsing

PDF Parsing using Python
Jupyter Notebook
4
star
15

arabic-sentiment-analysis

Python
4
star
16

ml_project_example

Example ML Project with a Hugging Face Space demo.
Jupyter Notebook
4
star
17

gameoflife-vuejs

Conway's Game of Life built with Vue.js
Vue
4
star
18

awesome-combinatorial-optimization

A curated list of resources for combinatorial optimization and its applications.
4
star
19

Connect4

Python
3
star
20

python-metaheuristics

Various metaheuristic algorithms implemented in Python.
Jupyter Notebook
3
star
21

WC_illusions

My research project.
Jupyter Notebook
3
star
22

Deep-Q-car

Python
3
star
23

search-engine

3
star
24

competitive-programming

A curated list of competitive-programming solutions
C++
2
star
25

ML-Toolbox

A Machine learning toolbox containing snippets of code to help realise a wide range of useful ML techniques.
Jupyter Notebook
2
star
26

kubernetes-kafka-quickstart

Python
2
star
27

hello-github-actions

2
star
28

number-plate-recognition

Jupyter Notebook
2
star
29

multiagent-systems

Jupyter Notebook
1
star