ggml
Tensor library for machine learning
Note that this project is under active development.
Some of the development is currently happening in the llama.cpp and whisper.cpp repos
Features
- Written in C
- 16-bit float support
- Integer quantization support (4-bit, 5-bit, 8-bit, etc.)
- Automatic differentiation
- ADAM and L-BFGS optimizers
- Optimized for Apple Silicon
- On x86 architectures utilizes AVX / AVX2 intrinsics
- On ppc64 architectures utilizes VSX intrinsics
- No third-party dependencies
- Zero memory allocations during runtime
Updates
- Example of GPT-2 inference examples/gpt-2
- Example of GPT-J inference examples/gpt-j
- Example of Whisper inference examples/whisper
- Support 4-bit integer quantization #27
- Example of Cerebras-GPT inference examples/gpt-2
- Example of FLAN-T5 inference #12
- Example of LLaMA inference ggerganov/llama.cpp
- Example of LLaMA training ggerganov/llama.cpp/examples/baby-llama
- Example of Falcon inference cmp-nct/ggllm.cpp
- Example of BLOOM inference NouamaneTazi/bloomz.cpp
- Example of RWKV inference saharNooby/rwkv.cpp
- Example of SAM inference
- Idea for GPU support: ggerganov/llama.cpp#915
- Example of StableLM (GPT-NeoX) inference examples/gpt-neox
- Example of BERT inference skeskinen/bert.cpp
- Example of
π« StarCoder inference examples/starcoder - Example of MPT inference examples/mpt
- Example of Replit inference examples/replit
- Example of BioGPT inference PABannier/biogpt.cpp
- Example of Encodec inference PABannier/encodec.cpp
- Example of CLIP inference monatis/clip.cpp
- Example of MiniGPT4 inference Maknee/minigpt4.cpp
- Example of ChatGLM inference li-plus/chatglm.cpp
Whisper inference (example)
With ggml you can efficiently run Whisper inference on the CPU.
Memory requirements:
Model | Disk | Mem |
---|---|---|
tiny | 75 MB | ~280 MB |
base | 142 MB | ~430 MB |
small | 466 MB | ~1.0 GB |
medium | 1.5 GB | ~2.6 GB |
large | 2.9 GB | ~4.7 GB |
GPT inference (example)
With ggml you can efficiently run GPT-2 and GPT-J inference on the CPU.
Here is how to run the example programs:
# Build ggml + examples
git clone https://github.com/ggerganov/ggml
cd ggml
mkdir build && cd build
cmake ..
make -j4 gpt-2 gpt-j
# Run the GPT-2 small 117M model
../examples/gpt-2/download-ggml-model.sh 117M
./bin/gpt-2 -m models/gpt-2-117M/ggml-model.bin -p "This is an example"
# Run the GPT-J 6B model (requires 12GB disk space and 16GB CPU RAM)
../examples/gpt-j/download-ggml-model.sh 6B
./bin/gpt-j -m models/gpt-j-6B/ggml-model.bin -p "This is an example"
# Install Python dependencies
python3 -m pip install -r ../requirements.txt
# Run the Cerebras-GPT 111M model
# Download from: https://huggingface.co/cerebras
python3 ../examples/gpt-2/convert-cerebras-to-ggml.py /path/to/Cerebras-GPT-111M/
./bin/gpt-2 -m /path/to/Cerebras-GPT-111M/ggml-model-f16.bin -p "This is an example"
The inference speeds that I get for the different models on my 32GB MacBook M1 Pro are as follows:
Model | Size | Time / Token |
---|---|---|
GPT-2 | 117M | 5 ms |
GPT-2 | 345M | 12 ms |
GPT-2 | 774M | 23 ms |
GPT-2 | 1558M | 42 ms |
--- | --- | --- |
GPT-J | 6B | 125 ms |
For more information, checkout the corresponding programs in the examples folder.
Using cuBLAS
# fix the path to point to your CUDA compiler
cmake -DGGML_CUBLAS=ON -DCMAKE_CUDA_COMPILER=/usr/local/cuda-12.1/bin/nvcc ..
Using clBLAST
cmake -DGGML_CLBLAST=ON ..
Resources
- GGML - Large Language Models for Everyone: a description of the GGML format provided by the maintainers of the
llm
Rust crate, which provides Rust bindings for GGML - marella/ctransformers: Python bindings for GGML models.
- go-skynet/go-ggml-transformers.cpp: Golang bindings for GGML models
- smspillaz/ggml-gobject: GObject-introspectable wrapper for use of GGML on the GNOME platform.