• Stars
    star
    212
  • Rank 186,122 (Top 4 %)
  • Language
    Rust
  • License
    MIT License
  • Created over 1 year ago
  • Updated about 1 year ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

OpenAI compatible API for serving LLAMA-2 model

Cria - Local llama OpenAI-compatible API

The objective is to serve a local llama-2 model by mimicking an OpenAI API service. The llama2 model runs on GPU using ggml-sys crate with specific compilation flags.

Get started:

Using Docker (recommended way)

The easiest way of getting started is using the official Docker container. Make sure you have docker and docker-compose installed on your machine (example install for ubuntu20.04).

cria provides two docker images : one for CPU only deployments and a second GPU accelerated image. To use GPU image, you need to install the NVIDIA Container Toolkit. We also recommend using NVIDIA drivers with CUDA version 11.7 or higher.

To deploy the cria gpu version using docker-compose:

  1. Clone the repos:
git clone [email protected]:AmineDiro/cria.git
cd cria/docker
  1. The api will load the model located in /app/model.bin by default. You should change the docker-compose file with ggml model path for docker to bind mount. You can also change environement variables for your specific config. Alternatively, the easiest way is to set CRIA_MODEL_PATH in adocker/.env :
# .env
CRIA_MODEL_PATH=/path/to/ggml/model

# Other environement variables to set
CRIA_SERVICE_NAME=cria
CRIA_HOST=0.0.0.0
CRIA_PORT=3000
CRIA_MODEL_ARCHITECTURE=llama
CRIA_USE_GPU=true
CRIA_GPU_LAYERS=32
CRIA_ZIPKIN_ENDPOINT=http://zipkin-server:9411/api/v2/spans
  1. Run docker-compose to startup the cria API server and the zipkin server
docker compose up -f docker-compose-gpu.yaml -d
  1. Enjoy using your local LLM API server 🤟 !

Local Install

  1. Git clone project

    git clone [email protected]:AmineDiro/cria.git
    cd cria/
  2. Build project ( I ❤️ cargo !).

    cargo b --release
    • For cuBLAS (nvidia GPU ) acceleration use
      cargo b --release --features cublas
    • For metal acceleration use
      cargo b --release --features metal

      ❗ NOTE: If you have issues building for GPU, checkout the building issues section

  3. Download GGML .bin LLama-2 quantized model (for example llama-2-7b)

  4. Run API, use the use-gpu flag to offload model layers to your GPU

    ./target/cria -a llama --model {MODEL_BIN_PATH} --use-gpu --gpu-layers 32

Command line arguments reference

All the parameters can be passed as environment variables or command line arguments. Here is the reference for the command line arguments:

./target/cria --help

Usage: cria [OPTIONS]

Options:
  -a, --model-architecture <MODEL_ARCHITECTURE>      [default: llama]
      --model <MODEL_PATH>
  -v, --tokenizer-path <TOKENIZER_PATH>
  -r, --tokenizer-repository <TOKENIZER_REPOSITORY>
  -H, --host <HOST>                                  [default: 0.0.0.0]
  -p, --port <PORT>                                  [default: 3000]
  -m, --prefer-mmap
  -c, --context-size <CONTEXT_SIZE>                  [default: 2048]
  -l, --lora-adapters <LORA_ADAPTERS>
  -u, --use-gpu
  -g, --gpu-layers <GPU_LAYERS>
  --n-gqa <N_GQA>
      Grouped Query attention : Specify -gqa 8 for 70B models to work
  -z, --zipkin-endpoint <ZIPKIN_ENDPOINT>
  -h, --help                                         Print help

For environment variables, just prefix the argument with CRIA_ and use uppercase letters. For example, to set the model path, you can use CRIA_MODEL environment variable.

There is a an example docker/.env.sample file in the project root directory.

Prometheus Metrics

We are exporting Prometheus metrics via the /metrics endpoint.

Tracing

We are tracing performance metrics using tracing and tracing-opentelemetry crates.

You can use the --zipkin-endpoint to export metrics to a zipkin endpoint.

There is a docker-compose file in the project root directory to run a local zipkin server on port 9411.

screenshot

Completion Example

You can use openai python client or directly use the sseclient python library and stream messages. Here is an example :

Here is a example using a Python client
import json
import sys
import time

import sseclient
import urllib3

url = "http://localhost:3000/v1/completions"


http = urllib3.PoolManager()
response = http.request(
    "POST",
    url,
    preload_content=False,
    headers={
        "Content-Type": "application/json",
    },
    body=json.dumps(
        {
            "prompt": "Morocco is a beautiful country situated in north africa.",
            "temperature": 0.1,
        }
    ),
)

client = sseclient.SSEClient(response)

s = time.perf_counter()
for event in client.events():
    chunk = json.loads(event.data)
    sys.stdout.write(chunk["choices"][0]["text"])
    sys.stdout.flush()
e = time.perf_counter()

print(f"\nGeneration from completion took {e-s:.2f} !")

You can clearly see generation using my M1 GPU:

TODO/ Roadmap:

  • Run Llama.cpp on CPU using llm-chain
  • Run Llama.cpp on GPU using llm-chain
  • Implement /models route
  • Implement basic /completions route
  • Implement streaming completions SSE
  • Cleanup cargo features with llm
  • Support MacOS Metal
  • Merge completions / completion_streaming routes in same endpoint
  • Implement /embeddings route
  • Implement route /chat/completions
  • Setup good tracing (Thanks to @aparo)
  • Docker deployment on CPU/GPU
  • Metrics : Prometheus (Thanks to @aparo)
  • Implement a global request queue
    • For each response put an entry in a queue
    • Spawn a model in separate task reading from ringbuffer, get entry and put each token in response
    • Construct stream from flume resp_rx chan and stream responses to user.
  • Implement streaming chat completions SSE
  • Setup CI/CD (thanks to @Benjamint22 )
  • BETTER ERRORS and http responses (deal with all the unwrapping)
  • Implement request batching
  • Implement request continuous batching
  • Maybe Support huggingface candle lib for a full rust integration 🤔 ?

API routes

Details on OpenAI API docs: https://platform.openai.com/docs/api-reference/

More Repositories

1

daskqueue

Distributed Task Queue based Dask
Python
35
star
2

Adversarial-Attacks

FGSM and L-BFG implementation
Jupyter Notebook
11
star
3

docvec

Semantic search webassembly module
HTML
10
star
4

OT-GAN

Improving GANs Using Optimal Transport Paper implementation
Jupyter Notebook
6
star
5

UFC-fighting-styles

A clustering approach to analyze the fighting styles of mma fighter
Jupyter Notebook
4
star
6

Chatbot-kafka

Dialo-GPT based chatbot using kafka streams
CSS
3
star
7

CascadeBandit

Implementation of Cascading Bandits UCB algorithm
Jupyter Notebook
3
star
8

ane_gte_small

ANE inference
Python
3
star
9

transform-rs

Transformers implementation in Rust
Python
2
star
10

Powerlifting-meets-analysis

Powerlifting US tested project
Jupyter Notebook
2
star
11

Prediction-COVID19-France-

Ce projet a pour but d'apporter une approche de data science pour prédire le développement du virus COVID19 en France
Jupyter Notebook
2
star
12

Churn-analysis

Churn analysis
HTML
1
star
13

Hotel-Bookings

HTML
1
star
14

data-challenge-iapau

Jupyter Notebook
1
star
15

pghamming

Postgres extension to compute hamming distance
Rust
1
star
16

GPU_NIPALS_GS_PCA

PYCUDA implementation of the PCA algorithm
Jupyter Notebook
1
star
17

Matplotlib-Vizualization---Weather-data

Broken decade highlow temperatures in 2015
Jupyter Notebook
1
star
18

Proximal-Policy-Gradient

pyTorch implementation of the famous Proximal Policy Gradient inspired from OpenAI spinningup
Jupyter Notebook
1
star
19

fastwalker

Concurrent File system walker written in GO
Go
1
star
20

Breast-cancer-classification---K-neighbors

Accuracy for Malignant and benign cells for the K-neighbors
Jupyter Notebook
1
star