• Stars
    star
    1,780
  • Rank 26,144 (Top 0.6 %)
  • Language
    C
  • License
    MIT License
  • Created over 1 year ago
  • Updated 10 months ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Python bindings for the Transformer models implemented in C/C++ using GGML library.

C Transformers PyPI tests build

Python bindings for the Transformer models implemented in C/C++ using GGML library.

Also see ChatDocs

Supported Models

Models Model Type
GPT-2 gpt2
GPT-J, GPT4All-J gptj
GPT-NeoX, StableLM gpt_neox
LLaMA, LLaMA 2 llama
MPT mpt
Dolly V2 dolly-v2
Replit replit
StarCoder, StarChat starcoder
Falcon (Experimental) falcon

Installation

pip install ctransformers

Usage

It provides a unified interface for all models:

from ctransformers import AutoModelForCausalLM

llm = AutoModelForCausalLM.from_pretrained('/path/to/ggml-gpt-2.bin', model_type='gpt2')

print(llm('AI is going to'))

Run in Google Colab

If you are getting illegal instruction error, try using lib='avx' or lib='basic':

llm = AutoModelForCausalLM.from_pretrained('/path/to/ggml-gpt-2.bin', model_type='gpt2', lib='avx')

It provides a generator interface for more control:

tokens = llm.tokenize('AI is going to')

for token in llm.generate(tokens):
    print(llm.detokenize(token))

It can be used with a custom or Hugging Face tokenizer:

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained('gpt2')

tokens = tokenizer.encode('AI is going to')

for token in llm.generate(tokens):
    print(tokenizer.decode(token))

It also provides access to the low-level C API. See Documentation section below.

Hugging Face Hub

It can be used with models hosted on the Hub:

llm = AutoModelForCausalLM.from_pretrained('marella/gpt-2-ggml')

If a model repo has multiple model files (.bin files), specify a model file using:

llm = AutoModelForCausalLM.from_pretrained('marella/gpt-2-ggml', model_file='ggml-model.bin')

It can be used with your own models uploaded on the Hub. For better user experience, upload only one model per repo.

To use it with your own model, add config.json file to your model repo specifying the model_type:

{
  "model_type": "gpt2"
}

You can also specify additional parameters under task_specific_params.text-generation.

See marella/gpt-2-ggml for a minimal example and marella/gpt-2-ggml-example for a full example.

LangChain

It is integrated into LangChain. See LangChain docs.

GPU

Note: Currently only LLaMA and Falcon models have GPU support.

To run some of the model layers on GPU, set the gpu_layers parameter:

llm = AutoModelForCausalLM.from_pretrained('/path/to/ggml-llama.bin', model_type='llama', gpu_layers=50)

CUDA

Make sure you have installed CUDA 12 and latest NVIDIA Drivers.

Show instructions for CUDA 11

To use with CUDA 11, install the ctransformers package using:

CT_CUBLAS=1 pip install ctransformers --no-binary ctransformers

On Windows PowerShell run:

$env:CT_CUBLAS=1
pip install ctransformers --no-binary ctransformers

On Windows Command Prompt run:

set CT_CUBLAS=1
pip install ctransformers --no-binary ctransformers

Run in Google Colab

Metal

To enable Metal support, install the ctransformers package using:

CT_METAL=1 pip install ctransformers --no-binary ctransformers

Documentation

Config

Parameter Type Description Default
top_k int The top-k value to use for sampling. 40
top_p float The top-p value to use for sampling. 0.95
temperature float The temperature to use for sampling. 0.8
repetition_penalty float The repetition penalty to use for sampling. 1.1
last_n_tokens int The number of last tokens to use for repetition penalty. 64
seed int The seed value to use for sampling tokens. -1
max_new_tokens int The maximum number of new tokens to generate. 256
stop List[str] A list of sequences to stop generation when encountered. None
stream bool Whether to stream the generated text. False
reset bool Whether to reset the model state before generating text. True
batch_size int The batch size to use for evaluating tokens. 8
threads int The number of threads to use for evaluating tokens. -1
context_length int The maximum context length to use. -1
gpu_layers int The number of layers to run on GPU. 0

Note: Currently only LLaMA, MPT, Falcon models support the context_length parameter and only LLaMA, Falcon models support the gpu_layers parameter.

class AutoModelForCausalLM


classmethod AutoModelForCausalLM.from_pretrained

from_pretrained(
    model_path_or_repo_id: str,
    model_type: Optional[str] = None,
    model_file: Optional[str] = None,
    config: Optional[ctransformers.hub.AutoConfig] = None,
    lib: Optional[str] = None,
    local_files_only: bool = False,
    **kwargs
) → LLM

Loads the language model from a local file or remote repo.

Args:

  • model_path_or_repo_id: The path to a model file or directory or the name of a Hugging Face Hub model repo.
  • model_type: The model type.
  • model_file: The name of the model file in repo or directory.
  • config: AutoConfig object.
  • lib: The path to a shared library or one of avx2, avx, basic.
  • local_files_only: Whether or not to only look at local files (i.e., do not try to download the model).

Returns: LLM object.

class LLM

method LLM.__init__

__init__(
    model_path: str,
    model_type: str,
    config: Optional[ctransformers.llm.Config] = None,
    lib: Optional[str] = None
)

Loads the language model from a local file.

Args:

  • model_path: The path to a model file.
  • model_type: The model type.
  • config: Config object.
  • lib: The path to a shared library or one of avx2, avx, basic.

property LLM.config

The config object.


property LLM.context_length

The context length of model.


property LLM.embeddings

The input embeddings.


property LLM.eos_token_id

The end-of-sequence token.


property LLM.logits

The unnormalized log probabilities.


property LLM.model_path

The path to the model file.


property LLM.model_type

The model type.


property LLM.vocab_size

The number of tokens in vocabulary.


method LLM.detokenize

detokenize(tokens: Sequence[int], decode: bool = True) → Union[str, bytes]

Converts a list of tokens to text.

Args:

  • tokens: The list of tokens.
  • decode: Whether to decode the text as UTF-8 string.

Returns: The combined text of all tokens.


method LLM.embed

embed(
    input: Union[str, Sequence[int]],
    batch_size: Optional[int] = None,
    threads: Optional[int] = None
) → List[float]

Computes embeddings for a text or list of tokens.

Note: Currently only LLaMA and Falcon models support embeddings.

Args:

  • input: The input text or list of tokens to get embeddings for.
  • batch_size: The batch size to use for evaluating tokens. Default: 8
  • threads: The number of threads to use for evaluating tokens. Default: -1

Returns: The input embeddings.


method LLM.eval

eval(
    tokens: Sequence[int],
    batch_size: Optional[int] = None,
    threads: Optional[int] = None
) → None

Evaluates a list of tokens.

Args:

  • tokens: The list of tokens to evaluate.
  • batch_size: The batch size to use for evaluating tokens. Default: 8
  • threads: The number of threads to use for evaluating tokens. Default: -1

method LLM.generate

generate(
    tokens: Sequence[int],
    top_k: Optional[int] = None,
    top_p: Optional[float] = None,
    temperature: Optional[float] = None,
    repetition_penalty: Optional[float] = None,
    last_n_tokens: Optional[int] = None,
    seed: Optional[int] = None,
    batch_size: Optional[int] = None,
    threads: Optional[int] = None,
    reset: Optional[bool] = None
) → Generator[int, NoneType, NoneType]

Generates new tokens from a list of tokens.

Args:

  • tokens: The list of tokens to generate tokens from.
  • top_k: The top-k value to use for sampling. Default: 40
  • top_p: The top-p value to use for sampling. Default: 0.95
  • temperature: The temperature to use for sampling. Default: 0.8
  • repetition_penalty: The repetition penalty to use for sampling. Default: 1.1
  • last_n_tokens: The number of last tokens to use for repetition penalty. Default: 64
  • seed: The seed value to use for sampling tokens. Default: -1
  • batch_size: The batch size to use for evaluating tokens. Default: 8
  • threads: The number of threads to use for evaluating tokens. Default: -1
  • reset: Whether to reset the model state before generating text. Default: True

Returns: The generated tokens.


method LLM.is_eos_token

is_eos_token(token: int) → bool

Checks if a token is an end-of-sequence token.

Args:

  • token: The token to check.

Returns: True if the token is an end-of-sequence token else False.


method LLM.reset

reset() → None

Resets the model state.


method LLM.sample

sample(
    top_k: Optional[int] = None,
    top_p: Optional[float] = None,
    temperature: Optional[float] = None,
    repetition_penalty: Optional[float] = None,
    last_n_tokens: Optional[int] = None,
    seed: Optional[int] = None
) → int

Samples a token from the model.

Args:

  • top_k: The top-k value to use for sampling. Default: 40
  • top_p: The top-p value to use for sampling. Default: 0.95
  • temperature: The temperature to use for sampling. Default: 0.8
  • repetition_penalty: The repetition penalty to use for sampling. Default: 1.1
  • last_n_tokens: The number of last tokens to use for repetition penalty. Default: 64
  • seed: The seed value to use for sampling tokens. Default: -1

Returns: The sampled token.


method LLM.tokenize

tokenize(text: str) → List[int]

Converts a text into list of tokens.

Args:

  • text: The text to tokenize.

Returns: The list of tokens.


method LLM.__call__

__call__(
    prompt: str,
    max_new_tokens: Optional[int] = None,
    top_k: Optional[int] = None,
    top_p: Optional[float] = None,
    temperature: Optional[float] = None,
    repetition_penalty: Optional[float] = None,
    last_n_tokens: Optional[int] = None,
    seed: Optional[int] = None,
    batch_size: Optional[int] = None,
    threads: Optional[int] = None,
    stop: Optional[Sequence[str]] = None,
    stream: Optional[bool] = None,
    reset: Optional[bool] = None
) → Union[str, Generator[str, NoneType, NoneType]]

Generates text from a prompt.

Args:

  • prompt: The prompt to generate text from.
  • max_new_tokens: The maximum number of new tokens to generate. Default: 256
  • top_k: The top-k value to use for sampling. Default: 40
  • top_p: The top-p value to use for sampling. Default: 0.95
  • temperature: The temperature to use for sampling. Default: 0.8
  • repetition_penalty: The repetition penalty to use for sampling. Default: 1.1
  • last_n_tokens: The number of last tokens to use for repetition penalty. Default: 64
  • seed: The seed value to use for sampling tokens. Default: -1
  • batch_size: The batch size to use for evaluating tokens. Default: 8
  • threads: The number of threads to use for evaluating tokens. Default: -1
  • stop: A list of sequences to stop generation when encountered. Default: None
  • stream: Whether to stream the generated text. Default: False
  • reset: Whether to reset the model state before generating text. Default: True

Returns: The generated text.

License

MIT

More Repositories

1

chatdocs

Chat with your documents offline using AI.
Python
683
star
2

material-icons

Latest icon fonts and CSS for self-hosting material design icons.
CSS
311
star
3

material-symbols

Latest variable icon fonts and optimized SVGs for Material Symbols.
CSS
163
star
4

material-design-icons

Latest icon fonts and optimized SVGs for material design icons.
JavaScript
158
star
5

gpt4all-j

Python bindings for the C++ port of GPT4All-J model.
Python
38
star
6

jekyll-theme-documentation

A Jekyll theme for hosting documentation on GitHub Pages.
HTML
17
star
7

gptj.cpp

Port of GPT-J model in C/C++
C++
11
star
8

new-url-loader

A tiny alternative to url-loader and file-loader for webpack 5.
JavaScript
8
star
9

node-grpc-web

gRPC Web proxy and Express / Connect middleware for Node.js
JavaScript
7
star
10

svgv

Transform SVGs into Vue components.
JavaScript
3
star
11

evaluate

A tool to evaluate the performance of various machine learning algorithms and preprocessing steps to find a good baseline for a given task.
Python
2
star
12

exllama

Python
2
star
13

phd

PHP Database library.
PHP
2
star
14

redux-reflex

Reduce boilerplate code by automatically creating action creators and action types from reducers.
JavaScript
2
star
15

jekyll-theme-github

A Jekyll theme for GitHub Pages based on GitHub's Primer styles.
SCSS
2
star
16

nbimport

An IPython magic command to import and run external notebooks using public URLs.
Python
2
star
17

nn

A neural network library built on top of TensorFlow for quickly building deep learning models.
Python
2
star
18

phython

Call Python modules and functions from PHP.
PHP
2
star
19

webpack-setup

[DEPRECATED] Webpack config simplified.
JavaScript
1
star
20

react-redux-async

Load react components and redux reducers asynchronously. Useful for code splitting and lazy loading.
JavaScript
1
star
21

godb

A simple key-value store server written in Go language.
Go
1
star
22

code-guidelines

1
star
23

pwk

Sample scripts and manifests for playing with Kubernetes.
Shell
1
star
24

train

A library to build and train reinforcement learning agents in OpenAI Gym environments.
Python
1
star
25

guides

General contributing guidelines and coding standards.
1
star
26

external

Run tasks on external processes to overcome Python's global interpreter lock.
Python
1
star
27

shr

Simple HTTP requests for browser. "Simple requests" don't trigger a CORS preflight.
JavaScript
1
star
28

test-lumen

Test monolith and microservices implementation in lumen.
PHP
1
star
29

modernize

normalize.css with useful defaults for modern browsers.
CSS
1
star