C Transformers
Python bindings for the Transformer models implemented in C/C++ using GGML library.
Also see ChatDocs
Supported Models
Models | Model Type |
---|---|
GPT-2 | gpt2 |
GPT-J, GPT4All-J | gptj |
GPT-NeoX, StableLM | gpt_neox |
LLaMA, LLaMA 2 | llama |
MPT | mpt |
Dolly V2 | dolly-v2 |
Replit | replit |
StarCoder, StarChat | starcoder |
Falcon (Experimental) | falcon |
Installation
pip install ctransformers
Usage
It provides a unified interface for all models:
from ctransformers import AutoModelForCausalLM
llm = AutoModelForCausalLM.from_pretrained('/path/to/ggml-gpt-2.bin', model_type='gpt2')
print(llm('AI is going to'))
If you are getting illegal instruction
error, try using lib='avx'
or lib='basic'
:
llm = AutoModelForCausalLM.from_pretrained('/path/to/ggml-gpt-2.bin', model_type='gpt2', lib='avx')
It provides a generator interface for more control:
tokens = llm.tokenize('AI is going to')
for token in llm.generate(tokens):
print(llm.detokenize(token))
It can be used with a custom or Hugging Face tokenizer:
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained('gpt2')
tokens = tokenizer.encode('AI is going to')
for token in llm.generate(tokens):
print(tokenizer.decode(token))
It also provides access to the low-level C API. See Documentation section below.
Hugging Face Hub
It can be used with models hosted on the Hub:
llm = AutoModelForCausalLM.from_pretrained('marella/gpt-2-ggml')
If a model repo has multiple model files (.bin
files), specify a model file using:
llm = AutoModelForCausalLM.from_pretrained('marella/gpt-2-ggml', model_file='ggml-model.bin')
It can be used with your own models uploaded on the Hub. For better user experience, upload only one model per repo.
To use it with your own model, add config.json
file to your model repo specifying the model_type
:
{
"model_type": "gpt2"
}
You can also specify additional parameters under task_specific_params.text-generation
.
See marella/gpt-2-ggml for a minimal example and marella/gpt-2-ggml-example for a full example.
LangChain
It is integrated into LangChain. See LangChain docs.
GPU
Note: Currently only LLaMA and Falcon models have GPU support.
To run some of the model layers on GPU, set the gpu_layers
parameter:
llm = AutoModelForCausalLM.from_pretrained('/path/to/ggml-llama.bin', model_type='llama', gpu_layers=50)
CUDA
Make sure you have installed CUDA 12 and latest NVIDIA Drivers.
Show instructions for CUDA 11
To use with CUDA 11, install the ctransformers
package using:
CT_CUBLAS=1 pip install ctransformers --no-binary ctransformers
On Windows PowerShell run:
$env:CT_CUBLAS=1
pip install ctransformers --no-binary ctransformers
On Windows Command Prompt run:
set CT_CUBLAS=1
pip install ctransformers --no-binary ctransformers
Metal
To enable Metal support, install the ctransformers
package using:
CT_METAL=1 pip install ctransformers --no-binary ctransformers
Documentation
Config
Parameter | Type | Description | Default |
---|---|---|---|
top_k |
int |
The top-k value to use for sampling. | 40 |
top_p |
float |
The top-p value to use for sampling. | 0.95 |
temperature |
float |
The temperature to use for sampling. | 0.8 |
repetition_penalty |
float |
The repetition penalty to use for sampling. | 1.1 |
last_n_tokens |
int |
The number of last tokens to use for repetition penalty. | 64 |
seed |
int |
The seed value to use for sampling tokens. | -1 |
max_new_tokens |
int |
The maximum number of new tokens to generate. | 256 |
stop |
List[str] |
A list of sequences to stop generation when encountered. | None |
stream |
bool |
Whether to stream the generated text. | False |
reset |
bool |
Whether to reset the model state before generating text. | True |
batch_size |
int |
The batch size to use for evaluating tokens. | 8 |
threads |
int |
The number of threads to use for evaluating tokens. | -1 |
context_length |
int |
The maximum context length to use. | -1 |
gpu_layers |
int |
The number of layers to run on GPU. | 0 |
Note: Currently only LLaMA, MPT, Falcon models support the
context_length
parameter and only LLaMA, Falcon models support thegpu_layers
parameter.
AutoModelForCausalLM
class AutoModelForCausalLM.from_pretrained
classmethod from_pretrained(
model_path_or_repo_id: str,
model_type: Optional[str] = None,
model_file: Optional[str] = None,
config: Optional[ctransformers.hub.AutoConfig] = None,
lib: Optional[str] = None,
local_files_only: bool = False,
**kwargs
) → LLM
Loads the language model from a local file or remote repo.
Args:
model_path_or_repo_id
: The path to a model file or directory or the name of a Hugging Face Hub model repo.model_type
: The model type.model_file
: The name of the model file in repo or directory.config
:AutoConfig
object.lib
: The path to a shared library or one ofavx2
,avx
,basic
.local_files_only
: Whether or not to only look at local files (i.e., do not try to download the model).
Returns:
LLM
object.
LLM
class LLM.__init__
method __init__(
model_path: str,
model_type: str,
config: Optional[ctransformers.llm.Config] = None,
lib: Optional[str] = None
)
Loads the language model from a local file.
Args:
model_path
: The path to a model file.model_type
: The model type.config
:Config
object.lib
: The path to a shared library or one ofavx2
,avx
,basic
.
property LLM.config
The config object.
property LLM.context_length
The context length of model.
property LLM.embeddings
The input embeddings.
property LLM.eos_token_id
The end-of-sequence token.
property LLM.logits
The unnormalized log probabilities.
property LLM.model_path
The path to the model file.
property LLM.model_type
The model type.
property LLM.vocab_size
The number of tokens in vocabulary.
LLM.detokenize
method detokenize(tokens: Sequence[int], decode: bool = True) → Union[str, bytes]
Converts a list of tokens to text.
Args:
tokens
: The list of tokens.decode
: Whether to decode the text as UTF-8 string.
Returns: The combined text of all tokens.
LLM.embed
method embed(
input: Union[str, Sequence[int]],
batch_size: Optional[int] = None,
threads: Optional[int] = None
) → List[float]
Computes embeddings for a text or list of tokens.
Note: Currently only LLaMA and Falcon models support embeddings.
Args:
input
: The input text or list of tokens to get embeddings for.batch_size
: The batch size to use for evaluating tokens. Default:8
threads
: The number of threads to use for evaluating tokens. Default:-1
Returns: The input embeddings.
LLM.eval
method eval(
tokens: Sequence[int],
batch_size: Optional[int] = None,
threads: Optional[int] = None
) → None
Evaluates a list of tokens.
Args:
tokens
: The list of tokens to evaluate.batch_size
: The batch size to use for evaluating tokens. Default:8
threads
: The number of threads to use for evaluating tokens. Default:-1
LLM.generate
method generate(
tokens: Sequence[int],
top_k: Optional[int] = None,
top_p: Optional[float] = None,
temperature: Optional[float] = None,
repetition_penalty: Optional[float] = None,
last_n_tokens: Optional[int] = None,
seed: Optional[int] = None,
batch_size: Optional[int] = None,
threads: Optional[int] = None,
reset: Optional[bool] = None
) → Generator[int, NoneType, NoneType]
Generates new tokens from a list of tokens.
Args:
tokens
: The list of tokens to generate tokens from.top_k
: The top-k value to use for sampling. Default:40
top_p
: The top-p value to use for sampling. Default:0.95
temperature
: The temperature to use for sampling. Default:0.8
repetition_penalty
: The repetition penalty to use for sampling. Default:1.1
last_n_tokens
: The number of last tokens to use for repetition penalty. Default:64
seed
: The seed value to use for sampling tokens. Default:-1
batch_size
: The batch size to use for evaluating tokens. Default:8
threads
: The number of threads to use for evaluating tokens. Default:-1
reset
: Whether to reset the model state before generating text. Default:True
Returns: The generated tokens.
LLM.is_eos_token
method is_eos_token(token: int) → bool
Checks if a token is an end-of-sequence token.
Args:
token
: The token to check.
Returns:
True
if the token is an end-of-sequence token else False
.
LLM.reset
method reset() → None
Resets the model state.
LLM.sample
method sample(
top_k: Optional[int] = None,
top_p: Optional[float] = None,
temperature: Optional[float] = None,
repetition_penalty: Optional[float] = None,
last_n_tokens: Optional[int] = None,
seed: Optional[int] = None
) → int
Samples a token from the model.
Args:
top_k
: The top-k value to use for sampling. Default:40
top_p
: The top-p value to use for sampling. Default:0.95
temperature
: The temperature to use for sampling. Default:0.8
repetition_penalty
: The repetition penalty to use for sampling. Default:1.1
last_n_tokens
: The number of last tokens to use for repetition penalty. Default:64
seed
: The seed value to use for sampling tokens. Default:-1
Returns: The sampled token.
LLM.tokenize
method tokenize(text: str) → List[int]
Converts a text into list of tokens.
Args:
text
: The text to tokenize.
Returns: The list of tokens.
LLM.__call__
method __call__(
prompt: str,
max_new_tokens: Optional[int] = None,
top_k: Optional[int] = None,
top_p: Optional[float] = None,
temperature: Optional[float] = None,
repetition_penalty: Optional[float] = None,
last_n_tokens: Optional[int] = None,
seed: Optional[int] = None,
batch_size: Optional[int] = None,
threads: Optional[int] = None,
stop: Optional[Sequence[str]] = None,
stream: Optional[bool] = None,
reset: Optional[bool] = None
) → Union[str, Generator[str, NoneType, NoneType]]
Generates text from a prompt.
Args:
prompt
: The prompt to generate text from.max_new_tokens
: The maximum number of new tokens to generate. Default:256
top_k
: The top-k value to use for sampling. Default:40
top_p
: The top-p value to use for sampling. Default:0.95
temperature
: The temperature to use for sampling. Default:0.8
repetition_penalty
: The repetition penalty to use for sampling. Default:1.1
last_n_tokens
: The number of last tokens to use for repetition penalty. Default:64
seed
: The seed value to use for sampling tokens. Default:-1
batch_size
: The batch size to use for evaluating tokens. Default:8
threads
: The number of threads to use for evaluating tokens. Default:-1
stop
: A list of sequences to stop generation when encountered. Default:None
stream
: Whether to stream the generated text. Default:False
reset
: Whether to reset the model state before generating text. Default:True
Returns: The generated text.