• Stars
    star
    117
  • Rank 301,828 (Top 6 %)
  • Language
    TypeScript
  • License
    MIT License
  • Created over 1 year ago
  • Updated over 1 year ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

JavaScript BPE Tokenizer Encoder Decoder for OpenAI's GPT-2 / GPT-3 / GPT-4. Port of OpenAI's tiktoken with additional features.

gpt-tokenizer

Play with gpt-tokenizer

gpt-tokenizer is a highly optimized Token Byte Pair Encoder/Decoder for all OpenAI's models (including those used by GPT-2, GPT-3, GPT-3.5 and GPT-4). It's written in TypeScript, and is fully compatible with all modern JavaScript environments.

This package is a port of OpenAI's tiktoken, with some additional features sprinkled on top.

OpenAI's GPT models utilize byte pair encoding to transform text into a sequence of integers before feeding them into the model.

As of 2023, it is the most feature-complete, open-source GPT tokenizer on NPM. It implements some unique features, such as:

  • Support for easily tokenizing chats thanks to the encodeChat function
  • Support for all current OpenAI models (available encodings: r50k_base, p50k_base, p50k_edit and cl100k_base)
  • Generator function versions of both the decoder and encoder functions
  • Provides the ability to decode an asynchronous stream of data (using decodeAsyncGenerator and decodeGenerator with any iterable input)
  • No global cache (no accidental memory leaks, as with the original GPT-3-Encoder implementation)
  • Includes a highly performant isWithinTokenLimit function to assess token limit without encoding the entire text/chat
  • Improves overall performance by eliminating transitive arrays
  • Type-safe (written in TypeScript)
  • Works in the browser out-of-the-box

Thanks to @dmitry-brazhenko's SharpToken, whose code was served as a reference for the port.

Historical note: This package started off as a fork of latitudegames/GPT-3-Encoder, but version 2.0 was rewritten from scratch.

Installation

As NPM package

npm install gpt-tokenizer

As a UMD module

<script src="https://unpkg.com/gpt-tokenizer"></script>

<script>
  // the package is now available as a global:
  const { encode, decode } = GPTTokenizer_cl100k_base
</script>

If you wish to use a custom encoding, fetch the relevant script.

The global name is a concatenation: GPTTokenizer_${encoding}.

Refer to supported models and their encodings section for more information.

Playground

The playground is published under a memorable URL: https://gpt-tokenizer.dev/

You can play with the package in the browser using the Playground.

GPT Tokenizer Playground

The playground mimics the official OpenAI Tokenizer.

Usage

import {
  encode,
  encodeChat,
  decode,
  isWithinTokenLimit,
  encodeGenerator,
  decodeGenerator,
  decodeAsyncGenerator,
} from 'gpt-tokenizer'

const text = 'Hello, world!'
const tokenLimit = 10

// Encode text into tokens
const tokens = encode(text)

// Decode tokens back into text
const decodedText = decode(tokens)

// Check if text is within the token limit
// returns false if the limit is exceeded, otherwise returns the actual number of tokens (truthy value)
const withinTokenLimit = isWithinTokenLimit(text, tokenLimit)

// Example chat:
const chat = [
  { role: 'system', content: 'You are a helpful assistant.' },
  { role: 'assistant', content: 'gpt-tokenizer is awesome.' },
]

// Encode chat into tokens
const chatTokens = encodeChat(chat)

// Check if chat is within the token limit
const chatWithinTokenLimit = isWithinTokenLimit(chat, tokenLimit)

// Encode text using generator
for (const tokenChunk of encodeGenerator(text)) {
  console.log(tokenChunk)
}

// Decode tokens using generator
for (const textChunk of decodeGenerator(tokens)) {
  console.log(textChunk)
}

// Decode tokens using async generator
// (assuming `asyncTokens` is an AsyncIterableIterator<number>)
for await (const textChunk of decodeAsyncGenerator(asyncTokens)) {
  console.log(textChunk)
}

By default, importing from gpt-tokenizer uses cl100k_base encoding, used by gpt-3.5-turbo and gpt-4.

To get a tokenizer for a different model, import it directly, for example:

import {
  encode,
  decode,
  isWithinTokenLimit,
} from 'gpt-tokenizer/model/text-davinci-003'

If you're dealing with a resolver that doesn't support package.json exports resolution, you might need to import from the respective cjs or esm directory, e.g.:

import {
  encode,
  decode,
  isWithinTokenLimit,
} from 'gpt-tokenizer/cjs/model/text-davinci-003'

Supported models and their encodings

chat:

  • gpt-4-32k (cl100k_base)
  • gpt-4-0314 (cl100k_base)
  • gpt-4-32k-0314 (cl100k_base)
  • gpt-3.5-turbo (cl100k_base)
  • gpt-3.5-turbo-0301 (cl100k_base)

text-only:

  • text-davinci-003 (p50k_base)
  • text-davinci-002 (p50k_base)
  • text-davinci-001 (r50k_base)
  • text-curie-001 (r50k_base)
  • text-babbage-001 (r50k_base)
  • text-ada-001 (r50k_base)
  • davinci (r50k_base)
  • curie (r50k_base)
  • babbage (r50k_base)
  • ada (r50k_base)

code:

  • code-davinci-002 (p50k_base)
  • code-davinci-001 (p50k_base)
  • code-cushman-002 (p50k_base)
  • code-cushman-001 (p50k_base)
  • davinci-codex (p50k_base)
  • cushman-codex (p50k_base)

edit:

  • text-davinci-edit-001 (p50k_edit)
  • code-davinci-edit-001 (p50k_edit)

embeddings:

  • text-embedding-ada-002 (cl100k_base)

old embeddings:

  • text-similarity-davinci-001 (r50k_base)
  • text-similarity-curie-001 (r50k_base)
  • text-similarity-babbage-001 (r50k_base)
  • text-similarity-ada-001 (r50k_base)
  • text-search-davinci-doc-001 (r50k_base)
  • text-search-curie-doc-001 (r50k_base)
  • text-search-babbage-doc-001 (r50k_base)
  • text-search-ada-doc-001 (r50k_base)
  • code-search-babbage-code-001 (r50k_base)
  • code-search-ada-code-001 (r50k_base)

API

encode(text: string): number[]

Encodes the given text into a sequence of tokens. Use this method when you need to transform a piece of text into the token format that the GPT models can process.

Example:

import { encode } from 'gpt-tokenizer'

const text = 'Hello, world!'
const tokens = encode(text)

decode(tokens: number[]): string

Decodes a sequence of tokens back into text. Use this method when you want to convert the output tokens from GPT models back into human-readable text.

Example:

import { decode } from 'gpt-tokenizer'

const tokens = [18435, 198, 23132, 328]
const text = decode(tokens)

isWithinTokenLimit(text: string, tokenLimit: number): false | number

Checks if the text is within the token limit. Returns false if the limit is exceeded, otherwise returns the number of tokens. Use this method to quickly check if a given text is within the token limit imposed by GPT models, without encoding the entire text.

Example:

import { isWithinTokenLimit } from 'gpt-tokenizer'

const text = 'Hello, world!'
const tokenLimit = 10
const withinTokenLimit = isWithinTokenLimit(text, tokenLimit)

encodeChat(chat: ChatMessage[], model?: ModelName): number[]

Encodes the given chat into a sequence of tokens.

If you didn't import the model version directly, or if model wasn't provided during initialization, it must be provided here to correctly tokenize the chat for a given model. Use this method when you need to transform a chat into the token format that the GPT models can process.

Example:

import { encodeChat } from 'gpt-tokenizer'

const chat = [
  { role: 'system', content: 'You are a helpful assistant.' },
  { role: 'assistant', content: 'gpt-tokenizer is awesome.' },
]
const tokens = encodeChat(chat)

encodeGenerator(text: string): Generator<number[], void, undefined>

Encodes the given text using a generator, yielding chunks of tokens. Use this method when you want to encode text in chunks, which can be useful for processing large texts or streaming data.

Example:

import { encodeGenerator } from 'gpt-tokenizer'

const text = 'Hello, world!'
const tokens = []
for (const tokenChunk of encodeGenerator(text)) {
  tokens.push(...tokenChunk)
}

encodeChatGenerator(chat: Iterator<ChatMessage>, model?: ModelName): Generator<number[], void, undefined>

Same as encodeChat, but uses a generator as output, and may use any iterator as the input chat.

decodeGenerator(tokens: Iterable<number>): Generator<string, void, undefined>

Decodes a sequence of tokens using a generator, yielding chunks of decoded text. Use this method when you want to decode tokens in chunks, which can be useful for processing large outputs or streaming data.

Example:

import { decodeGenerator } from 'gpt-tokenizer'

const tokens = [18435, 198, 23132, 328]
let decodedText = ''
for (const textChunk of decodeGenerator(tokens)) {
  decodedText += textChunk
}

decodeAsyncGenerator(tokens: AsyncIterable<number>): AsyncGenerator<string, void, undefined>

Decodes a sequence of tokens asynchronously using a generator, yielding chunks of decoded text. Use this method when you want to decode tokens in chunks asynchronously, which can be useful for processing large outputs or streaming data in an asynchronous context.

Example:

import { decodeAsyncGenerator } from 'gpt-tokenizer'

async function processTokens(asyncTokensIterator) {
  let decodedText = ''
  for await (const textChunk of decodeAsyncGenerator(asyncTokensIterator)) {
    decodedText += textChunk
  }
}

Special tokens

There are a few special tokens that are used by the GPT models. Not all models support all of these tokens.

Custom Allowed Sets

gpt-tokenizer allows you to specify custom sets of allowed special tokens when encoding text. To do this, pass a Set containing the allowed special tokens as a parameter to the encode function:

import {
  EndOfPrompt,
  EndOfText,
  FimMiddle,
  FimPrefix,
  FimSuffix,
  ImStart,
  ImEnd,
  ImSep,
  encode,
} from 'gpt-tokenizer'

const inputText = `Some Text ${EndOfPrompt}`
const allowedSpecialTokens = new Set([EndOfPrompt])
const encoded = encode(inputText, allowedSpecialTokens)
const expectedEncoded = [8538, 2991, 220, 100276]

expect(encoded).toBe(expectedEncoded)

Custom Disallowed Sets

Similarly, you can specify custom sets of disallowed special tokens when encoding text. Pass a Set containing the disallowed special tokens as a parameter to the encode function:

import { encode } from 'gpt-tokenizer'

const inputText = `Some Text`
const disallowedSpecial = new Set(['Some'])
// throws an error:
const encoded = encode(inputText, undefined, disallowedSpecial)

In this example, an Error is thrown, because the input text contains a disallowed special token.

Testing and Validation

gpt-tokenizer includes a set of test cases in the TestPlans.txt file to ensure its compatibility with OpenAI's Python tiktoken library. These test cases validate the functionality and behavior of gpt-tokenizer, providing a reliable reference for developers.

Running the unit tests and verifying the test cases helps maintain consistency between the library and the original Python implementation.

License

MIT

Contributing

Contributions are welcome! Please open a pull request or an issue to discuss your bug reports, or use the discussions feature for ideas or any other inquiries.

Hope you find the gpt-tokenizer useful in your projects!

More Repositories

1

bash-oo-framework

Bash Infinity is a modern standard library / framework / boilerplate for Bash
Shell
5,479
star
2

hashids.js

A small JavaScript library to generate YouTube-like ids from numbers.
TypeScript
4,139
star
3

typescript-vs-flowtype

Differences between Flowtype and TypeScript -- syntax and usability
1,722
star
4

node-webpackify

Just like babel-register, but using all the loaders, resolvers and aliases from webpack
JavaScript
53
star
5

bashscript

TypeScript to bash transpiler. Because.
TypeScript
46
star
6

webpack-dependency-suite

A set of Webpack plugins, loaders and utilities designed for advanced dependency resolution
TypeScript
34
star
7

google-sheets-scripting-starter-pack

A template repository for creating Google Sheets scripts utilizing full potential of JavaScript ecosystem
TypeScript
33
star
8

napkin

a command line tool, built to help manage nginx's and PHP's configuration files and automating the creation of secure vhosts
PHP
17
star
9

chunk-splitting-plugin

Arbitrarily split your Webpack chunks and bundles into smaller pieces
JavaScript
15
star
10

puppeteer-intercept-and-modify-requests

The best way to intercept and modify requests done by a Chromium website when instrumented by puppeteer
TypeScript
13
star
11

iredmail-docker

iRedMail, properly dockerized into multiple containers bind together via docker-compose
Shell
10
star
12

callbag-toolkit

A collection of tools that make creating and consuming callbags more intuitive
TypeScript
10
star
13

nandu

Create music using Evolutionary Computation algorithms
PHP
10
star
14

TokenizedInputCs

An example implementation of a Token Field (Tokenized Input) for .NET
C#
10
star
15

sanei

SANEi
Shell
8
star
16

aurelia-rxjs

An Aurelia plugin that allows you to bind to Rx Observables and Rx Subjects or use them as Binding Signals
JavaScript
8
star
17

rethinkdb-typescript-definition

TypeScript definitions for RethinkDB (deprecated, see link)
TypeScript
7
star
18

kitty-remix

Git-ified KiTTY with some of my changes and a way to compile under VS2013
C
7
star
19

putty

putty mirror with a kitty branch (trying to merge latest putty regularly)
C
6
star
20

aurelia-cycle

An Aurelia plugin that enables the use of Cycle.js inside of Aurelia
TypeScript
6
star
21

nesity

Monorepo of various JavaScript tools and utilities
TypeScript
6
star
22

pydio-docker-compose

Properly dockerized Pydio on modern stack: Nginx + HTTP2, PHP 7, MariaDB, LetsEncrypt SSL
Nginx
6
star
23

gpt-tokenizer-demo

Playground for gpt-tokenizer package
TypeScript
6
star
24

typed-horizon-client

Typings for @horizon/client by RethinkDB
5
star
25

sars-cov-2-vaccine-to-midi

SARS-CoV-2 BioNTech/Pfizer vaccine as MIDI
TypeScript
4
star
26

jira-sheets

TypeScript
4
star
27

universal-price-scraper

Export links from Session Buddy and automatically make a CSV budget with images of products
TypeScript
4
star
28

ffmpeg-tools

A set of ffmpeg scripts
Shell
3
star
29

lxc-shared

[DEPRACATED] take a look at the SANEi repo instead
PHP
3
star
30

AsanaSharp

Implementation of Asana API in C# for .NET and Mono
C#
2
star
31

just-scripts

NPM scripts on steroids
TypeScript
2
star
32

BeFocusedApp

Chrome App to refocus yourself, integrates with Asana
JavaScript
2
star
33

square-net-css-animation

A cool looking experiment that generates a “snake-like” animation around a selected DIV.
JavaScript
2
star
34

illumination-control

Use JavaScript to control your Aukey Smart RGB Lamp running on Tuya IoT firmware
JavaScript
2
star
35

aurelia-binding-functions

An Aurelia plugin that makes it possible create BindingFunctions
TypeScript
1
star
36

kindle-annotations

Automatically exported from code.google.com/p/kindle-annotations
Java
1
star
37

haraka-autoresponder-plugin

An autoresponder plugin for Haraka Mail Server
JavaScript
1
star
38

typescript-babel-boilerplate

Boilerplate for TS -> ES6 -> ES5 for modern JS development. Branches for Node 5 and Web (with SASS -> CSS and JSPM)
JavaScript
1
star
39

putty-vs2013

A Virtual Studio 2013 project for PuTTY/KiTTY
1
star
40

codegen-typescript-graphql-module-declarations-plugin

A codegen plugin - similar to the TypeScript GraphQL Files Modules Plugin, but uses typed nodes.
TypeScript
1
star
41

color-blocks

1
star
42

aurelia-skeleton-plugin

DEPRECATED: A opinionated starter kit for building an Aurelia plugin using TypeScript
JavaScript
1
star