Web AI
Web AI is a TypeScript library that allows you to run modern deep learning models directly in your web browser. You can easily add AI capabilities to your web applications without the need for complex server-side infrastructure.
Features:
- Easy to use. Create a model with one line of code, get the result with another one.
- Powered by ONNX runtime. Web AI runs the models using ONNX runtime for Web, which has rich support for of all kinds of operators. It means that any model will work just fine.
- Compatible with Hugging Face hub. Web AI utilizes model configuration files in the same format as the hub, which makes it even easier to integrate existing models.
- Built-in caching. Web AI stores the downloaded models in IndexedDB using localforage. You can configure the size of the cache dynamically.
- Web worker support. All heavy operations - model creation and inference - are offloaded to a separate thread so the UI does not freeze.
Status
The library is under active development. If something does not work correctly, please file an issue on GitHub. Contributions are very welcome.
Sponsors
- Continuing work on this project is sponsored by Reflect - awesome app for taking notes.
- Thanks to AlgoveraAI for the grant under their AI project financing program.
Model types
Text models
- Sequence-to-sequence (
TextModelType.Seq2Seq
). These models are used to transform the text into another text. Examples of such transformations are translation, summarization, and grammar correction. - Feature extraction (
TextModelType.FeatureExtraction
). These models are used to transform the text into an array of numbers - embedding. Generated vectors are useful for semantic search or cluster analysis because embeddings of semantically similar text are similar and can be compared using cosine similarity.
Image models
-
Semantic segmentation (
ImageModelType.Segmentation
). These models cluster images into parts which belong to the same object class. In other words, segmentation models detect the exact shape of the objects in the image and classify them. An example of object detection is below: -
Object detection (
ImageModelType.ObjectDetection
). These models find objects in the images, classify them, and generate bounding boxes for the objects. The example of the object detection is below. -
Classification (
ImageModelType.Classification
). These models do not find exact objects in the images but they only determine what type of object is the most likely in the image. Because of that, this type of model is the most useful when there is only one distinct class of objects present in the image. In the example below, the image is classified as "Egyptian cat". -
Image-to-image (
ImageModelType.Img2Img
). These models produce images from other images. There are many use cases for this kind of models. In the example below, you can see example of super-resolution (resizing the image from 256x256 to 1024x1024) with image restoring.
Multimodal models
This type of models combines several types of data (e.g., image and text) to produce the desired output.
- Zero-shot image classification (
MultimodalModelType.ZeroShotClassification
). These models classify images based on arbitrary classes without additional training.
Installation
The library can be installed via npm
:
npm install @visheratin/web-ai
If you plan to use image models, you also need to install jimp
:
npm install jimp
Create model instance
Create model from ID
The first way of creating a model is using the model identifier. This method works only for the built-in models.
For text models:
import { TextModel } from "@visheratin/web-ai";
const result = await TextModel.create("grammar-t5-efficient-tiny")
console.log(result.elapsed)
const model = result.model
For image models:
import { ImageModel } from "@visheratin/web-ai";
const result = await ImageModel.create("yolos-tiny-quant")
console.log(result.elapsed)
const model = result.model
Create model from metadata
The second way to create a model is via the model metadata. This method allows the use of custom ONNX models. In this case, we need
to use a specific model class. Please note that when creating the model from the metadata, you need to call an init()
method before using the model. This is needed to create inference sessions, download configuration files, and create internal structures.
Text models
The metadata for text models is defined by the TextMetadata
class. Not all fields are required for the model creation. The minimal example for the Seq2Seq
model is:
import { Seq2SeqModel, TextMetadata } from "@visheratin/web-ai";
const metadata: TextMetadata = {
modelPaths: new Map<string, string>([
[
"encoder",
"https://huggingface.co/visheratin/t5-efficient-tiny-grammar-correction/resolve/main/encoder_model.onnx",
],
[
"decoder",
"https://huggingface.co/visheratin/t5-efficient-tiny-grammar-correction/resolve/main/decoder_with_past_model.onnx",
],
]),
tokenizerPath: "https://huggingface.co/visheratin/t5-efficient-tiny-grammar-correction/resolve/main/tokenizer.json",
}
const model = new Seq2SeqModel(metadata);
const elapsed = await model.init();
console.log(elapsed);
The minimal example for the FeatureExtraction
model is:
import { TextFeatureExtractionModel, TextMetadata } from "@visheratin/web-ai";
const metadata: TextMetadata = {
modelPaths: new Map<string, string>([
[
"encoder",
"https://huggingface.co/visheratin/t5-efficient-tiny-grammar-correction/resolve/main/encoder_model.onnx",
],
]),
tokenizerPath: "https://huggingface.co/visheratin/t5-efficient-tiny-grammar-correction/resolve/main/tokenizer.json",
}
const model = new TextFeatureExtractionModel(metadata);
const elapsed = await model.init();
console.log(elapsed);
Image models
The metadata for image models is defined by the ImageMetadata
class. Not all fields are required for the model creation. The minimal example for all image models is:
import { ImageMetadata } from "@visheratin/web-ai";
const metadata: ImageMetadata = {
modelPath: "https://huggingface.co/visheratin/segformer-b0-finetuned-ade-512-512/resolve/main/b0.onnx.gz",
configPath: "https://huggingface.co/visheratin/segformer-b0-finetuned-ade-512-512/resolve/main/config.json",
preprocessorPath: "https://huggingface.co/visheratin/segformer-b0-finetuned-ade-512-512/resolve/main/preprocessor_config.json",
}
Then, the model can be created:
import { ClassificationModel, ObjectDetectionModel, SegmentationModel } from "@visheratin/web-ai";
const model = new ClassificationModel(metadata);
// or
const model = new ObjectDetectionModel(metadata);
// or
const model = new SegmentationModel(metadata);
const elapsed = await model.init();
console.log(elapsed);
Additional parameters
You can configure cache size and execution mode for the model using the following parameters:
cache_size_mb
- cache size, in megabytes. Default value is500
.proxy
- flag specifying whether to offload the model to the web worker. Default value istrue
.
These parameters are available in create()
and init()
methods.
Data processing
Text models
The processing is done using a process()
method.
Seq2Seq
models output text:
const input = "Test text input"
const output = await model.process(input)
console.log(output.text)
console.log(`Sentence of length ${input.length} (${output.tokensNum} tokens) was processed in ${output.elapsed} seconds`)
Seq2Seq
models also support output streaming via processStream()
method:
const input = "Test text input"
let output = "";
for await (const piece of model.processStream(value)) {
output = output.concat(piece);
}
console.log(output)
If a Seq2Seq
model supports task-specific prefixes (e.g., summarize
or translate
), you can use them
to specify what kind of processing is needed:
const input = "Test text input"
const output = await model.process(input, "summarize")
console.log(output.text)
console.log(`Sentence of length ${input.length} (${output.tokensNum} tokens) was processed in ${output.elapsed} seconds`)
If the model does not allow the specified prefix, the error will be thrown.
FeatureExtraction
models output numeric array:
const input = "Test text input"
const output = await model.process(input)
console.log(output.result)
console.log(`Sentence of length ${input.length} (${output.tokensNum} tokens) was processed in ${output.elapsed} seconds`)
Image models
For the image models, the processing is also done using a process()
method.
Segmentation
models output HTML canvas, which can be overlayed on the original image:
const input = "https://upload.wikimedia.org/wikipedia/commons/thumb/8/81/Georgia5and120loop.jpg/640px-Georgia5and120loop.jpg"
const output = await model.process(input)
var destCtx = canvas.getContext("2d");
destCtx.globalAlpha = 0.4;
destCtx.drawImage(result.canvas, 0, 0, result.canvas.width, result.canvas.height,
0, 0, canvas.width, canvas.height);
console.log(output.elapsed)
If you want to determine the class from the output canvas, you can use the getClass()
method:
// xCoord and yCoord are coordinates of the target pixel on the canvas
const rect = canvas.getBoundingClientRect();
const ctx = canvas.getContext("2d");
const x = xCoord - rect.left;
const y = yCoord - rect.top;
const c = ctx!.getImageData(x, y, 1, 1).data;
const className = model.instance.getClass(c);
console.log(className);
Object detection
models output a list of bounding box predictions along with their classes and colors. Bounding boxes can be used to draw them over the original image:
const input = "https://upload.wikimedia.org/wikipedia/commons/thumb/8/81/Georgia5and120loop.jpg/640px-Georgia5and120loop.jpg"
const output = await model.process(input)
for (let object of output.objects) {
var rect = document.createElementNS("http://www.w3.org/2000/svg", "rect");
rect.setAttributeNS(null, "x", (sizes[0] * object.x).toString());
rect.setAttributeNS(null, "y", (sizes[1] * object.y).toString());
rect.setAttributeNS(null, "width", (sizes[0] * object.width).toString());
rect.setAttributeNS(
null,
"height",
(sizes[1] * object.height).toString()
);
const color = object.color;
rect.setAttributeNS(null, "fill", color);
rect.setAttributeNS(null, "stroke", color);
rect.setAttributeNS(null, "stroke-width", "2");
rect.setAttributeNS(null, "fill-opacity", "0.35");
// svgRoot is a root SVG element on the page
svgRoot.appendChild(rect);
}
Classification
models output an array of predicted classes along with the confidence scores in range [0,1] sorted by confidence in the descending order. When running the process()
method, you can specify the number of returned predictions (default is 3):
const input = "https://upload.wikimedia.org/wikipedia/commons/thumb/8/81/Georgia5and120loop.jpg/640px-Georgia5and120loop.jpg"
const output = await model.process(input, 5)
for (let item of output.results) {
console.log(item.class, item.confidence)
}
Multimodal models
ZeroShotClassification
models output an array of predicted classes along with the confidence scores in range [0,1] sorted by confidence in the descending order. The output also includes feature vectors for input image and texts. These vectors are useful for analyzing similarity between images and classes. When running the process()
method, you must specify the image and the list of classes:
const input = "https://upload.wikimedia.org/wikipedia/commons/thumb/8/81/Georgia5and120loop.jpg/640px-Georgia5and120loop.jpg"
const output = await model.process(input, ["road", "street", "car", "forest"])
for (let item of output.results) {
console.log(item.class, item.confidence)
}
Img2Text
models extract the image features and then use them to generate the text. One useful example of such processing is image captioning. You can also specify the prefix to set the beginning of the output text. The output for this type of models is a string:
const input = "https://upload.wikimedia.org/wikipedia/commons/thumb/8/81/Georgia5and120loop.jpg/640px-Georgia5and120loop.jpg"
const output = await model.process(input, "The image shows")
console.log(output.text)
Built-in models
Text models
Grammar correction
grammar-t5-efficient-mini
- larger model for grammar correction (197 MB). Works the best overall.grammar-t5-efficient-mini-quant
- minified (quantized) version of thegrammar-t5-efficient-mini
model. Quantization makes the performance slightly worse but the size is 5 times smaller than the original one.grammar-t5-efficient-tiny
- small model for grammar correction (113 MB). Works a bit worse than the larger model but is almost twice smaller.grammar-t5-efficient-tiny-quant
- minified (quantized) version of thegrammar-t5-efficient-tiny
model. Quantization makes the performance slightly worse but the size is 4 times smaller than the original one. It is the smallest model, only 24 MB in total.
Summarization
summarization-cnn-dailymail
- a model for summarization that was trained on CNN and Daily Mail news articles.summarization-cnn-dailymail-quant
- minified (quantized) version of thesummarization-t5
model. Quantization makes the performance slightly worse but the size is 5 times smaller than the original one - 63 MB.
Feature extraction
gtr-t5-large-quant
- larger model for feature extraction (242 MB). Works the best overall.gtr-t5
- smaller model (185 MB) that still works very well.gtr-t5-quant
- minified version (78 MB) of thegtr-t5
model.sentence-t5-large-quant
- larger model for feature extraction (242 MB).sentence-t5
- smaller model (185 MB) that still works quite well.sentence-t5-quant
- minified version (78 MB) of thesentence-t5
model.
Image models
Semantic segmentation
segformer-b0-segmentation-quant
- the smallest model for indoor and outdoor scenes (3 MB). Provides a decent quality but the object borders are not always correct.segformer-b1-segmentation-quant
- larger model for indoor and outdoor scenes (9 MB). Provides good quality and better object borders.segformer-b4-segmentation-quant
- the largest model for indoor and outdoor scenes (41 MB). Provides the best quality.
Classification
mobilevit-small
- small model (19 MB) for classification of a large range of classes - people, animals, indoor and outdoor objects.mobilevit-xsmall
- even smaller model (8 MB) for classification of a large range of classes - people, animals, indoor and outdoor objects.mobilevit-xxsmall
- the smallest model (5 MB) for classification of a large range of classes - people, animals, indoor and outdoor objects.segformer-b2-classification
- larger model for indoor and outdoor scenes (88 MB).segformer-b2-classification-quant
- minified (quantized) version of thesegformer-b2-classification
model (17 MB). Provides comparable results with the original.segformer-b1-classification
- smaller model for indoor and outdoor scenes (48 MB).segformer-b1-classification-quant
- minified (quantized) version of thesegformer-b1-classification
model (9 MB). Provides comparable results with the original.segformer-b0-classification
- the smallest model for indoor and outdoor scenes (13 MB).segformer-b0-classification-quant
- minified (quantized) version of thesegformer-b0-classification
model (3 MB). Provides comparable results with the original.
Object detection
yolos-tiny
- small (23 MB) but powerful model for finding a large range of classes - people, animals, indoor and outdoor objects.yolos-tiny-quant
- minified (quantized) version of theyolos-tiny
model. The borders are slightly off compared to the original but the size is 3 times smaller (7 MB).
Feature extraction
efficientformer-l1-feature
- small model (43 MB) for feature extraction. Works well for similar objects.efficientformer-l1-feature-quant
- minified (11 MB) version of theefficientformer-l1-feature
model.efficientformer-l3-feature
- medium model (116 MB) for feature extraction. The best balance between size and quality.efficientformer-l3-feature-quant
- minified (30 MB) version of theefficientformer-l3-feature
model.efficientformer-l7-feature
- large model (308 MB) for feature extraction.efficientformer-l7-feature-quant
- minified (78 MB) version of theefficientformer-l7-feature
model.
Image-to-image
superres-standard
- the model for 2x super-resolution (43 MB). Be aware that the image generation with this model is quite slow - 50+ seconds, depending on hardware and image size.superres-standard-quant
- minified (quantized) version of thesuperres-standard
model (10 MB).superres-small
- tiny model for 2x super-resolution (4 MB). The model runs much faster than thesuperres-standard
- 10+ seconds, depending on hardware and image size.superres-small-quant
- minified (quantized) version of thesuperres-small
model (1.5 MB).superres-standard-x4
- the model for 4x super-resolution (43 MB). Be aware that the image generation with this model is quite slow - 50+ seconds, depending on hardware and image size.superres-standard-x4-quant
- minified (quantized) version of thesuperres-standard-x4
model (10 MB).superres-compressed-x4
- the model for 4x super-resolution of compressed images (43 MB). The model not only increases the image resolution, but also improves its quality. Be aware that the image generation with this model is quite slow - 50+ seconds, depending on hardware and image size.superres-compressed-x4-quant
- minified (quantized) version of thesuperres-compressed-x4
model (10 MB).
Multimodal models
clip-base
- high-quality model for zero-shot classification (370 MB).clip-base-quant
- minified (quantized) version of theclip-base
model (102 MB).blip-base
- high-quality model for image captioning (876 MB).blip-base-quant
- minified (quantized) version of theblip-base
model (161 MB).vit-gpt2
- less resource-intensive model for image captioning (980 MB).vit-gpt2-quant
- minified (quantized) version of thevit-gpt2
model (183 MB).
Future development
- Add examples for popular web frameworks.
- Add audio models (Whisper-small).