• Stars
    star
    1,006
  • Rank 45,677 (Top 0.9 %)
  • Language
    Python
  • License
    Apache License 2.0
  • Created over 1 year ago
  • Updated 5 months ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Pocket-Sized Multimodal AI for content understanding and generation across multilingual texts, images, and ๐Ÿ”œ video, up to 5x faster than OpenAI CLIP and LLaVA ๐Ÿ–ผ๏ธ & ๐Ÿ–‹๏ธ

UForm

Pocket-Sized Multi-Modal AI
For Semantic Search & Recommendation Systems


Discord ย ย ย  LinkedIn ย ย ย  Twitter ย ย ย  Blog ย ย ย  GitHub


UForm + USearch + UCall Demo

Welcome to UForm, a multi-modal AI library that's as versatile as it is efficient. Imagine encoding text, images, and soon, audio, video, and JSON documents into a shared Semantic Vector Space. With compact custom pre-trained transformer models, all of this can run anywhereโ€”from your server farm down to your smartphone. Check them out on HuggingFace!

๐ŸŒŸ Key Features

โšก Speed & Efficiency

  • Tiny Embeddings: With just 256 dimensions, our embeddings are lean and fast to work with, making your search operations 1.5-3x quicker compared to other CLIP-like models with 512-1024 dimensions.

  • Quantization Magic: Our models are trained to be quantization-aware, letting you downcast embeddings from f32 to i8 without losing much accuracy. Supported by USearch, this leads to a further 3x reduction in index size and up to a 5x higher performance, especially on IoT devices with low floating-point performance.

๐ŸŒ Global Reach

๐ŸŽ› Versatility

  • Mid-Fusion: Our models use mid-fusion to align multiple transformer towers, enabling database-like operations on multi-modal data.

  • Bi-Modal Features: Thanks to mid-fusion, our models can produce combined vision & language features, perfect for recommendation systems.

  • Cheap Inference: Our models have under 1 Billion parameters, meaning substantially higher throughput and lower inference costs than even tiny models, like the famous distilbert.

  • Hardware Friendly: Whether it's CoreML, ONNX, or specialized AI hardware like Graphcore IPUs, we've got you covered.

๐ŸŽ“ Architectural Improvements

Inspired by the ALBEF paper by Salesforce, we've pushed the boundaries of pre-training objectives to squeeze more language-vision understanding into smaller models. Some UForm models were trained on just 4 million samples across 10x consumer-grade GPUs โ€” a 100x reduction in both dataset size and compute budget compared to OpenAI's CLIP. While they may not be suited for zero-shot classification tasks, they are your go-to choice for processing large image datasets or even petabytes of video frame-by-frame.

Mid-Fusion

Fusion Models

  • Late-Fusion Models: Great for capturing the big picture but might miss the details. Ideal for large-scale retrieval. OpenAI CLIP is one of those.

  • Early-Fusion Models: These are detail-oriented models that capture fine-grained features. They're usually employed for re-ranking smaller retrieval results.

  • Mid-Fusion Models: The balanced diet of models. They offer an unimodal and a multimodal part, capturing both the forest and the trees. The multimodal part enhances the unimodal features with a cross-attention mechanism.

Broad Training Objectives

We adopt the following training objectives, in line with methodologies presented in the ALBEF and ViCHA papers:

  • Image-Text Matching (ITM): Uses a loss function to gauge how well the image complements the text.
  • Masked Language Modeling (MLM): Stacked on the multimodal encoder to improve language understanding.
  • Hierarchical Image-Text Contrastive (H-ITC): Compares representations across layers for better alignment.
  • Masked Image Modeling (SSL): Applied to the image encoder to enhance visual data interpretation.

๐Ÿ›  Installation

Install UForm via pip:

pip install uform

Note: For versions below 0.3.0, dependencies include transformers and timm. Newer versions only require PyTorch and utility libraries. For optimal performance, use PyTorch v2.0.0 or above.

๐Ÿš€ Quick Start

Loading a Model

import uform

model = uform.get_model('unum-cloud/uform-vl-english') # Just English
model = uform.get_model('unum-cloud/uform-vl-multilingual-v2') # 21 Languages

The multi-lingual model is much heavier due to a 10x more extensive vocabulary. So, if you only expect English data, take the former for efficiency. You can also load your Mid-fusion model. Just upload it on HuggingFace and pass the model name to get_model.

Encoding Data

from PIL import Image

text = 'a small red panda in a zoo'
image = Image.open('red_panda.jpg')

image_data = model.preprocess_image(image)
text_data = model.preprocess_text(text)

image_embedding = model.encode_image(image_data)
text_embedding = model.encode_text(text_data)
joint_embedding = model.encode_multimodal(image=image_data, text=text_data)

Retrieving Features

image_features, image_embedding = model.encode_image(image_data, return_features=True)
text_features, text_embedding = model.encode_text(text_data, return_features=True)

These features can later be used to produce joint multimodal encodings faster, as the first layers of the transformer can be skipped. Those might be useful for re-ranking search results, and recommendation systems.

joint_embedding = model.encode_multimodal(
    image_features=image_features,
    text_features=text_features,
    attention_mask=text_data['attention_mask']
)

Graphcore IPUs

To run on Graphcore IPUs, you must set up PopTorch first. Follow the user guide on their website. Once complete, our example would need a couple of adjustments to best leverage the Graphcore platform's available data and model-parallelism.

import poptorch
from PIL import Image

options = poptorch.Options()
options.replicationFactor(1)
options.deviceIterations(4)

model = get_model_ipu('unum-cloud/uform-vl-english').parallelize()
model = poptorch.inferenceModel(model, options=options)

text = 'a small red panda in a zoo'
image = Image.open('red_panda.jpg')
image_data = model.preprocess_image(image)
text_data = model.preprocess_text(text)

image_data = image_data.repeat(4, 1, 1, 1)
text_data = {k: v.repeat(4, 1) for k,v in text_data.items()}

image_features, text_features = model(image_data, text_data)

Cloud API

You can also use our larger, faster, better proprietary models deployed in optimized cloud environments. For that, please choose the cloud of liking, search the marketplace for "Unum UForm", and reinstall UForm with optional dependencies:

$ pip install uform[remote]

model = uform.get_client('0.0.0.0:7000')

The only thing that changes after that is calling get_client with the IP address of your instance instead of using get_model for local usage.

Please, join our Discord for early access!

๐Ÿ“Š Models

Architecture

Model Language Tower Image Tower Multimodal Part Languages URL
unum-cloud/uform-vl-english BERT, 2 layers ViT-B/16 2 layers 1 weights.pt
unum-cloud/uform-vl-multilingual BERT, 8 layers ViT-B/16 4 layers 12 weights.pt
unum-cloud/uform-vl-multilingual-v2 BERT, 8 layers ViT-B/16 4 layers 21 weights.pt

The multilingual models were trained on a language-balanced dataset. The missing captions were augmented with NLLB, effectively distilling multi-lingual capabilities from a large NMT model into our tiny multi-modal encoder.

Accuracy

Evaluating the unum-cloud/uform-vl-multilingual-v2 model, one can expect the following metrics for text-to-image search, compared against xlm-roberta-base-ViT-B-32 OpenCLIP model. The @ 1, @ 5, and @ 10 showcase the quality of top-1, top-5, and top-10 search results, compared to human-annotated dataset. Higher is better.

Language OpenCLIP @ 1 UForm @ 1 OpenCLIP @ 5 UForm @ 5 OpenCLIP @ 10 UForm @ 10 Speakers
Arabic ๐Ÿ‡ธ๐Ÿ‡ฆ 22.7 31.7 44.9 57.8 55.8 69.2 274 M
Armenian ๐Ÿ‡ฆ๐Ÿ‡ฒ 5.6 22.0 14.3 44.7 20.2 56.0 4 M
Chinese ๐Ÿ‡จ๐Ÿ‡ณ 27.3 32.2 51.3 59.0 62.1 70.5 1'118 M
English ๐Ÿ‡บ๐Ÿ‡ธ 37.8 37.7 63.5 65.0 73.5 75.9 1'452 M
French ๐Ÿ‡ซ๐Ÿ‡ท 31.3 35.4 56.5 62.6 67.4 73.3 274 M
German ๐Ÿ‡ฉ๐Ÿ‡ช 31.7 35.1 56.9 62.2 67.4 73.3 134 M
Hebrew ๐Ÿ‡ฎ๐Ÿ‡ฑ 23.7 26.7 46.3 51.8 57.0 63.5 9 M
Hindi ๐Ÿ‡ฎ๐Ÿ‡ณ 20.7 31.3 42.5 57.9 53.7 69.6 602 M
Indonesian ๐Ÿ‡ฎ๐Ÿ‡ฉ 26.9 30.7 51.4 57.0 62.7 68.6 199 M
Italian ๐Ÿ‡ฎ๐Ÿ‡น 31.3 34.9 56.7 62.1 67.1 73.1 67 M
Japanese ๐Ÿ‡ฏ๐Ÿ‡ต 27.4 32.6 51.5 59.2 62.6 70.6 125 M
Korean ๐Ÿ‡ฐ๐Ÿ‡ท 24.4 31.5 48.1 57.8 59.2 69.2 81 M
Persian ๐Ÿ‡ฎ๐Ÿ‡ท 24.0 28.8 47.0 54.6 57.8 66.2 77 M
Polish ๐Ÿ‡ต๐Ÿ‡ฑ 29.2 33.6 53.9 60.1 64.7 71.3 41 M
Portuguese ๐Ÿ‡ต๐Ÿ‡น 31.6 32.7 57.1 59.6 67.9 71.0 257 M
Russian ๐Ÿ‡ท๐Ÿ‡บ 29.9 33.9 54.8 60.9 65.8 72.0 258 M
Spanish ๐Ÿ‡ช๐Ÿ‡ธ 32.6 35.6 58.0 62.8 68.8 73.7 548 M
Thai ๐Ÿ‡น๐Ÿ‡ญ 21.5 28.7 43.0 54.6 53.7 66.0 61 M
Turkish ๐Ÿ‡น๐Ÿ‡ท 25.5 33.0 49.1 59.6 60.3 70.8 88 M
Ukranian ๐Ÿ‡บ๐Ÿ‡ฆ 26.0 30.6 49.9 56.7 60.9 68.1 41 M
Vietnamese ๐Ÿ‡ป๐Ÿ‡ณ 25.4 28.3 49.2 53.9 60.3 65.5 85 M
Mean 26.5ยฑ6.4 31.8ยฑ3.5 49.8ยฑ9.8 58.1ยฑ4.5 60.4ยฑ10.6 69.4ยฑ4.3 -
Google Translate 27.4ยฑ6.3 31.5ยฑ3.5 51.1ยฑ9.5 57.8ยฑ4.4 61.7ยฑ10.3 69.1ยฑ4.3 -
Microsoft Translator 27.2ยฑ6.4 31.4ยฑ3.6 50.8ยฑ9.8 57.7ยฑ4.7 61.4ยฑ10.6 68.9ยฑ4.6 -
Meta NLLB 24.9ยฑ6.7 32.4ยฑ3.5 47.5ยฑ10.3 58.9ยฑ4.5 58.2ยฑ11.2 70.2ยฑ4.3 -

Lacking a broad enough evaluation dataset, we translated the COCO Karpathy test split with multiple public and proprietary translation services, averaging the scores across all sets, and breaking them down in the bottom section. Check out the unum-cloud/coco-sm repository for details.

Speed

On RTX 3090, the following performance is expected from uform on text encoding.

Model Multi-lingual Model Size Speed Speedup
bert-base-uncased No 109'482'240 1'612 seqs/s
distilbert-base-uncased No 66'362'880 3'174 seqs/s x 1.96
sentence-transformers/all-MiniLM-L12-v2 Yes 33'360'000 3'604 seqs/s x 2.24
sentence-transformers/all-MiniLM-L6-v2 No 22'713'216 6'107 seqs/s x 3.79
unum-cloud/uform-vl-multilingual-v2 Yes 120'090'242 6'809 seqs/s x 4.22

๐Ÿงฐ Additional Tooling

There are two options to calculate semantic compatibility between an image and a text: Cosine Similarity and Matching Score.

Cosine Similarity

import torch.nn.functional as F

similarity = F.cosine_similarity(image_embedding, text_embedding)

The similarity will belong to the [-1, 1] range, 1 meaning the absolute match.

Pros:

  • Computationally cheap.
  • Only unimodal embeddings are required. Unimodal encoding is faster than joint encoding.
  • Suitable for retrieval in large collections.

Cons:

  • Takes into account only coarse-grained features.

Matching Score

Unlike cosine similarity, unimodal embedding is not enough. Joint embedding will be needed, and the resulting score will belong to the [0, 1] range, 1 meaning the absolute match.

score = model.get_matching_scores(joint_embedding)

Pros:

  • Joint embedding captures fine-grained features.
  • Suitable for re-ranking - sorting retrieval results.

Cons:

  • Resource-intensive.
  • Not suitable for retrieval in large collections.

More Repositories

1

usearch

Fast Open-Source Search & Clustering engine ร— for Vectors & ๐Ÿ”œ Strings ร— in C++, C, Python, JavaScript, Rust, Java, Objective-C, Swift, C#, GoLang, and Wolfram ๐Ÿ”
C++
2,081
star
2

ucall

Web Serving and Remote Procedure Calls at 50x lower latency and 70x higher bandwidth than FastAPI, implementing JSON-RPC & REST over io_uring โ˜Ž๏ธ
C
1,123
star
3

ustore

Multi-Modal Database replacing MongoDB, Neo4J, and Elastic with 1 faster ACID solution, with NetworkX and Pandas interfaces, and bindings for C 99, C++ 17, Python 3, Java, GoLang ๐Ÿ—„๏ธ
C++
527
star
4

udisk

The fastest ACID-transactional persisted Key-Value store designed as modified LSM-Tree for NVMe block-devices with GPU-acceleration and SPDK to bypass the Linux kernel
57
star
5

ucsb

Wide NoSQL benchmark for RocksDB, LevelDB, Redis, WiredTiger and MongoDB extending the Yahoo Cloud Serving Benchmark
C++
49
star
6

ucset

If only std::set was a DBMS: collection of templated ACID in-memory exception-free thread-safe and concurrent containers in a header-only library
C++
32
star
7

awesome

A list of awesome resources and blogs on topics related to Unum
30
star
8

NetworkXum

A NetworkX-like Python wrapper for graphs persisted in a DBMS
Python
12
star
9

CppArm

Meet the C++ and Systems Design Group of Armenia!
9
star
10

go-ustore

GoLang Packages for UKV
Go
9
star
11

examples

Learning Unum's efficient data-processing tools one cool project at a time
9
star
12

udsb

Unlimited Data-Science Benchmarks for Numeric, Tabular and Graph Workloads
Jupyter Notebook
9
star
13

coco-sm

Evaluation of Vision-Language models' multilingual properties in 20 different languages.
Python
7
star
14

usearch-benchmarks

Comparing USearch to FAISS and other Vector Search engines on Billion-scale datasets
Python
7
star
15

Docker

A set of Docker containers extensively used for both AI/HPC software development and deployment at Unum
Dockerfile
4
star
16

uform-benchmarks

Python
3
star
17

ustore-deps

UStore precompiled dependencies
Python
3
star
18

.github

GitHub Profile Header
1
star
19

press

Unum press materials
1
star
20

RegExum

A Python wrapper for persistent DBMS that simplifies large-scale text search
Python
1
star
21

PyStats2MD

Generate Markdown tables & Plotly charts from Google Benchmark outputs or directly in Python!
Python
1
star