• Stars
    star
    772
  • Rank 58,858 (Top 2 %)
  • Language
    Python
  • License
    MIT License
  • Created over 1 year ago
  • Updated over 1 year ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

A tiny nearest-neighbor embedding database built with SQLite and Pytorch. (In development!)

tinyvector logo

tinyvector - the tiny, least-dumb, speedy vector embedding database.
No, you don't need a vector database. You need tinyvector.

In pre-release: prod-ready by late-July. Still in development, not ready!

Features

  • Tiny: It's in the name. It's just a Flask server, SQLite DB, and Numpy indexes. Extremely easy to customize, under 500 lines of code.
  • Fast: Tinyvector wlll have comparable speed to advanced vector databases when it comes to speed on small to medium datasets.
  • Vertically Scales: Tinyvector stores all indexes in memory for fast querying. Very easy to scale up to 100 million+ vector dimensions without issue.
  • Open Source: MIT Licensed, free forever.

Soon

  • Powerful Queries: Tinyvector is being upgraded with full SQL querying functionality, something missing from most other databases.
  • Integrated Models: Soon you won't have to bring your own vectors, just generate them on the server automaticaly. Will support SBert, Hugging Face models, OpenAI, Cohere, etc.
  • Python/JS Client: We'll add a comprehensive Python and Javascript package for easy integration with tinyvector in the next two weeks.

Versions

🦀 tinyvector in Rust: tinyvector-rs
🐍 tinyvector in Python: tinyvector

We're better than ...

In most cases, most vector databases are overkill for something simple like:

  1. Using embeddings to chat with your documents. Most document search is nowhere close to what you'd need to justify accelerating search speed with HNSW or FAISS.
  2. Doing search for your website or store. Unless you're selling 1,000,000 items, you don't need Pinecone.
  3. Performing complex search queries on a very large database. Even if you have 2 million embeddings, this might still be the better option due to vector databases struggling with complex filtering. Tinyvector doesn't support metadata/filtering just yet, but it's very easy for you to add that yourself.

Usage

// Run the server manually:
pip install -r requirements
python -m server

// Run tests:
pip install pytest pytest-mock
pytest

Embeddings?

What are embeddings?

As simple as possible: Embeddings are a way to compare similar things, in the same way humans compare similar things, by converting text into a small list of numbers. Similar pieces of text will have similar numbers, different ones have very different numbers.

Read OpenAI's explanation.

Get involved

tinyvector is going to be growing a lot (don't worry, will still be tiny). Feel free to make a PR and contribute. If you have questions, just mention @willdepue.

Some ideas for first pulls:

  • Add metadata and allow querying/filtering. This is especially important since a lot vector databases literally don't have a WHERE clause lol (or just an extremely weak one). Not a problem here. Read more about this.
  • Rethinking SQLite and choosing something. NOSQL feels fitting for embeddings?
  • Add embedding functions for easy adding text (sentence transformers, OpenAI, Cohere, etc.)
  • Let's start GPU accelerating with a Pytorch index. GPUs are great at matmuls -> NN search with a fused kernel. Let's put 32 million vectors on a single GPU.
  • Help write unit and integration tests.
  • See all active issues!

Known Issues

# Major bugs:
Data corruption SQLite error? Stored vectors end up changing. Replicate by creating a table, inserting vectors, creating an index and then screwing around till an error happens. Dims end up unmatched (might be the blob functions or the norm functions most likely, but doesn't explain why the database is changing).
PCA is not tested, neither is immutable Brute Force index.

License

MIT