• Stars
    star
    230
  • Rank 174,053 (Top 4 %)
  • Language
    Java
  • License
    Apache License 2.0
  • Created over 11 years ago
  • Updated over 10 years ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

A vector similarity database

Simbase: A vector similarity database

Simbase is a redis-like vector similarity database. You can add, get, delete vectors to/from it, and then retrieve the most similar vectors within one vector set or between two vector sets.

Release

Current version is v0.1.0-beta1.

Build status

Build Status

Concepts

Simbase use a concept model as below:

                   + - - - +
      +----------->| Basis |<------------------+
      |  belongs   + _ _ _ +      belongs      |
      |                                        |
      |                                        |
+ - - - - - +        source           + - - - - - - - -+ 
| VectorSet |<------------------------| Recommendation |
+ - - - - - +                         + - - - - - - - -+
      ^              target                    |
      |________________________________________|
  • Vector set: a set of vectors
  • Basis: the basis for vectors, vectors in one vector set have same basis
  • Recommendation: a one-direction binary relationship between two vector sets which have the same basis

A real example follow the model below:

     + - - - - - +                 + - - - - - - - -+ 
+--->|  Articles |<----------------|  User Profiles |
|    + - - - - - +                 + - - - - - - - -+
|          |
+----------+

This graph shows

  • recommend article by article (recommend from article to article)
  • recommend article by user profile (recommend from user profile to article)

How to build and start

To build the project, you need install leiningen first, and then

cd SIMBASE_HOME

lein uberjar

After the uberjar is created, you can start the system

cd SIMBASE_HOME

bin/start

How to connect to Simbase

You can use redis-cli directly for administration tasks.

Or you can use redis client bindings in different language directly in a programming way.

Python example

import redis

dest = redis.Redis(host='localhost', port=7654)
dest.execute_command('bmk', 'ba', 'a', 'b', 'c')
dest.execute_command('vmk', 'ba', 'va')
dest.execute_command('rmk', 'va', 'va', 'cosinesq')

Node.js example

var redis = require("redis"), client = redis.createClient(7654, 'localhost');

client.send_command('bmk', ['ba', 'a', 'b', 'c'])
client.send_command('vmk', ['ba', 'va'])
client.send_command('rmk', ['va', 'va', 'cosinesq'])

A general application case

For example, we need to recommend articles to users, we may follow below steps:

Setup

> bmk b2048 t1 t2 t3 ... t2047 t2048
> vmk b2048 article
> vmk b2048 userprofile
> rmk userprofile article cosinesq

Fill data

> vadd article 1 0.11 0.112 0.1123...
> vadd article 2 0.21 0.212 0.2123...
...    

> vadd userprofile 1 0.11 0.112 0.1123...
> vadd userprofile 2 0.21 0.212 0.2123...
...

Query

> rrec userprofile 2 article

All commands are explained in next section.

Core commands

Then you can use redis-cli to connect to simbase directly

Basis related

  • blist

    blist

    List all basis in system

  • bmk basisname components...

    bmk b512 universe time space human animal plant...

    Create a basis

  • brev basisname components...

    brev b512 plant animal human space time universe...

    Revise a basis

Vector set related

  • vlist basisname

    vlist b512

    List all vector set with one basis

  • vmk basisname vecsetname

    vmk b512 article

    Create a vector set

  • vget vecsetname vecid

    vget article 12345678

    Get the vector for the article with id 12345678

  • vadd vecsetname vecid components...

    vadd article 12345678 0.1 0.12 0.123 0.1234 0.12345 0.123456...

    add the value for the article vector with id 12345678

  • vset vecsetname vecid components...

    vset article 12345678 0.1 0.12 0.123 0.1234 0.12345 0.123456...

    set the value for the article vector with id 12345678

  • vacc vecsetname vecid components...

    vacc article 12345678 0.1 0.12 0.123 0.1234 0.12345 0.123456...

    accumulate the value for the article vector with id 12345678

  • vrem vecsetname vecid

    vrem article 12345678

remove the vector with id 12345678 from article vector set

Recommendation related

  • rlist vecsetname

    rlist article

    List all recommendation targets with the inputed vecset as source

  • rmk vecsetname1 vecsetname2 funcscore

    rmk userprofile article cosinesq

    Create a recommendation to article by userprofile and it use cosinesq as score function. Currently score functions you can choice are: 'cosinesq' and 'jensenshannon'

  • rrec vecsetname1 vecid vecsetname2

    rrec userprofile 87654321 article

    Recommend articles for user 87654321

Limitations

Assumptions on vectors

Although Simbase can store arbitrary vectors, but score functions may apply some constraints on vectors.

For example, if you adopt "jensenshannon" as your score function, you should assure your vector is a probability distribution, i.e. the sum of all components equals to one.

Performance consideration

The write operation is handled in a single thread per basis, and comparison between any two vectors is needed, so the write operation is scaled at O(n).

We had a non-final performance test for the dense vectors on an i7-cpu Macbook, it can easily handle 100k 1k-dimensional vectors with each write operation in under 0.14 sec; and if the linear scale ratio can hold, it means Simbase can handle 700k dense vectors with each write operation in under 1 sec.

Since the data is all in memory, the read operation is pretty fast.

We are still in the process of tuning the performance of the sparse vectors.

Licenses

Simbase is dual licensed under the Apache License 2.0 and Eclipse Public License 1.0. Simbase is free for commercial use and distribution under the terms of either license.

Special thanks

Special thanks for Feng Sheng, we borrowed lots of code from his great project http-kit ( https://github.com/http-kit/http-kit/ ).

Also thanks for Kunwei Zhang from Tsinghua Univ. for his smart idea.

Contributors

More Repositories

1

swagger-py-codegen

a Python web framework generator supports Flask, Tornado, Falcon, Sanic
Python
554
star
2

gkseg

Yet another Chinese word segmentation package based on character-based tagging heuristics and CRF algorithm
C
243
star
3

stan-cn-nlp

stan-cn-nlp: an API wrapper based on Stanford NLP packages for the convenience of Chinese users
Java
56
star
4

Brief

In a nutshell, this is a Text Summarizer
Python
42
star
5

Caver

Caver: a toolkit for multilabel text classification.
Python
39
star
6

redis-namespace

namespaced subset of your redis keyspace
Python
22
star
7

G.js

A simple javascript module loader from Guokr.com
JavaScript
20
star
8

corpus

An open corpus for Chinese NLP study
16
star
9

TorchCTR

CTR Prediction on PyTorch
Python
14
star
10

stan-cn-ner

A Chinese naming entity recognization package in stan-cn-* family
Java
14
star
11

guokr-build

A build tool for frontend developer from guokr.com
JavaScript
10
star
12

guokr

guokr modules
JavaScript
9
star
13

asynx

An open source, distributed, and web / HTTP oriented taskqueue & scheduler service inspired by GAE
Python
6
star
14

stan-cn-seg

A Chinese word segmentation package in stan-cn-* family
ActionScript
6
star
15

clj-cn-nlp

A clojure wrapper for Stanford CoreNLP package based on stan-cn-nlp Java wrapper for Simplified Chinese users
Clojure
6
star
16

neuseg

An experimental Chinese word segmentation tool based on vector model and neurual networks
Java
5
star
17

simbase-clj

A clojure client for simbase
Clojure
5
star
18

wikicrawl

A crawler to achieve the category structure of wikipedia
Clojure
4
star
19

tsuru-postgresapi

A PostgreSQL API for tsuru PaaS
4
star
20

hebo

A dataflow scheduler based on cascalog for hadoop tasks
Java
3
star
21

stan-cn-tag

A Chinese POS tagging package in stan-cn-* family
Java
3
star
22

stan-cn-com

A common base for stan-cn-* package family.
Java
3
star
23

string-demon

Python
2
star
24

usher-heartbeat

register to the usher and keep the heartbeat
Python
1
star