• Stars
    star
    497
  • Rank 88,652 (Top 2 %)
  • Language
    Go
  • License
    Apache License 2.0
  • Created over 7 years ago
  • Updated over 1 year ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

A Go library implementing an FST (finite state transducer)

vellum vellum

NOTE: active development of the vellum library has moved to https://github.com/blevesearch/vellum

This repository will remain as is to support previous Couchbase builds.

Tests Coverage Status GoDoc Go Report Card License

A Go library implementing an FST (finite state transducer) capable of:

  • mapping between keys ([]byte) and a value (uint64)
  • enumerating keys in lexicographic order

Some additional goals of this implementation:

  • bounded memory use while building the FST
  • streaming out FST data while building
  • mmap FST runtime to support very large FTSs (optional)

Usage

Building an FST

To build an FST, create a new builder using the New() method. This method takes an io.Writer as an argument. As the FST is being built, data will be streamed to the writer as soon as possible. With this builder you MUST insert keys in lexicographic order. Inserting keys out of order will result in an error. After inserting the last key into the builder, you MUST call Close() on the builder. This will flush all remaining data to the underlying writer.

In memory:

  var buf bytes.Buffer
  builder, err := vellum.New(&buf, nil)
  if err != nil {
    log.Fatal(err)
  }

To disk:

  f, err := os.Create("/tmp/vellum.fst")
  if err != nil {
    log.Fatal(err)
  }
  builder, err := vellum.New(f, nil)
  if err != nil {
    log.Fatal(err)
  }

MUST insert keys in lexicographic order:

err = builder.Insert([]byte("cat"), 1)
if err != nil {
  log.Fatal(err)
}

err = builder.Insert([]byte("dog"), 2)
if err != nil {
  log.Fatal(err)
}

err = builder.Insert([]byte("fish"), 3)
if err != nil {
  log.Fatal(err)
}

err = builder.Close()
if err != nil {
  log.Fatal(err)
}

Using an FST

After closing the builder, the data can be used to instantiate an FST. If the data was written to disk, you can use the Open() method to mmap the file. If the data is already in memory, or you wish to load/mmap the data yourself, you can instantiate the FST with the Load() method.

Load in memory:

  fst, err := vellum.Load(buf.Bytes())
  if err != nil {
    log.Fatal(err)
  }

Open from disk:

  fst, err := vellum.Open("/tmp/vellum.fst")
  if err != nil {
    log.Fatal(err)
  }

Get key/value:

  val, exists, err = fst.Get([]byte("dog"))
  if err != nil {
    log.Fatal(err)
  }
  if exists {
    fmt.Printf("contains dog with val: %d\n", val)
  } else {
    fmt.Printf("does not contain dog")
  }

Iterate key/values:

  itr, err := fst.Iterator(startKeyInclusive, endKeyExclusive)
  for err == nil {
    key, val := itr.Current()
    fmt.Printf("contains key: %s val: %d", key, val)
    err = itr.Next()
  }
  if err != nil {
    log.Fatal(err)
  }

How does the FST get built?

A full example of the implementation is beyond the scope of this README, but let's consider a small example where we want to insert 3 key/value pairs.

First we insert "are" with the value 4.

step1

Next, we insert "ate" with the value 2.

step2

Notice how the values associated with the transitions were adjusted so that by summing them while traversing we still get the expected value.

At this point, we see that state 5 looks like state 3, and state 4 looks like state 2. But, we cannot yet combine them because future inserts could change this.

Now, we insert "see" with value 3. Once it has been added, we now know that states 5 and 4 can longer change. Since they are identical to 3 and 2, we replace them.

step3

Again, we see that states 7 and 8 appear to be identical to 2 and 3.

Having inserted our last key, we call Close() on the builder.

step4

Now, states 7 and 8 can safely be replaced with 2 and 3.

For additional information, see the references at the bottom of this document.

What does the serialized format look like?

We've broken out a separate document on the vellum disk format v1.

What if I want to use this on a system that doesn't have mmap?

The mmap library itself is guarded with system/architecture build tags, but we've also added an additional build tag in vellum. If you'd like to Open() a file based representation of an FST, but not use mmap, you can build the library with the nommap build tag. NOTE: if you do this, the entire FST will be read into memory.

Can I use this with Unicode strings?

Yes, however this implementation is only aware of the byte representation you choose. In order to find matches, you must work with some canonical byte representation of the string. In the future, some encoding-aware traversals may be possible on top of the lower-level byte transitions.

How did this library come to be?

In my work on the Bleve project I became aware of the power of the FST for many search-related tasks. The obvious starting point for such a thing in Go was the mafsa project. While working with mafsa I encountered some issues. First, it did not stream data to disk while building. Second, it chose to use a rune as the fundamental unit of transition in the FST, but I felt using a byte would be more powerful in the end. My hope is that higher-level encoding-aware traversals will be possible when necessary. Finally, as I reported bugs and submitted PRs I learned that the mafsa project was mainly a research project and no longer being maintained. I wanted to build something that could be used in production. As the project advanced more and more techniques from the BurntSushi/fst were adapted to our implementation.

Are there tools to work with vellum files?

Under the cmd/vellum subdirectory, there's a command-line tool which features subcommands that can allow you to create, inspect and query vellum files.

How can I generate a state transition diagram from a vellum file?

The vellum command-line tool has a "dot" subcommand that can emit graphviz dot output data from an input vellum file. The dot file can in turn be converted into an image using graphviz tools. Example...

$ vellum dot myFile.vellum > output.dot
$ dot -Tpng output.dot -o output.png

Related Work

Much credit goes to two existing projects:

Most of the original implementation here started with my digging into the internals of mafsa. As the implementation progressed, I continued to borrow ideas/approaches from the BurntSushi/fst library as well.

For a great introduction to this topic, please read the blog post Index 1,600,000,000 Keys with Automata and Rust

More Repositories

1

couchbase-lite-ios

Lightweight, embedded, syncable NoSQL database engine for iOS and MacOS apps.
Objective-C
1,603
star
2

forestdb

A Fast Key-Value Storage Engine Based on Hierarchical B+-Tree Trie
C++
1,284
star
3

couchbase-lite-android

Lightweight, embedded, syncable NoSQL database engine for Android.
Java
1,173
star
4

moss

moss - a simple, fast, ordered, persistable, key-val storage library for golang
Go
947
star
5

geocouch

GeoCouch, a spatial index for CouchDB
Erlang
513
star
6

couchnode

Couchbase Node.js Client Library (Official)
C++
462
star
7

sync_gateway

Manages access and synchronization between Couchbase Lite and Couchbase Server
Go
447
star
8

couchbase-lite-net

A lightweight, document-oriented (NoSQL), syncable database engine for .NET
C#
436
star
9

go-slab

slab allocator in go
Go
364
star
10

gocb

The Couchbase Go SDK
Go
359
star
11

go-couchbase

Couchbase client in Go
Go
318
star
12

fleece

A super-fast, compact, JSON-equivalent binary data format
C++
309
star
13

couchbase-net-client

The official Couchbase SDK for .NET Core and Full Frameworks
C#
271
star
14

couchbase-lite-core

Cross-platform C++ core library for Couchbase Lite
C++
253
star
15

couchbase-java-client

The official Java client for Couchbase Server
Java
252
star
16

couchbase-python-client

Couchbase Python Client Library (Official)
Python
239
star
17

couchbase-elasticsearch-connector

The Official Couchbase Elasticsearch Connector
Java
176
star
18

libcouchbase

The couchbase client for C.
C
173
star
19

docker

Dockerfiles and configuration scripts for the Docker Hub Official Couchbase images
Dockerfile
140
star
20

cbft

Couchbase Full Text server
Go
136
star
21

couchdb

CouchDB
Erlang
129
star
22

kv_engine

Couchbase Key-Value Engine
C++
128
star
23

couchbase-spark-connector

The Official Couchbase Spark Connector
Scala
119
star
24

couchbase-lite-C

C language bindings for the Couchbase Lite embedded NoSQL database engine
C++
112
star
25

nitro

A high performance in-memory index storage engine
Go
107
star
26

couchbase-ruby-client

Couchbase Ruby Client Library (Official)
Ruby
107
star
27

query

Query engine.
Go
103
star
28

kafka-connect-couchbase

Kafka Connect connector for Couchbase Server
Java
70
star
29

couchbase-kafka-connector

Legacy Couchbase to Kafka connector, superseded by Kafka Connect based.
Java
69
star
30

couchbase-lite-java-core

Couchbase Lite Java core library
Java
68
star
31

CouchbaseMock

A Java mock for Couchbase
Java
62
star
32

couchbase-ruby-model

The Active Model implementation for Couchbase Server built on couchbase-ruby-client
Ruby
61
star
33

memcached

Memcached work planned for contribution back to memcached/memcached
C++
56
star
34

couchstore

couchbase storage file library
C
54
star
35

couchbase-lite-java

Portable java version of Couchbase Lite
Java
52
star
36

java-dcp-client

Couchbase Java DCP Client
Java
48
star
37

couchbase-lite-java-ce-root

The root workspace for the Community Editions of the Java language family of products (Java Desktop, Java WebService, and Android)
Shell
47
star
38

ns_server

The Membase Server Superdupervisor.
JavaScript
44
star
39

couchbase-jvm-clients

The Couchbase Monorepo for JVM Clients: Java, Scala, io-core…
Java
44
star
40

kubernetes

Deprecated. Please use the Couchbase Autonomous Operator
Shell
43
star
41

couchbase-cli

Command Line tools for Administering a Couchbase Cluster
Python
39
star
42

moxi

a memcached proxy with energy and pep
C
39
star
43

indexing

Couchbase Indexes
Go
38
star
44

tlm

top level makefile
CMake
35
star
45

docs-site

The Antora playbook project, contributing documentation, and home page for the new Couchbase Docs site.
JavaScript
34
star
46

go_n1ql

N1QL Driver for Go lang's database/sql package
Go
33
star
47

couchbase-jvm-core

The JVM core for Couchbase SDKs.
Java
32
star
48

docs-cb4

Documentation for Couchbase Server 4.x and 5.x GA releases
HTML
28
star
49

ep-engine

Eventually Persistent Couchbase Data Layer.
C++
28
star
50

couchbase-exporter

Couchbase Prometheus Exporter
Go
27
star
51

eventing

Couchbase Eventing Engine
Go
26
star
52

couchbase-lite-android-liteserv

An HTTP (ReST) interface to the Couchbase-Lite database running on the device/emulator
Java
24
star
53

gocbcore

The IO component of gocb
Go
22
star
54

cbgt

The cbgt project provides a generic golang library that manages partitions or data shards across a cluster of servers.
Go
22
star
55

testrunner

The TestRunner (Extracted from carlin).
Python
21
star
56

perfrunner

Performance TAF for Couchbase Server
Python
20
star
57

couchbase-lite-java-native

This is a shared native SQLite library used for Couchbase Lite Android/Java.
C++
20
star
58

couchbase-examples

Ruby
19
star
59

chronicle

Erlang
19
star
60

docs-server

The Couchbase Server documentation source files (in AsciiDoc) used in the Couchbase Docs site.
HTML
18
star
61

service-broker

An Open Service Broker Based Kubernetes Templating Engine
Go
17
star
62

build

jenkins scripts for executing builds, cgi scripts for status and reporting
Shell
17
star
63

goxdcr

Go
16
star
64

phosphor

High performance event tracing
C++
16
star
65

couchbase-fluent-bit

Fast and Lightweight Log processor and forwarder. Based on upstream Fluent Bit, this includes some additional Couchbase specific configuration and support - https://github.com/fluent/fluent-bit
Go
13
star
66

subjson

High performance JSON manipulation library
C++
13
star
67

gperftools

C++
12
star
68

platform

Small library providing a platform layer
C++
12
star
69

docs-ui-old

Produces the UI bundle used by the Couchbase documentation site.
CSS
12
star
70

couchbase-lite-java-listener

Embedded web server to expose Couchbase Lite REST API on an http socket
Java
12
star
71

cbmonitor

cbmonitor
Python
11
star
72

docs-couchbase-lite

Documentation for Couchbase Lite
Java
10
star
73

couchbase-lite-java-javascript

Javascript view engine for Couchbase Lite Android
Java
10
star
74

sg-bucket

Sync Gateway Bucket interface and common code used by all Sync Gateway bucket implementations.
Go
10
star
75

query-ui

The Couchbase query workbench UI for SQL++ / N1QL.
JavaScript
10
star
76

godbc

Golang database connectivity API. This API is more flexible and extensible than golang's built-in database/sql package, because like JDBC, the API uses interfaces instead of concrete types. This allows it to be extended to handle both SQL and NoSQL / JSON data sources.
Go
10
star
77

couchbase-lite-android-ce

The community edition of couchbase lite for android
9
star
78

clog

Couchbase logging for go.
Go
9
star
79

gometa

Go
9
star
80

cbbootstrap

REST API to help bootstrap Couchbase Server clusters
Go
8
star
81

build-infra

Various programs and scripts used by the Build & Release team not directly related to specific software build processes
Dockerfile
7
star
82

gocbmgr

A library for the making Couchbase REST API calls in golang
Go
7
star
83

sigar

System Information Gatherer And Reporter
C++
7
star
84

Android-EmptyApp

The android empty app.
Java
7
star
85

product-metadata

Various configuration files describing products we build
Jinja
6
star
86

go-blip

Go language implementation of BLIP-over-WebSocket protocol
Go
6
star
87

docs-sdk-go

The Go SDK documentation source files used in the new Couchbase Docs site.
Go
5
star
88

build-manifests

Internal build manifests for all products.
5
star
89

tools-common

Go
5
star
90

couchbase-php-client

Couchbase PHP Client Library (Official)
PHP
5
star
91

docs-sdk-java

The Java SDK documentation source files used in the Couchbase Docs site.
Java
5
star
92

couchbase-lite-java-common

Common code for the Java language family of products (Java Desktop, Java WebService, and Android)
Java
5
star
93

server-sandbox

Dockerfiles for couchbase/server-sandbox automated build
Shell
5
star
94

couchbase-hadoop-plugin

A Couchbase to Hadoop (Sqoop) plugin for importing and exporting data
Java
5
star
95

n1k1

n1k1, pronounced "nicky", is a prototype execution compiler and engine for N1QL query plans
Go
4
star
96

gocb-opentelemetry

Go
4
star
97

stellar-gateway

Go
4
star
98

rhmap

robinhood hashmap in golang
Go
4
star
99

spring

Simple Couchbase CRUD-workload generator based on pylibcouchbase
Python
4
star
100

product-texts

Repository for product-specific documents (e.g. READMEs, license files, etc.)
Rich Text Format
3
star