Vasto
A distributed high-performance key-value store. On Disk. Eventual consistent. HA. Able to grow or shrink without service interruption.
Vasto scales embedded RocksDB into a distributed key-value store, adding sharding, replication, and support operations to
- create a new keyspace
- delele an existing keyspace
- grow a keyspace
- shrink a keyspace
- replace a node in a keyspace
Why
A key-value store is often re-invented. Why there is another one?
Vasto enables developers to setup a distributed key-value store as simple as creating a map object.
The operations, such as creating/deleting the store, partitioning, replications, seamlessly adding/removing servers, etc, are managed by a few commands. Client connection configurations are managed automatically.
In a sense, Vasto is an in-house cloud providing distributed key-value stores as a service, minus the need to balance performance and cloud service costs, plus consistent and low latency.
Architecture
There are one Vasto master and N number of Vasto stores, plus Vasto clients or Vasto proxies/gateways.
- The Vasto stores are basically simple wrapper of RocksDB.
- The Vasto master manages all the Vasto stores and Vasto clients.
- Vasto clients rely on the master to connect to the right Vasto stores.
- Vasto gateways use Vasto client libraries to support different APIs.
The master is the brain of the system. Vasto does not use gossip protocols, or other consensus algorithms. Instead, Vasto uses a single master for simple setup, fast failure detection, fast topology changes, and precise coordinations. The master only contains soft states and is only required when topology changes. So even if it ever crashes, a simple restart will recover everything.
The Vasto stores simply pass get/put/delete/scan requests to RocksDB. One Vasto store can host multiple db instances.
Go applications can use the client library directly.
Applications in other languages can talk to the Vasto gateway, which uses the client library and reverse proxy the requests to the Vasto stores. The number of Vasto gateways are unlimited. They can be installed on any application machines to reduce one network hop. Or can be on its dedicated machine to reduce number of connections to the Vasto stores if both the number of stores and the number of clients are very high.
Life cycle
One Vasto cluster has one master and multiple Vasto stores. When the store joins the cluster, it is just empty.
When the master receives a request to create a keyspace with x shards and replication factor = y, the master would
- find x stores that meet the requirement and assign it to the keyspace
- ask the stores to create the shards, including replicas.
- inform the clients of the store locations
When the master receives a request to resize the keyspace from m shards to n shards, the master would
- if size increased, find n-m stores that meet the requirement and assign it to the keyspace
- ask the stores to create the shards, including replicas.
- prepare the data to the new stores
- direct the clients traffic to the new stores
- remove retiring shards
Hashing algorithm
Vasto used Jumping Consistent Hash to allocate data. This algorithm
- requires no storage. The master only need soft state to manage all store servers. It is OK to restart master.
- evenly distribute the data into buckets.
- when the number of bucket changes, it can also evenly dividing the workload.
With this jumping hash, the cluster resizing is rather simple, flexible, and efficient:
- Cluster can resize up or down freely.
- Resizing is well coordinated.
- Data can be moved via the most efficient SSTable writes.
- Clients aware of the cluster change and can redirect traffic only when the new whole new server are ready.
Eventual Consistency and Active-Active Replication
All Vasto stores can be used to read and write. The changes will be propagated to other replicas within a few milliseconds. Only the primary replica participate in the normal operations. The replica are participating when the primary replica is down, or in a different data center.
Vasto assumes the data already has the event time. It should be the time when the event really happens, not the time when the data is feed into Vasto system. If the system fails over to the replica partition, and there are multiple changes to one key, the one with latest event times will win.
Client APIs
See https://godoc.org/github.com/chrislusf/vasto/goclient/vs
Example
// create a vasto client talking to master at localhost:8278
vc := vs.NewVastoClient(context.Background(), "client_name", "localhost:8278")
// create a cluster for keyspace ks1, with one server, and one copy of data.
vc.CreateCluster("ks1", 1, 1)
// get a cluster client for ks1
cc := vc.NewClusterClient("ks1")
// operate with the cluster client
var key, value []byte
cc.Put(key, value)
cc.Get(vs.Key(key))
...
// change cluster size to 3 servers
vc.ResizeCluster("ks1", 3)
// operate with the existing cluster client
cc.Put(key, value)
cc.Get(vs.Key(key))
...
Currently only basic go library is provided. The gateway is not ready yet.