• This repository has been archived on 12/Nov/2019
  • Stars
    star
    400
  • Rank 107,843 (Top 3 %)
  • Language
    Go
  • License
    Apache License 2.0
  • Created over 8 years ago
  • Updated about 5 years ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

A distributed sync package.

This project has moved to https://github.com/minio/minio/tree/master/pkg/dsync

dsync Slack Go Report Card codecov

A distributed locking and syncing package for Go.

Introduction

dsync is a package for doing distributed locks over a network of n nodes. It is designed with simplicity in mind and hence offers limited scalability (n <= 32). Each node will be connected to all other nodes and lock requests from any node will be broadcast to all connected nodes. A node will succeed in getting the lock if n/2 + 1 nodes (whether or not including itself) respond positively. If the lock is acquired it can be held for as long as the client desires and needs to be released afterwards. This will cause the release to be broadcast to all nodes after which the lock becomes available again.

Motivation

This package was developed for the distributed server version of Minio Object Storage. For this we needed a distributed locking mechanism for up to 32 servers that each would be running minio server. The locking mechanism itself should be a reader/writer mutual exclusion lock meaning that it can be held by a single writer or an arbitrary number of readers.

For minio the distributed version is started as follows (for a 6-server system):

$ minio server http://server1/disk http://server2/disk http://server3/disk http://server4/disk http://server5/disk http://server6/disk

(note that the same identical command should be run on servers server1 through to server6)

Design goals

  • Simple design: by keeping the design simple, many tricky edge cases can be avoided.
  • No master node: there is no concept of a master node which, if this would be used and the master would be down, causes locking to come to a complete stop. (Unless you have a design with a slave node but this adds yet more complexity.)
  • Resilient: if one or more nodes go down, the other nodes should not be affected and can continue to acquire locks (provided not more than n/2 - 1 nodes are down).
  • Drop-in replacement for sync.RWMutex and supports sync.Locker interface.
  • Automatically reconnect to (restarted) nodes.

Restrictions

  • Limited scalability: up to 32 nodes.
  • Fixed configuration: changes in the number and/or network names/IP addresses need a restart of all nodes in order to take effect.
  • If a down node comes up, it will not try to (re)acquire any locks that it may have held.
  • Not designed for high performance applications such as key/value stores.

Performance

  • Support up to a total of 7500 locks/second for a size of 16 nodes (consuming 10% CPU usage per server) on moderately powerful server hardware.
  • Lock requests (successful) should not take longer than 1ms (provided decent network connection of 1 Gbit or more between the nodes).

The tables below show detailed performance numbers.

Performance with varying number of nodes

This table shows test performance on the same (EC2) instance type but with a varying number of nodes:

EC2 Instance Type Nodes Locks/server/sec Total Locks/sec CPU Usage
c3.2xlarge 4 (min=3110, max=3376) 12972 25%
c3.2xlarge 8 (min=1884, max=2096) 15920 25%
c3.2xlarge 12 (min=1239, max=1558) 16782 25%
c3.2xlarge 16 (min=996, max=1391) 19096 25%

The min and max locks/server/sec gradually declines but due to the larger number of nodes the overall total number of locks rises steadily (at the same CPU usage level).

Performance with difference instance types

This table shows test performance for a fixed number of 8 nodes on different EC2 instance types:

EC2 Instance Type Nodes Locks/server/sec Total Locks/sec CPU Usage
c3.large (2 vCPU) 8 (min=823, max=896) 6876 75%
c3.2xlarge (8 vCPU) 8 (min=1884, max=2096) 15920 25%
c3.8xlarge (32 vCPU) 8 (min=2601, max=2898) 21996 10%

With the rise in the number of cores the CPU load decreases and overall performance increases.

Stress test

Stress test on a c3.8xlarge (32 vCPU) instance type:

EC2 Instance Type Nodes Locks/server/sec Total Locks/sec CPU Usage
c3.8xlarge 8 (min=2601, max=2898) 21996 10%
c3.8xlarge 8 (min=4756, max=5227) 39932 20%
c3.8xlarge 8 (min=7979, max=8517) 65984 40%
c3.8xlarge 8 (min=9267, max=9469) 74944 50%

The system can be pushed to 75K locks/sec at 50% CPU load.

Usage

NOTE: Previously if you were using dsync.Init([]NetLocker, nodeIndex) to initialize dsync has been changed to dsync.New([]NetLocker, nodeIndex) which returns a *Dsync object to be used in every instance of NewDRWMutex("test", *Dsync)

Exclusive lock

Here is a simple example showing how to protect a single resource (drop-in replacement for sync.Mutex):

import (
	"github.com/minio/dsync/v3"
)

func lockSameResource() {

	// Create distributed mutex to protect resource 'test'
	dm := dsync.NewDRWMutex(context.Background(), "test", ds)

	dm.Lock("lock-1", "example.go:505:lockSameResource()")
	log.Println("first lock granted")

	// Release 1st lock after 5 seconds
	go func() {
		time.Sleep(5 * time.Second)
		log.Println("first lock unlocked")
		dm.Unlock()
	}()

	// Try to acquire lock again, will block until initial lock is released
	log.Println("about to lock same resource again...")
	dm.Lock("lock-1", "example.go:515:lockSameResource()")
	log.Println("second lock granted")

	time.Sleep(2 * time.Second)
	dm.Unlock()
}

which gives the following output:

2016/09/02 14:50:00 first lock granted
2016/09/02 14:50:00 about to lock same resource again...
2016/09/02 14:50:05 first lock unlocked
2016/09/02 14:50:05 second lock granted

Read locks

DRWMutex also supports multiple simultaneous read locks as shown below (analogous to sync.RWMutex)

func twoReadLocksAndSingleWriteLock() {

	drwm := dsync.NewDRWMutex(context.Background(), "resource", ds)

	drwm.RLock("RLock-1", "example.go:416:twoReadLocksAndSingleWriteLock()")
	log.Println("1st read lock acquired, waiting...")

	drwm.RLock("RLock-2", "example.go:420:twoReadLocksAndSingleWriteLock()")
	log.Println("2nd read lock acquired, waiting...")

	go func() {
		time.Sleep(1 * time.Second)
		drwm.RUnlock()
		log.Println("1st read lock released, waiting...")
	}()

	go func() {
		time.Sleep(2 * time.Second)
		drwm.RUnlock()
		log.Println("2nd read lock released, waiting...")
	}()

	log.Println("Trying to acquire write lock, waiting...")
	drwm.Lock("Lock-1", "example.go:445:twoReadLocksAndSingleWriteLock()")
	log.Println("Write lock acquired, waiting...")

	time.Sleep(3 * time.Second)

	drwm.Unlock()
}

which gives the following output:

2016/09/02 15:05:20 1st read lock acquired, waiting...
2016/09/02 15:05:20 2nd read lock acquired, waiting...
2016/09/02 15:05:20 Trying to acquire write lock, waiting...
2016/09/02 15:05:22 1st read lock released, waiting...
2016/09/02 15:05:24 2nd read lock released, waiting...
2016/09/02 15:05:24 Write lock acquired, waiting...

Basic architecture

Lock process

The basic steps in the lock process are as follows:

  • broadcast lock message to all n nodes
  • collect all responses within certain time-out window
    • if quorum met (minimally n/2 + 1 responded positively) then grant lock
    • otherwise release all underlying locks and try again after a (semi-)random delay
  • release any locks that (still) came in after time time-out window

Unlock process

The unlock process is really simple:

  • broadcast unlock message to all nodes that granted lock
  • if a destination is not available, retry with gradually longer back-off window to still deliver
  • ignore the 'result' (cover for cases where destination node has gone down and came back up)

Dealing with Stale Locks

A 'stale' lock is a lock that is left at a node while the client that originally acquired the client either:

  • never released the lock (due to eg a crash) or
  • is disconnected from the network and henceforth not able to deliver the unlock message.

Too many stale locks can prevent a new lock on a resource from being acquired, that is, if the sum of the stale locks and the number of down nodes is greater than n/2 - 1. In dsync a recovery mechanism is implemented to remove stale locks (see here for the details).

Known deficiencies

Known deficiencies can be divided into two categories, namely a) more than one write lock granted and b) lock not becoming available anymore.

More than one write lock

So far we have identified one case during which this can happen (example for 8 node system):

  • 3 nodes are down (say 6, 7, and 8)
  • node 1 acquires a lock on "test" (nodes 1 through to 5 giving quorum)
  • node 4 and 5 crash (dropping the lock)
  • nodes 4 through to 8 restart
  • node 4 acquires a lock on "test" (nodes 4 through to 8 giving quorum)

Now we have two concurrent locks on the same resource name which violates the core requirement. Note that if just a single server out of 4 or 5 crashes that we are still fine because the second lock cannot acquire quorum.

This table summarizes the conditions for different configurations during which this can happen:

Nodes Down nodes Crashed nodes Total nodes
4 1 2 3
8 3 2 5
12 5 2 7
16 7 2 9

(for more info see testMultipleServersOverQuorumDownDuringLockKnownError in chaos.go)

Lock not available anymore

This would be due to too many stale locks and/or too many servers down (total over n/2 - 1). The following table shows the maximum toterable number for different node sizes:

Nodes Max tolerable
4 1
8 3
12 5
16 7

If you see any other short comings, we would be interested in hearing about them.

Tackled issues

  • When two nodes want to acquire the same lock at precisely the same time, it is possible for both to just acquire n/2 locks and there is no majority winner. Both will fail back to their clients and will retry later after a semi-randomized delay.

Server side logic

On the server side just the following logic needs to be added (barring some extra error checking):

const WriteLock = -1

type lockServer struct {
	mutex   sync.Mutex
	lockMap map[string]int64 // Map of locks, with negative value indicating (exclusive) write lock
	                         // and positive values indicating number of read locks
}

func (l *lockServer) Lock(args *LockArgs, reply *bool) error {
	l.mutex.Lock()
	defer l.mutex.Unlock()
	if _, *reply = l.lockMap[args.Name]; !*reply {
		l.lockMap[args.Name] = WriteLock // No locks held on the given name, so claim write lock
	}
	*reply = !*reply // Negate *reply to return true when lock is granted or false otherwise
	return nil
}

func (l *lockServer) Unlock(args *LockArgs, reply *bool) error {
	l.mutex.Lock()
	defer l.mutex.Unlock()
	var locksHeld int64
	if locksHeld, *reply = l.lockMap[args.Name]; !*reply { // No lock is held on the given name
		return fmt.Errorf("Unlock attempted on an unlocked entity: %s", args.Name)
	}
	if *reply = locksHeld == WriteLock; !*reply { // Unless it is a write lock
		return fmt.Errorf("Unlock attempted on a read locked entity: %s (%d read locks active)", args.Name, locksHeld)
	}
	delete(l.lockMap, args.Name) // Remove the write lock
	return nil
}

If you also want RLock()/RUnlock() functionality, then add this as well:

const ReadLock = 1

func (l *lockServer) RLock(args *LockArgs, reply *bool) error {
	l.mutex.Lock()
	defer l.mutex.Unlock()
	var locksHeld int64
	if locksHeld, *reply = l.lockMap[args.Name]; !*reply {
		l.lockMap[args.Name] = ReadLock // No locks held on the given name, so claim (first) read lock
		*reply = true
	} else {
		if *reply = locksHeld != WriteLock; *reply { // Unless there is a write lock
			l.lockMap[args.Name] = locksHeld + ReadLock // Grant another read lock
		}
	}
	return nil
}

func (l *lockServer) RUnlock(args *LockArgs, reply *bool) error {
	l.mutex.Lock()
	defer l.mutex.Unlock()
	var locksHeld int64
	if locksHeld, *reply = l.lockMap[args.Name]; !*reply { // No lock is held on the given name
		return fmt.Errorf("RUnlock attempted on an unlocked entity: %s", args.Name)
	}
	if *reply = locksHeld != WriteLock; !*reply { // A write-lock is held, cannot release a read lock
		return fmt.Errorf("RUnlock attempted on a write locked entity: %s", args.Name)
	}
	if locksHeld > ReadLock {
		l.lockMap[args.Name] = locksHeld - ReadLock // Remove one of the read locks held
	} else {
		delete(l.lockMap, args.Name) // Remove the (last) read lock
	}
	return nil
}

See dsync-server_test.go for a full implementation.

Sub projects

  • See performance directory for performance measurements
  • See chaos directory for some edge cases

Testing

The full test code (including benchmarks) from sync/rwmutex_test.go is used for testing purposes.

Extensions / Other use cases

Robustness vs Performance

It is possible to trade some level of robustness with overall performance by not contacting each node for every Lock()/Unlock() cycle. In the normal case (example for n = 16 nodes) a total of 32 RPC messages is sent and the lock is granted if at least a quorum of n/2 + 1 nodes respond positively. When all nodes are functioning normally this would mean n = 16 positive responses and, in fact, n/2 - 1 = 7 responses over the (minimum) quorum of n/2 + 1 = 9. So you could say that this is some overkill, meaning that even if 6 nodes are down you still have an extra node over the quorum.

For this case it is possible to reduce the number of nodes to be contacted to for example 12. Instead of 32 RPC messages now 24 message will be sent which is 25% less. As the performance is mostly depending on the number of RPC messages sent, the total locks/second handled by all nodes would increase by 33% (given the same CPU load).

You do however want to make sure that you have some sort of 'random' selection of which 12 out of the 16 nodes will participate in every lock. See here for some sample code that could help with this.

Scale beyond 32 nodes?

Building on the previous example and depending on how resilient you want to be for outages of nodes, you can also go the other way, namely to increase the total number of nodes while keeping the number of nodes contacted per lock the same.

For instance you could imagine a system of 64 nodes where only a quorum majority of 17 would be needed out of 28 nodes. Again this requires some sort of pseudo-random 'deterministic' selection of 28 nodes out of the total of 64 servers (same example as above).

Other techniques

We are well aware that there are more sophisticated systems such as zookeeper, raft, etc. However we found that for our limited use case this was adding too much complexity. So if dsync does not meet your requirements than you are probably better off using one of those systems.

Other links that you may find interesting:

Performance of net/rpc vs grpc

We did an analysis of the performance of net/rpc vs grpc, see here, so we'll stick with net/rpc for now.

License

Released under the Apache License v2.0. You can find the complete text in the file LICENSE.

Contributing

Contributions are welcome, please send PRs for any enhancements.

More Repositories

1

minio

MinIO is a high-performance, S3 compatible object store, open sourced under GNU AGPLv3 license.
Go
47,069
star
2

mc

Simple | Fast tool to manage MinIO clusters ☁️
Go
2,803
star
3

minio-go

MinIO Go client SDK for S3 compatible object storage
Go
2,406
star
4

simdjson-go

Golang port of simdjson: parsing gigabytes of JSON per second
Go
1,789
star
5

c2goasm

C to Go Assembly
Go
1,307
star
6

operator

Simple Kubernetes Operator for MinIO clusters 💻
Go
1,209
star
7

minio-java

MinIO Client SDK for Java
Java
1,074
star
8

sha256-simd

Accelerate SHA256 computations in pure Go using AVX512, SHA Extensions for x86 and ARM64 for ARM. On AVX512 it provides an up to 8x improvement (over 3 GB/s per core). SHA Extensions give a performance boost of close to 4x over native.
Go
958
star
9

minio-js

MinIO Client SDK for Javascript
JavaScript
933
star
10

highwayhash

Native Go version of HighwayHash with optimized assembly implementations on Intel and ARM. Able to process over 10 GB/sec on a single core on Intel CPUs - https://en.wikipedia.org/wiki/HighwayHash
Go
870
star
11

console

Simple UI for MinIO Object Storage 🧮
JavaScript
832
star
12

minio-py

MinIO Client SDK for Python
Python
811
star
13

awesome-minio

A curated list of Awesome MinIO community projects.
665
star
14

selfupdate

Build self-updating Go programs
Go
663
star
15

minio-dotnet

MinIO Client SDK for .NET
C#
556
star
16

directpv

Kubernetes CSI driver for Direct Attached Storage 💽
Go
547
star
17

docs

MinIO Object Storage Documentation
SCSS
547
star
18

warp

S3 benchmarking tool
Go
539
star
19

sidekick

High Performance HTTP Sidecar Load Balancer
Go
538
star
20

kes

Key Managament Server for Object Storage and more
Go
455
star
21

minfs

A network filesystem client to connect to MinIO and Amazon S3 compatible cloud storage servers
Go
455
star
22

doctor

Doctor is a documentation server for your docs in github
Ruby
389
star
23

minio-service

Collection of MinIO server scripts for upstart, systemd, sysvinit, launchd.
Shell
366
star
24

minsql

High-performance log search engine.
Rust
358
star
25

sio

Go implementation of the Data At Rest Encryption (DARE) format.
Go
354
star
26

blake2b-simd

Fast hashing using pure Go implementation of BLAKE2b with SIMD instructions
Go
251
star
27

minio-rs

MinIO Rust SDK for Amazon S3 Compatible Cloud Storage
Rust
208
star
28

concert

Concert is a console based certificate generation tool for https://letsencrypt.org.
Go
194
star
29

md5-simd

Accelerate aggregated MD5 hashing performance up to 8x for AVX512 and 4x for AVX2. Useful for server applications that need to compute many MD5 sums in parallel.
Go
172
star
30

asm2plan9s

Tool to generate BYTE sequences for Go assembly as generated by YASM
Go
169
star
31

certgen

A dead simple tool to generate self signed certificates for MinIO TLS deployments
Go
140
star
32

minio-cpp

MinIO C++ Client SDK for Amazon S3 Compatible Cloud Storage
C++
119
star
33

thumbnailer

A thumbnail generator example using Minio's listenBucketNotification API
JavaScript
104
star
34

charts

MinIO Helm Charts
Mustache
98
star
35

spark-select

A library for Spark DataFrame using MinIO Select API
Scala
96
star
36

mint

Collection of tests to detect overall correctness of MinIO server.
Go
81
star
37

madmin-go

The MinIO Admin Go Client SDK provides APIs to manage MinIO services
Go
81
star
38

openlake

Build Data Lake using Open Source tools
Jupyter Notebook
70
star
39

minio-java-rest-example

REST example using minio-java library.
Java
65
star
40

minio-go-media-player

A HTML5 media player using minio-go library.
HTML
57
star
41

dperf

Drive performance measurement tool
Go
53
star
42

minio-js-store-app

Store Application using minio-js library to manage product assets
HTML
50
star
43

minio-hs

MinIO Client SDK for Haskell
Haskell
47
star
44

hperf

Distributed HTTP Speed Test.
Go
45
star
45

msf

MFS (Minio Federation Service) is a namespace, identity and access management server for Minio Servers
Go
43
star
46

zipindex

Package for indexing zip files and storing a compressed index
Go
43
star
47

nifi-minio

A custom ContentRepository implementation for NiFi to persist data to MinIO Object Storage
Java
34
star
48

simdcsv

Go
33
star
49

benchmarks

Collection of benchmarks captured for MinIO server.
30
star
50

lxmin

Backup and Restore LXC instances from MinIO
Go
28
star
51

m3

MinIO Kubernetes Cloud
Go
27
star
52

android-photo-app

Android Photo App example using minio-java library.
Java
26
star
53

minio-ruby

MinIO Client SDK for Ruby
Ruby
26
star
54

radio

Redundant Array of Distributed Independent Objectstores in short RADIO performs synchronous mirroring, erasure coding across multiple object stores
Go
24
star
55

pkg

Repository to hold all the common packages imported by MinIO projects
Go
24
star
56

blog-assets

Collection of assets used for various articles at https://blogs.min.io
Jupyter Notebook
24
star
57

parquet-go

Go library to work with Parquet Files
Go
23
star
58

presto-minio

How to use Presto (with Hive metastore) and MinIO?
23
star
59

bottlenet

Find bottlenecks in distributed network
Go
21
star
60

lsync

Local syncing package with support for timeouts. This package offers both a sync.Mutex and sync.RWMutex compatible interface.
Go
17
star
61

simple-ci

Stateless. Infinite scalability. Easy Setup. Microservice. Minimalist CI
JavaScript
17
star
62

ming

Object Storage Gateway for Hybrid Cloud
Go
17
star
63

gluegun

Glues Github markdown docs to present a beautiful documentation site.
CSS
16
star
64

swift-photo-app

Swift photo app
Swift
15
star
65

homebrew-stable

Homebrew tap for MinIO
Ruby
15
star
66

mnm

Minimal Minio API aggregates many minio instances to look like one
Go
14
star
67

rsync-go

This is a pure go implementation of the rsync algorithm with highwayhash signature
Go
13
star
68

minio-iam-testing

Shell
13
star
69

mds

MinIO Design System is a common library of all the UI design elements.
TypeScript
12
star
70

perftest

Collection of scripts used in Minio performance testing.
Go
12
star
71

ror-resumeuploader-app

Ruby on rails app using aws-sdk-ruby
JavaScript
11
star
72

select-simd

Go
8
star
73

spark-streaming-checkpoint

Spark Streaming Checkpoint File Manager for MinIO
Scala
8
star
74

kms-go

MinIO Key Managment SDK
Go
8
star
75

chaos

A framework for testing Minio's fault tolerance capability.
Go
8
star
76

hdfs-to-minio

A simple containerized hadoop CLI to migrate content between various HCFS implementations
Dockerfile
7
star
77

simdjson-fuzz

Fuzzers and corpus for https://github.com/minio/simdjson-go
Go
7
star
78

minio-lambda-notification-example

Example App that uses MinIO Lambda Notification with Postgres
JavaScript
7
star
79

buzz

A prototype for github issue workflow management
Less
7
star
80

dmt

Direct MinIO Tunnel
Go
6
star
81

go-cv

Golang wrapper for https://github.com/ermig1979/Simd
Go
6
star
82

spark-data-generator

Generates dummy parquet, csv, json files for testing and validating MinIO compatibility
Scala
6
star
83

xxml

Package xml implements a simple XML 1.0 parser that understands XML name spaces, extended support for control characters.
Go
5
star
84

minio-jenkins

This is a simple Jenkins plugin that lets you upload Jenkins artifacts to a Minio Server
Java
5
star
85

disco

Disco discovery service for MinIO.
Go
5
star
86

docs-k8s

MinIO Docs for Kubernetes
Python
4
star
87

attic

Collection of deprecated packages 😟
C++
4
star
88

pkger

Debian, RPMs and APKs for MinIO
Go
4
star
89

confess

Object store consistency checker
Go
4
star
90

kitchensink

Go
3
star
91

colorjson

Package json implements encoding and decoding of JSON as defined in RFC 7159. The mapping between JSON and Go values is described in the documentation for the Marshal and Unmarshal functions
Go
3
star
92

webhook

HTTP events to file logger
Go
3
star
93

marketplace

Makefile
3
star
94

minio-pcf-adapter

MinIO Service Adapter for Pivotal
Go
2
star
95

training

Materials for supporting MinIO-led training and curriculum.
Python
2
star
96

xfile

Determines information about the object.
Go
2
star
97

wiki

MinIO's Wiki
2
star
98

docs-vsphere

MinIO Docs for VMware Cloud Foundation
Python
2
star
99

hcp-to-minio

About A simple CLI to migrate content from HCP to MinIO
Go
2
star
100

csvparser

Package csv reads and writes comma-separated values (CSV) files.
Go
2
star