• Stars
    star
    159
  • Rank 227,560 (Top 5 %)
  • Language
    Go
  • License
    Apache License 2.0
  • Created about 4 years ago
  • Updated over 1 year ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Accelerate aggregated MD5 hashing performance up to 8x for AVX512 and 4x for AVX2. Useful for server applications that need to compute many MD5 sums in parallel.

md5-simd

This is a SIMD accelerated MD5 package, allowing up to either 8 (AVX2) or 16 (AVX512) independent MD5 sums to be calculated on a single CPU core.

It was originally based on the md5vec repository by Igneous Systems, but has been made more flexible by amongst others supporting different message sizes per lane and adding AVX512.

md5-simd integrates a similar mechanism as described in minio/sha256-simd for making it easy for clients to take advantages of the parallel nature of the MD5 calculation. This will result in reduced overall CPU load.

It is important to understand that md5-simd does not speed up a single threaded MD5 hash sum. Rather it allows multiple independent MD5 sums to be computed in parallel on the same CPU core, thereby making more efficient usage of the computing resources.

Usage

Documentation

In order to use md5-simd, you must first create an Server which can be used to instantiate one or more objects for MD5 hashing.

These objects conform to the regular hash.Hash interface and as such the normal Write/Reset/Sum functionality works as expected.

As an example:

    // Create server
    server := md5simd.NewServer()
    defer server.Close()

    // Create hashing object (conforming to hash.Hash)
    md5Hash := server.NewHash()
    defer md5Hash.Close()

    // Write one (or more) blocks
    md5Hash.Write(block)
    
    // Return digest
    digest := md5Hash.Sum([]byte{})

To keep performance both a Server and individual Hasher should be closed using the Close() function when no longer needed.

A Hasher can efficiently be re-used by using Reset() functionality.

In case your system does not support the instructions required it will fall back to using crypto/md5 for hashing.

Limitations

As explained above md5-simd does not speed up an individual MD5 hash sum computation, unless some hierarchical tree construct is used but this will result in different outcomes. Running a single hash on a server results in approximately half the throughput.

Instead, it allows running multiple MD5 calculations in parallel on a single CPU core. This can be beneficial in e.g. multi-threaded server applications where many go-routines are dealing with many requests and multiple MD5 calculations can be packed/scheduled for parallel execution on a single core.

This will result in a lower overall CPU usage as compared to using the standard crypto/md5 functionality where each MD5 hash computation will consume a single thread (core).

It is best to test and measure the overall CPU usage in a representative usage scenario in your application to get an overall understanding of the benefits of md5-simd as compared to crypto/md5, ideally under heavy CPU load.

Also note that md5-simd is best meant to work with large objects, so if your application only hashes small objects of a few kilobytes you may be better of by using crypto/md5.

Performance

For the best performance writes should be a multiple of 64 bytes, ideally a multiple of 32KB. To help with that a buffered := bufio.NewWriterSize(hasher, 32<<10) can be inserted if you are unsure of the sizes of the writes. Remember to flush buffered before reading the hash.

A single 'server' can process 16 streams concurrently with 1 core (AVX-512) or 2 cores (AVX2). In situations where it is likely that more than 16 streams are fully loaded it may be beneficial to use multiple servers.

The following chart compares the multi-core performance between crypto/md5 vs the AVX2 vs the AVX512 code:

md5-performance-overview

Compared to crypto/md5, the AVX2 version is up to 4x faster:

$ benchcmp crypto-md5.txt avx2.txt 
benchmark                     old MB/s     new MB/s     speedup
BenchmarkParallel/32KB-4      2229.22      7370.50      3.31x
BenchmarkParallel/64KB-4      2233.61      8248.46      3.69x
BenchmarkParallel/128KB-4     2235.43      8660.74      3.87x
BenchmarkParallel/256KB-4     2236.39      8863.87      3.96x
BenchmarkParallel/512KB-4     2238.05      8985.39      4.01x
BenchmarkParallel/1MB-4       2233.56      9042.62      4.05x
BenchmarkParallel/2MB-4       2224.11      9014.46      4.05x
BenchmarkParallel/4MB-4       2199.78      8993.61      4.09x
BenchmarkParallel/8MB-4       2182.48      8748.22      4.01x

Compared to crypto/md5, the AVX512 is up to 8x faster (for larger block sizes):

$ benchcmp crypto-md5.txt avx512.txt
benchmark                     old MB/s     new MB/s     speedup
BenchmarkParallel/32KB-4      2229.22      11605.78     5.21x
BenchmarkParallel/64KB-4      2233.61      14329.65     6.42x
BenchmarkParallel/128KB-4     2235.43      16166.39     7.23x
BenchmarkParallel/256KB-4     2236.39      15570.09     6.96x
BenchmarkParallel/512KB-4     2238.05      16705.83     7.46x
BenchmarkParallel/1MB-4       2233.56      16941.95     7.59x
BenchmarkParallel/2MB-4       2224.11      17136.01     7.70x
BenchmarkParallel/4MB-4       2199.78      17218.61     7.83x
BenchmarkParallel/8MB-4       2182.48      17252.88     7.91x

These measurements were performed on AWS EC2 instance of type c5.xlarge equipped with a Xeon Platinum 8124M CPU at 3.0 GHz.

If only one or two inputs are available the scalar calculation method will be used for the optimal speed in these cases.

Operation

To make operation as easy as possible there is a “Server” coordinating everything. The server keeps track of individual hash states and updates them as new data comes in. This can be visualized as follows:

server-architecture

The data is sent to the server from each hash input in blocks of up to 32KB per round. In our testing we found this to be the block size that yielded the best results.

Whenever there is data available the server will collect data for up to 16 hashes and process all 16 lanes in parallel. This means that if 16 hashes have data available all the lanes will be filled. However since that may not be the case, the server will fill less lanes and do a round anyway. Lanes can also be partially filled if less than 32KB of data is written.

server-lanes-example

In this example 4 lanes are fully filled and 2 lanes are partially filled. In this case the black areas will simply be masked out from the results and ignored. This is also why calculating a single hash on a server will not result in any speedup and hash writes should be a multiple of 32KB for the best performance.

For AVX512 all 16 calculations will be done on a single core, on AVX2 on 2 cores if there is data for more than 8 lanes. So for optimal usage there should be data available for all 16 hashes. It may be perfectly reasonable to use more than 16 concurrent hashes.

Design & Tech

md5-simd has both an AVX2 (8-lane parallel), and an AVX512 (16-lane parallel version) algorithm to accelerate the computation with the following function definitions:

//go:noescape
func block8(state *uint32, base uintptr, bufs *int32, cache *byte, n int)

//go:noescape
func block16(state *uint32, ptrs *int64, mask uint64, n int)

The AVX2 version is based on the md5vec repository and is essentially unchanged except for minor (cosmetic) changes.

The AVX512 version is derived from the AVX2 version but adds some further optimizations and simplifications.

Caching in upper ZMM registers

The AVX2 version passes in a cache8 block of memory (about 0.5 KB) for temporary storage of intermediate results during ROUND1 which are subsequently used during ROUND2 through to ROUND4.

Since AVX512 has double the amount of registers (32 ZMM registers as compared to 16 YMM registers), it is possible to use the upper 16 ZMM registers for keeping the intermediate states on the CPU. As such, there is no need to pass in a corresponding cache16 into the AVX512 block function.

Direct loading using 64-bit pointers

The AVX2 uses the VPGATHERDD instruction (for YMM) to do a parallel load of 8 lanes using (8 independent) 32-bit offets. Since there is no control over how the 8 slices that are passed into the (Golang) blockMd5 function are laid out into memory, it is not possible to derive a "base" address and corresponding offsets (all within 32-bits) for all 8 slices.

As such the AVX2 version uses an interim buffer to collect the byte slices to be hashed from all 8 inut slices and passed this buffer along with (fixed) 32-bit offsets into the assembly code.

For the AVX512 version this interim buffer is not needed since the AVX512 code uses a pair of VPGATHERQD instructions to directly dereference 64-bit pointers (from a base register address that is initialized to zero).

Note that two load (gather) instructions are needed because the AVX512 version processes 16-lanes in parallel, requiring 16 times 64-bit = 1024 bits in total to be loaded. A simple VALIGND and VPORD are subsequently used to merge the lower and upper halves together into a single ZMM register (that contains 16 lanes of 32-bit DWORDS).

Masking support

Due to the fact that pointers are passed directly from the Golang slices, we need to protect against NULL pointers. For this a 16-bit mask is passed in the AVX512 assembly code which is used during the VPGATHERQD instructions to mask out lanes that could otherwise result in segment violations.

Minor optimizations

The roll macro (three instructions on AVX2) is no longer needed for AVX512 and is replaced by a single VPROLD instruction.

Also several logical operations from the various ROUNDS of the AVX2 version could be combined into a single instruction using ternary logic (with the VPTERMLOGD instruction), resulting in a further simplification and speed-up.

Low level block function performance

The benchmark below shows the (single thread) maximum performance of the block() function for AVX2 (having 8 lanes) and AVX512 (having 16 lanes). Also the baseline single-core performance from the standard crypto/md5 package is shown for comparison.

BenchmarkCryptoMd5-4                     687.66 MB/s           0 B/op          0 allocs/op
BenchmarkBlock8-4                       4144.80 MB/s           0 B/op          0 allocs/op
BenchmarkBlock16-4                      8228.88 MB/s           0 B/op          0 allocs/op

License

md5-simd is released under the Apache License v2.0. You can find the complete text in the file LICENSE.

Contributing

Contributions are welcome, please send PRs for any enhancements.

More Repositories

1

minio

The Object Store for AI Data Infrastructure
Go
43,034
star
2

mc

Simple | Fast tool to manage MinIO clusters ☁️
Go
2,683
star
3

minio-go

MinIO Go client SDK for S3 compatible object storage
Go
2,204
star
4

simdjson-go

Golang port of simdjson: parsing gigabytes of JSON per second
Go
1,730
star
5

c2goasm

C to Go Assembly
Go
1,296
star
6

operator

Simple Kubernetes Operator for MinIO clusters 💻
Go
1,092
star
7

minio-java

MinIO Client SDK for Java
Java
995
star
8

sha256-simd

Accelerate SHA256 computations in pure Go using AVX512, SHA Extensions for x86 and ARM64 for ARM. On AVX512 it provides an up to 8x improvement (over 3 GB/s per core). SHA Extensions give a performance boost of close to 4x over native.
Go
919
star
9

minio-js

MinIO Client SDK for Javascript
JavaScript
879
star
10

highwayhash

Native Go version of HighwayHash with optimized assembly implementations on Intel and ARM. Able to process over 10 GB/sec on a single core on Intel CPUs - https://en.wikipedia.org/wiki/HighwayHash
Go
850
star
11

console

Simple UI for MinIO Object Storage 🧮
TypeScript
788
star
12

minio-py

MinIO Client SDK for Python
Python
758
star
13

awesome-minio

A curated list of Awesome MinIO community projects.
658
star
14

selfupdate

Build self-updating Go programs
Go
583
star
15

docs

MinIO Object Storage Documentation
SCSS
532
star
16

directpv

Simple Kubernetes CSI driver for Direct Attached Storage 💽
Go
517
star
17

sidekick

High Performance HTTP Sidecar Load Balancer
Go
515
star
18

minio-dotnet

MinIO Client SDK for .NET
C#
506
star
19

warp

S3 benchmarking tool
Go
463
star
20

minfs

A network filesystem client to connect to MinIO and Amazon S3 compatible cloud storage servers
Go
451
star
21

kes

Key Managament Server for Object Storage and more
Go
441
star
22

dsync

A distributed sync package.
Go
399
star
23

doctor

Doctor is a documentation server for your docs in github
Ruby
389
star
24

minsql

High-performance log search engine.
Rust
358
star
25

minio-service

Collection of MinIO server scripts for upstart, systemd, sysvinit, launchd.
Shell
345
star
26

sio

Go implementation of the Data At Rest Encryption (DARE) format.
Go
340
star
27

blake2b-simd

Fast hashing using pure Go implementation of BLAKE2b with SIMD instructions
Go
245
star
28

concert

Concert is a console based certificate generation tool for https://letsencrypt.org.
Go
195
star
29

minio-rs

MinIO Rust SDK for Amazon S3 Compatible Cloud Storage
Rust
169
star
30

asm2plan9s

Tool to generate BYTE sequences for Go assembly as generated by YASM
Go
165
star
31

certgen

A dead simple tool to generate self signed certificates for MinIO TLS deployments
Go
104
star
32

thumbnailer

A thumbnail generator example using Minio's listenBucketNotification API
JavaScript
103
star
33

charts

MinIO Helm Charts
Mustache
98
star
34

spark-select

A library for Spark DataFrame using MinIO Select API
Scala
97
star
35

minio-cpp

MinIO C++ Client SDK for Amazon S3 Compatible Cloud Storage
C++
92
star
36

mint

Collection of tests to detect overall correctness of MinIO server.
Go
76
star
37

madmin-go

The MinIO Admin Go Client SDK provides APIs to manage MinIO services
Go
65
star
38

minio-java-rest-example

REST example using minio-java library.
Java
62
star
39

minio-go-media-player

A HTML5 media player using minio-go library.
HTML
57
star
40

minio-js-store-app

Store Application using minio-js library to manage product assets
HTML
49
star
41

minio-hs

MinIO Client SDK for Haskell
Haskell
46
star
42

dperf

Drive performance measurement tool
Go
46
star
43

msf

MFS (Minio Federation Service) is a namespace, identity and access management server for Minio Servers
Go
43
star
44

openlake

Build Data Lake using Open Source tools
Jupyter Notebook
39
star
45

zipindex

Package for indexing zip files and storing a compressed index
Go
39
star
46

hperf

Distributed HTTP Speed Test.
Go
38
star
47

simdcsv

Go
33
star
48

nifi-minio

A custom ContentRepository implementation for NiFi to persist data to MinIO Object Storage
Java
30
star
49

benchmarks

Collection of benchmarks captured for MinIO server.
29
star
50

m3

MinIO Kubernetes Cloud
Go
27
star
51

android-photo-app

Android Photo App example using minio-java library.
Java
26
star
52

minio-ruby

MinIO Client SDK for Ruby
Ruby
26
star
53

lxmin

Backup and Restore LXC instances from MinIO
Go
26
star
54

radio

Redundant Array of Distributed Independent Objectstores in short RADIO performs synchronous mirroring, erasure coding across multiple object stores
Go
24
star
55

parquet-go

Go library to work with Parquet Files
Go
23
star
56

presto-minio

How to use Presto (with Hive metastore) and MinIO?
23
star
57

pkg

Repository to hold all the common packages imported by MinIO projects
Go
22
star
58

bottlenet

Find bottlenecks in distributed network
Go
21
star
59

lsync

Local syncing package with support for timeouts. This package offers both a sync.Mutex and sync.RWMutex compatible interface.
Go
17
star
60

simple-ci

Stateless. Infinite scalability. Easy Setup. Microservice. Minimalist CI
JavaScript
17
star
61

ming

Object Storage Gateway for Hybrid Cloud
Go
17
star
62

blog-assets

Collection of assets used for various articles at https://blogs.min.io
Jupyter Notebook
17
star
63

gluegun

Glues Github markdown docs to present a beautiful documentation site.
CSS
16
star
64

swift-photo-app

Swift photo app
Swift
15
star
65

homebrew-stable

Homebrew tap for MinIO
Ruby
15
star
66

mnm

Minimal Minio API aggregates many minio instances to look like one
Go
13
star
67

perftest

Collection of scripts used in Minio performance testing.
Go
12
star
68

ror-resumeuploader-app

Ruby on rails app using aws-sdk-ruby
JavaScript
11
star
69

mds

MinIO Design System is a common library of all the UI design elements.
TypeScript
10
star
70

minio-iam-testing

Shell
10
star
71

rsync-go

This is a pure go implementation of the rsync algorithm with highwayhash signature
Go
9
star
72

select-simd

Go
8
star
73

chaos

A framework for testing Minio's fault tolerance capability.
Go
8
star
74

hdfs-to-minio

A simple containerized hadoop CLI to migrate content between various HCFS implementations
Dockerfile
7
star
75

simdjson-fuzz

Fuzzers and corpus for https://github.com/minio/simdjson-go
Go
7
star
76

minio-lambda-notification-example

Example App that uses MinIO Lambda Notification with Postgres
JavaScript
7
star
77

buzz

A prototype for github issue workflow management
Less
7
star
78

dmt

Direct MinIO Tunnel
Go
6
star
79

go-cv

Golang wrapper for https://github.com/ermig1979/Simd
Go
6
star
80

spark-data-generator

Generates dummy parquet, csv, json files for testing and validating MinIO compatibility
Scala
6
star
81

kms-go

MinIO key managment SDK
Go
6
star
82

xxml

Package xml implements a simple XML 1.0 parser that understands XML name spaces, extended support for control characters.
Go
5
star
83

spark-streaming-checkpoint

Spark Streaming Checkpoint File Manager for MinIO
Scala
5
star
84

minio-jenkins

This is a simple Jenkins plugin that lets you upload Jenkins artifacts to a Minio Server
Java
5
star
85

disco

Disco discovery service for MinIO.
Go
5
star
86

docs-k8s

MinIO Docs for Kubernetes
Python
4
star
87

attic

Collection of deprecated packages 😟
C++
4
star
88

pkger

Debian, RPMs and APKs for MinIO
Go
4
star
89

marketplace

Makefile
4
star
90

kitchensink

Go
3
star
91

confess

Object store consistency checker
Go
3
star
92

webhook

HTTP events to file logger
Go
3
star
93

colorjson

Package json implements encoding and decoding of JSON as defined in RFC 7159. The mapping between JSON and Go values is described in the documentation for the Marshal and Unmarshal functions
Go
2
star
94

minio-pcf-adapter

MinIO Service Adapter for Pivotal
Go
2
star
95

training

Materials for supporting MinIO-led training and curriculum.
Python
2
star
96

docs-vsphere

MinIO Docs for VMware Cloud Foundation
Python
2
star
97

xfile

Determines information about the object.
Go
2
star
98

wiki

MinIO's Wiki
2
star
99

hcp-to-minio

About A simple CLI to migrate content from HCP to MinIO
Go
2
star
100

csvparser

Package csv reads and writes comma-separated values (CSV) files.
Go
2
star